How to Run a Local LLM on Your Server

If you're searching for "local LLM" or "how to run LLM locally", you probably want one of two things: privacy (your data never leaves your machine) or cost savings (no per-token API fees). Running a self-hosted LLM gives you both—plus the freedom to use uncensored models and customize behavior however you want.

The tradeoff: local models are generally less capable than cloud APIs like GPT-4 or Claude. For many tasks—summarization, code completion, chat, simple reasoning—the gap is small enough to be worth it. For complex multi-step reasoning or frontier capabilities, you'll still want cloud models.

This tutorial shows you how to set up Ollama, the simplest way to run LLMs locally.

Why run LLMs locally?

Privacy: Your prompts and data never leave your server. For sensitive documents, proprietary code, or personal conversations, this matters.

Cost: No API fees. Once you have the hardware, inference is free. If you're running thousands of queries per day, local models pay for themselves quickly.

Control: Use any model you want. Fine-tune on your own data. No content policies or usage restrictions.

Latency: For on-device use, local inference can be faster than round-tripping to an API.

Offline access: Works without internet connectivity.

Hardware requirements

LLM inference is primarily limited by GPU memory (VRAM) or RAM for CPU inference.

| Model size | Minimum VRAM/RAM | Example models | | --- | --- | --- | | 7B parameters | 8GB | Llama 3.2 7B, Mistral 7B, Qwen2 7B | | 13B parameters | 16GB | Llama 2 13B, CodeLlama 13B | | 70B parameters | 48GB+ | Llama 3.1 70B (quantized) |

For CPU inference (no GPU): you need roughly 1GB of RAM per billion parameters for quantized models. A 7B model runs acceptably on 16GB RAM machines, though much slower than GPU inference.

Cloud VMs with GPUs work well. Providers like Lambda Labs, RunPod, or Vast.ai offer GPU instances starting around $0.20/hour.

Installing Ollama

Ollama is the easiest way to get started. One command to install, one command to run a model.

curl -fsSL https://ollama.ai/install.sh | sh

Verify the installation:

ollama --version

Start the Ollama service (it runs as a background daemon):

ollama serve &

Running your first model

Pull and run Llama 3.2 (a good general-purpose model):

ollama run llama3.2

This downloads the model (~4GB for 7B) and starts an interactive chat. Type your prompts, press Enter, get responses.

To exit: type /bye or press Ctrl+C.

Running models via API

Ollama exposes an OpenAI-compatible API on localhost:11434. This means you can use it with any tool that supports the OpenAI API.

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain quantum computing in one sentence."
}'

For chat completions (multi-turn conversations):

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "What is the capital of France?"}
  ]
}'

Model recommendations

General purpose: Llama 3.2 7B or Mistral 7B. Fast, capable, good for most tasks.

Coding: CodeLlama or DeepSeek Coder. Trained specifically on code.

Long context: Mistral or models with 32k+ context windows.

Smallest footprint: Phi-3 Mini (3.8B parameters) or Gemma 2B. Run on laptops or low-memory devices.

Most capable (local): Llama 3.1 70B if you have the hardware. Approaches cloud model quality.

Browse available models:

ollama list

Pull a specific model:

ollama pull codellama
ollama pull mistral
ollama pull deepseek-coder

Ollama vs vllm vs other options

Ollama: Easiest setup. Great for single-user or light workloads. Handles model management automatically.

vllm: Higher throughput for production workloads. Better batching and scheduling. More complex setup.

llama.cpp: The underlying engine many tools use. Maximum flexibility, minimum convenience.

LM Studio: GUI application for Mac/Windows. Good if you want a visual interface.

For most people starting out, Ollama is the right choice. Move to vllm when you need to serve many concurrent users or maximize throughput.

Running on a remote server

If you're running Ollama on a cloud server or Zo Computer, you can expose it to your local machine.

On the server, start Ollama bound to all interfaces:

OLLAMA_HOST=0.0.0.0 ollama serve

Then access it remotely (replace your-server-ip):

curl http://your-server-ip:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello!"
}'

For Zo Computer, you can set up a persistent service using the built-in service management.

When to use local vs cloud models

Use local LLMs when:

Privacy is critical (medical, legal, personal data)
You have predictable, high-volume usage
You need offline access
You want to experiment with fine-tuning or custom models

Use cloud APIs when:

You need frontier capabilities (complex reasoning, large context)
Usage is sporadic (pay-per-token is cheaper than idle hardware)
You need multimodal capabilities (vision, audio)
You want managed infrastructure

Many workflows combine both: use local models for routine tasks and high-volume processing, cloud models for complex reasoning when you need it.

Next steps

Once Ollama is running, you can:

Connect it to your existing tools via the OpenAI-compatible API
Build applications using the Ollama Python library
Fine-tune models on your own data with Ollama's Modelfile
Set up a web UI with Open WebUI

For more on AI automation, see: