If you're searching for "local LLM" or "how to run LLM locally", you probably want one of two things: privacy (your data never leaves your machine) or cost savings (no per-token API fees). Running a self-hosted LLM gives you both—plus the freedom to use uncensored models and customize behavior however you want.
The tradeoff: local models are generally less capable than cloud APIs like GPT-4 or Claude. For many tasks—summarization, code completion, chat, simple reasoning—the gap is small enough to be worth it. For complex multi-step reasoning or frontier capabilities, you'll still want cloud models.
This tutorial shows you how to set up Ollama, the simplest way to run LLMs locally.
Why run LLMs locally?
Privacy: Your prompts and data never leave your server. For sensitive documents, proprietary code, or personal conversations, this matters.
Cost: No API fees. Once you have the hardware, inference is free. If you're running thousands of queries per day, local models pay for themselves quickly.
Control: Use any model you want. Fine-tune on your own data. No content policies or usage restrictions.
Latency: For on-device use, local inference can be faster than round-tripping to an API.
Offline access: Works without internet connectivity.
Hardware requirements
LLM inference is primarily limited by GPU memory (VRAM) or RAM for CPU inference.
| Model size | Minimum VRAM/RAM | Example models | | --- | --- | --- | | 7B parameters | 8GB | Llama 3.2 7B, Mistral 7B, Qwen2 7B | | 13B parameters | 16GB | Llama 2 13B, CodeLlama 13B | | 70B parameters | 48GB+ | Llama 3.1 70B (quantized) |
For CPU inference (no GPU): you need roughly 1GB of RAM per billion parameters for quantized models. A 7B model runs acceptably on 16GB RAM machines, though much slower than GPU inference.
Cloud VMs with GPUs work well. Providers like Lambda Labs, RunPod, or Vast.ai offer GPU instances starting around $0.20/hour.
Installing Ollama
Ollama is the easiest way to get started. One command to install, one command to run a model.
curl -fsSL https://ollama.ai/install.sh | sh
Verify the installation:
ollama --version
Start the Ollama service (it runs as a background daemon):
ollama serve &
Running your first model
Pull and run Llama 3.2 (a good general-purpose model):
ollama run llama3.2
This downloads the model (~4GB for 7B) and starts an interactive chat. Type your prompts, press Enter, get responses.
To exit: type /bye or press Ctrl+C.
Running models via API
Ollama exposes an OpenAI-compatible API on localhost:11434. This means you can use it with any tool that supports the OpenAI API.
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Explain quantum computing in one sentence."
}'
For chat completions (multi-turn conversations):
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Model recommendations
General purpose: Llama 3.2 7B or Mistral 7B. Fast, capable, good for most tasks.
Coding: CodeLlama or DeepSeek Coder. Trained specifically on code.
Long context: Mistral or models with 32k+ context windows.
Smallest footprint: Phi-3 Mini (3.8B parameters) or Gemma 2B. Run on laptops or low-memory devices.
Most capable (local): Llama 3.1 70B if you have the hardware. Approaches cloud model quality.
Browse available models:
ollama list
Pull a specific model:
ollama pull codellama
ollama pull mistral
ollama pull deepseek-coder
Ollama vs vllm vs other options
Ollama: Easiest setup. Great for single-user or light workloads. Handles model management automatically.
vllm: Higher throughput for production workloads. Better batching and scheduling. More complex setup.
llama.cpp: The underlying engine many tools use. Maximum flexibility, minimum convenience.
LM Studio: GUI application for Mac/Windows. Good if you want a visual interface.
For most people starting out, Ollama is the right choice. Move to vllm when you need to serve many concurrent users or maximize throughput.
Running on a remote server
If you're running Ollama on a cloud server or Zo Computer, you can expose it to your local machine.
On the server, start Ollama bound to all interfaces:
OLLAMA_HOST=0.0.0.0 ollama serve
Then access it remotely (replace your-server-ip):
curl http://your-server-ip:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Hello!"
}'
For Zo Computer, you can set up a persistent service using the built-in service management.
When to use local vs cloud models
Use local LLMs when:
Privacy is critical (medical, legal, personal data)
You have predictable, high-volume usage
You need offline access
You want to experiment with fine-tuning or custom models
Use cloud APIs when:
You need frontier capabilities (complex reasoning, large context)
Usage is sporadic (pay-per-token is cheaper than idle hardware)
You need multimodal capabilities (vision, audio)
You want managed infrastructure
Many workflows combine both: use local models for routine tasks and high-volume processing, cloud models for complex reasoning when you need it.
Next steps
Once Ollama is running, you can:
Connect it to your existing tools via the OpenAI-compatible API
Build applications using the Ollama Python library
Fine-tune models on your own data with Ollama's Modelfile
Set up a web UI with Open WebUI
For more on AI automation, see: