I run all my AI locally now. No API keys, no usage limits, no monthly bills, no data leaving my machine. Six months ago I was spending $40-60/month on various AI APIs. Now I spend exactly zero, and for 90% of my use cases, the experience is just as good. Let me walk you through how to set this up and what to actually expect.
Why Run LLMs Locally?
There are four reasons people do this, and they matter in different proportions depending on who you are:
Privacy. When you send a prompt to ChatGPT or Claude’s API, that data travels to their servers. For personal projects, who cares. But if you’re working with client data, proprietary code, medical records, legal documents, or anything under NDA – local inference means the data never leaves your hardware. Full stop. No terms of service to read, no data processing agreements to negotiate.
Cost. API calls add up fast. If you’re making hundreds of requests a day for coding assistance, document summarization, or data extraction, you’re looking at real money. Local models have an upfront hardware cost but zero marginal cost per query. Run a million inferences and your electricity bill goes up by a few dollars.
Offline access. I travel a lot. Airports, trains, coffee shops with terrible Wi-Fi. Having a capable language model available without internet access is genuinely useful. I’ve drafted entire blog posts, debugged code, and summarized research papers on flights with no connectivity.
Customization. You can fine-tune local models on your own data, create custom system prompts without token limits, and tweak generation parameters however you want. No content filters you can’t control, no rate limits, no “I can’t help with that” when you’re doing legitimate security research or creative writing.
Getting Started with Ollama
Ollama is where I tell everyone to start. It’s the Docker of local LLMs – it abstracts away all the painful setup and gives you a clean interface to download and run models.
Installation is almost insultingly simple. On macOS, download the app from ollama.com. On Linux, it’s a one-line curl command:
curl -fsSL https://ollama.com/install.sh | sh
That’s it. No Python environment to configure, no CUDA drivers to wrestle with (Ollama handles GPU detection automatically), no dependency hell. It just works.
Once installed, pulling a model is like pulling a Docker image:
ollama pull llama3.1
This downloads Meta’s Llama 3.1 8B model, which is about 4.7GB. First download takes a few minutes depending on your connection. After that, running it is instant:
ollama run llama3.1
You’re now chatting with a local AI model. No account, no API key, no internet required after the initial download. Type your prompt, get a response. It also exposes a local API on port 11434 that’s compatible with the OpenAI API format, so you can point most AI-enabled tools at it with minimal configuration changes.
My personal setup uses three models for different tasks: llama3.1 8B for quick questions and casual chat, codellama 13B for programming assistance, and mistral 7B for writing tasks. Switching between them is just ollama run modelname.
The Alternatives Worth Knowing
LM Studio is what I recommend to anyone who prefers a graphical interface. It’s a desktop app – available on Windows, Mac, and Linux – that gives you a ChatGPT-like interface for local models. You browse a model catalog, click download, and start chatting. It also lets you adjust temperature, context length, and other parameters through sliders instead of command-line flags. If “terminal” makes you nervous, start here.
llama.cpp is the engine under the hood of both Ollama and LM Studio. It’s a C/C++ implementation of LLM inference that’s been obsessively optimized for performance. If you want maximum speed, minimum memory usage, and total control over every parameter, you run llama.cpp directly. The tradeoff is that setup requires more technical knowledge – compiling from source, manually downloading model files, understanding quantization formats. I’d only recommend this to developers who enjoy that kind of tinkering.
vLLM is aimed at a different use case: serving models to multiple users. If you want to run a local model and expose it as an API for your team or your application, vLLM handles concurrent requests, batching, and memory management far better than Ollama. It’s production infrastructure, not a personal tool. The setup is more involved and it really wants a proper NVIDIA GPU.
Hardware Reality Check
This is where the marketing stops and physics starts. Local LLMs need RAM – lots of it. Here’s what actually works:
RAM Requirements by Model Size
8GB RAM is the bare minimum for running 7B parameter models. You’ll be using heavily quantized versions and the experience will be… functional. Responses come in at maybe 5-8 tokens per second on CPU. Usable for simple tasks, but you’ll feel the wait on longer generations. I wouldn’t call this comfortable.
16GB RAM is where things get comfortable for 7B models and feasible for 13B models. If you have a Mac with Apple Silicon (M1 or newer), 16GB of unified memory gives you surprisingly good performance because the GPU and CPU share the same memory pool. On a 16GB M2 MacBook Air, I get 15-20 tokens per second with Llama 3.1 8B. That’s fast enough for interactive conversation.
32GB RAM opens up the really good stuff. You can run 30B+ parameter models with reasonable quantization, and 13B models at high quality. This is my sweet spot recommendation. A 32GB M2 Pro Mac Mini runs circles around what I expected when I first tried local inference.
NVIDIA GPUs change the equation dramatically. A used RTX 3090 (24GB VRAM) can be found for $700-900 and runs 70B parameter models at interactive speeds. That’s the same class of model that powers many commercial AI products. If you’re on a desktop and willing to invest in a GPU, this is the best performance per dollar available.
Quantization: JPEG for AI Models
You’ll see model files labeled things like Q4_K_M, Q5_K_M, Q8_0, and wonder what that means. Here’s the simple version: quantization compresses model weights from their original precision (usually 16-bit floating point) to lower precision, reducing file size and memory requirements at the cost of some quality.
Think of it like image compression. A RAW photo might be 50MB. A high-quality JPEG is 5MB. A compressed JPEG is 500KB. They all show the same picture, but the details degrade as you compress more. Same principle with model quantization.
The practical comparison:
- Q4_K_M – About 4 bits per weight. File sizes roughly 4x smaller than the original. This is the most popular quantization for daily use. Quality loss is noticeable but acceptable for most tasks. A 7B model at Q4_K_M is around 4GB. Great balance of size and capability.
- Q5_K_M – About 5 bits per weight. Slightly larger files, noticeably better quality than Q4 especially on reasoning tasks. If you have the RAM headroom, this is my recommendation. A 7B model at Q5_K_M runs about 5GB.
- Q8_0 – 8 bits per weight. Minimal quality loss from the original model. File sizes are roughly 2x the original 16-bit size divided by two. A 7B model at Q8 is around 7.5GB. Use this when quality matters and you have plenty of RAM.
My rule of thumb: use Q5_K_M as the default. Drop to Q4_K_M if you’re memory constrained. Go Q8 only if you have RAM to spare and you’re doing tasks where precision matters, like code generation or math.
Recommended Setups by Budget
Let me be blunt about what different spending levels get you:
$0 (existing hardware): If you have a laptop with 16GB RAM made in the last 3-4 years, install Ollama and run Llama 3.1 8B at Q4_K_M. You’ll get a capable general-purpose assistant that handles writing, coding help, summarization, and brainstorming. Won’t match GPT-4, but it’s free and private.
$500-800 (used GPU): Buy a used RTX 3060 12GB or RTX 3080 10GB. Pair it with 32GB system RAM. You can now run 13B models at good quality and 7B models at maximum quality with fast generation speeds. This is genuinely useful for daily development work.
$1000-1500 (serious setup): Used RTX 3090 with 24GB VRAM, 64GB system RAM. Run 30-34B parameter models comfortably, 70B models with some offloading to CPU. At this tier, the quality gap between local and cloud AI narrows significantly. For coding, summarization, and analysis, you might not miss the API at all.
$2000+ (no compromises): RTX 4090 24GB or dual 3090s. 128GB system RAM. Run the largest open-source models at full speed. At this point you’re essentially running a personal AI server. Overkill for most people, but researchers and AI-focused developers find it worthwhile.
The best part about local LLMs isn’t the cost savings or the privacy – it’s the feeling of ownership. This model runs on my hardware, on my terms, and nobody can change the terms of service, raise prices, or shut it down. In an era where every digital service is a subscription that can be altered at any time, that independence is worth something.
Start with Ollama, pull Llama 3.1, and run your first local prompt. It takes five minutes. Once you see it working – this AI running entirely on your own machine – you’ll understand why so many developers are making the switch.