I Tested 7 Open-Source LLMs and Here Is What Actually Works

I spent two weeks running seven open-source large language models through the same battery of tests – coding tasks, creative writing, logical reasoning, factual Q&A, and instruction following. I used the same prompts for every model, tracked response quality on a simple 1-5 scale, and kept notes on the experience of actually using each one. Not benchmark scores from a leaderboard. Real-world, hands-on impressions from someone who writes code and prose for a living.

Here’s what I found. Some of it surprised me.

The Setup

My test machine: a desktop with an RTX 4090 (24GB VRAM), 64GB RAM, running Ubuntu. For models that fit, I ran them locally through Ollama (dead simple, highly recommend) and LM Studio (better UI, more configuration options). For larger models, I used quantized versions – mostly Q4_K_M, which gives you a solid balance between quality and memory usage.

A quick note on quantization for the uninitiated: it’s a compression technique that shrinks model files by reducing numerical precision. A 70-billion parameter model that would normally need 140GB of memory can be squeezed into ~40GB at Q4 quantization. You lose some quality, but in practice, the difference is smaller than you’d expect.

The Seven Contenders

1. Llama 3.1 (Meta) – 8B / 70B / 405B

I tested the 8B and 70B variants. The 8B is the best “small” model I’ve used, full stop. For its size, it handles instruction following, basic coding, and conversational tasks remarkably well. Runs on a laptop with 8GB VRAM. The 70B version is where things get serious – it competes with GPT-3.5-turbo on most tasks and beats it on several. Coding output is clean, reasoning is solid, and it handles long contexts without completely falling apart.

Quick verdict: The default recommendation. If you’re new to local LLMs, start here. The 8B model punches absurdly above its weight, and the 70B is genuinely production-quality for many use cases.

2. Mistral (Mistral AI) – 7B / Mixtral 8x7B

Mistral 7B was the model that proved small models could be great, and it still holds up. Fast, efficient, surprisingly good at structured output (JSON, code). The Mixtral 8x7B variant uses a mixture-of-experts architecture – only a portion of the model activates for each query, making it faster than a dense model of equivalent quality. In my testing, Mixtral was the speed king. Responses came back noticeably faster than comparably sized models, and the quality on factual Q&A was excellent.

Quick verdict: Best for speed-sensitive applications. If you need low latency and solid general capability, Mixtral is hard to beat. The base 7B is aging a bit compared to newer models, though.

3. Phi-3 (Microsoft) – 3.8B / 14B

Microsoft’s “small but mighty” pitch is real. The Phi-3 Mini at 3.8B parameters runs on a phone, and it handles basic tasks – summarization, simple Q&A, light coding – better than some 7B models. The 14B variant is genuinely impressive for reasoning tasks. Microsoft trained these on heavily curated “textbook quality” data, and you can feel it: the outputs are structured, logical, and surprisingly nuanced for the size.

Quick verdict: Best for constrained hardware. If you’re running on a laptop with no dedicated GPU, Phi-3 Mini gives you surprisingly usable AI. The 14B is a hidden gem for reasoning tasks.

4. Qwen 2.5 (Alibaba) – 7B / 72B

Qwen is the model that Western tech circles keep sleeping on. The 72B version is, in my testing, the best open-source model for coding. Period. It handled Python, JavaScript, Rust, and even obscure SQL dialects with a fluency that felt almost wrong for a free model. On my coding benchmarks, it outperformed Llama 3.1 70B by a noticeable margin. It also has excellent multilingual capabilities – not surprising given its origin, but useful if you work with non-English content.

Quick verdict: The dark horse. If coding is your primary use case, Qwen 2.5 72B should be your first choice. The 7B variant is decent but nothing special compared to Llama 3.1 8B.

5. Gemma 2 (Google) – 9B / 27B

Google’s open-source offering improved dramatically from version 1 to version 2. The 9B model is competitive with Llama 3.1 8B across most benchmarks, with slightly better performance on factual accuracy – presumably because Google’s training data advantages translate to better world knowledge. The 27B version fits a sweet spot: too large for most laptops, but very manageable on a desktop with a decent GPU. Its instruction following is among the best I tested.

Quick verdict: Best for factual accuracy. If you’re building something where getting facts right matters more than creative writing or coding, Gemma 2 27B is a strong pick. Drier and more “clinical” in its writing style, though.

6. Command R (Cohere) – 35B / 104B

Command R is built specifically for retrieval-augmented generation (RAG) – the technique where you feed a model relevant documents and ask it to answer questions based on them. And at that specific task, it’s phenomenal. It cites sources properly, sticks to the provided context, and resists the urge to hallucinate information that isn’t in the documents. For general chat and coding, it’s middling. But for its intended use case? Best in class.

Quick verdict: Specialist pick. If you’re building a RAG pipeline – a chatbot over your company docs, a research assistant, a legal document analyzer – Command R is purpose-built for you. For everything else, look elsewhere.

7. DeepSeek (DeepSeek AI) – V2.5 / V3

DeepSeek V3 deserves its own section because of what it represents. This is a 671-billion parameter mixture-of-experts model that, using clever engineering, reportedly cost only $5.5 million to train – a fraction of what GPT-4 or Gemini cost. In my testing (using the quantized version through API and the smaller distilled variants locally), DeepSeek V3 is the closest thing to GPT-4 performance in the open-source world. Reasoning is exceptional. It handles multi-step math and logic problems that trip up every other model on this list. Creative writing is vivid and varied.

Quick verdict: Most impressive overall. The full model is too large for consumer hardware, but the distilled versions and API access make it usable. If you want the best raw intelligence available in open source, this is it.

Benchmark Scores: Top Open-Source LLMs

Coding
Reasoning
Chat Quality

Llama 3.1 70B

8.5

8.2

8.8

DeepSeek V3

9.1

8.8

8.0

Mixtral 8x22B

7.8

7.5

8.2

Qwen 2.5 72B

8.3

8.6

7.9

Scores out of 10 | Based on community benchmarks and independent testing

The “Best For” Breakdown

Best for coding: Qwen 2.5 72B, followed closely by DeepSeek V3
Best for creative writing: DeepSeek V3, then Llama 3.1 70B
Best for reasoning/math: DeepSeek V3, with Phi-3 14B as a lightweight alternative
Best for RAG/document Q&A: Command R, no contest
Best for constrained hardware: Phi-3 Mini (3.8B) or Llama 3.1 8B
Best for speed: Mixtral 8x7B
Best for factual accuracy: Gemma 2 27B
Best all-rounder: Llama 3.1 70B (most balanced across all tasks)

Hardware Reality Check

Let me be blunt about hardware requirements, because too many articles gloss over this:

8B models (Llama 3.1 8B, Mistral 7B, Phi-3 Mini): 8GB VRAM minimum. A MacBook with M1/M2/M3 handles these comfortably. An RTX 3060 works fine.
13-27B models (Phi-3 14B, Gemma 2 27B): 16GB VRAM recommended. M2 Pro/Max MacBooks or RTX 4070+ territory.
70B models (Llama 3.1 70B, Qwen 2.5 72B): 40GB+ VRAM at Q4 quantization. You need an RTX 4090, an A6000, or an M2 Ultra Mac. Or two GPUs.
100B+ models (Command R 104B, DeepSeek V3): Multiple high-end GPUs or cloud instances. Not practical for most individuals locally.

Ollama vs. LM Studio – Which Should You Use?

Both are excellent. Ollama is command-line-first, lightweight, and brilliant for developers who want to integrate local LLMs into their workflow. Install it, run ollama run llama3.1, and you’re chatting in seconds. It also exposes an OpenAI-compatible API, making it trivial to swap a local model into any project that uses the OpenAI SDK.

LM Studio has a proper GUI, lets you browse and download models from a built-in catalog, and gives you more control over generation parameters (temperature, top-p, repetition penalty). It’s better for experimentation and non-technical users. I use Ollama for daily work and LM Studio when I want to tinker with settings.

The Honest Bottom Line

Open-source LLMs in early 2026 are good enough for a surprising range of real work. Not “good enough for a free thing” – genuinely good enough, period. Qwen 2.5 72B writes better code than GPT-3.5-turbo ever did. DeepSeek V3 reasons through complex problems at a level that was exclusive to frontier closed models eighteen months ago.

But – and this matters – they’re not GPT-4o or Claude Opus. The gap has shrunk from a canyon to a crack, but it’s still there, especially on nuanced reasoning, long-form coherence, and the ability to follow complex, multi-constraint instructions precisely. If your use case demands the absolute best quality and cost isn’t a concern, closed models still win.

Where open-source wins: privacy (your data never leaves your machine), cost (no per-token fees), customization (fine-tune to your specific domain), and availability (no rate limits, no outages, no sudden API changes). For many businesses and developers, those advantages outweigh the quality gap.

My personal setup? I run Llama 3.1 8B through Ollama for quick tasks throughout the day. For serious coding sessions, I switch to Qwen 2.5 72B. And when I need the absolute best output for client work, I still reach for Claude or GPT-4o. That hybrid approach – local models for the 80% and frontier models for the 20% – feels like the sweet spot right now.

The open-source LLM space is moving fast enough that this article will probably be outdated in six months. And honestly? That’s the most exciting part.

Tagged in

#benchmarks #Llama #LLMs #local AI #Mistral #Ollama #open source

Arjun Mehta

Contributing Writer

ML engineer and open-source contributor building production AI systems for 8 years. Previously at a top AI research lab and a large tech company. Writes about making machine learning practical and accessible.

View all posts 4 articles