Three months ago, I set out to fine-tune a large language model on our company’s internal documentation. What followed was one of the most humbling learning experiences of my career. I’d read the blog posts, watched the YouTube tutorials, and figured I could knock it out in a weekend. It took six weeks, three failed attempts, and a near-total rethink of my approach before I got something that actually worked.
Here’s everything I wish someone had told me before I started.
First Question: Do You Actually Need Fine-Tuning?
Before you touch a single training script, you need a brutally honest conversation with yourself about whether fine-tuning is even the right approach. I wasted two weeks because I skipped this step.
Here’s the decision framework I use now:
Fine-Tuning Decision Tree
Choose the right approach before investing time and money
Prompt Engineering
RAG
Fine-Tuning
- Prompt engineering – Try this first. Always. If you can get 80% of what you need by crafting better prompts, system messages, and few-shot examples, stop there. It’s cheaper, faster, and infinitely more maintainable.
- RAG (Retrieval-Augmented Generation) – If your problem is that the model doesn’t know your data, RAG is almost certainly the better path. Embed your documents, retrieve relevant chunks, feed them into the context window. For most knowledge-base applications, this beats fine-tuning hands down.
- Fine-tuning – This is for when you need the model to behave differently, not just know different things. Specific tone, particular output formats, domain-specific reasoning patterns, or tasks where latency from retrieval is unacceptable.
My actual use case – generating technical summaries in a very specific internal format with domain jargon – turned out to be a legitimate fine-tuning candidate. But I could have saved myself grief by trying RAG more seriously first.
Data Preparation: The 80% You Don’t Want to Hear About
Everyone talks about model selection and hyperparameters. Nobody wants to talk about the boring part: your data is probably garbage, and fixing it will consume most of your project timeline.
Our internal docs consisted of roughly 12,000 pages across Confluence, Google Docs, and some ancient SharePoint archives. I initially thought I could just dump them into a training pipeline. That was mistake number one.
What I actually had to do:
- Filter out duplicates, outdated pages, and placeholder content – this eliminated about 40% of the corpus.
- Convert everything into clean instruction-response pairs. This is tedious, manual work. I ended up with 2,847 high-quality training examples from those 12,000 pages.
- Validate a random sample of 200 pairs by hand. Found errors in 23% of them. Back to step two.
- Create a held-out test set of 300 examples that the model would never see during training.
This process took three weeks. The actual training? A few hours.
Choosing a Base Model and Method
I started with Llama 2 7B because it was the default recommendation everywhere at the time. For my task, it turned out that Mistral 7B performed noticeably better as a starting point – about 15% higher on my evaluation metrics before any fine-tuning. Base model selection matters more than most guides suggest.
For the fine-tuning method, I went with QLoRA (Quantized Low-Rank Adaptation). Full fine-tuning of a 7B model requires serious GPU memory – we’re talking 60+ GB of VRAM. QLoRA let me train on a single A100 40GB, and honestly, you can get away with a single RTX 4090 (24GB) for 7B models. You do not need an H100 cluster. I rented a single A100 instance on Lambda Labs for about $1.10/hour. Total compute cost for my final successful run: around $47.
Key QLoRA settings that worked for me: rank 64, alpha 128, dropout 0.05, targeting all linear layers. I tried rank 16 first – too constrained for the behavioral changes I needed.
The Three Failures
Attempt One: Overfitting on Small Data
My first training run used only 800 examples. The model memorized them almost perfectly – training loss dropped to near zero – and then produced incoherent outputs on anything outside that narrow set. My evaluation score: 0.31 on the held-out test set. Basically useless.
Attempt Two: Catastrophic Forgetting
I overcorrected by training for too many epochs (8) on the larger dataset. The model got decent at my specific task but lost the ability to write coherent English in general contexts. It would output our internal format structure even when asked simple questions. Training for 3 epochs with a lower learning rate (2e-5 instead of 1e-4) fixed this.
Attempt Three: The Data Quality Wall
Even with better hyperparameters, I plateaued at 0.64 on my eval metrics. That’s when I went back and audited the training data again. Found that about 15% of my instruction-response pairs had subtle issues: responses that were technically correct but stylistically inconsistent, instructions that were ambiguous, or examples that contradicted each other. Cleaning those up and retraining pushed me to 0.82. Data quality was the bottleneck, not the model or the method.
What Actually Worked: Lessons as a List
- Start with evaluation. Define how you’ll measure success before you train anything. I used a combination of automated metrics (ROUGE-L, a custom format-compliance scorer) and human evaluation on 50 randomly sampled outputs.
- More data isn’t always better. Cleaner data almost always is. Going from 2,847 messy examples to 2,400 clean ones improved performance more than adding 1,000 new examples did.
- Train for fewer epochs than you think. For QLoRA on instruction-following tasks, 2-4 epochs is usually the sweet spot. Beyond that, you’re memorizing.
- Save checkpoints every 100 steps and evaluate each one. My best checkpoint was rarely the final one.
- Merge and quantize for deployment. After training, I merged the LoRA weights back into the base model and quantized to GGUF format for inference. Runs on much cheaper hardware.
- Version your datasets like you version your code. I cannot stress this enough. When something breaks, you need to know what changed.
- Test for regressions on general capabilities. Run the fine-tuned model through a standard benchmark (I used a subset of MMLU) to make sure you haven’t broken core reasoning.
The Tooling That Made It Manageable
Hugging Face Transformers and PEFT – the foundation. Solid documentation, wide community support, and the PEFT library makes LoRA/QLoRA straightforward to implement.
Axolotl – a wrapper that handles a lot of the configuration boilerplate. I switched to it after my second attempt and it eliminated several classes of mistakes I was making with raw training scripts. Config-driven training is significantly less error-prone than writing custom loops.
Unsloth – this one surprised me. It patches the training pipeline for speed improvements, and I measured a genuine 1.8x speedup on my A100 runs. The memory savings also meant I could increase my batch size, which helped training stability.
Weights & Biases – for tracking experiments. When you’re on your fifth training run with slightly different hyperparameters, you will lose track without proper logging. Non-negotiable.
Is It Worth It? An Honest Assessment
After six weeks of work and roughly $200 in total compute costs (including all the failed runs), my fine-tuned model scores 0.82 on our evaluation suite compared to 0.54 for the base model with optimized prompts and 0.71 for RAG with the same base model.
Is that gap worth it? For our use case – generating hundreds of these summaries daily – yes. The format compliance alone saves our team about 12 hours of editing per week. Over a year, that easily justifies the development investment.
But I’ll be direct: for probably 70% of the use cases I see people attempting to fine-tune for, prompt engineering or RAG would get them close enough at a fraction of the effort. Fine-tuning is a precision tool. Reach for it when you’ve genuinely exhausted the simpler options, when you need behavioral changes rather than knowledge injection, and when you have enough quality data to make it work.
The uncomfortable truth about fine-tuning is that the model is the easy part. The hard part is being honest about your data – how much of it is actually good, how consistent it is, and whether it truly represents what you want the model to learn. Get that right, and the rest is almost mechanical.