Fine-tuning is the answer to a question most enterprises aren't actually asking.
The pitch is intuitive. Your model doesn't know your business. Take your business data, train the model on it, now the model knows your business. The vendors offering this are well-funded, the demos are persuasive, and "we fine-tuned a model on our data" sounds like a sentence a serious AI team should be saying.
First, in plain terms, because the words get thrown around loosely. A model like the one behind ChatGPT arrives pre-built. It was trained once, at enormous cost, on a huge slice of the public internet. "Fine-tuning" (also sold as "training the AI on our data") means taking that finished model and training it a bit further on your documents, so its internal wiring shifts toward your world. The promise is that your knowledge ends up inside the model, and from then on the model simply knows. That's the part worth examining, because it's where the intuition quietly breaks.
It's a mistake for the problem most enterprises have. What follows isn't a blanket dismissal. It's a sharp distinction between what fine-tuning is good at and what people are using it for.
What fine-tuning is actually good at
Let's give the technique its due. Fine-tuning genuinely works for:
- Style and format. Teaching a model to respond in a specific tone, structure, or schema. This is what most successful fine-tunes do.
- Domain language. Teaching a model the surface vocabulary of medicine, law, or a specific technical field. The facts the model already had; the words it needed help with.
- Narrow task specialization. Code generation in a specific framework, document classification in a specific taxonomy, agent tool-calling in a specific format.
- Guardrails and refusals. Shaping what the model will and won't do.
These are formatting and behavior problems. Fine-tuning is a reasonable solution for them.
What people actually want fine-tuning to do
Almost everyone we talk to wants fine-tuning to do something different. They want to bake their institutional knowledge into the model so it answers questions correctly about their business. The customer relationships. The product taxonomy. The org-specific definitions. The exception cases. The current state of the data.
This is the problem fine-tuning is bad at. And it's bad at it for structural reasons that don't go away with more compute.
Five reasons it doesn't work
1. The frequency is wrong.
Institutional knowledge changes constantly. Sales reorganizes a region. Finance updates a definition. A new product launches. A customer gets renamed after an acquisition. A reporting line gets deprecated. None of this is a quarterly event. It's daily.
A fine-tuning loop is a pipeline. Curate the data, run the training, evaluate, deploy, monitor. The fastest serious team can do this in days; most do it in weeks or months. By the time the new fine-tune lands, the truth has moved.
You end up with a model that confidently recites a version of your business that no longer exists.
2. You can't audit a fact in a weight.
When the controller asks why the agent gave the wrong number for Q3 revenue, the answer is buried in the model's "weights": billions of internal numbers, tuned during training, that no person can read or trace. You can't open it. You can't point to the line that taught it the wrong thing. You can't roll back one fact without rolling back the entire fine-tune.
For most enterprise use cases this is disqualifying on its own. Compliance, finance, legal, healthcare: these workflows require knowing where an answer came from. A fine-tuned weight has no provenance. It's a probability distribution that happened to favor the right token. That isn't an audit trail.
3. Architectures get outdated, and your fine-tune dies with them.
The base model gets a new release every few months, and the architecture under it shifts faster than that. Llama 3 to Llama 4. Dense to mixture-of-experts. New attention schemes, new positional encodings, new context-window mechanics, new tokenizers. Every few quarters, the frontier moves in a way that makes the previous architecture's weights structurally incompatible with the new one.
When that happens, the work you did doesn't carry over. A fine-tune is welded to the shape of the specific model generation it was trained on. You don't get to "upgrade" it. You re-do it: re-curate the data, re-train, re-evaluate, re-deploy, re-validate against months of accumulated corrections. In the lighter-weight cases this is more annoyance than catastrophe, and tooling has made it faster. But it is never free, and it is never finished. Every time the frontier moves, you run the pipeline again.
So the choice on every architecture shift is: stay on the older base model (forgo the new capabilities, eventually lose support, watch competitors pull ahead) or pay to redo the fine-tune on a recurring schedule, indefinitely. Either way, your institutional knowledge is structurally chained to whichever model generation you committed to.
Compare to a knowledge layer that sits outside the model: when a new architecture ships, you swap the model underneath and the layer is unchanged. Years of accumulated corrections, terminology, and entity mappings carry forward intact. The model becomes a part you can swap out, not a dependency you're married to.
4. The contamination problem.
A fine-tuned model is correct on the things it was trained on and quietly wrong on the things next to them. The wrongness is the dangerous part: it's confident, fluent, and looks exactly like the right answers. Six months in, no one can reliably tell which answers came from your data and which the model simply made up in your house style. Neither can the model.
This is not a UX problem. It's a governance problem. You've shipped a system whose behavior is shaped by a training run nobody can fully introspect.
5. The economics are upside-down.
Neither approach is free, so be honest about where the money goes. A knowledge layer has a real cost too: people have to define the terms, map the entities, and keep the structure current. But that's work you can see, do once, and reuse. A correction to the layer is a database write that takes effect immediately and stays fixed. Fine-tuning, by contrast, charges you again every cycle: corpus size, training runs, model size, and a full re-do each time the model underneath changes. As your business grows, the layer gets cheaper per correction because the structure is built once and reused; fine-tuning just gets more expensive, forever.
Put plainly: one approach asks you to maintain something you own and can inspect. The other asks you to keep repurchasing a result you can't.
What about RL fine-tuning?
The current frontier of "fine-tuning" (RL on weights from expert preferences) is more sophisticated than supervised fine-tuning. It's a real research bet and we take it seriously.
But the structural objections still apply. RL fine-tuning still bakes facts into weights, still requires retraining cycles, still produces models you can't audit fact-by-fact, still ties you to a base model's release cadence. The technique is better; the architectural fit for institutional knowledge isn't.
For some problems (agent tool-use behavior, complex multi-step planning, format adherence under pressure), RL fine-tuning is the right tool. For "the model should know what we mean by ACV in Q3 of FY26," it isn't.
"Isn't this just RAG?"
If you've spent any time around AI vendors, you've heard the other answer to "the model doesn't know our business": RAG, short for retrieval-augmented generation. The idea is simpler than the name. Instead of baking facts into the model, you leave the facts in a searchable store, look up the relevant ones at the moment a question is asked, and hand them to the model to read. The model stays generic; the knowledge stays outside it, where you can see and change it. On the core question of this post, RAG already has the right instinct: don't train facts into weights.
So why don't we just say "use RAG and move on"? Because basic RAG only solves the easy half. It can find a document that mentions "Acme." It cannot tell you which Acme, what your company specifically means by "active customer," which version of a policy supersedes the others, or how the routine case differs from the exception. Hand a model ten documents that disagree and it will confidently pick one. The hard part was never fetching text. It's resolving your business into something unambiguous before the model ever sees it.
That's the layer we're describing. It's the same instinct as RAG, taken to where the real difficulty is.
The architectural alternative
The right place for institutional knowledge is a layer between your data and the model:
- Entity resolution so the agent knows which
Acmeis which. - Terminology and definitions so the agent uses your meaning of "active customer."
- Process and exception structure so the agent handles the common case and the edge cases differently.
- Authoritativeness and recency so the agent knows which document supersedes which.
- Corrections from experts that update the layer immediately, no retraining needed.
The model becomes the runtime, not the knowledge store. When the model improves, you keep the layer. When the layer changes, you don't touch the model. Each piece does what it's good at.
When you actually should fine-tune
To be specific: fine-tune the model when you need consistent format, tone, or task behavior that prompting can't reliably get you. Don't fine-tune the model to teach it facts about your business. The first is what fine-tuning is for. The second is a category error that gets repeated because the pitch is intuitive and the alternative is harder to describe.
If you've been considering a fine-tuning project to make the model "know your business better," the question worth asking before signing the contract is: when this fine-tune is wrong six months from now, and an expert tells you the answer is wrong, what is your plan to fix it?
If the plan is "another fine-tune," you've signed up for an indefinite training pipeline.
If the plan is "we'll update the knowledge layer," you didn't need the fine-tune in the first place.