Why Fine-Tuning Won't Teach Your Model Your Business

Fine-tuning is the answer to a question most enterprises aren't actually asking.

The pitch is intuitive. Your model doesn't know your business. Take your business data, train the model on it, now the model knows your business. The vendors offering this are well-funded, the demos are persuasive, and "we fine-tuned a model on our data" sounds like a sentence a serious AI team should be saying.

We think it's a mistake for the problem most enterprises have, and we want to lay out why — not as a blanket dismissal, but as a sharp distinction between what fine-tuning is good at and what people are using it for.

What fine-tuning is actually good at

Let's give the technique its due. Fine-tuning genuinely works for:

Style and format. Teaching a model to respond in a specific tone, structure, or schema. This is what most successful fine-tunes do.
Domain language. Teaching a model the surface vocabulary of medicine, law, or a specific technical field. The facts the model already had; the words it needed help with.
Narrow task specialization. Code generation in a specific framework, document classification in a specific taxonomy, agent tool-calling in a specific format.
Guardrails and refusals. Shaping what the model will and won't do.

These are formatting and behavior problems. Fine-tuning is a reasonable solution for them.

What people actually want fine-tuning to do

Almost everyone we talk to wants fine-tuning to do something different. They want to bake their institutional knowledge into the model so it answers questions correctly about their business. The customer relationships. The product taxonomy. The org-specific definitions. The exception cases. The current state of the data.

This is the problem fine-tuning is bad at. And it's bad at it for structural reasons that don't go away with more compute.

Five reasons it doesn't work

1. The frequency is wrong.

Institutional knowledge changes constantly. Sales reorganizes a region. Finance updates a definition. A new product launches. A customer gets renamed after an acquisition. A reporting line gets deprecated. None of this is a quarterly event — it's daily.

A fine-tuning loop is a pipeline. Curate the data, run the training, evaluate, deploy, monitor. The fastest serious team can do this in days; most do it in weeks or months. By the time the new fine-tune lands, the truth has moved.

You end up with a model that confidently recites a version of your business that no longer exists.

2. You can't audit a fact in a weight.

When the controller asks why the agent gave the wrong number for Q3 revenue, the answer is somewhere in the weights of a 70B-parameter model. You can't open it. You can't point to the line that taught it the wrong thing. You can't roll back one fact without rolling back the entire fine-tune.

For most enterprise use cases this is disqualifying on its own. Compliance, finance, legal, healthcare — these workflows require knowing where an answer came from. A fine-tuned weight has no provenance. It's a probability distribution that happened to favor the right token. That isn't an audit trail.

3. Architectures get outdated, and your fine-tune dies with them.

The base model gets a new release every few months — and the architecture under it shifts faster than that. Llama 3 to Llama 4. Dense to mixture-of-experts. New attention schemes, new positional encodings, new context-window mechanics, new tokenizers. Every few quarters, the frontier moves in a way that makes the previous architecture's weights structurally incompatible with the new one.

When that happens, a LoRA adapter trained on the old architecture doesn't load onto the new one. A full fine-tune is a binary tied to the previous generation's parameter shape. You don't get to "upgrade" the fine-tune. You re-do it from scratch — re-curate the data, re-train, re-evaluate, re-deploy, re-validate against months of accumulated corrections. And then you do it again the next time the frontier moves.

So the choice on every architecture shift is: stay on the older base model (forgo the new capabilities, eventually lose support, watch competitors pull ahead) or pay the full fine-tune cost on a recurring schedule indefinitely. Either way, your institutional knowledge is structurally chained to whichever model generation you committed to.

Compare to a knowledge layer that sits outside the model: when a new architecture ships, you swap the runtime and the layer is unchanged. Years of accumulated corrections, terminology, and entity mappings carry forward intact. The model becomes a swappable component instead of a multi-million-dollar dependency.

4. The contamination problem.

A fine-tuned model is correct on the things it was trained on and quietly wrong on the things adjacent to them. The wrongness is the dangerous part — it's confident, fluent, and looks just like the right answers. Six months in, you can't reliably tell which answers came from the fine-tune and which came from the base model's prior. Neither can the model.

This is not a UX problem. It's a governance problem. You've shipped a system whose behavior is shaped by a training run nobody can fully introspect.

5. The economics are upside-down.

Fine-tuning costs scale with corpus size, training frequency, and model size. Knowledge-layer corrections cost roughly nothing per correction — they're database writes. As your enterprise grows and the institutional knowledge expands, fine-tuning gets more expensive linearly; a knowledge layer gets cheaper per unit of correction because the schema is amortized.

This is why the labs love selling fine-tuning, and why most enterprises shouldn't buy it for this purpose. The cost structure benefits the vendor, not the buyer.

What about RL fine-tuning?

The current frontier of "fine-tuning" — RL on weights from expert preferences — is more sophisticated than supervised fine-tuning. It's a real research bet and we take it seriously.

But the structural objections still apply. RL fine-tuning still bakes facts into weights, still requires retraining cycles, still produces models you can't audit fact-by-fact, still ties you to a base model's release cadence. The technique is better; the architectural fit for institutional knowledge isn't.

For some problems — agent tool-use behavior, complex multi-step planning, format adherence under pressure — RL fine-tuning is the right tool. For "the model should know what we mean by ACV in Q3 of FY26," it isn't.

The architectural alternative

The right place for institutional knowledge is a layer between your data and the model:

Entity resolution so the agent knows which Acme is which.
Terminology and definitions so the agent uses your meaning of "active customer."
Process and exception structure so the agent handles the common case and the edge cases differently.
Authoritativeness and recency so the agent knows which document supersedes which.
Corrections from experts that update the layer immediately, no retraining needed.

The model becomes the runtime, not the knowledge store. When the model improves, you keep the layer. When the layer changes, you don't touch the model. Each piece does what it's good at.

When you actually should fine-tune

To be specific: fine-tune the model when you need consistent format, tone, or task behavior that prompting can't reliably get you. Don't fine-tune the model to teach it facts about your business. The first is what fine-tuning is for. The second is a category error that gets repeated because the pitch is intuitive and the alternative is harder to describe.

If you've been considering a fine-tuning project to make the model "know your business better," the question worth asking before signing the contract is: when this fine-tune is wrong six months from now, and an expert tells you the answer is wrong, what is your plan to fix it?

If the plan is "another fine-tune," you've signed up for an indefinite training pipeline.

If the plan is "we'll update the knowledge layer," you didn't need the fine-tune in the first place.