Finetuning vs in-context learning

Published on April 3, 2024

In Dwarkesh’s recent podcast with Sholto Douglass (Deepmind) and Trenton Bricken (Anthropic), Sholto speculated that in a world of long-context, fine-tuning might disappear. I’m skeptical this makes economic sense even in a world where 1) the quadratic penalty on longer contexts is solved, and 2) we have more compute. I think it likely the fixed-cost nature of fine-tuning will mean it remains viable and, in most cases, preferred.

The precise prediction was the following:

With long-context, there's also a degree to which fine-tuning might disappear, to be honest. These two things are very important today. With today's landscape models, we have whole different tiers of model sizes and we have fine-tuned models of different things. You can imagine a future where you just actually have a dynamic bundle of compute and infinite context, and that specializes your model to different things.

First, what is the current set of trade-offs between fine-tuning and in-context learning? Today, for a given task requiring specialization beyond the pre-trained model you’re using, you can either fine-tune (more pre-training, but at the end) or provide in-context examples using the same fine-tuning data. Currently, in-context learning suffers two distinct disadvantages:

The quadratic penalty – the computational cost of processing an input grows quadratically with the length of the input (i.e., doubling the input results in a computational cost increase of 4x). This is a function of how transformers work by attending to each previous word in the input. It presents a practical limitation on how much data can be provided in-context.
Fine-tuning is a fixed cost. For a given task where you have 100 example to ‘teach’ the model, you can either pay once to fine-tune on those samples and then scale inference as much as required. In-context learning essentially requires you to pay to learn the same thing at every inference, which will quickly drive-up costs vs a single fine-tuning when deployed to production at-scale.

In future, it seems likely the quadratic penalty is either reduced or removed altogether through something like sparse attention mechanisms or another algorithmic improvement. It is entirely possible this has already been achieved given Gemini 1.5s 1M token context length, and Magic’s supposed 10M token context length. Indeed it seems hard to imagine how inference on those models is affordable without such an algorithmic improvement. However, the nature of fixed vs variable costs seems unlikely to be solved by algorithmic improvements. Even if the quadratic penalty is entirely removed, and inference costs continue to fall dramatically (which we should assume they will), unless we think compute will be so abundant (‘too cheap to meter’) as to be an insignificant cost, this should continue to drive users to fine-tune for cost savings in the majority of cases. There may be some exceptions, where what needs to be learned is constantly changing. Or perhaps at some point compute will truly be too cheap to meter and it won’t matter, but I suspect we are a long way from that point in time.