Maximizing performance on bespoke evals

Published on September 22, 2024

LLMs exhibit Jagged Intelligence – they simultaneously perform incredibly on some tasks and extremely poorly on others. Without a lot of experimentation, it’s very hard to work out what to expect a priori. In fact, it’s even harder than this. Without a lot of experimentation for your specific use case you will not know whether an LLM is a valuable tool. Over time you can certainly develop an intuition for the capabilities of a given model, but even then, the models change so quickly that you need to constantly update your views. By spending a lot of time experimenting with various models, building tools, and testing their capabilities, I’ve slowly built a framework for how to quickly set up the right tooling to do these evaluations, determine how an LLM will perform, and iterate quickly towards higher performance. Below I’ll explain how I think you should think about this problem conceptually, and some of the specifics of how I go about it.

What to optimize for

Very simply, my recommended approach is:

Ignore costs and speed while trying to get the best possible performance
Then try to speed it up
Then try to optimize costs

When starting out, it’s tempting to try and jump straight to cheaper models, especially if you do the rough math and realize that your current approach with GPT-4o (or o1-mini/preview!) is too expensive to justify in production. Fight this urge. Without fail, if I have been able to make something work as measured by my bespoke eval, I have been able to subsequently optimize the cost while maintaining quality. And this is before considering that quality-adjusted model costs are dropping >50% every 6 months. The first challenge, always, is to determine if you can make something work. If you can, you’ll either be able to make it cheaper later, or the models will simply come down in price and rescue you. The reason this works is that in experimenting to find the approach that performs well, you’ll learn to recognize the jagged frontier of model capabilities. And in doing so you will develop an intuition for how to optimize costs and speed. More on this later.

How to improve performance

Below is my (very) high level framework for testing performance. For each step below I’ll provide some details and an example. For the example I’ll use the case of attempting to have an LLM build a roster from a template (i.e., fill in the shifts while respecting various constraints such as shift details, team, employee qualifications, etc).

1. Develop a small test set: Start by building a bespoke set of tests for your use cases. Start small so that you can move quickly. You can get started with ~30 examples. Don’t try to overengineer this. Often you can write these manually. You can also usually get an LLM to write them if you specify the format you need them in. I usually try to set up a txt file with a Json object where each test case is an object with params for the inputs and the acceptable output. In my example I would build 30 blank rosters, where each one had some number of shifts with a date, start and end time and a team. My other input would be a list of valid employees and the teams they can work in. This ‘workforce data’ would be common across sample cases for simplicity and an LLM could absolutely create it, as well as the rosters.

2. Create multiple, specific, deterministic evals: You want to be able to assess as quickly as possible a) how well your LLM is doing, and b) where it is failing. Think of this like building unit tests. ‘Deterministic’ might seem redundant given that an eval is typically deterministic but I say this to make sure you don’t rely on asking an LLM to judge the output. At this early stage, it won’t be well calibrated or reliable enough. In my rostering example I would start with evals for:

How many of the total shifts were correctly filled (i.e., the model returned a shift that exactly matched the input and contained an employee
How many shifts were not filled (to determine if it is missing shifts altogether)
How many shifts were filled with invalid employees (i.e., not in the input data)
Any other constraints (e.g., how many shifts were created that clashed with another shift for the same employee)

This will immediately point you towards where the model is failing. This is important because the solution for ‘the model is failing because it keeps hallucinating fake employees’ and ‘the model is failing because it is booking shifts that clash for the same employee’ are very different.

3. Build simple testing infrastructure: Once you have your test set and your evals, write a small app to run your tests. Your core app should include a config, pull in the test data, evals, and your prompt (more on that next), build the prompt, send the request to OpenAI/Anthropic/Google, parse the results, run the evals and store the results. To keep it simple, just store a csv with columns for (at minimum):

Timestamp
Model
Prompt_file_name
Actual prompt
Input data
Input tokens
Output tokens
Runtime (this is even more important now with o1 models as they tend to have high variance in inference time, even for identical prompts)
- Note: One thing to watch out for here is if you run lots of tests and hit rate limits that you make sure this wait time is not included in your runtime logs
Eval results (as many columns as needed e.g., overall_accuracy, num_fake_shift_errors, etc)

The config I mentioned can be as simple as a few lines where you specify the inputs for a given test. For example, you probably want to easily specify model name and prompt, so that you can very easily test different variations. For even better infrastructure, you can set up configs that take lists of models and prompts, and then run tests on all unique combinations of these inputs to gather more data quickly.

4. Write (and track!) prompt: Now it is time to write your first prompt. First, I strongly recommend you version control your prompts. A big part of improving performance will be prompt engineering. You want to know what works. You should have a folder in your project where you store txt files with each prompt you try. This is also why included prompt_file_name as a result above, so that you could easily track which prompt you used for a given result. My one other tip is to try and edit only one thing per variation. For example, using the same baseline prompt but adding one in-context example, then adding 5 in another prompt. This will allow you to systematically track the impact of incremental changes, which will make it easier to mix-and-match later when you have enough data. There are lots of guides on how to prompt well so we won’t spend much time here.

5. Run tests: You’re ready to run tests. If you’ve set up the infrastructure right, this should take a few seconds. Simply edit the inputs and run.

6. Log the full results: We discussed this earlier but one thing I recommend is to take your CSV with the results and drop it in a spreadsheet. In fact, it’s fairly simple to have the results appended to a google sheet, Claude/GPT will even set it up for you and walk you through the Google Cloud Console setup. It’s important to make it really simple to both view specific tests so you can quickly get a feel for what the model is getting right and wrong.

7. Review logs for successful pathways: Once it’s in the sheet, I’d recommend running some very simple calculations, such as:

Min, max, and average accuracy by model
Same as above but for prompt
Error type by count and percentage for each prompt

Once you have this all available it’s very simple to start seeing where the model is going wrong. Maybe you notice it repeatedly creates clashing shifts, so you add some in-context examples with Chain-of-Thought reasoning to steer it away from this failure mode. Once you’ve done that you might try different methods of demonstrating the examples to the model, testing each one in a separate prompt and comparing the accuracy. As you add complexity you may need to add new evals and new logs, but the overall framework remains unchanged. The goal is to maximize your iteration speed and ability to identify which ideas work and which don’t. If you start here, I can almost guarantee that even if something doesn’t end up working, you’ll figure that out faster and be able to move on to something that does.