CorpusBench

CorpusBench is a new agentic customer service benchmark, designed to test models in a more realistic scenario than benchmarks like taubench. It does this by providing the agent with various issues to resolve, but no policy guidance in the prompt. What the agent does receive is access to the simulated business’s historical data - emails with customers, internal communication regarding policy changes, order history, product catalogue, etc. The agent’s job is to use this to infer the correct policy to apply, and then apply it. The agent is measured on two things: its ability to take the correct action, and whether or not it provided the correct rationale for those actions.

Why build it?

When I wrote about what was holding back AI diffusion, one of the examples I gave that had worked extremely well for deploying AI was this:

The second breakthrough, which was likely even more important, was that instead of trying to hardcode behavior into prompts, we simply instruct the agent to begin by searching Gmail for the most recent similar cases, and use this to inform the response. Initially I didn’t realize what a huge improvement this would be but it quickly became clear. Not only do we not have to maintain prompts, but the model naturally picks up the style, tone and length of our responses. It also inherits our policies, like how we handle refunds or returns, and when we deviate from the default, because it can see how we’ve handled it in the past. Perhaps the greatest benefit of this approach is that in a way, it learns over time. I’m using the word loosely of course. But consider the following case: the agent searches for similar cases and drafts a response. The user decides the response was wrong and edits it before sending. Next time the agent faces that situation, or any that are similar, it follows the most recent behavior. This required no explicit intervention from the user.

This pattern is incredibly powerful for deploying agents into a business. If the agent can access historical precedent, it can infer huge amounts of what it needs to learn within a single context window. As memory improves this becomes more powerful still. Not all use cases support this [side quote: I do not think it is a coincidence that many of the largest AI use cases share this characteristic: coding has the codebase, customer service has the historical cases, and legal work has the contracts and documents, along with historical redlines. In each case, a sufficiently intelligent agent can use these things to infer how to do work correctly without needing to be explicitly instructed], but when they do, deployment is theoretically far easier, because we can potentially realize the dream of the drop-in digital worker - an agent that can simply be connected to relevant systems and data, and learn all that it needs to complete some piece of work while adhering to the correct policies.

However, as far as I can tell, existing benchmarks do not test this. The canonical agentic customer service benchmark is taubench from Sierra. While it is an excellent benchmark, the agent is given the full policy manual to apply in the prompt. It is still testing something valuable, but it is not testing the agent’s ability to search and gather the correct context. CorpusBench aims to solve this.

Despite the excitement and usefulness for agents in greenfield projects, the total set of opportunities is dominated by transforming existing businesses and their workflows. This is much harder, for many reasons. One of the main ones is that context is dispersed across the organization. It’s in different systems and formats, much of it is tacit, and large amounts of it are outdated. CorpusBench is an early attempt to measure an agent’s ability to solve this problem when provided with historical data access.