I want to discuss an idea I’m coming to believe strongly about how to design complex compound AI systems, from my last few years of building with frontier models. But to get there, I want to lay out some shared foundation about the idea of composition in general.
Composition and abstraction
In software design, we manage complexity through composition and abstraction. We compose hardware instructions or variables to build functions, and compose functions to build libraries and programs. To implement a user interface, I might abstract specific inputs and conceptual groups of functionality into components like buttons and tabs and panels, and compose them together by placing them next to or inside one another to build up the interface. A complex web application may be composed of thousands of services interacting with each other through requests and responses, each service abstracting over some detail that other services shouldn’t worry about.
Together, composition and abstraction help us contend with the necessary complexity of software systems that must express some inherently complex real-world idea. By finding the right way to break down a complex problem into small parts which each manage their own details and expose clearly-defined ways to interact with other parts, we build a system that’s easier to understand and easier to evolve over time. When used well, composition and abstraction localize future failures and changes to the system to specific parts rather than everywhere all the time. Abstraction also often helps with reuse of the same ideas in different places across a system.
In mature subfields of software engineering, there are strong conventions around composition and abstraction boundaries. Over time and trial-and-error, mature fields end up with a few evolutionarily proven patterns for ways to carve up a complex software system to isolate and localize failure and change to small parts. Look at operating systems as an example. Although there are exceptions, the general pattern of a kernel abstracting over hardware with system calls, processes and threads for multitasking, synchronous and asynchronous I/O primitives… we can find these in most operating systems in roughly the same shape. Where there are differences (for example, APIs for async I/O), it’s an indication of active area of evolution. Though less mature, user interfaces on the Web feel like it’s also close to a steady-state of compositional patterns. If you’re building a Web interface today, you’re probably choosing to compose it out of interface components that have a functional kind of behavior, holding onto some state and re-rendering in some way when state updates as a consequence of some event somewhere else in the system.
In software, the evolutionary pressure on the gene pool of compositional patterns is very strong. If your system cannot isolate operational faults in production or cannot confidently accommodate the need to change its behavior over time without the changes propagating chaotically, the design pattern will get weeded out of the ecosystem in favor of patterns that soak up change with ease.
This brings us to the question of how we should compose compound AI systems, particularly ones that make heavy use of prompted and fine-tuned language models (which I’ll just call “models” going forward).
Tasks are the building block of capabilities
In a new domain absent any convention, one tempting way to try to find good abstraction boundaries is to look for areas in the system’s implementation where we could reuse code to reduce repetition. This often leads to abstracting along implementation details. Let’s call this approach composing implementations.
Many libraries and frameworks for working with models make this mistake. They push their consumers to think of things like prompts, tools, models, or some combination thereof (sometimes called an “agent”) as the reusable unit of composing larger systems. In this paradigm, the application defines specific prompts or tools, and by making a careful set of these available to the model for any specific task, the system ought to be capable of the desired behavior.
I have struggled to build robust, changeable systems with this approach, and believe this produces fragile systems that are hard to evolve confidently.
When we build an AI system by composing implementation details, we are making a few assumptions:
- We assume this design localizes future changes to specific parts of the system. That is, we believe we’ll be able to accommodate future changes to our system by changing either the prompt, or some tools, or the model, independently of each other.
- We assume it makes sense to reason about failures at the level of these components. In other words, we assume that it’s constructive to think of a system failure as a “prompt” being bad, or a “tool” being bad.
In practice, when a compound AI system fails at some task or needs to change how it performs on a task, I’ve often needed to change all the components — prompts/context, tools, and the underlying model – together. Conversely, if a system doesn’t perform up to par on a particular task, we cannot necessarily isolate the fault to a bad prompt or a bad model; both are contributing factors, and we may be able to fix the problem by improving one or the other. Possibly both.
As a consequence of this mismatch between assumptions behind composing implementations and reality, in practice, I’ve found this paradigm to create a lot of unnecessary complexity that doesn’t actually make the system easier to understand or evolve.
Instead of composing implementations, we should define our abstractions to correspond to specific tasks and subtasks we want our system to perform, and compose tasks to assemble larger systems, in much the same way we compose functions to build larger programs.
When I write “task”, I mean some mapping of a domain of inputs to a domain of outputs, paired with some evaluation criteria on the output domain. For any given problem, there are potentially many different valid ways to factor the problem out into subtasks. Let’s take a question answering system as an illustrative example.
- End-to-end: As a trivial case, we could define the entire end-to-end problem as a single task, where the input is a user query and output is an answer. The evaluation criteria could be some combination of concision, accuracy, and relevance. As part of this definition, we may also define the input domain to consider certain questions “out of scope” — irrelevant to whether the task is being performed well or not.
- Factored: Many real-world QA systems instead subdivide the task into three subtasks:
- Query expansion: mapping a user query to one or more queries rewritten to be optimized for some backend search engine
- Retrieval: mapping some internal search query (often looking differently from user queries, e.g. complex full text search queries) to a list of ranked, filtered document matches
- Generation: mapping the original user query and retrieved documents to a specific phrasing of a response.
Out of context, one factoring of this task isn’t obviously better than the other. To decide which is better requires knowing the technical and engineering context of the situation. The factored approach may be preferred by a team that already has a working retrieval system, or where it’s possible to evaluate the quality of the query expander independently of the result e.g. from an opinionated guide on how to use the search engine effectively. The end-to-end approach may be better suited for a context where the strongest correctness signal only exists for the final result, and where it’s possible to propagate that feedback signal directly through the entire system through methods like RL. Even if a team begins with one factoring of a task, as the available technologies improve or become cheaper, the right architecture may change over time. For example, as end-to-end RL becomes more viable a factored implementation of QA may decide to move to an E2E design.
Let’s compare composing tasks to composing implementations from above. In my experience:
- Future changes to an AI system tend to localize changes to specific tasks. If I need to upgrade the backing model for query expansion, or rearchitect the answer generation subtask for latency, as long as that subtask evaluates well, I can trust that other parts of the system will interoperate well with the upgraded version. Changes are better isolated to individual subtasks, meaning the abstraction boundaries are true boundaries.
- It’s often much easier to talk about failures at the level of specific tasks, than prompts/tools/models. When I see a bad answer output by a question answering system, it’s easier for me to say, “This would have been better if only the model had the right context,” or “The search engine clearly didn’t return the right document.” It’s harder for me to say “If only the prompt had been better,” — maybe a better model would have solved the problem too.
Note that this approach isn’t incompatible with a bias to end-to-end approaches to solving big problems. As technology improves I might choose to bundle the three subtasks in a QA problem into a single end-to-end task, which I could then reuse in a bigger, more complex report generation task. I’m arguing for inputs and outputs as abstraction boundaries, rather than implementation details of pipeline harnesses; I’m not necessarily arguing for always factoring tasks into smaller tasks.
A system built by composing subtasks often looks less like a conventionally implemented “agent”, and more like a “workflow” or a “pipeline,” with each task accepting some well-defined input and returning some output to the next task. In this design, when implementing new workflows, it becomes natural to try to see if we can assemble them in terms of smaller pre-existing workflows. In a world where prompts and tools are the unit of abstraction, using prompts and tools in new contexts often requires tweaking those implementation details. By contrast, when assembling new tasks out of existing tasks, the abstraction boundaries stay clear — either the subtask is performing well, benefiting the larger task, or the subtask needs to perform better. If the subtask’s output isn’t quite the right format, we can pass it through another subtask to get the right output.
It’s a more functional programming-inspired design, and just as functional programming conventions tend to make it easier to reason about complex logic, composing by tasks makes it easier to reason about complex AI systems.
Composing tasks is also not incompatible with a “give a model lots of tools” style of architecture; a model may choose to invoke a subtask as a tool, and consume its output as a tool call output. (This begins to look like what I’ve seen referred to as “subagents.”) This approach keeps the part that I consider the most fundamental strength of “agentic” systems, which is that the model chooses what tasks to perform next. The model controls the control flow. Even in this world, by reasoning about tasks in terms of inputs, outputs, and evaluation criteria, we can make the whole system easier to evolve and understand.
Composing tasks in the real world
A pattern emerges when we engineer systems by composing tasks. As underlying technologies evolve and become more general, we often need to re-draw the boundaries between subtasks to get maximum performance. This is a consequence of the Bitter Lesson: as more data and compute become available to deploy against a problem, it becomes more attractive to solve more tasks end-to-end rather than decompose it into smaller, easier tasks.
I’ve found that this applies not just when using computers to solve software tasks, but also when working with people and software to solve real-world tasks inside teams. As technology improves, often towards more general capabilities, it makes sense to periodically check whether a group of people and software can work better together by re-drawing the boundaries between tasks. A common naïve version of this is to shift from a human performing some task to a machine performing the task and a human validating its output, but I think this leaves a lot of creativity and upside on the table. If a workflow happens once every N days today, you could consider moving to a streaming design. If there are many workflows that all depend on some element that looks very similar, you might unify that shared laborious task across many workflows into a program and let everyone “call” the program as needed, on demand and at much lower cost. This would mean any improvements to the performance and reliability of that shared subtask lifts the quality of all downstream workflows.
As I spend more time working across artificial intelligence, user interfaces, and understanding ecosystems of people collaborating, I’ve come to really believe that these are small interconnected components of a larger system design problem. AI, UI, and social coordination are each a part of an important broader question: how to design a hybrid system of people and software that’s robust to change and improves gracefully over time. When looking at big complex problems this way, often a paradigm that works well in one domain, like interface design, points to analogues in sibling domains, like social coordination. Task composition feels like one such principle.
I share new posts on my newsletter. If you liked this one, you should consider joining the list.
Have a comment or response? You can email me.