Imagining better interfaces to language models

Suppose you’re a product engineer working on an app that needs to understand natural language. Maybe you’re trying to understand human-language questions and provide answers, or maybe you want to understand what humans are talking about on social media, to group and categorize them for easier browsing. Today, there is no shortage of tools you may reach for to solve this problem. But if you have a lot of money, a lot of compute hardware, and you’re feeling a little adventurous, you may find yourself reaching for the biggest hammer of all the NLP hammers: the large autoregressive language model, GPT-3 and friends.

Autoregressive models let engineers take advantage of computers’ language understanding through a simple interface: it continues some given piece of text in a way the model predicts is the most likely. If you give it the start of a Wikipedia entry, it will write a convincingly thorough Wikipedia article; if you give it the start of a conversation log between friends or a forum thread between black hat hackers, it will continue those conversations plausibly.

If you’re an engineer or designer tasked with harnessing the power of these models into a software interface, the easiest and most natural way to “wrap” this capability into a UI would be a conversational interface: If the user wants to ask a question, the interface can embed the user’s query into a script for a customer-support conversation, and the model can respond with something reasonable. This is what Google’s LaMDA does. It wraps a generative language model in a script for an agreeable conversation, and exposes one side of the conversation to the human operator. Another natural interface is just to expose the model’s text-completion interface directly. This kind of a “direct completion” interface may actually be the most useful thing if you’re building, say, an AI-assisted writing tool, where “finish this paragraph for me” may be a useful feature for unblocking authors stuck in creative ruts.

But when I ponder the question “what interface primitives should language model-infused software use?”, it doesn’t seem like exposing the raw text-completion interface is going to be the most interesting, powerful, or creative bet long-term. When we look back at history, for every new capability of computers, the first generation of software interfaces tend to expose the most direct and raw interface to that capability. Over time, though, subsequent generations of interface designs tend to explore what kinds of entirely new metaphors are possible that build on the fundamental new capability of the computer, but aren’t tethered to the conduits through which the algorithms speak.

Long live files

Perhaps the most striking example of interface evolution is the notion of “files” which dominated the desktop computing paradigm for decades before being transfigured out of recognition in the switch to mobile. Operating systems still think of lots of pieces of data as files on disk, encoded in some file format, sitting at the end of some hierarchy of folders. Software developers and creative professionals still work with them on a daily basis, but for the commonplace “personal computing” tasks like going on social media, texting friends, streaming video, or even editing photos and blogging, humans don’t need to think about files anymore.

Files still exist, but the industry has found better interface primitives for mediating most kinds of personal computing workflows. Imagine if the Photos app on the iPhone made you deal with files and folders to search your camera roll, or if Spotify exposed a hierarchical-folders interface to browsing your playlists. Where files are exposed directly, end users can often skim a “Recently used” list of 5-10 files or search for a few keywords to find files – no trudging through folder hierarchies necessary. We can also depend on pretty reliable cloud syncing services to make sure our important files are “on every device”, though of course, that’s not really how files work. We’ve just evolved the interface primitive of a “file” to become more useful as our needs changed.

The path here is clear: we found a software primitive (files and folders), built initial interfaces faithfully around them (Windows 98), and gradually replaced most use cases with more natural interface ideas or augmented the initial metaphor with more effective tools.

A similar transition has been happening over the last decade for URLs on the web. Initially, URLs were front-and-center in web browsers. In a web full of static webpages linking to each other through short, memorable URLs written mostly by humans, URLs were a legible and important part of the user interface of the web. But as the web became aggregated by social media and powerful search engines, URLs became less important. As URLs became machine-generated references to ephemeral database records, the web embraced new ways to label and navigate websites – bookmarks and favorites, algorithmic feeds, and ever-more-powerful search. Most browsers these days don’t show full URLs of webpages by default.

With software interfaces for language models, we’re just at the tippy-tip of the beginning stages of exploration. We’re exposing the algorithm’s raw interface – text completion – directly to end users. It seems to me that the odds of this being the most effective interface for harnessing language models’ capabilities are low. Over the next decade, we’re likely to see new interface ideas come and go that explore the true breadth of novel interfaces through which humans can harness computers’ understanding of language.

Future interfaces are always difficult to imagine, but taking a page from Bret Victor’s book, I want to explore at least one possible future by studying one way a currently popular interface idea, conversational UIs, falls short.

Conversations are a terrible way to keep track of information

A conversational interface puts the user in conversation with some other simulated agent. The user accomplishes things by talking to this fictional agent. Siri is the prototypical example, but we also find these in (usually unsatisfying) customer support portals, in phone trees, in online order forms, and many other places where there may be a broad set of questions the user might ask, but a few fixed set of things the computer should do in response (like “rebook a flight”).

A basic conversational UI is easy to build on top of language models, because it’s just a thin wrapper around generative LMs’ raw interface: continuing a text prompt. But for most personal computing tasks, I think CUIs are not ideal. This is because conversations are a bad way to keep track of information, and most useful tasks require us to keep information in our working memory.

Here are some tasks that involve keeping track of information throughout:

Travel planning (what you’ve seen, which places/bookings you made)
Project management (what have I done? what’s on my plate?)
Researching a topic (why do people keep all those browser tabs open if not to keep state?)
Decision making (what choices do I have? which is better how?)
Following instructions (what have I done? did I miss a step? how much is left?)
Editing podcasts, videos, papers
Understanding a complex system, like reading a map or financial forecast

Current (mobile apps, websites) interfaces help the user “keep track of information” in these workflows by simply continuing to display relevant information on-screen while the user is performing some action. When I go to Expedia to book a trip, for example, even while I search for my return flight, I can see the date and time at which I depart, and on which airline I’ll be flying. In a conversational UI, these pieces of information can’t simply “stick around” – the user needs to keep them in mind somehow. And as the complexity of conversations and tasks increase, the user may find themselves interacting not with a kind and knowledgeable interlocutor, but a narrow and frustrating conduit of words through which a bot is trying to squeeze a whole screenful of information, one message at a time.

Not all tasks are so complex, though, and some tasks don’t really involve keeping anything in our working memory. These are good for CUIs, and include:

Querying specific trivia (time, weather, calendar events, todos)
Fire-and-forget tasks (Send X a message, play music)
AI as conversational partner (e.g. brainstorming, but then you’d need to “keep state” in another place like meeting notes)

If the user has to keep track of information in a conversation, they have to hold that information in their working memory (hard for no reason) or keep asking the interface (what was step one again? what were my options again?). It simply doesn’t make sense for some tasks.

What instead?

So what’s the solution to collaborating with language models on more complex tasks?

One solution may be documents you can talk to. Instead of holding a conversation with a bot, you and the bot collaborate together to write a document and build up a record of the salient points and ideas to remember. Think GitHub Copilot for everything else.

More generally, when there’s some shared context that a language model-powered agent and the human operator share, I think it’s best to reify that context into a real thing in the interface rather than try to have a conversation subsume it. A while ago, I made a language model-powered UNIX shell, where instead of typing in code like cp my-file.txt new-file.txt, I would type in natural-language commands:

Starting to think that a terminal shell you can talk to may not be the worst interface to just doing ~general stuff on your computer~?

Thinking about hooking up something like this to an OS-wide CMd+K or something.🤔 pic.twitter.com/4DifMyKP7H
— Linus (@thesephist) May 7, 2022

Playing with this experiment, I appreciated that rather than trying to access the world through the narrow pipe of a text conversation, instead the agent (shell) and I were both manipulating some shared environment collaboratively.

Swimming in latent space

My second interface idea is much more abstract and less developed, but it’s a current area of research for me, so I want to plant a seed of this idea in your mind.

In a standard text generation process with an LM, we control the generated text through a single lever: the prompt. Prompts can be very expressive, but the best prompts are not always obvious. There is no sense in which we can use prompts to “directly manipulate” the text being generated – we’re merely pulling levers, with only some rules of thumb to guide us, and the levers adjust the model’s output through some black-box series of digital gears and knobs. Mechanistic interpretability research, understanding how these models work by breaking them down into well-understood sub-components and layers, is showing progress, but I don’t expected even a fully-understood language model (whatever that would mean) to give us the feeling of directly, tactilely guiding text being generated by a language model as if we were “in the loop”.

I’m interested in giving humans the ability to more directly manipulate text generation from language models. In the same way we moved from command-line incantations for moving things around in software space to a more continuous and natural multi-touch paradigm, I want to be able to:

point a language model towards some “destination” concept, towards which the model tries to guide the text it generates.
drag some “handle” on an essay to make it less formal or more academic in tone.
interpolate between two different ideas by dragging one sentence onto the other and seeing which sentences are revealed in between them.

There is active research towards these ideas today. Many researchers are looking into how to build more “guidable” conversational agents out of language models, for example. And in the same way models like DALL-E 2 guide the synthesis of an image using some text prompt, we may also be able to guide synthesis of sentences or paragraphs using high-level prompts.

I’ve been researching how we could give humans the ability manipulate embeddings in the latent space of sentences and paragraphs, to be able to interpolate between ideas or drag sentences across spaces of meaning. The primary interface challenge here is one of dimensionality: the “space of meaning” that large language models construct in training is hundreds and thousands of dimensions large, and humans struggle to navigate spaces more than 3-4 dimensions deep. What visual and sensory tricks can we use to coax our visual-perceptual systems to understand and manipulate objects in higher dimensions? Projects like Gray Crawford’s Xoromancy explore this question for generative image models (BigGAN). I’m interested in similar possibilities for generative text models.

I’ve written before about the feedback loop difference between work and play. I wrote then:

The main difference between work and play is that “work” makes you wait to reap the rewards of your labor until the end, where “play” simply comes with much tighter, often immediate feedback loops. If you can take a project with all the rewards and feedback concentrated at the end and make the feedback loops more immediate, you can make almost any work more play-like, and make every piece of that work a little more motivating.

This is the job of game designers – taking a 10-hour videogame and skillfully distributing little rewards throughout the gameplay so that you lever have to ask yourself, “ugh, there’s so much of the game left, and I don’t know if I’m motivated enough to finish it.” What a ridiculous question to ask about a game! And yet, that’s a product of very deliberate design, the same process of design we can take to every other aspect of our work.

If we can build an interface to LMs that let humans directly guide and manipulate the conceptual “path” a model takes when generating words, it would create a feedback loop much tighter and more engaging than the prompt-wait-retry cycle we’re used to today. It may also give us a new way to think about language models. Rather than “text-completion”, language models may be able to become tools for humans to explore and map out interesting latent spaces of ideas.

Interfaces amplify capabilities

Large language models represent a fundamentally new capability computers have: computers can now understand natural language at a human-or-better level. The cost to do this will get cheaper over time, and the speed and scale at which we can do it will go up quickly. When we imagine software interfaces to harness this language capability for building tools and games, we should ask not “what can we do with a program that completes my sentences?” but “what should a computer that understands language do for us?”

Language understanding unlocks a new world of possible things computers can help humans accomplish and imagine, and the best interfaces for most of those tasks have yet to be imagined. Maybe in a decade we’ll be synthesizing entirely new interfaces just-in-time for every task:

inching closer to fully generated OSs and environments

imagine interfaces "prompted" in realtime by our actions

(using #cogvideo)
70mm f/2.3 photograph of a vast 3D floating document room of a misty submerged hypermodern macOS desktop file system UI, photograph of 3D iOS room pic.twitter.com/XJjpJaoY02
— Gray Crawford 🪡🦯🥢 (@graycrawford) July 24, 2022

Not “computers can complete text prompts, now what?” but “computers can understand language, now what?”

← Legal documents are pushing text interfaces forward

Thoughts at the boundary between machine and mind →

I share new posts on my newsletter. If you liked this one, you should consider joining the list.

Have a comment or response? You can email me.