Some applied research problems in machine learning

There’s a lot of great research on foundation models these days that are yielding deeper understanding of their mechanics (how they work) and phenomenology (how they behave in different scenarios). But despite a growing body of research literature, in my work at Notion and conversations with founders, I’ve noticed a big gap between what we currently know about foundation models and what we still need to build valuable tools out of them. This post is a non-exhaustive list of applied research questions I’ve collected from conversations with engineers and founders over the last couple of months.

To be abundantly clear, I don’t mean to say that nobody is working on a question when I include it here, but only that more people should be working on them. If you are working on one of these problems in your research, I’d love to hear from you.

Modeling and data

How do models represent style, and how can we more precisely extract and steer it?

A commonly requested feature in almost any LLM-based writing application is “I want the AI to respond in my style of writing,” or “I want the AI to adhere to this style guide.” Aside from costly and complicated multi-stage finetuning processes like Anthropic’s RL with AI feedback (RLAIF; Constitutional AI), we don’t really have a good way to imbue language models with a specific desired voice.

In the previous “AI boom”, generative models like GANs gathered interest for their ability to perform neural style transfer, absorbing the style of one image or video and overlaying it onto another. There is some literature on how to do this without re-training or costly fine-tuning for modern diffusion models as well, but few seem to have gotten traction.

Steering mechanisms like ControlNet may be another avenue of exploration for style transfer, but most applications of ControlNet seem to be about steering a model to include specific objects or layouts of objects in a scene rather than steering a model toward a particular style.

I’m currently optimistic that mechanistic interpretability techniques like sparse autoencoders can make advances here by discovering precise, interpretable “features” or concepts corresponding to style. By intervening on these specific features during the generation process, we may be able to manipulate style of model outputs very precisely.

When does it make sense to prefer one of RL, DPO/KTO, supervised fine-tuning? When is synthetic data useful?

In most fundamental ML research, model training usually happens in one of two very clear regimes: “data-rich” environments where we assume the amount of data is never a bottleneck, or “data-efficient” settings where we assume we have a hard limit on how much data we can learn from. Academic datasets are generally assumed to perfectly describe the task being studied. For example, most studies using the ImageNet or MS COCO computer vision datasets simply train on the entire dataset and no more, and assume that the dataset only contains correct results.

In reality, in industrial settings, none of these assumptions are true. Instead:

Datasets can always be expanded by paying for human labor to find or label more samples. For frontier models today, each data sample may cost up to tens of dollars to acquire.
We can’t assume even “high quality” datasets always demonstrate correct task behavior. Academic evaluation sets like MMLU contain many instances of incorrect or nonsensical answers, and I’ve encountered subtle reasoning mistakes in datasets used to train models I’ve personally worked on.
There’s often a smooth tradeoff curve between quality and quantity of training data. Teams can choose to pay more per high-quality sample to build a smaller dataset, or spend the same budget to build a much larger, lower-quality dataset. In absence of scaling law-style empirical rules for training data quality vs quality tradeoffs, teams are left to make educated guesses on these decisions.

Once a budget and data quality bar has been set, teams then need to decide between a growing zoo of alignment techniques like RLHF, DPO, KTO, and classic supervised training. How do these techniques compare in apples-to-apples studies considering different data quality and quantity regimes? We don’t know. Furthermore, there’s a talent shortage as well for some teams. Many companies trying to build AI products don’t have in-house expertise to effectively deploy RL, for example.

It’s not glamorous work, but empirically studying tradeoffs in the frontiers of these techniques would help push the industry forward.

How can models effectively attend to elements of visual design, like precise color, typography, vector shapes, and components?

Imagine a real-world application of graphic design or photography. Chances are, imagery is only a small part of the final creative artifact. These “primary source” visuals are combined with type, animation, layout, and even software interface components to produce a final asset like a poster, an advertisement, or a product mock-up.

In a production pipeline for, say, a brand campaign, original illustrations and photography consume a small portion of the total budget, which includes pre-production like concepting but also tasks like copywriting, typography, overlaying brand assets like logos, and producing layouts for different display formats.

Even the most expensive frontier multimodal models like Claude 3 and GPT-4 can’t reliably differentiate similar colors or type families, and certainly can’t generate images with pixel-perfect alignment between components in a layout. These kinds of tasks probably require domain-specific data or adding on some new component to the model designed for numerically precise visuals.

I suspect companies like Figma are working hard on teaching language models how to work with vector shapes and text in design, but more work from the research community can likely benefit everyone.

How can we make interacting with conversational models feel more natural?

Every conversational interface to a language model adopts the same pattern:

A chat history sidebar, with each conversation lasting just a few turns
New sessions always begin in a brand-new thread
Every user query must always elicit exactly one response

None of these assumptions are true for human conversations, and in general, I’m excited about every advance that closes the gap between this stilted experience and human-like dialogue. As a successful example of this idea, pushing latency down to sub-second levels makes GPT-4o from OpenAI feel much less robotic in an audio conversation compared to other similar systems.

What problems, if solved, could enable more natural dialogue?

Interruptible interaction. The user and the AI system should both feel comfortable interjecting at any point in the conversation. Because conversations are currently uninterruptible, humans and models are both incentivized to send messages with maximal context and content, where in interruptible human conversations interlocutors get away with saying much less in each message and progressively sharing more.
Meaningful sense of time. When I come back to a conversation in ChatGPT or Claude, the AI system has little idea whether a few minutes, a few days, or a few months passed in between. Having this kind of context would probably lead to even more natural interactions.
Continuous, indefinite conversations. I don’t start a new texting thread on my phone every day with the same person; I use a single long-running thread. In a conversation that spans weeks or months or years, I can continue to reference both recent and nearly forgotten ideas. If we can move past the “each session is a blank slate” problem, I think we can make conversations feel much more natural.

Part of this challenge is that the AI system must pick up on when I want to change topics. In human conversations, when I come back to a conversation hours or days later, it’s clear I’ll probably want to start a new conversation. In absence of meaningful sense of time, models may have a hard time knowing when to faithfully reference previous messages rather than moving onto a new topic.

Knowledge representations and applied interpretability

Many of these questions assume that within the next year, we’ll see high-quality open datasets of steering vectors and sparse autoencoder features that allow for precisely “reading out” and influencing what production-scale models are thinking. There has been tremendous interest and progress in mechanistic interpretability in the last six months, owing to early pioneering and field building work of labs like Conjecture and Anthropic, and it seems all but certain that if the trend continues, this assumption will prove out.

In a world where anyone can choose which of hundreds of million concepts to inject into a generating model to flexibly steer its behavior and output, what applied ML and interface design research questions should we be asking?

How can we communicate to end users what a “feature” is?

I’ve been giving talks and speaking with engineers and non-technical audiences about interpretability since 2022, and I still struggle to explain exactly what a “feature” is. I often use words like “concept” or “style”, or establish metaphors to debugging programs or making fMRI scans of brains. Both metaphors help people outside of the subfield understand core motivations of interpretability research, but don’t actually help people imagine what real model features may look like.

I’ve found that the best way to develop intuition for features is just to spend a lot of time looking at real features from real models, using interfaces like my Prism tool. It turns out features can represent pretty much anything about some input, like:

Style attributes, from voice, tone, and word choice in text models to artistic style, color grading, and lighting in image models
Topics, like sports, cooking, artificial intelligence, parenthood, and law
Structure or input category, like “cooking recipe”, “televised broadcast announcement”, “enumeration with roman numerals”, and “first-person narratives with quotes”
Language or context, like “written in French”, “C++ code snippets”, and “legal contracts”
Even higher level semantic features like “internal character conflict”, “predictions of future growth”, and “philosophical reflections on humanity”

Given the breadth of ideas that features can represent, how can we help the user build a mental model of what features are, and why they’re useful? Are features colors on a color palette? One in a selection of brushes? Knobs and levers in a machine? I think we need to discover the right interface metaphors as a foundation for building more powerful products on this technology.

What’s the best way for an end user to organize and explore millions of latent space features?

I’ve found tens of thousands of interpretable features in my experiments, and frontier labs have demonstrated results with a thousand times more features in production-scale models. No doubt, as interpretability techniques advance, we’ll see feature maps that are many orders of magnitude larger. How can we enable a human to navigate such a vast space of features and find useful ones for a particular task?

The exact interaction mechanics will probably differ for each use case, but I can think of a few broad patterns to borrow from prior art:

Taxonomies. Researchers and designers could categorize millions of features into clear hierarchical categories (a bit like what I did above) and allow exploring them by traversing down the tree, from broad groups to specific feature vectors. Because similar features tend to also be nearby each other in latent space, this may also be possible to automate through techniques like clustering.
Spatial maps. Perhaps most common today, features’ geometric nature lends itself nicely to spatial visualization like the ones in Anthropic’s recent work on finding features in their Claude 3 model. I tend to shy away from spatial visualization because it’s often more eye candy than utility, but it may help end users explore a space of features and get a sense of what kinds of concepts a model represents, at which levels of abstraction.
Delegating to a conversational AI system. Another way to work around the sheer number of features may simply be to delegate the task to a language model that can attend to millions of features at once and figure out which features are best suited for a particular task on behalf of the user. Imagine specifying a writing style, for example, and having a chat model automatically choose and activate the most appropriate features to evoke the user’s desired tone of voice.
Templates. End users may interact with features mostly by installing “feature packs” or “collections” that group together dozens, hundreds, or even thousands of features into a higher-level, more meaningful package. For example, a feature collection may combine hundreds of different style features that together describe a particular author’s writing voice.
New notation or abstraction. I think it’s possible that the best way to represent features may be something entirely new. Just as mathematicians introduce new symbols and notational conventions alongside a new abstraction like “integral” or “modular arithmetic”, maybe we should try to invent new ways of writing down features. See also, Notational intelligence.

How do we compare and reconcile features across models of different scales, families, and modalities?

As understanding and steering models mechanically become more popular, users may expect to be able take their favorite features from one model to another, or across modalities (e.g. from image to video).

We know from existing research that (1) models trained on similar data distributions tend to learn similar feature spaces and (2) sparse autoencoders trained on similar models tend to learn similar features. It would be interesting to try to build a kind of “model-agnostic feature space” that lets users bring features or styles across models and modalities.

What does direct manipulation in latent space look like?

There is some precedent for direct manipulation in a space that isn’t the concrete output space of a modality. In Photoshop and other image editing programs, users can directly manipulate representations of an image that isn’t the raw pixels. For example, I can edit an image in “color space” by dragging over the color curves of an image. In doing so, I’m manipulating the image in a different more abstract dimension than pixels in space, but I’m still directly manipulating an interface element to explore my options and decide on a result.

With interpretable generative models, the number of possible levers and filters explodes to millions or even billions. While it’s easy to imagine a directly draggable dial for some features like “time of day”, great UI affordances are less obvious for other higher-level features like “symbols associated with speed or agility”.

Closely related to manipulation is selection. In existing design domains like photo or text editing, the industry invented, and our culture collectively absorbed, a rich family of metaphors for how to make a selection, what selection looks like, and what we can do once some slice of text or image is selected. When we can select higher-level concepts like the “verbosity of a sentence” or “vintage-ness of an image”, how should these software interface metaphors evolve?

From primitives to useful building blocks

Historically, use cases and cultural uptake of a new technology go up not when the fundamental primitives are discovered, but when those primitives are combined into more opinionated, more diverse building blocks that are closer to end users’ workflows and needs. Fundamental research helps discover the primitives of a new technology, like language models, alignment techniques (RLHF), and feature vectors, but building for serious contexts of use will inspire the right building blocks to make these primitives really valuable and useful.

I think we’re still in the period of technology adoption that favors implementation purity over usefulness. We build chat experiences with LLMs the way we do not because it’s the best chat interface, but because LLMs have context window limits. We generate images from text rather than image-native features and descriptions because we train our models on paired text-image data.

When we can move beyond polishing primitives toward more opinionated building blocks designed for humans, I think we’ll see a rejuvenation in possibilities at the application layer.

← In the beginning… was the command line

Personalization, measuring with taste, and intrinsic interfaces →

I share new posts on my newsletter. If you liked this one, you should consider joining the list.

Have a comment or response? You can email me.