Some applied research problems in machine learning

9 June 2024
9 Jun 2024
Manhattan, New York
13 mins

There’s a lot of great research on foundation models these days that are yielding deeper understanding of their mechanics (how they work) and phenomenology (how they behave in different scenarios). But despite a growing body of research literature, in my work at Notion and conversations with founders, I’ve noticed a big gap between what we currently know about foundation models and what we still need to build valuable tools out of them. This post is a non-exhaustive list of applied research questions I’ve collected from conversations with engineers and founders over the last couple of months.

To be abundantly clear, I don’t mean to say that nobody is working on a question when I include it here, but only that more people should be working on them. If you are working on one of these problems in your research, I’d love to hear from you.

Modeling and data

How do models represent style, and how can we more precisely extract and steer it?

A commonly requested feature in almost any LLM-based writing application is “I want the AI to respond in my style of writing,” or “I want the AI to adhere to this style guide.” Aside from costly and complicated multi-stage finetuning processes like Anthropic’s RL with AI feedback (RLAIF; Constitutional AI), we don’t really have a good way to imbue language models with a specific desired voice.

In the previous “AI boom”, generative models like GANs gathered interest for their ability to perform neural style transfer, absorbing the style of one image or video and overlaying it onto another. There is some literature on how to do this without re-training or costly fine-tuning for modern diffusion models as well, but few seem to have gotten traction.

Steering mechanisms like ControlNet may be another avenue of exploration for style transfer, but most applications of ControlNet seem to be about steering a model to include specific objects or layouts of objects in a scene rather than steering a model toward a particular style.

I’m currently optimistic that mechanistic interpretability techniques like sparse autoencoders can make advances here by discovering precise, interpretable “features” or concepts corresponding to style. By intervening on these specific features during the generation process, we may be able to manipulate style of model outputs very precisely.

When does it make sense to prefer one of RL, DPO/KTO, supervised fine-tuning? When is synthetic data useful?

In most fundamental ML research, model training usually happens in one of two very clear regimes: “data-rich” environments where we assume the amount of data is never a bottleneck, or “data-efficient” settings where we assume we have a hard limit on how much data we can learn from. Academic datasets are generally assumed to perfectly describe the task being studied. For example, most studies using the ImageNet or MS COCO computer vision datasets simply train on the entire dataset and no more, and assume that the dataset only contains correct results.

In reality, in industrial settings, none of these assumptions are true. Instead:

Once a budget and data quality bar has been set, teams then need to decide between a growing zoo of alignment techniques like RLHF, DPO, KTO, and classic supervised training. How do these techniques compare in apples-to-apples studies considering different data quality and quantity regimes? We don’t know. Furthermore, there’s a talent shortage as well for some teams. Many companies trying to build AI products don’t have in-house expertise to effectively deploy RL, for example.

It’s not glamorous work, but empirically studying tradeoffs in the frontiers of these techniques would help push the industry forward.

How can models effectively attend to elements of visual design, like precise color, typography, vector shapes, and components?

Imagine a real-world application of graphic design or photography. Chances are, imagery is only a small part of the final creative artifact. These “primary source” visuals are combined with type, animation, layout, and even software interface components to produce a final asset like a poster, an advertisement, or a product mock-up.

In a production pipeline for, say, a brand campaign, original illustrations and photography consume a small portion of the total budget, which includes pre-production like concepting but also tasks like copywriting, typography, overlaying brand assets like logos, and producing layouts for different display formats.

Even the most expensive frontier multimodal models like Claude 3 and GPT-4 can’t reliably differentiate similar colors or type families, and certainly can’t generate images with pixel-perfect alignment between components in a layout. These kinds of tasks probably require domain-specific data or adding on some new component to the model designed for numerically precise visuals.

I suspect companies like Figma are working hard on teaching language models how to work with vector shapes and text in design, but more work from the research community can likely benefit everyone.

How can we make interacting with conversational models feel more natural?

Every conversational interface to a language model adopts the same pattern:

None of these assumptions are true for human conversations, and in general, I’m excited about every advance that closes the gap between this stilted experience and human-like dialogue. As a successful example of this idea, pushing latency down to sub-second levels makes GPT-4o from OpenAI feel much less robotic in an audio conversation compared to other similar systems.

What problems, if solved, could enable more natural dialogue?

Knowledge representations and applied interpretability

Many of these questions assume that within the next year, we’ll see high-quality open datasets of steering vectors and sparse autoencoder features that allow for precisely “reading out” and influencing what production-scale models are thinking. There has been tremendous interest and progress in mechanistic interpretability in the last six months, owing to early pioneering and field building work of labs like Conjecture and Anthropic, and it seems all but certain that if the trend continues, this assumption will prove out.

In a world where anyone can choose which of hundreds of million concepts to inject into a generating model to flexibly steer its behavior and output, what applied ML and interface design research questions should we be asking?

How can we communicate to end users what a “feature” is?

I’ve been giving talks and speaking with engineers and non-technical audiences about interpretability since 2022, and I still struggle to explain exactly what a “feature” is. I often use words like “concept” or “style”, or establish metaphors to debugging programs or making fMRI scans of brains. Both metaphors help people outside of the subfield understand core motivations of interpretability research, but don’t actually help people imagine what real model features may look like.

I’ve found that the best way to develop intuition for features is just to spend a lot of time looking at real features from real models, using interfaces like my Prism tool. It turns out features can represent pretty much anything about some input, like:

Given the breadth of ideas that features can represent, how can we help the user build a mental model of what features are, and why they’re useful? Are features colors on a color palette? One in a selection of brushes? Knobs and levers in a machine? I think we need to discover the right interface metaphors as a foundation for building more powerful products on this technology.

What’s the best way for an end user to organize and explore millions of latent space features?

I’ve found tens of thousands of interpretable features in my experiments, and frontier labs have demonstrated results with a thousand times more features in production-scale models. No doubt, as interpretability techniques advance, we’ll see feature maps that are many orders of magnitude larger. How can we enable a human to navigate such a vast space of features and find useful ones for a particular task?

The exact interaction mechanics will probably differ for each use case, but I can think of a few broad patterns to borrow from prior art:

How do we compare and reconcile features across models of different scales, families, and modalities?

As understanding and steering models mechanically become more popular, users may expect to be able take their favorite features from one model to another, or across modalities (e.g. from image to video).

We know from existing research that (1) models trained on similar data distributions tend to learn similar feature spaces and (2) sparse autoencoders trained on similar models tend to learn similar features. It would be interesting to try to build a kind of “model-agnostic feature space” that lets users bring features or styles across models and modalities.

What does direct manipulation in latent space look like?

There is some precedent for direct manipulation in a space that isn’t the concrete output space of a modality. In Photoshop and other image editing programs, users can directly manipulate representations of an image that isn’t the raw pixels. For example, I can edit an image in “color space” by dragging over the color curves of an image. In doing so, I’m manipulating the image in a different more abstract dimension than pixels in space, but I’m still directly manipulating an interface element to explore my options and decide on a result.

With interpretable generative models, the number of possible levers and filters explodes to millions or even billions. While it’s easy to imagine a directly draggable dial for some features like “time of day”, great UI affordances are less obvious for other higher-level features like “symbols associated with speed or agility”.

Closely related to manipulation is selection. In existing design domains like photo or text editing, the industry invented, and our culture collectively absorbed, a rich family of metaphors for how to make a selection, what selection looks like, and what we can do once some slice of text or image is selected. When we can select higher-level concepts like the “verbosity of a sentence” or “vintage-ness of an image”, how should these software interface metaphors evolve?

From primitives to useful building blocks

Historically, use cases and cultural uptake of a new technology go up not when the fundamental primitives are discovered, but when those primitives are combined into more opinionated, more diverse building blocks that are closer to end users’ workflows and needs. Fundamental research helps discover the primitives of a new technology, like language models, alignment techniques (RLHF), and feature vectors, but building for serious contexts of use will inspire the right building blocks to make these primitives really valuable and useful.

I think we’re still in the period of technology adoption that favors implementation purity over usefulness. We build chat experiences with LLMs the way we do not because it’s the best chat interface, but because LLMs have context window limits. We generate images from text rather than image-native features and descriptions because we train our models on paired text-image data.

When we can move beyond polishing primitives toward more opinionated building blocks designed for humans, I think we’ll see a rejuvenation in possibilities at the application layer.

In the beginning… was the command line

I share new posts on my newsletter. If you liked this one, you should consider joining the list.

Have a comment or response? You can email me.