Beyond chat as interface

One of the great failures of modern computing is how it has largely ignored the presence of the human body beyond the slightest acknowledgement that humans have a pair of eyeballs and a few fingertips. Largely, we sit in chairs staring at screens, our hands relegated to pecking away at a keyboard or poking at a tiny touchscreen. Compared to the way we employ and use other tools and instruments—from spatulas to screwdrivers to accordions to violins—the way we use computers today is a gross underutilization of both the expressiveness and sensitivity of our bodies.

Consider, for instance, our hands. Human hands are capable of so much more than typing and swiping. Human finger sensitivity is astonishingly precise, capable of detecting wrinkles on a smooth surface with a wavelengths as small as 10nm, features so small they are invisible to the eye. Hands can both manipulate and communicate, have effect on the world and perceive its most minuscule features.

Enthusiasm around large language models (LLMs) and text-to-image systems, however, has made it easy to forget about the impressive facility of our hands and the astonishing capabilities of our vision systems, leading instead to proclamations about the pending death of the graphical user interface as an input system. Instead of veering towards a future where computers make greater use of our bodies’ expressiveness, LLMs compel us to diverge further from such a course. Why manipulate anything on a computer with gesture or point and click when you can type or speak? Just tell the AI what you want, with natural language.

The shortcomings of this assertion should be obvious: not everything can nor should be expressed through spoken or written language. The future of computing cannot and will not be only talking to computers. For all its ease-of-use and benefits, text-based interfaces—be they verbal or written—are a limited interaction modality, a narrow straw of expressive bandwidth. Sometimes it’s very hard to say what you want.

From word processing to photo and video editing to drawing to 3D modeling—most creative work that happens on computers involves manipulation. It’s easier and more natural to trim a video by manipulating a slider on a track than by typing or speaking “trim the second clip on the timeline by 3.4 seconds.” It’s quicker to select a portion of a photo for editing by tapping it with your finger or circling it with a lasso than by telling the computer “please select the left eyebrow of the third woman to the right.”1

With direct manipulation we bypass the language centers of our brains and act without thinking, relying instead on muscular gestalts2, allowing us to enter flow state.

The exciting thing about a world where computers “understand” natural language isn’t that text-based interfaces will supplant the GUI. What’s exciting instead is how they will augment the GUI. The most compelling demos of AI-powered text-based interfaces, especially in creative contexts, always revolve around how they are paired with traditional input modalities. Sketch to image. Inpainting and style transfer from reference images. Photoshop’s generative fill. All of these still require the graphical user interface as an input system.

The future of user interfaces isn’t just chat; it’s multimodal.