It’s Far More Interesting that LLMs Can Read (and See) Than Write

Giving computers eyes for the first time

Nov 08, 2024

Not many people can write well. Myself included. Writing well is a craft and nuanced, one that most people (like me) find hard.

So it makes sense that we’re collectively mesmerised by the fact that large language models (LLMs) can “write.” When they respond to prompts or spin out sentences that read like a competent human, it’s captivating. But focusing on what they can output from a prompt is only half the story.

Something potentially far more bigger: LLMs can read. And not just text—they can “see” images, understand screenshots of products and scrape insights from watch video footage. This

It makes me think about the scene that happens in every crime movie scenes where the detective spends hours scrubbing through CCTV footage to spot the suspect. That concept is dead.

Today, we could upload that footage to an LLM, type in “Find the person in the red coat,” or “Count the people entering and exiting between 8 p.m. and midnight,” and it’s done in minutes.

Take another example from Ethan Mollick recently, using Claude computer use and the following prompt:

Watch here: Ethan Mollick (@emollick) on X

The mind boggles at what else this could be applied to and below are quick random ideas that jumped into my head:

Imagine being able to constantly monitor your product’s interface and receive instant notifications the moment there’s a visual change, no manual tracking required. The LLM just sees it. UI, front-end, and mobile testing have always been a headache when building products, but now, the pain of keeping up with every little change could be a thing of the past.
As a teenager, I used to record football matches for a local club, capturing footage for the coaches to analyse later. Now, imagine an LLM running the analysis in real-time, offering player-specific data like pass accuracy by foot, position, and maybe even head swivels for midfielders—measuring how often they scan the pitch. Some software can handle the basics, but with this tech, we could uncover entirely new stats we never thought possible
Most home appliances still come with instruction manuals, but I know of companies working on a solution: simply point your phone at a complex machine, and an LLM can instantly provide a step-by-step setup guide based on the model it recognizes. I’d love to record myself assembling Ikea furniture and get a real-time alert if I start making a mistake (while kindly ignoring any swearing in the process).

For the last 18 months, everyone’s been focused on what LLMs can generate in response to prompts. And yes, that will keep evolving. But allowing LLMs to see, to genuinely interpret and act on the world around them, feels like an exciting next stage. The potential applications that leverage LLMs’ ability to see and respond will be fascinating to explore.

It’s still early days, of course. But I don’t think I’ve ever been this excited about the future, at least when it comes to technology and the products waiting to be built. It feels like we’re giving computers eyes for the first time, and that’s going be really interesting.

AI for Exco

Discussion about this post