Originally released Dec 20, 2024.

Below are my thoughts on the market valuation of AI, and how we should progress to computer use agents.

THIS ESSAY IS ABOUT COMPUTER USE/SCREEN AGENTS.

AI’s market value is underpriced in the 5+ year term. Where the current valuation of the industry originates, is however in my opinion, misguided.

I don’t think it’s productive to speculate about A(G/M/S)I, though I think a lot of current value placed in AI is on the possibility of A(G/M/S)I, so I’m going to lay out a few axiomatic beliefs I have, and what I think they imply.

Beliefs:

Current architectures, specifically, transformer LLMs, are sufficient to replace a good portion of economically valuable work. In fact, I don’t even think we need to get better ML techniques. I think, for the most part, there is simply a data bottleneck.
1. It’s quite easy to see that no current LLM agent works. I think this is caused by a lot of (very smart but) misguided efforts. Attempting to scale computer usage data by way of paying for the data/launching un-useful products, then just hard-scaling using VLMs, is while feasible, a path of significant resistance.
  Computer environments are, while mostly more discrete in comparison to the real world, multiple different ‘worlds’ rolled up into one. This is very hard to collect data for effectively, and thus makes it difficult to get useful computer use models. This, I think, is the most significant bottleneck. I think the big labs have more than enough talent to produce computer use agent if they have sufficient data. They don’t.
  I think outside observers don’t take note of the possibility of alternative paths towards working computer use agents, and somehow think that the infeasibility of the current path somehow implies that no agents will work anytime soon, and so current market sentiments ascribes most of the market value to AI to some discounted estimate of the value some A(G/M/S)I would bring.
I think a significant portion of this value will come in the form of general computer use agents and physical robots. The former will likely come first.
1. I think the value is self-evident, and impossible to price.
2. I think the former will come first because I think there is a significantly easier way of scaling data for computer-use agents in comparison to say, humanoid robotics.
Most people do not care if the models get moderately ‘smarter’ or can answer some questions you may need a PhD for. They just need them to do things and be robust.
1. The biggest models today outperform >90% people for cognitive work, albeit while being a bit amnesic. It cannot however, take their jobs even if we only care about computer driven tasks.
2. I don’t want to a few minutes for a LLM just to order my pizza incorrectly.
  - Tangentially, I think most of the initial value of computer use agent will be used to resolve small simple tasks that require operating across multiple contexts, and will require the user to switch contexts. It’ll be to decrease the cognitive load of menial but annoying novel tasks (e.g., pull up this set of notes I have for XYZ class, draft an email about why these numbers look weird.etc). I think the pizza example is terribly uncreative, it’s not really a problem, just a fun toy example. Think alt-tab but for dispatching background agents.
You need to learn dynamics for effective computer use agents.
1. A large part of the difficulty for computer use agents comes from (as previously mentioned), the differing dynamics between programs/websites. The word ‘dynamics’ is actually a bit loaded here, so I’ll clarify more. In my mind, dynamics is continuous in its nature with respect to time horizon. It’s the ability to understand what a single action, or a sequence of actions will do.
  - Higher level dynamics (i.e., planning a trip may require booking plane tickets) differs from lower level dynamics (this button does this) differs from medium level dynamics (those websites fulfil those esoteric desires) in the presence in training data and their ease of collection.These are relative, which makes life quite convenient.
2. I think this is almost obvious. Taken to the extreme, if the agent had no understanding of dynamics, it’d be useless. The question is, how much does it have understand to be useful?
You need mass consumer adoption to collect enough dynamics data to MAYBE get a minimally-scaffolded computer use agent using a very smart VLM. But there is no chicken and egg problem with computer use agents.
1. Rationally, there is no ‘divine’ reason for there to be a chicken and egg problem.
2. It’s unclear to me how well/quickly paying for data will work due to the “multiple different ‘worlds’” problem.
3. I think the most difficult data to obtain is low and medium level dynamics, irregardless of it being in the form of texts/images/video, of enough environments. How are you going to get this from anyone but a significant number of users?
  - Higher level dynamics exist in sufficient mass inside readily available data.
    Even though for specific programs, I think it’s easy enough to specify low & medium level dynamics (this is an IDE with XYZ debugging functionalities, and lets you write code, and this button runs code), and higher level dynamics don’t really matter in this scenario.
    What about the open ended web?
    I don’t think this information can be expected from all programs/developers, to any consistent degree of quality.
    Also you know, UI changes, updates happen.etc
  - I don’t even think you need action-by-action workflows for medium level dynamics to get effective-enough computer use agents. I.e., your medium level dynamics don’t have to be that granular. Low level dynamics are quite easy to determine, there are just a lot of them to learn. How they can be combined with higher level dynamics to fill this middle gap is a matter of execution. There is some obsession with learning and retrieving workflows, I think this is prima facie just not robust, and also unnecessary.
    I think the idea of action-by-action workflows as a necessity for medium level dynamics is slightly misguided. I think the most important part of these workflows isn’t to demonstrate how exactly something can be done in some environment, but that some specific website allows you to fulfill some potentially esoteric desire. I think with sufficient training data, as long as an agent knows that something can be done in some rough location (e.g., this site has a settings page which allows you to deactivate your account), good enough low+high level dynamics is sufficient.
    However even this will be very difficult to scale using brute force.
  - Exploration is very difficult for agents, and very painful for users. You mainly want to understand enough dynamics to minimize this for users.
  - In short, I think you need very good low level dynamics, some rough medium level dynamics, and good high level dynamics, to get useful (but not fully robust) computer use agents that’s good enough for mass adoption. Going from here to robust computer use agents is relatively more clear.
  - I think medium level dynamics will be the greatest roadblock to robust computer use agents. Low level dynamics, while more plentiful, are easier to learn.
4. To argue for training, no amount of scaffolded use of the largest available LM will beat a small+preference aligned VLM trained for computer use. The latter is simply just quicker and by definition serves the user better. Furthermore, you need understand these agent-related preferences. Do you need to do a lot of (if any) training for a working product? Likely not. Will all LLM-driven agents be purpose-trained in a ~decade? Yes.

Implications:

I think there should be a focus on developing a scalable, ‘good-enough’, scaffolding scheme that is data-efficient -> productize -> gather data -> get general computer use agents. This is very underexplored.

I think the feasibility of such a scheme is heavily underestimated, and the current value judgement of AI lies in the promise of some unclear definition A(G/M/S)I instead of the immediacy of robust computer use agents. A new HCI is likely to introduce use-cases that we are currently unaware of, and be impossible to price or guess the feeling of. We sort of ‘feel the AGI’, but we haven’t ‘felt’ agents yet.

How can we abuse code-gen and known engineering techniques to build scalable ‘good-enough’ scaffolding?