Earlier this 12 months, we mentioned that we’re bringing pc use capabilities to builders by way of the Gemini API. Immediately, we’re releasing the Gemini 2.5 Computer Use model, our new specialised mannequin constructed on Gemini 2.5 Professional’s visible understanding and reasoning capabilities that powers brokers able to interacting with consumer interfaces (UIs). It outperforms main options on a number of internet and cellular management benchmarks, all with decrease latency. Builders can entry these capabilities by way of the Gemini API in Google AI Studio and Vertex AI.
Whereas AI fashions can interface with software program by way of structured APIs, many digital duties nonetheless require direct interplay with graphical consumer interfaces, for instance, filling and submitting varieties. To finish these duties, brokers should navigate internet pages and purposes simply as people do: by clicking, typing and scrolling. The power to natively fill out varieties, manipulate interactive components like dropdowns and filters, and function behind logins is a vital subsequent step in constructing highly effective, general-purpose brokers.
The way it works
The mannequin’s core capabilities are uncovered by way of the brand new `computer_use` software within the Gemini API and must be operated inside a loop. Inputs to the software are the consumer request, screenshot of the atmosphere, and a historical past of latest actions. The enter may specify whether or not to exclude features from the full list of supported UI actions or specify further customized features to incorporate.

