@@ -29,33 +29,24 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
2929
3030These models are optimized specifically for click prediction and UI element grounding:
3131
32- ### GTA1-7B
32+ ### OmniParser
3333
34- State- of-the-art grounding model from the [ GUI Agent Grounding Leaderboard ] ( https://gui-agent.github.io/grounding-leaderboard/ ) :
34+ OCR-focused set- of-marks model that requires an LLM for click prediction :
3535
36- - ` huggingface-local/HelloKKMe/GTA1-7B `
36+ - ` omniparser ` (requires combination with any LiteLLM vision model)
3737
38- ``` python
39- agent = ComputerAgent(" huggingface-local/HelloKKMe/GTA1-7B" , tools = [computer])
38+ ### GTA1-7B
4039
41- # Predict click coordinates for UI elements
42- coords = agent.predict_click(" find the submit button" )
43- print (f " Click coordinates: { coords} " ) # (450, 320)
40+ State-of-the-art grounding model from the [ GUI Agent Grounding Leaderboard] ( https://gui-agent.github.io/grounding-leaderboard/ ) :
4441
45- # Note: GTA1 cannot perform autonomous task planning
46- # This will raise an error:
47- # agent.run("Fill out the form and submit it")
48- ```
42+ - ` huggingface-local/HelloKKMe/GTA1-7B `
4943
5044## Usage Examples
5145
5246``` python
5347# Using any grounding model for click prediction
5448agent = ComputerAgent(" claude-3-5-sonnet-20241022" , tools = [computer])
5549
56- # Take a screenshot first
57- screenshot = agent.computer.screenshot()
58-
5950# Predict coordinates for specific elements
6051login_coords = agent.predict_click(" find the login button" )
6152search_coords = agent.predict_click(" locate the search text field" )
@@ -66,6 +57,33 @@ print(f"Search field: {search_coords}")
6657print (f " Menu icon: { menu_coords} " )
6758```
6859
60+ ``` python
61+ # OmniParser is just for OCR, so it requires an LLM for predict_click
62+ agent = ComputerAgent(" omniparser+anthropic/claude-3-5-sonnet-20241022" , tools = [computer])
63+
64+ # Predict click coordinates using composed agent
65+ coords = agent.predict_click(" find the submit button" )
66+ print (f " Click coordinates: { coords} " ) # (450, 320)
67+
68+ # Note: Cannot use omniparser alone for click prediction
69+ # This will raise an error:
70+ # agent = ComputerAgent("omniparser", tools=[computer])
71+ # coords = agent.predict_click("find button") # Error!
72+ ```
73+
74+ ``` python
75+ agent = ComputerAgent(" huggingface-local/HelloKKMe/GTA1-7B" , tools = [computer])
76+
77+ # Predict click coordinates for UI elements
78+ coords = agent.predict_click(" find the submit button" )
79+ print (f " Click coordinates: { coords} " ) # (450, 320)
80+
81+ # Note: GTA1 cannot perform autonomous task planning
82+ # This will raise an error:
83+ # agent.run("Fill out the form and submit it")
84+ ```
85+
86+
6987---
7088
7189For information on combining grounding models with planning capabilities, see [ Composed Agents] ( ./composed-agents ) .
0 commit comments