Skip to content

Commit 4eccf05

Browse files
committed
Added omniparser to grounding page
1 parent 760faf1 commit 4eccf05

File tree

1 file changed

+33
-15
lines changed

1 file changed

+33
-15
lines changed

docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx

Lines changed: 33 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -29,33 +29,24 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
2929

3030
These models are optimized specifically for click prediction and UI element grounding:
3131

32-
### GTA1-7B
32+
### OmniParser
3333

34-
State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
34+
OCR-focused set-of-marks model that requires an LLM for click prediction:
3535

36-
- `huggingface-local/HelloKKMe/GTA1-7B`
36+
- `omniparser` (requires combination with any LiteLLM vision model)
3737

38-
```python
39-
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
38+
### GTA1-7B
4039

41-
# Predict click coordinates for UI elements
42-
coords = agent.predict_click("find the submit button")
43-
print(f"Click coordinates: {coords}") # (450, 320)
40+
State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
4441

45-
# Note: GTA1 cannot perform autonomous task planning
46-
# This will raise an error:
47-
# agent.run("Fill out the form and submit it")
48-
```
42+
- `huggingface-local/HelloKKMe/GTA1-7B`
4943

5044
## Usage Examples
5145

5246
```python
5347
# Using any grounding model for click prediction
5448
agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
5549

56-
# Take a screenshot first
57-
screenshot = agent.computer.screenshot()
58-
5950
# Predict coordinates for specific elements
6051
login_coords = agent.predict_click("find the login button")
6152
search_coords = agent.predict_click("locate the search text field")
@@ -66,6 +57,33 @@ print(f"Search field: {search_coords}")
6657
print(f"Menu icon: {menu_coords}")
6758
```
6859

60+
```python
61+
# OmniParser is just for OCR, so it requires an LLM for predict_click
62+
agent = ComputerAgent("omniparser+anthropic/claude-3-5-sonnet-20241022", tools=[computer])
63+
64+
# Predict click coordinates using composed agent
65+
coords = agent.predict_click("find the submit button")
66+
print(f"Click coordinates: {coords}") # (450, 320)
67+
68+
# Note: Cannot use omniparser alone for click prediction
69+
# This will raise an error:
70+
# agent = ComputerAgent("omniparser", tools=[computer])
71+
# coords = agent.predict_click("find button") # Error!
72+
```
73+
74+
```python
75+
agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
76+
77+
# Predict click coordinates for UI elements
78+
coords = agent.predict_click("find the submit button")
79+
print(f"Click coordinates: {coords}") # (450, 320)
80+
81+
# Note: GTA1 cannot perform autonomous task planning
82+
# This will raise an error:
83+
# agent.run("Fill out the form and submit it")
84+
```
85+
86+
6987
---
7088

7189
For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).

0 commit comments

Comments
 (0)