Added omniparser to grounding page

ddupont808 · ddupont808 · commit 4eccf059e506 · 2025-08-06T11:54:36.000-04:00
diff --git a/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx b/docs/content/docs/agent-sdk/supported-agents/grounding-models.mdx
@@ -29,33 +29,24 @@ All models that support `ComputerAgent.run()` also support `ComputerAgent.predic
 
 These models are optimized specifically for click prediction and UI element grounding:
 
-### GTA1-7B
+### OmniParser
 
-State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
+OCR-focused set-of-marks model that requires an LLM for click prediction:
 
-- `huggingface-local/HelloKKMe/GTA1-7B`
+- `omniparser` (requires combination with any LiteLLM vision model)
 
-```python
-agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
+### GTA1-7B
 
-# Predict click coordinates for UI elements
-coords = agent.predict_click("find the submit button")
-print(f"Click coordinates: {coords}")  # (450, 320)
+State-of-the-art grounding model from the [GUI Agent Grounding Leaderboard](https://gui-agent.github.io/grounding-leaderboard/):
 
-# Note: GTA1 cannot perform autonomous task planning
-# This will raise an error:
-# agent.run("Fill out the form and submit it")
-```
+- `huggingface-local/HelloKKMe/GTA1-7B`
 
 ## Usage Examples
 
 ```python
 # Using any grounding model for click prediction
 agent = ComputerAgent("claude-3-5-sonnet-20241022", tools=[computer])
 
-# Take a screenshot first
-screenshot = agent.computer.screenshot()
-
 # Predict coordinates for specific elements
 login_coords = agent.predict_click("find the login button")
 search_coords = agent.predict_click("locate the search text field")
@@ -66,6 +57,33 @@ print(f"Search field: {search_coords}")
 print(f"Menu icon: {menu_coords}")
 ```
 
+```python
+# OmniParser is just for OCR, so it requires an LLM for predict_click
+agent = ComputerAgent("omniparser+anthropic/claude-3-5-sonnet-20241022", tools=[computer])
+
+# Predict click coordinates using composed agent
+coords = agent.predict_click("find the submit button")
+print(f"Click coordinates: {coords}")  # (450, 320)
+
+# Note: Cannot use omniparser alone for click prediction
+# This will raise an error:
+# agent = ComputerAgent("omniparser", tools=[computer])
+# coords = agent.predict_click("find button")  # Error!
+```
+
+```python
+agent = ComputerAgent("huggingface-local/HelloKKMe/GTA1-7B", tools=[computer])
+
+# Predict click coordinates for UI elements
+coords = agent.predict_click("find the submit button")
+print(f"Click coordinates: {coords}")  # (450, 320)
+
+# Note: GTA1 cannot perform autonomous task planning
+# This will raise an error:
+# agent.run("Fill out the form and submit it")
+```
+
+
 ---
 
 For information on combining grounding models with planning capabilities, see [Composed Agents](./composed-agents).