- [2026-01-16] π Paper released on arXiv: OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
- [2026-01] π Dataset & Framework released
A benchmark framework for evaluating instruction-following capabilities of AI Coding Agents. It intercepts API calls via LiteLLM Proxy, collects complete interaction trajectories, and performs automated scoring using LLM.
- Multi-Scaffold Support: Supports Claude Code, Kilo-Dev, Droid and other AI development tools
- Trajectory Collection: Automatically intercepts and records complete API call trajectories
- Automated Evaluation: Multi-dimensional scoring of trajectories using LLM based on Checklist
- Docker Isolation: Each task instance runs in an isolated container with a clean environment
- Proxy Startup: LiteLLM Proxy runs on the host machine, intercepting all API calls
- Task Execution: Scaffolds (Claude Code, Kilo, Droid) complete tasks in Docker containers
- Trajectory Collection: Each API request/response is recorded to individual JSONL files (raw trajectories)
- Trajectory Processing: Use
convert/tools to deduplicate and merge raw trajectories into complete conversation trajectories - Automated Evaluation: Score merged trajectories using LLM based on Checklist
- Python 3.11+
- Docker
- LLM API Key (Anthropic / MiniMax / Gemini, etc.)
pip install -r requirements.txtcd proxy
cp env.sh.example env.sh
# Edit env.sh and fill in your API Keys
source env.sh# 1. Start Proxy (Terminal 1)
cd proxy
source env.sh
python start_proxy.py
# 2. Run evaluation pipeline (Terminal 2)
./run.sh
# Specify model
./run.sh --model claude-opus-4-5-20251101benchmark/
βββ run.sh # One-click run script (task execution + trajectory processing + evaluation)
βββ benchmark_runner.py # Benchmark runner main program
βββ evaluate.py # Trajectory evaluation script
βββ requirements.txt # Python dependencies
β
βββ scaffolds/ # Scaffold modules (multi-tool support)
β βββ __init__.py # Scaffold registry and factory functions
β βββ base.py # Abstract base class definition
β βββ claudecode.py # Claude Code scaffold implementation
β βββ kilo_dev.py # Kilo-Dev scaffold implementation
β βββ droid.py # Droid scaffold implementation
β
βββ proxy/ # LiteLLM Proxy component (trajectory collection)
β βββ start_proxy.py # Proxy startup script
β βββ trajectory_logger.py # Trajectory logger (custom Callback)
β βββ litellm_config.yaml # LiteLLM model configuration
β βββ env.sh.example # Environment variable template
β βββ Dockerfile # Proxy containerization config
β
βββ convert/ # Trajectory processing tools (dedup & merge)
βββ convert_cc_traj_to_msg.py # Main program: Ray parallel trajectory processing
βββ dedup.py # Deduplication logic
βββ utils.py # Completion data structures + format conversion
Task instances are loaded from MiniMaxAI/OctoBench, each record in JSON format:
{
"instance_id": "benchmark-example-001",
"user_query": ["Please help me analyze how this function works"],
"system_prompt": "",
"category": "Claude.md",
"image": "docker-image:tag",
"workspace_abs_path": "/app",
"scaffold": {
"name": "claudecode",
"version": "2.0.69"
},
"checklist": {
"SP": {
"description": "System Prompt constraints",
"checks": [
{
"check_id": "SP_language_match",
"description": "Check if correct language is used",
"check_type": "compliance"
}
]
}
}
}Key Fields:
scaffold.name: Scaffold name (claudecode / kilo-dev / droid)user_query: List of user queries, supports multi-turn conversationschecklist: Evaluation check items, organized by category
Raw trajectories collected by Proxy, one record per API call:
{
"instance_id": "benchmark-example-001",
"timestamp": "2024-12-27T10:00:00.000Z",
"success": true,
"model": "claude-sonnet-4-5-20250929",
"request": {
"messages": [...],
"tools": [...],
"system": [...]
},
"response": {
"content": "...",
"thinking_blocks": [...],
"tool_calls": [...],
"finish_reason": "end_turn"
},
"usage": {
"prompt_tokens": 1000,
"completion_tokens": 500,
"total_tokens": 1500
}
}Complete conversation trajectories after convert/ processing:
{
"meta": {
"session_id": "abc123",
"biz_id": "benchmark",
"model": "claude-sonnet-4-5-20250929",
"max_tokens": 8192
},
"tools": [
{
"type": "function",
"function": {
"name": "Read",
"description": "Read file content",
"parameters": { "type": "object", "properties": {...} }
}
}
],
"messages": [
{ "role": "system", "content": "You are a helpful assistant..." },
{ "role": "user", "content": "Please help me analyze this function" },
{
"role": "assistant",
"content": "OK, let me read the file first...",
"reasoning_content": "User needs to analyze function, I should first...",
"tool_calls": [{ "name": "Read", "arguments": {...} }],
"generation": true
},
{ "role": "tool", "tool_name": "Read", "content": "File content..." },
{
"role": "assistant",
"content": "This function does...",
"reasoning_content": "Based on the code content...",
"generation": true
}
]
}Key Fields:
reasoning_content: Model's thinking process (thinking block)tool_calls: List of tool calls
{
"results": [
{
"instance_id": "benchmark-example-001",
"success": true,
"reward": 0.85,
"eval_result": {
"SP": {
"reasoning": "Overall analysis...",
"checklist": [
{
"check_id": "SP_language_match",
"reasoning": "Specific analysis...",
"result": "success"
}
]
}
}
}
],
"summary": {
"total": 10,
"success_count": 9,
"avg_reward": 0.82
}
}model_list:
# Anthropic Claude
- model_name: claude-sonnet-4-5-20250929
litellm_params:
model: anthropic/claude-sonnet-4-5-20250929
api_key: os.environ/ANTHROPIC_API_KEY
# Google Gemini
- model_name: gemini-3-pro
litellm_params:
model: gemini/gemini-3-pro-preview-05-06
api_key: os.environ/GEMINI_API_KEY
# MiniMax
- model_name: MiniMax-M2.1
litellm_params:
model: anthropic/MiniMax-M2.1
api_base: https://api.minimaxi.com/anthropic
api_key: os.environ/MINIMAX_API_KEY| Variable | Description | Default |
|---|---|---|
TRAJECTORY_OUTPUT_DIR |
Trajectory output directory | ./trajectories |
LITELLM_PORT |
Proxy listening port | 4000 |
ANTHROPIC_API_KEY |
Anthropic API Key | - |
OPENAI_API_KEY |
OpenAI API Key (for evaluation) | - |
OPENAI_BASE_URL |
OpenAI API Base URL | - |
cd proxy
docker build -t benchmark-proxy .
docker run -d \
-p 4000:4000 \
-v /path/to/trajectories:/app/trajectories \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
benchmark-proxyInherit the TrajectoryLogger class and override the _build_record method to add custom fields.
MIT License