Master Terminal-Bench 2.0 on MacOS: Ultimate Guide to Running Custom Agents Locally

How I spent way too long debugging Terminal-Bench 2.0 so you don't have to.

Master Terminal-Bench 2.0 on MacOS without the debugging headaches – your complete guide to running custom agents locally.

Read more: Master Terminal-Bench 2.0 on MacOS: Ultimate Guide to Running Custom Agents Locally

Understanding Terminal-Bench for MacOS

Terminal-Bench 2.0 represents the ultimate benchmark for AI agents, featuring 100 real-world terminal challenges that test everything from git operations to security CTF-style tasks.

When you run Terminal-Bench locally on MacOS, you’ll discover why even Claude Code running Opus 4.5 ranks at 20th place on the Terminal-Bench leaderboard. These authentic challenges cover file manipulation, package management, data processing, and system administration tasks. They separate capable agents from exceptional ones** (according to them)

Terminal-Bench MacOS benchmark testing for custom agents

Running your custom CLI agent against Terminal-Bench on MacOS reveals how it performs against industry standards. However, the setup process contains hidden obstacles that can derail your benchmarking efforts.

Narrator: It was much harder than expected.

Common MacOS Terminal-Bench Challenges

When configuring Terminal-Bench locally on MacOS with custom agents, you’ll encounter several technical obstacles that require specific solutions:

  1. Harbor configuration complexity — The plugin-based evaluation framework demands precise conventions for custom agents on MacOS
  2. Apple Silicon architecture conflicts — M1/M2/M3 Mac processors create compatibility issues with x86 Docker containers
  3. TTY pseudo-terminal limitations — CLI agents expecting real terminals fail when Docker containers lack TTY support
  4. Docker networking isolation — Agents mysteriously lose API connectivity despite container internet access
  5. Harbor configuration gotchas — Undocumented but critical settings can break your entire setup
  6. Extended timeout requirements — Complex tests like COBOL Modernization require 2+ hours to complete

*Me, after the 47th “Connection Issue. Retrying…” message while running Terminal-Bench*


Essential Prerequisites for MacOS

Before running Terminal-Bench locally, verify these components are installed and operational on your MacOS system. Each requirement serves a critical function in the benchmarking workflow:

Docker Desktop Installation

Docker provides isolated container environments for each Terminal-Bench task on MacOS. This containerization prevents task failures from corrupting subsequent test environments across the 100-task benchmark suite.

brew install --cask docker

# Verify Docker installation on MacOS
docker info

Important: Launch Docker Desktop from Applications before running Terminal-Bench. If “Docker daemon not running” appears, simply open Docker.app and wait for initialization.

Python 3.10+ and uv Package Manager

Harbor CLI requires Python 3.10 or newer for Terminal-Bench orchestration. The uv package manager delivers significantly faster installation compared to traditional pip workflows on MacOS.

python3 --version

brew install [email protected]

# Install uv package manager for Harbor
curl -LsSf https://astral.sh/uv/install.sh | sh

Harbor CLI Framework

Harbor orchestrates the complete Terminal-Bench execution workflow on MacOS, managing container lifecycles, task distribution to agents, output capture, and result scoring. Consider Harbor the conductor coordinating your entire benchmark symphony.

uv tool install harbor

# Verify Harbor CLI installation
harbor --version

If Harbor doesn’t appear in your MacOS PATH after installation, add this to your shell profile (.zshrc or .bashrc):

export PATH="$HOME/.local/bin:$PATH"
  • Your custom CLI agent built locally or available as an installable package
  • Patience and caffeine — preferably black coffee with a hint of milk

Bonus Resource: Access the complete working reference repository with installation scripts at github.com/vipulgupta2048/terminalbenchdocker. Clone it for comprehensive Terminal-Bench MacOS setup examples.


Harbor Architecture and Agent Integration

Understanding the Terminal-Bench architecture on MacOS requires grasping Harbor’s plugin-based framework. While Harbor includes native support for agents like Claude, your custom agent needs a translation layer—a Harbor plugin that defines installation procedures, execution commands, and output capture methods.

Harbor CLI architecture diagram for Terminal-Bench MacOS setup

The Terminal-Bench agent lifecycle on MacOS follows this sequence:

  1. Setup Phase: Container initialization, dependency installation, configuration file uploads, CLI verification
  2. Task Phase: Harbor transmits task instructions to your agent (example: “fix this git repository”)
  3. Execution Phase: Your agent executes commands within the container environment to solve assigned tasks
  4. Verification Phase: Harbor validates task completion through automated checks

This cycle repeats 100 times sequentially (or in parallel with proper configuration). Your Harbor plugin must handle each phase reliably when running Terminal-Bench locally on MacOS.


Building Custom Harbor Agent Plugins

A Harbor agent plugin for Terminal-Bench MacOS requires two coordinated components: a Python class defining agent behavior and a shell script template handling Docker container installation.

Agent Class Implementation

Your agent class inherits from Harbor’s BaseInstalledAgent and implements these essential methods for Terminal-Bench integration:

  • name() — Returns your agent’s unique identifier string
  • version() — Version tracking for your agent builds
  • setup() — One-time container installation and authentication file uploads
  • create_run_agent_commands() — Converts instructions into executable shell commands

The critical challenge in create_run_agent_commands() involves shell escaping. Special characters ($, “, backticks) break naive instruction passing. The proven solution: Base64 encode instructions before shell execution, then decode within the container. This eliminates escaping nightmares entirely when running Terminal-Bench locally.

Why Base64 encoding? Binary-safe encoding transforms any instruction—regardless of quotes, dollar signs, or newlines—into clean alphanumeric strings that shells won’t misinterpret.

For complete implementation examples optimized for Terminal-Bench MacOS, reference the working agent class repository with production-ready patterns.

Installation Script Templates

Harbor executes a Jinja2 template installing your agent into Docker containers. This template specifies system dependencies, agent binary installation (npm, pip, curl methods), and authentication configuration for Terminal-Bench execution on MacOS.

Standard template flow: update apt packages, install system dependencies, install agent binaries, create working directories. The setup phase can upload MacOS authentication files into containers before task execution begins. Review the complete template in the reference repository.


Apple Silicon Docker Configuration

M1/M2/M3 Mac users encounter immediate architecture conflicts when running Terminal-Bench locally. Default Terminal-Bench Docker images target x86 (AMD64) architecture, creating slow and unstable emulation when Docker attempts x86 emulation on Apple Silicon ARM processors.

Cryptic error messages indicate architecture mismatches:

exec /bin/sh: exec format error

Translation: “Attempting x86 binary execution on ARM processor architecture.”

Docker containers for Terminal-Bench MacOS configuration

The elegant solution for Terminal-Bench on Apple Silicon MacOS: Add this environment variable to your shell profile (.zshrc or .bashrc):

export DOCKER_DEFAULT_PLATFORM=linux/amd64

This configuration forces Docker to use amd64 images on Apple Silicon, with Docker’s emulation layer handling architecture translation transparently. Performance won’t match native x86 machines, but reliability improves dramatically for Terminal-Bench MacOS execution.


Resolving TTY Terminal Issues

CLI tools often check for “real” terminal environments, behaving differently without TTY (pseudo-terminal) features like colored output and interactive prompts. Docker containers lack default TTY support, causing agents to fail with errors when running Terminal-Bench locally:

Error: Input is required but no TTY is available
the input device is not a TTY

The proven solution: Deploy the unbuffer utility from the `expect` package. This Unix tool creates pseudo-TTY wrappers around commands, convincing agents they’re operating in real terminal environments.

Add expect to your Terminal-Bench MacOS installation script:

apt-get install -y expect

Then wrap agent commands with unbuffer: Instead of my-agent, execute unbuffer my-agent. Your agent now detects TTY support despite containerized execution.


Fixing Docker Network Connectivity

Network connectivity issues present the most frustrating Terminal-Bench MacOS debugging challenge. Symptoms appear contradictory when running Terminal-Bench locally:

  • Container successfully reaches internet (curl google.com works)
  • Agent fails reaching API endpoints (LLM providers, external services)
  • Perfect local MacOS execution
  • Complete Docker container failures

Root cause: Docker’s default bridge networking creates network isolation. Containers reach public internet, but traffic routes through virtual network bridges interfering with DNS resolution and connection establishment. Some services implement rate limiting or IP-based restrictions confused by Docker’s network isolation.

The definitive solution: Enable host networking mode. This configures Docker to bypass bridge networking entirely, sharing your MacOS network stack directly with containers for Terminal-Bench execution.

Add host networking to your Harbor configuration file:

environment:
  type: docker
  network_mode: host

This configuration grants agents direct MacOS network access, eliminating persistent “Connection Issue. Retrying…” messages when running Terminal-Bench locally.


Complete Harbor Configuration

Here’s a production-ready Terminal-Bench MacOS configuration file incorporating all optimization techniques. Pass this configuration to Harbor when executing benchmarks:

agents:
  - import_path: my_agent:MyAwesomeAgent
    override_timeout_sec: 600

datasets:
  - name: terminal-bench
    registry: {}

environment:
  type: docker
  network_mode: host

orchestrator:
  n_concurrent_trials: 4

Configuration breakdown for Terminal-Bench on MacOS:

  • import_path specifies your custom agent class location for Harbor
  • override_timeout_sec: 600 allocates 10 minutes per task (increase to 1200 for complex challenges)
  • registry: {} references Harbor’s default Terminal-Bench registry
  • network_mode: host — critical MacOS networking fix discussed above
  • n_concurrent_trials: 4 — parallel execution of 4 simultaneous tasks (adjust based on MacOS CPU cores and RAM, each container requires ~2-4GB)

For aggressive parallelization on powerful MacOS systems, create separate configurations (config.yaml for sequential, config-parallel.yaml for 8 concurrent tasks).


Executing Terminal-Bench Benchmarks

With configuration complete, execute Terminal-Bench locally on MacOS. First, establish environment variables ensuring Docker PATH and Apple Silicon compatibility:

export PATH="/Applications/Docker.app/Contents/Resources/bin:$PATH"
export DOCKER_DEFAULT_PLATFORM=linux/amd64

Execute Harbor with your Terminal-Bench MacOS configuration:

uv run harbor run -c my-agent-config.yaml --timeout-multiplier 6
Terminal-bench running

The --timeout-multiplier 6 increases base timeouts for Terminal-Bench tasks genuinely requiring 30+ minutes. Without this multiplier, spurious timeout failures occur. Increase further (–timeout-multiplier 15 or 20) for exceptionally complex tasks.

Pro tip: Before running complete 100-task suites (~17 hours sequentially on MacOS), validate with single simple tasks:

uv run harbor run -c my-agent-config.yaml \
  --dataset-task-names "hello-world" \
  --timeout-multiplier 6

This validates agent functionality, configuration correctness, and Docker/networking setup in ~5 minutes instead of 17 hours. Highly recommended before committing to full Terminal-Bench MacOS runs.

Automated Bash Script Execution

The reference repository includes convenient bash scripts wrapping Harbor commands for Terminal-Bench MacOS:

# Execute complete suite with default configuration
./run_test.sh

# Run specific Terminal-Bench task
./run_test.sh -t "git-init"

# Configure custom concurrency
./run_test.sh -c 8

# Execute random task subset
./run_random.sh 5

These scripts automate environment setup and Harbor invocation. Explore script files for comprehensive option documentation.


Analyzing Results and Debugging

Terminal-Bench MacOS results save to timestamped jobs/ directories with organized but detailed structure:

jobs/2024-01-15__10-30-45/
├── result.json          # Aggregate benchmark results
├── job.log              # Execution log
│
└── git-init__abc123/    # Individual task directory
    ├── result.json      # Task result (reward, pass/fail)
    ├── trial.log        # Execution log
    │
    ├── agent/
    │   ├── install.sh   # Generated install script
    │   ├── setup/       # Setup phase output
    │   └── command-0/   # Task execution output
    │       ├── stdout.txt
    │       └── return-code.txt
    │
    └── verifier/        # Verification results
        ├── reward.txt   # Score (0.0 to 1.0)
        └── test-stdout.txt

When debugging Terminal-Bench failures on MacOS, examine jobs///agent/command-0/stdout.txt for agent output. Most failures trace to missing dependencies (installation logs) or networking issues (verify host mode enabled). Scores range 0.0 (failed) to 1.0 (passed).


Accelerating Benchmark Execution

Sequential execution of 100 Terminal-Bench tasks on MacOS requires approximately 17 hours. Your MacOS system gets hot running Terminal-Bench locally. Consider these acceleration options:

Local Parallel Execution

Increase n_concurrent_trials for Terminal-Bench MacOS configurations. Systems with 8 CPU cores and sufficient RAM can execute 8 concurrent tasks:

orchestrator:
  n_concurrent_trials: 8

This optimization reduces Terminal-Bench execution from 17 hours to ~2 hours on MacOS. Your laptop fans will activate aggressively, but functionality remains reliable. Monitor RAM usage—dial back to 4 or 6 concurrent trials if memory limits appear.

Cloud Provider Integration

Harbor integrates with Daytona, Modal, and E2B for massive cloud parallelization. Daytona (recommended for Terminal-Bench) enables 40+ concurrent trials, completing 100 tasks in ~30 minutes versus 17 hours on local MacOS:

export DAYTONA_API_KEY="your-key"

uv run harbor run -c my-agent-config.yaml \
  --env daytona \
  --n-concurrent-trials 40

Daytona API keys available at app.daytona.io (free tier available). The 30-minute versus 17-hour speedup justifies cloud execution for final Terminal-Bench validation runs.


Terminal-Bench MacOS Troubleshooting

Docker Initialization Failures

If “Docker daemon not running” appears when starting Terminal-Bench on MacOS, launch Docker.app from Applications and await initialization. Verify with docker info.

Low disk space requires Docker resource cleanup before running Terminal-Bench locally:

docker system prune -a
docker container prune -f
docker image prune -a -f

Harbor Agent Discovery Issues

Ensure PYTHONPATH includes your Terminal-Bench project directory on MacOS:

export PYTHONPATH="$(pwd):$PYTHONPATH"

Container Authentication Failures

Verify credentials upload to containers during Terminal-Bench MacOS setup phases. Check setup logs for upload errors. Ensure auth file readability: chmod 600 ~/.my_agent/*.

Extended Timeout Requirements

Complex Terminal-Bench tasks legitimately require extended timeouts. Increase override_timeout_sec in MacOS configuration or apply higher --timeout-multiplier values.

Missing Container Commands

Verify your install.sh.j2 template installs required tools for Terminal-Bench MacOS. Test Docker images manually:

./build-image.sh
docker run -it my-agent-image bash

Interactive container exploration verifies complete installation before Terminal-Bench execution.


Key Terminal-Bench MacOS Insights

Terminal-Bench 2.0 delivers brutally honest benchmarking without flashy demos—it tests whether agents solve authentic problems. Running Terminal-Bench locally on MacOS demands understanding these non-obvious requirements:

  1. Architecture configuration: Set DOCKER_DEFAULT_PLATFORM=linux/amd64 on Apple Silicon MacOS
  2. TTY support: Deploy unbuffer for agents expecting terminal environments
  3. Network configuration: Enable network_mode: host for API-dependent agents
  4. Instruction encoding: Base64 encode instructions avoiding shell escaping chaos
  5. Performance optimization: Local MacOS parallelism is free but slow; cloud providers deliver speed requiring API keys

Transform Terminal-Bench execution from perpetual errors to successful completion:

┌─────────────────┐
│ Reward: 0.0 │
│ Errors: 100 │
│ Time: ∞ │
│ Mood: 😭 │
└─────────────────┘

To optimized Terminal-Bench MacOS results:

┌─────────────────┐
│ Reward: X.X │
│ Errors: 0 │
│ Time: 30min │
│ Mood: 🎉 │
└─────────────────┘

Benchmark your agent and compare results on the Terminal-Bench leaderboard. When obstacles arise running Terminal-Bench locally, reference the comprehensive MacOS setup repository built specifically for troubleshooting moments.


Essential Terminal-Bench Resources

Written after extensive debugging, copious coffee, and existential questioning of Docker networking design decisions while running Terminal-Bench on MacOS.

If this guide enabled successful Terminal-Bench execution on your MacOS system, consider starring the reference repository. Achieve leaderboard placement? Share your Terminal-Bench results—I’d appreciate hearing your success story.

Till then, live in the mix!

A worthy bucket to drop in your thoughts, feedback or rant.

This site uses Akismet to reduce spam. Learn how your comment data is processed.