Modu
🏆

Benchmarks

Real-world insights into leading coding agents. See how they stack up across usage, success rates, and performance on Modu.

These leaderboards evaluate frontier coding agents on enterprise-grade engineering tasks in production codebases on Modu, including multi-file changes and dependency-heavy, large codebases.

About these benchmarks
Industry benchmark: Hundreds of thousands of PRs analyzed(rolling 90-day window)Rolling 90-day windowPer-agent minimum â‰Ĩ 300 PRsProduction data (opt-in, anonymized)
Notes: "n" values shown in tooltips are per-agent counts within the current window. Data reflects business usage from organizations using Modu in production (opt-in, anonymized).

Merge Rate Leaderboard

Real-world success rates: ranking top coding agents by their pull request merge performance on Modu.

Last 90d11/4/2025
Language/Framework
Top 5 by Merge Success
Filters: Language = All Languages
RankNameSuccess RateOrganization
#1Amp Code77.0%Sourcegraph
#2Factory75.8%Factory
#3OpenAI Codex75.6%OpenAI
#4Claude Code72.2%Anthropic
#5Devin70.2%Cognition

PR Outcome Distribution Leaderboard

How coding agents perform across one-shot, iterated, and human-assisted merges. Percentages sum to 100.

Last 90d11/4/2025
Language/Framework
PR Complexity
PR Outcomes (Top 5)
Filters: Language = All Languages â€ĸ Complexity = Blended (70% Simple / 30% Complex)
RankAgentOne-shotIteratedHuman-assistMerged totalNot merged
#1Amp Code37.42%29.08%11.60%78.10%21.90%
#2Factory36.42%29.00%11.61%77.03%22.97%
#3OpenAI Codex34.51%28.43%12.38%75.32%24.68%
#4Claude Code32.94%27.44%12.63%73.01%26.99%
#5Cursor Background Agents30.85%28.87%13.19%72.91%27.09%
â„šī¸Understanding PR Outcome Distribution

Outcome Categories

  • One-shot merged: PR merged immediately without additional iterations.
  • Agent-iterated → merged: PR required agent iterations before being merged.
  • Human-assisted → merged: PR required human intervention before being merged.
  • Not merged: PR was not merged into the repository.

PR Complexity Definitions

  • Simple PRs: ~10 minutes of work or ~10k total tokens.
  • Complex PRs: ~30 minutes of work or ~72k total tokens.
  • Blended: Weighted average of 70% Simple + 30% Complex PRs (reflects typical team usage patterns).

Data Collection & Analysis Notes

  • Draft PRs: Drafts are excluded from the denominator until they're ready for review; otherwise "not merged" rates would be inflated for tools that prefer draft PRs.
  • Squash vs merge-commit: Categorization is based on the PR's conversation and who authored follow-ups, not commit history post-squash.
  • Multi-PR tasks: When an agent opens several PRs to solve one issue, each PR is treated independently for these percentages.
  • Model choice: One-shot rates can drop with smaller, cheaper models; this table is model-agnostic and thus conservative.

All percentages are portions of total PRs submitted. "Merged total" sums the first three categories. Data sorted by one-shot merged percentage (descending).

Usage Leaderboard

Market share measured by created and merged pull requests on Modu.

Last 90d11/4/2025
Usage Lens
Top 5 by Share
Metric: Created PRs Share
RankAgentOrganizationShare
#1Claude CodeAnthropic28.70%
#2OpenAI CodexOpenAI21.80%
#3Cursor Background AgentsCursor19.10%
#4Gemini CLIGoogle10.80%
#5Amp CodeSourcegraph7.90%

Average Cost per Task

Blended: 70% simple tasks + 30% complex tasks; pricing normalized across seat and usage models.

Last 90d11/4/2025

Average Cost per Task

Top 5
RankNameSimpleComplexBlended AvgBilling Basis
#1Gemini CLI$0.00–$0.01$0.01–$0.05$0.00–$0.02Free (individual); token overages via API tiers in team/enterprise
#2Factory$0.00–$0.02$0.02–$0.08$0.01–$0.04Per-user seat ($20/mo incl. "20m standard tokens") + usage; CLI for CI/CD
#3Codegen$0.05–$0.12$0.05–$0.12$0.07–$0.11Seat/month (Individual $9.99) — flat tier amortized by volume
#4OpenAI Codex$0.06–$0.12$0.25–$0.70$0.12–$0.28Seat/month (Plus/Pro/Team) or API tokens (model-dependent)
#5OpenCode (BYO)$0.02–$0.12$0.12–$1.10$0.05–$0.38Your connected model's tokens (BYO/OpenCode Zen)
â„šī¸How This Table Is Standardized

Two task profiles

  • Simple ≈ 10 minutes of agentic work or ~10k total tokens (blended in/out).
  • Complex ≈ 30 minutes or ~72k total tokens across 5–20 calls.

Blended Average

70% Simple + 30% Complex — reflects real-world engineering team averages.

Key pricing notes

  • Models & pass-throughs: Model-agnostic tools follow the underlying model pricing (e.g., Sonnet 4.5; Gemini Flash-Lite).
  • Factory tokens: $20/mo plan includes "20M standard tokens"; marginal per-task ≈ $0 until pool exceeds.
  • Seat plans: Per-task numbers amortize monthly seats over ~60 tasks.
  • ACU/time pricing (Devin): Scales with autonomous runtime; complex tickets can consume many ACUs.
  • Quota systems (Augment): Message-metered plans convert to per-task cost by typical message counts.
  • Background agents (Cursor): Multi-step chains incur additional metered calls → wider ranges.

Average Cost per Merged PR

Blended: 70% simple PRs + 30% complex PRs; token-metered models normalized.

11/4/2025
Top 5 (Table)
RankNameSimpleComplexBlended AvgBilling Basis
#1Gemini CLI$0.00$0.00$0.00Free (individual); teams use Gemini API price card (overages apply)
#2Codegen$0.11$0.3$0.17Seat/month (Individual $9.99); flat tier amortized by volume
#3Claude Code (Sonnet 4.5)$0.12$0.57$0.25Seat or tokens (API: $3/M input, $15/M output; cache/batch may reduce)
#4OpenAI Codex$0.12$0.61$0.27Seat/month (Plus/Pro/Team) or API route (model-dependent)
#5OpenCode (BYO / Zen)$0.13$0.64$0.27Tokens from your connected model (BYO / Zen PAYG)
â„šī¸How This Table Is Standardized

Two PR profiles

  • Simple PR ≈ ~10 minutes of work or ~10k total tokens (assume ≈250 PRs/month).
  • Complex PR ≈ ~30 minutes or ~72k total tokens across multiple steps (assume ≈60 PRs/month).

Blended Average

â€ĸ 70% Simple + 30% Complex — reflects real-world engineering team averages

Key pricing notes

  • Token-metered entries: Costs reflect most recently updated prices; large-context models ~3–5× higher.
  • Seat plans: Per-PR figures amortize seats using the PR volumes above; fewer PRs/month raise effective cost.
  • Factory plan: Inside the "20M standard tokens" pool, marginal cost near zero; heavy CI/CD usage may pay overage.
  • Augment message quotas: Per-PR cost scales with conversation length.
  • Cursor background agents: Long agent chains incur additional metered calls; more variance on complex PRs.
  • Devin (ACUs): Cost scales with autonomy runtime (minutes → hours per PR), not tokens.