Provisional scores: Current rankings are based on public benchmark signals, pricing/context data, and manual curation. Approved practitioner reviews will be shown separately as they are submitted.
Methodology: Model-fit scores combine public sources such as Artificial Analysis, provider pricing/context data, and manual weighting for agent use cases. Have production experience? Add the first real review.
Choose the right LLM for your agent workload
Source-checked model pages that combine provider pricing, context limits, benchmark signals, and practical agent-use-case notes. Reviews remain separate until manually approved.
- Tracked models
- 10
- Source-checked
- 10
- Approved reviews
- 1
Leaderboard
Filter by use case, then open model pages for fit notes, caveats, and sources.
Showing 10 models
| Model | Best fit | Score | Price / Context | Confidence | Dimensions |
|---|---|---|---|---|---|
| Gemini 3.1 Flash-Lite Google - gemini-3.1-flash-lite | High-volume classification, translation, extraction, and lightweight support agents | 8.6/10 | $0.25 in / $1.5 out 1M context | Provisional0 practitioner reviews | TC CE LA AR CQ |
| GPT-5.4 mini OpenAI - gpt-5.4-mini | Coding assistants and subagents that need OpenAI tool support at lower cost | 8.6/10 | $0.75 in / $4.5 out 400K context | Provisional0 practitioner reviews | TC CE LA AR CQ |
| Claude Opus 4.7 Anthropic - claude-opus-4-7 | Complex software engineering tasks that need careful long-horizon reasoning | 8.4/10 | $5 in / $25 out 1M context | Provisional0 practitioner reviews | TC CE LA AR CQ |
| DeepSeek V4 Pro DeepSeek - deepseek-v4-pro | Cost-sensitive coding and RAG agents that still need long context | 8.4/10 | $0.43 in / $0.87 out 1M context | Provisional0 practitioner reviews | TC CE LA AR CQ |
| Grok 4.3 xAI - grok-4.3 | Agents that need a 1M context window and configurable reasoning | 8.4/10 | $1.25 in / $2.5 out 1M context | Provisional0 practitioner reviews | TC CE LA AR CQ |
| GPT-5.5 OpenAI - gpt-5.5 | Hard coding and multi-step agent tasks where quality matters more than unit cost | 8.2/10 | $5 in / $30 out 1.1M context | Provisional0 practitioner reviews | TC CE LA AR CQ |
| Kimi K2.6 Moonshot AI - kimi-k2.6 | Agentic coding workflows where Kimi-specific behavior has been tested | 8.2/10 | $0.95 in / $4 out 256K context | Provisional0 practitioner reviews | TC CE LA AR CQ |
| Qwen3.6 Max Preview Alibaba Cloud - qwen3.6-max-preview | Agent and coding evaluations where Qwen3.6 Max Preview beats cheaper Qwen tiers | 8.2/10 | $0.86 in / $3.44 out 256K context | Provisional0 practitioner reviews | TC CE LA AR CQ |
| Gemini 3.1 Pro Preview Google - gemini-3.1-pro-preview | Long-context multimodal RAG over documents, images, video, audio, or PDFs | 8/10 | $2 in / $12 out 1M context | Provisional0 practitioner reviews | TC CE LA AR CQ |
| Llama 4 Maverick Meta via OpenRouter - meta-llama/llama-4-maverick | Budget-sensitive RAG and general tasks where open-weight models are acceptable | 7.6/10 | $0.15 in / $0.6 out 1M context | Provisional0 practitioner reviews | TC CE LA AR CQ |
Built for developers who ship agents to production. See methodology for scoring and source rules.