Provisional scores: Current rankings are based on public benchmark signals, pricing/context data, and manual curation. Approved practitioner reviews will be shown separately as they are submitted.

Methodology: Model-fit scores combine public sources such as Artificial Analysis, provider pricing/context data, and manual weighting for agent use cases. Have production experience? Add the first real review.

Choose the right LLM for your agent workload

Source-checked model pages that combine provider pricing, context limits, benchmark signals, and practical agent-use-case notes. Reviews remain separate until manually approved.

Tracked models
10
Source-checked
10
Approved reviews
1

Leaderboard

Filter by use case, then open model pages for fit notes, caveats, and sources.

Showing 10 models

ModelBest fitScorePrice / ContextConfidenceDimensions
Gemini 3.1 Flash-Lite
Google - gemini-3.1-flash-lite
High-volume classification, translation, extraction, and lightweight support agents8.6/10
$0.25 in / $1.5 out
1M context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ
GPT-5.4 mini
OpenAI - gpt-5.4-mini
Coding assistants and subagents that need OpenAI tool support at lower cost8.6/10
$0.75 in / $4.5 out
400K context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ
Claude Opus 4.7
Anthropic - claude-opus-4-7
Complex software engineering tasks that need careful long-horizon reasoning8.4/10
$5 in / $25 out
1M context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ
DeepSeek V4 Pro
DeepSeek - deepseek-v4-pro
Cost-sensitive coding and RAG agents that still need long context8.4/10
$0.43 in / $0.87 out
1M context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ
Grok 4.3
xAI - grok-4.3
Agents that need a 1M context window and configurable reasoning8.4/10
$1.25 in / $2.5 out
1M context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ
GPT-5.5
OpenAI - gpt-5.5
Hard coding and multi-step agent tasks where quality matters more than unit cost8.2/10
$5 in / $30 out
1.1M context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ
Kimi K2.6
Moonshot AI - kimi-k2.6
Agentic coding workflows where Kimi-specific behavior has been tested8.2/10
$0.95 in / $4 out
256K context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ
Qwen3.6 Max Preview
Alibaba Cloud - qwen3.6-max-preview
Agent and coding evaluations where Qwen3.6 Max Preview beats cheaper Qwen tiers8.2/10
$0.86 in / $3.44 out
256K context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ
Gemini 3.1 Pro Preview
Google - gemini-3.1-pro-preview
Long-context multimodal RAG over documents, images, video, audio, or PDFs8/10
$2 in / $12 out
1M context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ
Llama 4 Maverick
Meta via OpenRouter - meta-llama/llama-4-maverick
Budget-sensitive RAG and general tasks where open-weight models are acceptable7.6/10
$0.15 in / $0.6 out
1M context
Provisional0 practitioner reviews
TC
CE
LA
AR
CQ

Built for developers who ship agents to production. See methodology for scoring and source rules.