Provisional scores: Current rankings are based on public benchmark signals, pricing/context data, and manual curation. Approved practitioner reviews will be shown separately as they are submitted.

Methodology: Model-fit scores combine public sources such as Artificial Analysis, provider pricing/context data, and manual weighting for agent use cases. Have production experience? Add the first real review.

Choose the right LLM for your agent workload

Source-checked model pages that combine provider pricing, context limits, benchmark signals, and practical agent-use-case notes. Reviews remain separate until manually approved.

Tracked models: 10
Source-checked: 10
Approved reviews: 1

Leaderboard

Filter by use case, then open model pages for fit notes, caveats, and sources.

Add Real Review Add Model

Showing 10 models

Model	Best fit	Score	Price / Context	Confidence	Dimensions
Gemini 3.1 Flash-Lite Google - gemini-3.1-flash-lite	High-volume classification, translation, extraction, and lightweight support agents	8.6/10	$0.25 in / $1.5 out 1M context	Provisional0 practitioner reviews	TC CE LA AR CQ
GPT-5.4 mini OpenAI - gpt-5.4-mini	Coding assistants and subagents that need OpenAI tool support at lower cost	8.6/10	$0.75 in / $4.5 out 400K context	Provisional0 practitioner reviews	TC CE LA AR CQ
Claude Opus 4.7 Anthropic - claude-opus-4-7	Complex software engineering tasks that need careful long-horizon reasoning	8.4/10	$5 in / $25 out 1M context	Provisional0 practitioner reviews	TC CE LA AR CQ
DeepSeek V4 Pro DeepSeek - deepseek-v4-pro	Cost-sensitive coding and RAG agents that still need long context	8.4/10	$0.43 in / $0.87 out 1M context	Provisional0 practitioner reviews	TC CE LA AR CQ
Grok 4.3 xAI - grok-4.3	Agents that need a 1M context window and configurable reasoning	8.4/10	$1.25 in / $2.5 out 1M context	Provisional0 practitioner reviews	TC CE LA AR CQ
GPT-5.5 OpenAI - gpt-5.5	Hard coding and multi-step agent tasks where quality matters more than unit cost	8.2/10	$5 in / $30 out 1.1M context	Provisional0 practitioner reviews	TC CE LA AR CQ
Kimi K2.6 Moonshot AI - kimi-k2.6	Agentic coding workflows where Kimi-specific behavior has been tested	8.2/10	$0.95 in / $4 out 256K context	Provisional0 practitioner reviews	TC CE LA AR CQ
Qwen3.6 Max Preview Alibaba Cloud - qwen3.6-max-preview	Agent and coding evaluations where Qwen3.6 Max Preview beats cheaper Qwen tiers	8.2/10	$0.86 in / $3.44 out 256K context	Provisional0 practitioner reviews	TC CE LA AR CQ
Gemini 3.1 Pro Preview Google - gemini-3.1-pro-preview	Long-context multimodal RAG over documents, images, video, audio, or PDFs	8/10	$2 in / $12 out 1M context	Provisional0 practitioner reviews	TC CE LA AR CQ
Llama 4 Maverick Meta via OpenRouter - meta-llama/llama-4-maverick	Budget-sensitive RAG and general tasks where open-weight models are acceptable	7.6/10	$0.15 in / $0.6 out 1M context	Provisional0 practitioner reviews	TC CE LA AR CQ

Built for developers who ship agents to production. See methodology for scoring and source rules.