All posts
CostRoutingEngineering

Model Routing: Using the Cheapest Model That Actually Solves the Task

AE
Autrace Engineering
·March 10, 2026·9 min read

Not every LLM call needs GPT-4o. If you're doing keyword classification, entity extraction, or simple Q&A - you're paying for a model that's 10-40x more expensive than necessary on every call.

The cost case

Approximate input/output costs as of early 2026:

  • GPT-4o: ~$2.50 / $10.00 per million tokens
  • Claude 3.5 Haiku: ~$0.80 / $4.00 per million tokens
  • Gemini 1.5 Flash: ~$0.075 / $0.30 per million tokens
  • Llama 3.1 8B (self-hosted): ~$0.05 / $0.05 per million tokens

For workloads where 70% of requests are simple classification or extraction, routing those to Flash or Haiku while keeping complex reasoning on GPT-4o can cut LLM spend by 60-80% with no quality degradation on complex tasks.

Routing rules in Autrace

# autrace-rules.yaml
routing:
  - id: route-classification
    match:
      metadata.task_type: ["classify", "extract", "summarize"]
      estimated_tokens: { max: 2000 }
    route_to:
      model: "gemini/gemini-1.5-flash"

  - id: route-complex-reasoning
    match:
      metadata.task_type: ["reason", "code", "analyze"]
    route_to:
      model: "openai/gpt-4o"

  - id: route-default
    match: "*"
    route_to:
      model: "anthropic/claude-3-haiku-20240307"

Fallback routing

routing:
  - id: ha-route
    match: "*"
    route_to:
      primary: "openai/gpt-4o"
      fallback:
        - "anthropic/claude-3-5-sonnet-20241022"
        - "google/gemini-1.5-pro"
      on_error: ["rate_limit", "timeout", "server_error"]

Cost attribution

Every proxied request is logged with the actual model used, input token count, output token count, and estimated cost. Export to your data warehouse via the audit log export API to attribute LLM spend per team, feature, or user cohort.

← Back to blogContact Enterprise Sales →