/posts/til/til-coderouting-llm-requests-with-embeddings

Using cheap embeddings to route language model requests

Using cheap embeddings to route language model requests What I learned - LLM request routing can be treated as a classification problem instead of a deep reasoning task.

5 min

Using cheap embeddings to route language model requests

What I learned

  • LLM request routing can be treated as a classification problem instead of a deep reasoning task.
  • Cheap embedding models plus distance or similarity thresholds are often sufficient to choose which model to call.
  • Routing can be implemented as nearest-neighbor search between query embeddings and route exemplars.
  • Thresholds and exemplars are key levers for controlling when to use specialized versus general models.

In my own words

When you have multiple language models or tools and need to decide which one to use for a given user request, you do not need a very smart model to make that decision. Instead, you can treat the decision as a simple classification problem: given this text, which bucket does it belong to? You can use a cheap embedding model to turn both the user query and each possible route (like “simple chat”, “deep reasoning”, or “code generation”) into vectors. Then you measure how close the query vector is to each route vector and pick the closest one, as long as it is above some similarity threshold. This is much cheaper and more predictable than asking a big model to reason about which model to call. In other words, routing is about pattern matching and similarity, not about deep cognition.

Why this matters

This matters because routing is often the hidden cost center in multi-model systems. If every routing decision uses a large, expensive model, your latency and spend will quickly balloon, especially at scale. By reframing routing as a simple classification problem solved with embeddings, you can make routing extremely fast and cheap while still leveraging specialized models where they add value. This enables architectures like “cheap by default, smart when needed” without requiring complex meta-reasoning. It also makes the system easier to test and debug: you can inspect similarity scores, adjust thresholds, and curate exemplar prompts instead of trying to interpret opaque chain-of-thought routing outputs. Overall, embedding-based routing is a practical way to build more efficient, predictable, and maintainable LLM applications.

Detailed explanation

Routing user requests between different LLMs or tools is often framed as a hard reasoning problem, but in practice it is usually just a classification problem over a small set of options. Instead of asking a powerful, expensive model to “think” about which model to call, you can predefine a set of routing targets (e.g., cheap-chat, code-model, math-model, policy-guardrail, human-escalation) and then use embeddings to classify the incoming request.

Core idea: Represent both the incoming query and each routing option as vectors in the same embedding space, then choose the option whose vector is closest to the query vector, as long as the distance (or similarity) passes a threshold. This turns routing into a nearest-neighbor lookup problem.

A typical workflow:

  1. Define routing classes: e.g., “simple chit-chat”, “complex reasoning”, “code generation”, “retrieval-heavy”, “sensitive / safety-critical”.
  2. For each class, create one or more representative descriptions or example prompts, and embed them using a cheap embedding model.
  3. At runtime, embed the user query with the same embedding model.
  4. Compute cosine similarity (or Euclidean distance) between the query embedding and each class embedding.
  5. Pick the class with the highest similarity, but only if it exceeds a threshold; otherwise fall back to a default route.

This approach is:

  • Cheap: Embedding models are much less expensive and faster than large reasoning models.
  • Deterministic-ish: Given the same embeddings and thresholds, routing is stable and easier to test.
  • Composable: You can add new routes by adding new class exemplars and re-running your similarity logic.

Nuances and considerations:

  • Multiple exemplars per class: Instead of a single vector per class, you can store several example prompts and use the maximum similarity or an average to represent the class. This improves robustness.
  • Threshold tuning: The similarity threshold is crucial. Too low and everything routes somewhere (even when it should default); too high and many queries fall back unnecessarily. You typically tune this using a labeled validation set.
  • Hierarchical routing: You can build a tree of routing decisions. For example, first classify “sensitive vs non-sensitive”; if non-sensitive, then classify into “chat vs code vs math”; if sensitive, route to a safety-focused model or human.
  • Fallbacks and overrides: Always have a safe default route (e.g., a general-purpose model with guardrails) and possibly manual overrides based on explicit keywords (e.g., "/code" prefix forces the code model).
  • Monitoring: Log the chosen route, similarity scores, and final outcomes so you can analyze misroutes and adjust thresholds or exemplars.

The key mental shift is to treat routing as a standard ML classification / retrieval problem, not as something that requires the model to introspect or reason about its own capabilities. This keeps your system simpler, cheaper, and more predictable, while still letting you take advantage of specialized models where they matter most.

Example

Below is a minimal Python example that routes between a cheap chat model and an expensive reasoning model using embeddings and cosine similarity. This uses OpenAI-style APIs, but the pattern is general.

import numpy as np
from openai import OpenAI

client = OpenAI()

# 1. Define routing classes and exemplar texts
ROUTES = {
    "cheap_chat": [
        "Casual conversation, small talk, short answers.",
        "Simple questions that do not require deep reasoning."
    ],
    "deep_reasoning": [
        "Multi-step reasoning, complex analysis, or planning.",
        "Questions that require careful logical thinking or long explanations."
    ]
}

EMBED_MODEL = "text-embedding-3-small"


def embed(texts):
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    return np.array([d.embedding for d in resp.data])


# 2. Precompute route embeddings (e.g., at startup)
route_embeddings = {}
for route_name, exemplars in ROUTES.items():
    vecs = embed(exemplars)
    route_embeddings[route_name] = vecs  # shape: (n_exemplars, dim)


def cosine_similarity(a, b):
    a_norm = a / np.linalg.norm(a, axis=-1, keepdims=True)
    b_norm = b / np.linalg.norm(b, axis=-1, keepdims=True)
    return np.dot(a_norm, b_norm.T)


def choose_route(user_query, threshold=0.7):
    q_vec = embed([user_query])[0]  # shape: (dim,)

    best_route = None
    best_score = -1.0

    for route_name, vecs in route_embeddings.items():
        # Similarity to each exemplar; take the max
        sims = cosine_similarity(q_vec[None, :], vecs)[0]
        score = float(sims.max())
        if score > best_score:
            best_score = score
            best_route = route_name

    if best_score < threshold:
        return "fallback", best_score
    return best_route, best_score


def call_model(route, user_query):
    if route == "cheap_chat":
        model = "gpt-4o-mini"
    elif route == "deep_reasoning":
        model = "gpt-4.1"
    else:  # fallback
        model = "gpt-4o"

    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": user_query}],
    )
    return resp.choices[0].message.content


if __name__ == "__main__":
    user_query = input("User: ")
    route, score = choose_route(user_query)
    print(f"[Routing to: {route} (score={score:.3f})]")
    answer = call_model(route, user_query)
    print("Assistant:", answer)

This script:

  • Embeds exemplar descriptions for each route once at startup.
  • Embeds each incoming query and finds the closest route by cosine similarity.
  • Uses a threshold to decide whether to use a specific route or a generic fallback.

You can extend this by adding more routes (e.g., code, math, safety_review) and tuning the threshold based on real traffic.

Pitfalls and notes

Common pitfalls include setting similarity thresholds without any evaluation data, which can cause over-routing to expensive models or too many fallbacks; using only a single exemplar per route, making routing brittle to phrasing; forgetting to periodically re-embed exemplars when you change them; and assuming embeddings capture non-textual constraints like cost budgets, latency requirements, or compliance rules, which often still need explicit logic or metadata-based routing. Also, do not skip monitoring: without logs of chosen routes and scores, it is hard to detect misclassifications.

References