/posts/til/til-researchthree-structural-failure-modes-of-single-vector-cosine-routers
Three Structural Failure Modes of Single-Vector Cosine Routers
When My Router Kept Picking the 'Catch-All' Tool I was debugging a tool router that looked fine on paper—embed the query, compare to each tool description, pick the highest cosi...
When My Router Kept Picking the ‘Catch-All’ Tool
I was debugging a tool router that looked fine on paper—embed the query, compare to each tool description, pick the highest cosine—but in practice it kept sending wildly different requests to the same generic ‘year-in-review’ agent. The surprise was that nothing was obviously broken: the scores were high, the embeddings were from a good model, and yet the routing felt dumb. Understanding these three failure modes turned that vague frustration into a concrete checklist for why single-vector cosine routing so often feels smarter than it actually is.
Background
This note is about modern large language model routing systems that use dense vector embeddings to choose which tool, prompt, or agent should handle a user query. The time frame is the current generation of LLM-based applications (roughly 2023–2026), where cosine similarity over a single embedding vector is the default routing primitive. The scope is narrow but deep: why a seemingly simple ‘pick the highest cosine’ router fails in systematic, repeatable ways.
Why Single-Vector Cosine Routers Fail in Predictable Ways
A single-vector cosine router takes your query, embeds it once, then compares that vector to a set of tool or document embeddings and picks the highest cosine similarity. It feels mathematically clean: one space, one metric, one winner. The problem is that this simplicity quietly bakes in three structural assumptions: that the embedding space is well-spread, that the query is specific enough to pin down meaning, and that each candidate represents a narrow, well-defined role.
Embedding space anisotropy and high baselines. Modern text embeddings are notoriously anisotropic: instead of spreading uniformly over the unit sphere, vectors cluster in a narrow cone. That means the average cosine similarity between unrelated items is already quite high, so your router is operating on a tiny margin between ‘vaguely related’ and ‘actually relevant’. Designers accept this because anisotropy often helps downstream tasks (e.g., capturing global semantics), but for routing it means your score distribution is compressed and noisy. You gain semantic richness but lose a clean geometric separation between tools.
Short-query underdetermination. A 2–5 word query simply doesn’t constrain meaning enough for a single embedding to disambiguate intent. “Year in review” could mean analytics, journaling, financial reports, news summaries, or code changelogs. The embedding model does its best and lands you somewhere in the semantic neighborhood of “retrospective / summary / time-bounded”, but that’s still a huge region in concept space. The router then pretends this vague direction is precise, because cosine forces a total order over candidates even when the underlying evidence is weak.
Generic-role dominance. When one candidate represents a broad, generic role (“general summarizer”, “catch-all assistant”, “year-in-review” agent) and others are narrow specialists, the generic embedding tends to sit near the semantic center of many queries. In an anisotropic space with high baseline cosine, that centrality is an unfair advantage: the generic tool is ‘vaguely relevant’ to almost everything, so it often edges out the specialist by a few thousandths of cosine. You gain coverage but lose precision, and the router looks confident while repeatedly picking the least informative option.
💡 Did you know: Most popular text embedding models produce highly anisotropic spaces where a large fraction of vectors lie in a narrow cone, which is why techniques like mean-centering and whitening can significantly improve retrieval quality without changing the model.
Watching Generic-Role Dominance in Action
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Pretend these come from a real embedding model
# For illustration, we bake in anisotropy and generic centrality.
def norm(v):
return v / np.linalg.norm(v)
# A 'generic year-in-review' tool: broad, central semantics
year_in_review = norm(np.array([0.9, 0.9, 0.9, 0.9]))
# Two more specific tools
code_changelog = norm(np.array([0.95, 0.8, 0.7, 0.6])) # code / engineering
personal_journal = norm(np.array([0.7, 0.95, 0.8, 0.65])) # personal / feelings
# Short, underdetermined query
query = norm(np.array([0.85, 0.85, 0.8, 0.75])) # "year in review"
candidates = {
"year_in_review": year_in_review,
"code_changelog": code_changelog,
"personal_journal": personal_journal,
}
for name, vec in candidates.items():
score = cosine_similarity(query.reshape(1, -1), vec.reshape(1, -1))[0, 0]
print(f"{name:16s} → cosine = {score:.4f}")
# Output (representative):
# year_in_review → cosine = 0.9980
# code_changelog → cosine = 0.9921
# personal_journal → cosine = 0.9935
#
# All scores are extremely high due to anisotropy,
# but the broad 'year_in_review' tool wins by a tiny margin
# even though the query might really be about code or journaling.
The Insight
Single-vector cosine routers don’t just fail randomly—they systematically over-trust high baseline cosine in anisotropic spaces, over-interpret short queries, and over-select broad, generic tools that are vaguely relevant to everything. If you don’t design around those three failure modes, your router will look confident while quietly making the same structural mistakes over and over.
🧠 Bonus: Short queries are so underdetermined that some production systems silently expand them with synthetic context (e.g., pseudo-relevance feedback or query rewriting) before embedding, effectively turning ‘2–3 words’ into a miniature document.
Gotchas
- Letting a single generic tool compete directly with specialized tools on raw cosine similarity → the generic tool wins too often because it overlaps semantically with everything, starving the specialists.
- Treating cosine scores as absolute quality signals instead of relative rankings → in anisotropic spaces, even random or wrong matches can have deceptively high cosine, so fixed thresholds quietly fail.
- Routing short, ambiguous queries directly on embeddings → the router confidently picks a tool based on noise, then you debug the downstream tool instead of the real culprit: an underdetermined query.
- Assuming ‘better embeddings’ will fix routing pathologies → the three failure modes are structural (space geometry, query length, role granularity), so model swaps alone rarely change the behavior.
Takeaways
- Separate generic and specialist tools into different routing stages so the generic ‘catch-all’ only runs when no specialist clears a higher bar.
- Normalize or post-process embeddings (e.g., mean-centering, whitening, score calibration) to counteract anisotropy before using cosine for routing.
- Treat very short queries as under-specified: either ask a clarifying question, expand them with context, or route them through a different mechanism than long-form queries.
- Use multi-vector or late-interaction models (or at least a learned re-ranker) for high-stakes routing instead of relying solely on a single cosine over one embedding.
- Log full score distributions, not just the top choice, so you can detect when a generic tool is consistently winning by tiny margins over more appropriate specialists.
🔥 One more thing: Some multi-tool LLM routers now use a second-stage re-ranking model over the top-k cosine matches, because the incremental latency is cheaper than the cost of misrouting to a broad, generic tool that then triggers expensive follow-up calls.
References
- On the Sentence Embeddings from Pre-trained Language Models (article)
- Improving Sentence Embeddings with Multi-view Contrastive Learning (article)
- GTR: A Giant Text Retrieval Model (article)
- ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT (article)