/posts/til/til-codehybrid-search-bm25-and-embeddings-not-everywhere-semantic
Combining BM25 and embeddings instead of using semantic search everywhere
Combining BM25 and embeddings instead of using semantic search everywhere What I learned - Not every search problem requires full semantic or embedding-only search; many are fun...
Combining BM25 and embeddings instead of using semantic search everywhere
What I learned
- Not every search problem requires full semantic or embedding-only search; many are fundamentally lexical.
- BM25 is excellent for exact or near-exact matching of tokens like names, IDs, dates, and codes.
- Embeddings shine for semantic similarity and natural language queries where wording can vary widely.
- A hybrid approach (BM25 plus embeddings) can combine exactness and meaning in one retrieval pipeline.
- Treat structured identifiers and codes as lexical problems first, then layer semantic search where it truly adds value.
In my own words
The key idea is that you do not need fancy semantic search for every kind of query. Some things, like invoice numbers, user IDs, dates, and short codes, are best handled as exact text matches. Traditional search methods like BM25 are very good at this: they look at the actual words or tokens and rank documents based on how well they match. Semantic search with embeddings is powerful when you care about meaning rather than exact wording, such as when users ask natural language questions or use synonyms. Instead of choosing only one method, you can combine them. Use BM25 to handle the exact, structured parts of a query and embeddings to handle the fuzzy, meaning-based parts. This hybrid approach gives you both precision for identifiers and flexibility for natural language, without overcomplicating your system where it is not needed.
Why this matters
This distinction matters because search infrastructure is expensive and complex, and overusing semantic search can waste resources and hurt reliability. If you treat everything as a semantic problem, you might miss exact matches for critical identifiers like invoice numbers, user IDs, or product SKUs, frustrating users who expect precise results. By recognizing which parts of your data are lexical (IDs, codes, dates, exact names) and which are semantic (descriptions, FAQs, documentation), you can design simpler, faster, and cheaper systems. A hybrid BM25-plus-embeddings approach lets you keep the robustness and interpretability of classic search while still benefiting from modern semantic retrieval where it truly helps. This leads to better relevance, easier debugging, and more predictable behavior in production search and RAG systems.
Detailed explanation
In information retrieval, not every search problem is best solved with fully semantic, embedding-only search. A lot of real-world queries are actually lexical: they care about exact tokens, spelling, and structure (e.g., IDs, codes, dates, names). For these, classic term-based ranking like BM25 is extremely strong, fast, and interpretable.
BM25 is a bag-of-words ranking function used in traditional search engines (e.g., Lucene, Elasticsearch, OpenSearch). It scores documents based on term frequency (how often a query term appears in a document), inverse document frequency (how rare the term is across the corpus), and document length normalization. This makes it excellent for matching exact tokens and prioritizing documents where those tokens are most important.
Embeddings, on the other hand, represent text as dense vectors in a high-dimensional space. Similar texts end up close together, enabling semantic search: you can find documents that are conceptually related even if they don’t share exact words. This is powerful for natural language questions, fuzzy phrasing, and concept-level similarity.
The key insight is that you don’t have to choose one or the other. You can combine BM25 and embeddings to get the best of both worlds:
- Hybrid retrieval: Use BM25 and vector search in parallel, then merge and re-rank results. For example, run a BM25 query and an embedding-based kNN query, then combine scores (e.g., weighted sum) or use a learned re-ranker.
- Two-stage retrieval: Use BM25 as a fast first-stage retriever to narrow down candidates, then apply embeddings (or an LLM re-ranker) to a small subset of documents for semantic refinement.
- Field-aware strategy: Use BM25 for structured or semi-structured fields (IDs, codes, dates, product names) and embeddings for unstructured text fields (descriptions, reviews, documentation).
A practical rule of thumb: treat names, IDs, dates, codes, and other highly structured tokens as lexical problems. You want exact or near-exact matches, and BM25 (or even keyword filters) is ideal. Use embeddings where meaning matters more than exact wording: FAQs, support tickets, knowledge base articles, research papers, and general natural language queries.
This hybrid mindset helps avoid over-engineering with full semantic stacks where a simple BM25 query would be cheaper, faster, and more reliable. It also makes debugging easier: if a user complains that a specific ID or code is not found, you can reason about the lexical pipeline separately from the semantic one.
Implementation-wise, many modern search systems (Elasticsearch, OpenSearch, Vespa, Weaviate, Qdrant, etc.) support hybrid search patterns. You can:
- Store documents in a traditional index for BM25.
- Store embeddings in a vector index.
- At query time, run both searches and combine.
The combination strategy can be as simple as a linear interpolation of normalized scores, or as advanced as a learned-to-rank model that takes BM25 score, vector similarity, and other features as input. Start simple, measure, and iterate.
The main takeaway: semantic search is powerful but not universally necessary. Recognizing when a problem is lexical vs semantic lets you design simpler, more robust, and more cost-effective search systems.
Example
Below is a minimal Python example using Elasticsearch to demonstrate a hybrid approach: BM25 for lexical matching and vector search for semantic similarity.
from elasticsearch import Elasticsearch
import numpy as np
es = Elasticsearch("http://localhost:9200")
index_name = "docs_hybrid"
# 1. Create index with both text and vector fields
mapping = {
"mappings": {
"properties": {
"id": {"type": "keyword"},
"title": {"type": "text"}, # BM25 by default
"body": {"type": "text"}, # BM25 by default
"body_vector": { # semantic field
"type": "dense_vector",
"dims": 768,
"index": True,
"similarity": "cosine"
}
}
}
}
if es.indices.exists(index=index_name):
es.indices.delete(index=index_name)
es.indices.create(index=index_name, body=mapping)
# 2. Index a document (pretend we already computed embeddings)
body_embedding = np.random.rand(768).tolist() # replace with real model output
es.index(
index=index_name,
id="doc-123",
document={
"id": "INV-2025-0001", # lexical ID
"title": "Invoice for December 2025",
"body": "This invoice covers consulting services for December 2025.",
"body_vector": body_embedding
}
)
es.indices.refresh(index=index_name)
# 3. Run a hybrid search
user_query = "December 2025 invoice INV-2025-0001"
query_embedding = np.random.rand(768).tolist() # again, from your embedding model
search_body = {
"size": 10,
"query": {
"bool": {
"should": [
# Lexical BM25 on title/body/id
{
"multi_match": {
"query": user_query,
"fields": ["title^2", "body", "id^3"]
}
},
# Semantic vector search on body_vector
{
"script_score": {
"query": {"match_all": {}},
"script": {
"source": "cosineSimilarity(params.query_vector, 'body_vector') + 1.0",
"params": {"query_vector": query_embedding}
}
}
}
]
}
}
}
response = es.search(index=index_name, body=search_body)
for hit in response["hits"]["hits"]:
print(hit["_id"], hit["_score"], hit["_source"]["id"])
In this example:
- The
idfield is treated lexically (keyword) so exact invoice IDs likeINV-2025-0001are matched reliably. - The
titleandbodyfields use BM25 for traditional text search. - The
body_vectorfield enables semantic similarity. - The
bool.shouldclause lets both lexical and semantic signals contribute to the final score, covering both exactness (IDs, dates) and meaning (natural language queries).
Pitfalls and notes
A common pitfall is trying to solve every search problem with embeddings alone, which can fail badly for exact identifiers and codes. Embedding models may not reliably preserve precise tokens like INV-2025-0001 or user_12345, leading to missed or poorly ranked results. Another mistake is ignoring field types and treating everything as unstructured text instead of separating lexical fields (IDs, dates, codes) from semantic fields (descriptions, content). Overcomplicating the stack is also a risk: adding vector databases and LLM re-rankers where BM25 alone would suffice increases cost, latency, and operational burden. Finally, poorly combining BM25 and vector scores (e.g., without normalization) can cause one signal to dominate, hiding the benefits of the other.
References
- Elasticsearch: Relevance and BM25 (documentation)
- An Introduction to BM25 (article)
- Hybrid Search: Best of Both Worlds with BM25 and Vectors (article)
- OpenSearch: k-NN and Hybrid Search (documentation)