AI for E-Commerce Search: Why BM25 Is Not Broken — It's Just Not Enough

April 28, 2026

2 views

9 min read

AI/ML

BM25 powers e-commerce search with speed and precision, but struggles with vocabulary gaps and intent. This blog explores its limits, why dense embeddings fall short, and how SPLADE bridges the gap—improving relevance, reducing zero results, and boosting conversions.

BM25 has powered e-commerce search since the 1990s. It remains the default scoring function in Elasticsearch, Solr, and Lucene for good reasons: it is fast, interpretable, and deterministic.

However, purely lexical search has a structural ceiling. The problem isn't that BM25 is wrong, but that its architecture cannot bridge the lexical gap. It is the mismatch between how customers search and how products are cataloged using AI solutions for e-commerce search environment.

In this guide, we examine the technical limits of BM25. Why dense embeddings often fail at attribute precision. And how learned sparse representations like SPLADE offer a production-ready middle ground for AI for e-commerce search to improve user experience.

How BM25 Actually Works and What It Gets Right?

BM25 (Best Matching 25) is a probabilistic ranking function that scores documents based on three pillars: term importance (IDF), term frequency (TF) saturation, and document length normalization. Unlike modern AI for e-commerce search models, BM25 relies on static mathematical weights rather than learned semantic patterns.

The Production Knobs:

TF Saturation: Controls how quickly term repetition loses impact. A value of 1.2 prevents "keyword stuffing" from overwhelming the score.
Length Norm: Regulates the penalty for long documents. A value of 0.75 ensures a short product title isn't drowned out by a 2,000-word technical spec.

BM25 Strengths	BM25 Limitations
Sub-millisecond latency (Inverted Index)	Zero vocabulary expansion
Exact matching for SKUs and model numbers	Fails on synonyms (e.g., "sneaker" vs "trainer")
Human-interpretable scoring	Struggles with natural language intent
No GPU or training data required	High maintenance for synonym files

The Vocabulary Gap: The One Problem BM25 Cannot Solve

BM25 relies on exact character matching. This "bag-of-words" constraint creates the vocabulary mismatch problem, where query tokens do not appear in the document. It is the primary driver for migrating toward AI for e-commerce search.

Failure of BM25 that push towards adoption of AI for E-commerce Search:

Synonymy: A search for summer dress misses products listed as sundress, floral midi, or lightweight cotton dress.
Jargon Discrepancy: A shopper types noise canceling, but the vendor feed uses active noise reduction or ANC.
Syntactic Confusion: BM25 cannot distinguish shirt dress (a dress) from dress shirt (formal attire) because both share the same tokens.

Gap Category	Example Query	Catalog Entry	BM25 Outcome
Synonymy	pregnancy dress	maternity gown	Zero Results
Abbreviation	ps5	PlayStation 5	Missed Match
Jargon	respirator mask	face mask	Relevance Drop
Concept	warm winter jacket	lined parka	Zero Results

Data from the Baymard Institute indicates that 31% of product searches return no results even when the item exists, primarily due to this lexical gap. For searchers who convert 2-3x higher than browsers a "zero results" page is a direct revenue leak that necessitates an AI for e-commerce search strategy.

Solving this requires understanding the underlying power of Large Language Models and how they process intent rather than just characters.

The Scaling Problem of Manual Intervention

Historically, search engineers have attempted to bridge this gap using query rewriting, expansion, and manually curated synonym lists.

While effective for head queries (the most frequent searches), these methods do not scale to the long tail of user intent.

Maintaining a synonym library for a catalog of millions of SKUs is prohibitively expensive and prone to error, particularly as language evolves and new slang or brand names emerge.

Furthermore, manual expansion can lead to query drift, where adding synonyms like car for automobile might inadvertently retrieve toy cars for a user looking for a real vehicle.

Success Story:
Discover how we applied high-precision retrieval in our Enterprise AI Knowledge Assistant Platform Case Study, resulting in a 75% reduction in information discovery speed.

Why Dense Embeddings Solve One Problem and Create Another

Dense retrieval (e.g., BERT, OpenAI text-embedding-3) maps text into a continuous vector space where "dog" and "puppy" sit close together. While this solves synonymy, it introduces attribute blur which is a common hurdle in AI for e-commerce search development.

The Attribute-Blur Problem

Dense models prioritize broad semantic similarity over specific tokens. In AI in e-commerce search, this is catastrophic. A dense search for an "iPhone 15 Pro Max 256GB" may return a 128GB version as the top result because the semantic "meaning" is nearly identical. Compression discards the specificity required for technical specifications, brands, and model numbers.

We address all these complexities through our Machine Learning Services, as in every work we do, precision is paramount.

Operational Constraints:

Interpretability: Vectors are "black boxes." You cannot explain why a specific product ranked first.
Infrastructure: Requires specialized Vector Databases and HNSW indices which are memory-intensive.

SPLADE and Learned Sparse Retrieval

SPLADE (SParse Lexical AnD Expansion) is the next evolution for AI for e-commerce search. It produces high-dimensional sparse vectors that are lexically grounded but semantically enriched. This mirrors the shift toward more complex Agentic AI Workflows, where systems expand on simple inputs to achieve a specific goal.

Architecture and Neural Expansion

SPLADE models are typically built upon a DistilBERT backbone and utilize the Masked Language Model (MLM) head. Instead of condensing the input into a single dense vector, SPLADE calculates the importance of every term in the model's vocabulary (typically ~30,000 tokens) for the given input.

The architectural process involves:

MLM Logit Generation: The transformer processes the input text and outputs a score (logit) for every vocabulary token at every position in the sequence.
Max-Pooling: The model takes the maximum logit for each vocabulary term across all input positions, ensuring that the strongest signal for a concept is preserved.
Sparsification and Log Saturation: A ReLU activation is applied to keep only positive weights, followed by a log(1+w) saturation function that mimics the diminishing returns of term frequency found in BM25.

The result is a vector where only a few hundred out of 30,000 dimensions are non-zero. Crucially, SPLADE performs "neural expansion," activating terms that were not present in the original text but are semantically related. A document mentioning "maternity gown" will have weights for "pregnancy," "dress," and "women" in its sparse representation, allowing it to be retrieved by those terms without manual synonym mapping, a massive win for AI for e-commerce search efficiency.

Featured Service:
Explore our Generative AI Development Services to see how we build customized neural expansion models that prevent "attribute blur."

Example Query Expansion: "noise canceling headphones"

Token	Weight
headphones	2.3
noise	1.9
canceling	1.7
audio	1.2 (Expanded)
wireless	0.8 (Expanded)

SPLADE expands the query to include audio and wireless automatically, without a synonym file. On the Amazon ESCI benchmark, fine-tuned SPLADE achieved an nDCG@10 of 0.389, a 27.5% improvement over the BM25 baseline.

Decision Matrix for Search Engineers (2026)

Choosing the right retrieval architecture for AI for e-commerce search depends on your catalog structure and latency budget.

Feature	BM25	Dense (Vector)	Sparse (SPLADE)
Exact Matching	Excellent	Poor (Blurry)	Excellent
Synonym Support	None (Manual)	Automatic	Automatic (Learned)
Interpretability	100%	0%	High
Infrastructure	Standard CPU	GPU/VDB Required	CPU Retrieval
Best For	SKUs, Part Numbers	Intent, Conversational	General E-Commerce

Benchmarking Sparse Retrieval in AI for E-Commerce Search

The superiority of learned sparse retrieval is not merely theoretical; it is backed by significant empirical evidence, particularly on the Amazon ESCI (Shopping Queries) dataset. This dataset is uniquely suited for e-commerce evaluation as it contains 1.2 million query-product pairs with four levels of human-annotated relevance: Exact, Substitute, Complement, and Irrelevant.

Performance Gains on ESCI

In head-to-head comparisons, SPLADE models fine-tuned on e-commerce data consistently outperform BM25 across all major retrieval metrics. The gains are most pronounced in the ability to retrieve relevant substitutes and complements that lack exact term overlap with the query.

Model Variant	ESCI nDCG@10	Wayfair WANDS nDCG@10	Home Depot nDCG@10
BM25 (Baseline)	0.305	0.329	0.349
SPLADE (Off-the-shelf)	0.326 (+6.8%)	0.341 (+3.6%)	0.391 (+12.0%)
SPLADE (Fine-tuned)	0.389 (+27.5%)	0.355 (+7.9%)	0.384 (+10.0%)

The data indicates that a fine-tuned SPLADE model achieves a 27.5% improvement over BM25 on the Amazon ESCI dataset.

This is not a marginal gain; in search engineering, a 27% increase in nDCG (Normalized Discounted Cumulative Gain) represents a fundamental improvement in the user experience, often correlating with double-digit increases in conversion rates.

The Cost of Specialization: Catastrophic Forgetting

While fine-tuning on e-commerce data yields impressive results within that domain, it leads to a phenomenon known as catastrophic forgetting or domain degradation. When a model is trained exclusively on Amazon's data, its understanding of general language can become skewed. For example, the term "Apple" might lose its meaning as a fruit and become exclusively associated with the electronics brand, and "Prime" might be interpreted only as a shipping speed rather than a mathematical concept.

On the MS MARCO general web search benchmark, a model specialized for e-commerce saw its performance drop by 18% compared to a general-purpose BM25 baseline. For search engineers, this underscores the importance of the training data mix. If a platform serves a diverse range of products (e.g., a marketplace that sells both groceries and electronics), a multi-domain training strategy that includes balanced datasets is essential to prevent the model from becoming too specialized to a single category's jargon.

The Production Standard: Hybrid Retrieval

The current gold standard for AI for e-commerce search is Hybrid Search, running BM25 and SPLADE in parallel and merging results via Reciprocal Rank Fusion (RRF). RRF uses a parameter-free formula (k=60) to boost products that appear in both lists, ensuring both lexical precision and semantic recall.

For startups and large-scale retailers alike, deploying these systems requires a solid DevOps strategy for startups and enterprises to manage the infrastructure of neural indices.

Where MoogleLabs Fits Into This

Moving to a fine-tuned AI for e-commerce search system is a non-trivial engineering project. It requires robust artificial intelligence solutions and a pipeline for hard-negative mining, evaluation, and MLOps services.

MoogleLabs has built production-grade machine learning services and generative AI development services for e-commerce, including:

Enterprise AI Search Assistants using Vespa and Redis for contextual discovery.
Automated Product Enrichment to fix the data quality issues that plague retrieval.
Agentic AI Services that serve as long-term memory for autonomous shopping assistants.

If your team is evaluating the move to AI for e-commerce search, if your team is evaluating the move to neural search, we offer structured consulting engagements to map your architecture. Consult with our ML team →

Loading FAQs

Please wait while we fetch the questions...

AI & ML

Smart App development

Blockchain

DevOps & Other Services

Consulting

Testing & Audit

Healthcare

Logistics & Supply Chain

Fintech

EdTech

Real Estate

Manufacturing

Insurance

Construction

Case Studies

Testimonials