
SPLADE-v3: New baselines for SPLADE
Key Points
- 1This paper introduces SPLADE-v3, a new series of SPLADE models, detailing improvements to their training structure.
- 2Key training enhancements include using multiple hard negatives, leveraging ensemble cross-encoder distillation scores, and combining KL-Div and MarginMSE losses.
- 3SPLADE-v3 achieves statistically significant effectiveness gains over BM25 and SPLADE++ models, notably surpassing 40 MRR@10 on MS MARCO and showing competitive performance against cross-encoder re-rankers.
This paper introduces SPLADE-v3, a new series of SPLADE models, presenting advancements in training methodology and demonstrating improved effectiveness. The core contributions lie in refining the training pipeline for sparse neural Information Retrieval (IR) models.
The training enhancements for SPLADE-v3 include:
- Multiple Negatives Per Batch: Following Tevatron's approach, SPLADE-v3 incorporates more than one hard negative per batch. Specifically, for each query, 100 negatives are sampled: 50 from the top-50 and 50 from the top-1k results of a
SPLADE++SelfDistilmodel. This strategy is noted to improve in-domain results. - Better Distillation Scores: To generate more robust distillation scores, an ensemble of five cross-encoder re-rankers is employed, rather than a single model. The re-rankers include
cross-encoder/ms-marco-MiniLM-L-6-v2and fournaver/trecdl22-crossencodermodels (rankT53b-repro, debertav3, debertav2, electra). Two types of scores are generated:- Simple Ensemble Scores: These are produced by feeding each of the 500k MS MARCO training queries (paired with positive documents and 100 negatives) to the re-rankers, with scores normalized per query using min-max aggregation from
ranx. - Rescored Version: An affine transformation is applied to the ensemble scores to make their average and standard deviation statistics (, ) closely mimic those of the scores generated by the
cross-encoder/ms-marco-MiniLM-L-6-v2model used in previous SPLADE iterations. This empirical adjustment helps in distillation, particularly for MarginMSE.
- Simple Ensemble Scores: These are produced by feeding each of the 500k MS MARCO training queries (paired with positive documents and 100 negatives) to the re-rankers, with scores normalized per query using min-max aggregation from
- Two Distillation Losses: The training combines two effective distillation losses: KL-Divergence () and MarginMSE (). It was observed that MarginMSE tends to emphasize Recall, while KL-Div focuses on Precision. By combining them with empirically determined weights, and , overall better results are achieved. The total loss function can be expressed as .
- Further Fine-Tuning: The training process for SPLADE-v3 begins from a pre-trained
SPLADE++SelfDistilcheckpoint, rather than generic language models like CoCondenser or DistilBERT. This warm-start approach leads to better effectiveness, possibly due to a form of curriculum learning.
The primary SPLADE-v3 model (naver/splade-v3) starts from SPLADE++SelfDistil, uses the mixed KL-Div and MarginMSE losses, and samples 8 negatives per query from SPLADE++SelfDistil outputs. The evaluation framework is a meta-analysis using the RANGER toolkit across 44 diverse query sets, measuring effectiveness with nDCG*@10 (nDCG* accounts for judged documents only).
Empirical evaluations demonstrate that SPLADE-v3 yields statistically significant improvements over BM25 across most query sets, with only a few exceptions like Webis Touché-2020 and TREC-MQ. Compared to its initialization model, SPLADE++SelfDistil, SPLADE-v3 generally shows gains, with only the Quora dataset experiencing a slight decrease. When compared to cross-encoder re-rankers (MiniLM and DeBERTaV3 re-ranking the top-50 documents retrieved by SPLADE-v3), SPLADE-v3 performs comparably to MiniLM, and while generally outperformed by DeBERTaV3, it serves as a strong first-stage retriever.
In addition to the base SPLADE-v3, the paper introduces three variants:
- SPLADE-v3-DistilBERT: Starts training from DistilBERT, offering a smaller inference footprint, though with slightly lower effectiveness than the base model.
- SPLADE-v3-Lexical: Removes query expansion, reducing retrieval FLOPS (Floating Point Operations Per Second) and improving efficiency. It is highly effective on in-domain tasks like MS MARCO and LoTTE but struggles with out-of-domain (BEIR) datasets.
- SPLADE-v3-Doc: Starts from CoCondenser and performs no computation for the query, effectively operating as a binary Bag-of-Words model for documents. While the least effective overall, especially in zero-shot settings, it remains competitive with state-of-the-art dense bi-encoders given its high efficiency.
Overall, SPLADE-v3 represents a significant step forward for sparse neural IR models, offering enhanced effectiveness and serving as a robust baseline that can even rival more complex re-rankers.