BAAI/bge-code-v1 · Hugging Face
Key Points
- 1BGE-Code-v1 is an LLM-based code embedding model designed for comprehensive retrieval across code, text, and multilingual contexts, supporting natural language queries in English and Chinese, plus 20 programming languages.
- 2The model demonstrates superior code retrieval, robust text retrieval comparable to specialized text embedding models, and extensive multilingual capabilities including English, Chinese, Japanese, and French.
- 3BGE-Code-v1 achieves state-of-the-art performance on both the CoIR and CodeRAG benchmarks, showcasing its effectiveness in various code and natural language retrieval tasks.
BAAI/bge-code-v1 is a state-of-the-art, LLM-based code embedding model developed by FlagEmbedding, designed for versatile retrieval tasks including code retrieval, text retrieval, and multilingual retrieval. The model is built upon a Qwen2 architecture, as suggested by its associated tags.
Core Capabilities:
- Superior Code Retrieval Performance: It excels at retrieving code snippets given natural language queries, supporting both English and Chinese queries across 20 programming languages.
- Robust Text Retrieval Capabilities: Despite its specialization in code, it maintains strong general text retrieval performance, comparable to dedicated text embedding models of similar scales.
- Extensive Multilingual Support: The model demonstrates proficiency in multilingual retrieval, with strong performance observed in languages such as English, Chinese, Japanese, and French.
Core Methodology:
The BGE-Code-v1 model leverages an instruction-tuned large language model architecture to generate embeddings. The key technical aspects include:
- Instruction Tuning: To enhance retrieval performance and task specificity, the model is fine-tuned to accept an explicit
instructionortask_descriptionalongside the query. This instruction provides context about the retrieval task, guiding the model to generate more semantically aligned embeddings. The standard prompt format used for queries is . For documents (corpus), typically no instruction is provided, or a generic one is implicitly used during training. This approach helps the model understand the relationship between queries and documents in various retrieval scenarios.
- Embedding Generation (Pooling Strategy): The model processes input texts (queries or documents) through its transformer layers. To derive a fixed-size embedding vector for each sequence, it employs a specific pooling strategy:
- Last Token Pooling: The embedding for a sequence is extracted from the hidden state of the last token (often the
[EOS]token or the last non-padding token). This is implemented bylast_token_pool(last_hidden_states, attention_mask). - Specifically, given the
last_hidden_statesof shape (batch size, sequence length, hidden dimension) andattention_maskof shape :- If left-padding is used (i.e., padding tokens are at the beginning), the last token's hidden state
last_hidden_states[:, -1]is used. - Otherwise (right-padding), the method identifies the actual sequence length for each example using and then selects the corresponding last non-padding token's hidden state: .
- If left-padding is used (i.e., padding tokens are at the beginning), the last token's hidden state
- L2 Normalization: The extracted embedding vectors are then L2-normalized to have a unit norm:
- Last Token Pooling: The embedding for a sequence is extracted from the hidden state of the last token (often the
This normalization is crucial for cosine similarity calculations, as it ensures that the similarity is solely dependent on the angle between vectors, not their magnitude.
- Similarity Calculation: For retrieval, cosine similarity is used to measure the semantic relatedness between query embeddings () and document embeddings ().
Given that the embeddings are already L2-normalized, the cosine similarity simplifies to the dot product:
In the provided examples, the scores are sometimes scaled by 100 (e.g., ), which is a common practice for displaying similarity scores.
Performance and Evaluation:
BGE-Code-v1 achieves state-of-the-art results on prominent code retrieval benchmarks:
- CoIR (Code-oriented Information Retrieval): The model achieves an average score of 81.77, outperforming previous models like CodeXEmbed-2B (75.65), CodeXEmbed-7B (78.20), Voyage-Code-002 (56.26), and Voyage-Code-003 (78.53). It demonstrates particularly strong performance on tasks like Apps (98.08), CSN-CCR (98.30), and CodeFeedBack-MT (94.38).
- CodeRAG (Code Retrieval Augmented Generation): BGE-Code-v1 obtains an average score of 72.8, surpassing SFR (67.0), Jina-v2-code (65.4), CodeXEmbed-2B (64.6), and Voyage-Code-002 (63.7). Notable results include 100.0 on HummanEval, 99.2 on MBPP, 40.9 on DS-1000, 36.1 on ODEX, 93.1 on RepoEval, and 67.4 on SWE-bench-Lite.
The model demonstrates significant advancements in code and multilingual retrieval by combining an LLM foundation with instruction-tuning and specific pooling strategies.