BAAI/bge-code-v1 · Hugging Face
Service

BAAI/bge-code-v1 · Hugging Face

2025.05.25
·Hugging Face·by Anonymous
#LLM#code embedding#retrieval#multilingual#FlagEmbedding

Key Points

  • 1BGE-Code-v1 is an LLM-based code embedding model designed for comprehensive retrieval across code, text, and multilingual contexts, supporting natural language queries in English and Chinese, plus 20 programming languages.
  • 2The model demonstrates superior code retrieval, robust text retrieval comparable to specialized text embedding models, and extensive multilingual capabilities including English, Chinese, Japanese, and French.
  • 3BGE-Code-v1 achieves state-of-the-art performance on both the CoIR and CodeRAG benchmarks, showcasing its effectiveness in various code and natural language retrieval tasks.

BAAI/bge-code-v1 is a state-of-the-art, LLM-based code embedding model developed by FlagEmbedding, designed for versatile retrieval tasks including code retrieval, text retrieval, and multilingual retrieval. The model is built upon a Qwen2 architecture, as suggested by its associated tags.

Core Capabilities:

  1. Superior Code Retrieval Performance: It excels at retrieving code snippets given natural language queries, supporting both English and Chinese queries across 20 programming languages.
  2. Robust Text Retrieval Capabilities: Despite its specialization in code, it maintains strong general text retrieval performance, comparable to dedicated text embedding models of similar scales.
  3. Extensive Multilingual Support: The model demonstrates proficiency in multilingual retrieval, with strong performance observed in languages such as English, Chinese, Japanese, and French.

Core Methodology:
The BGE-Code-v1 model leverages an instruction-tuned large language model architecture to generate embeddings. The key technical aspects include:

  • Instruction Tuning: To enhance retrieval performance and task specificity, the model is fine-tuned to accept an explicit instruction or task_description alongside the query. This instruction provides context about the retrieval task, guiding the model to generate more semantically aligned embeddings. The standard prompt format used for queries is f<instruct>taskdescription\n<query>queryf'<instruct> {task_description} \n<query> {query}'. For documents (corpus), typically no instruction is provided, or a generic one is implicitly used during training. This approach helps the model understand the relationship between queries and documents in various retrieval scenarios.
  • Embedding Generation (Pooling Strategy): The model processes input texts (queries or documents) through its transformer layers. To derive a fixed-size embedding vector for each sequence, it employs a specific pooling strategy:
    • Last Token Pooling: The embedding for a sequence is extracted from the hidden state of the last token (often the [EOS] token or the last non-padding token). This is implemented by last_token_pool(last_hidden_states, attention_mask).
    • Specifically, given the last_hidden_states of shape (B,L,D)(B, L, D) (batch size, sequence length, hidden dimension) and attention_mask of shape (B,L)(B, L):
      • If left-padding is used (i.e., padding tokens are at the beginning), the last token's hidden state last_hidden_states[:, -1] is used.
      • Otherwise (right-padding), the method identifies the actual sequence length for each example using sequencelengths=attentionmask.sum(dim=1)1sequence_lengths = attention_mask.sum(dim=1) - 1 and then selects the corresponding last non-padding token's hidden state: lasthiddenstates[torch.arange(batchsize,device=lasthiddenstates.device),sequencelengths]last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths].
    • L2 Normalization: The extracted embedding vectors are then L2-normalized to have a unit norm:
embeddingsnormalized=embeddingsembeddings2\text{embeddings}_{\text{normalized}} = \frac{\text{embeddings}}{\|\text{embeddings}\|_2}
This normalization is crucial for cosine similarity calculations, as it ensures that the similarity is solely dependent on the angle between vectors, not their magnitude.

  • Similarity Calculation: For retrieval, cosine similarity is used to measure the semantic relatedness between query embeddings (EQ\mathbf{E}_Q) and document embeddings (ED\mathbf{E}_D).
similarity(eQ,eD)=eQeDeQ2eD2\text{similarity}(\mathbf{e}_Q, \mathbf{e}_D) = \frac{\mathbf{e}_Q \cdot \mathbf{e}_D}{\|\mathbf{e}_Q\|_2 \|\mathbf{e}_D\|_2}
Given that the embeddings are already L2-normalized, the cosine similarity simplifies to the dot product:
similarity(eQ,eD)=eQeD\text{similarity}(\mathbf{e}_Q, \mathbf{e}_D) = \mathbf{e}_Q \cdot \mathbf{e}_D
In the provided examples, the scores are sometimes scaled by 100 (e.g., scores=(embeddings[:2]@embeddings[2:].T)100scores = (embeddings[:2] @ embeddings[2:].T) * 100), which is a common practice for displaying similarity scores.

Performance and Evaluation:
BGE-Code-v1 achieves state-of-the-art results on prominent code retrieval benchmarks:

  • CoIR (Code-oriented Information Retrieval): The model achieves an average score of 81.77, outperforming previous models like CodeXEmbed-2B (75.65), CodeXEmbed-7B (78.20), Voyage-Code-002 (56.26), and Voyage-Code-003 (78.53). It demonstrates particularly strong performance on tasks like Apps (98.08), CSN-CCR (98.30), and CodeFeedBack-MT (94.38).
  • CodeRAG (Code Retrieval Augmented Generation): BGE-Code-v1 obtains an average score of 72.8, surpassing SFR (67.0), Jina-v2-code (65.4), CodeXEmbed-2B (64.6), and Voyage-Code-002 (63.7). Notable results include 100.0 on HummanEval, 99.2 on MBPP, 40.9 on DS-1000, 36.1 on ODEX, 93.1 on RepoEval, and 67.4 on SWE-bench-Lite.

The model demonstrates significant advancements in code and multilingual retrieval by combining an LLM foundation with instruction-tuning and specific pooling strategies.