Implementing the BGE-M3 Model
Blog

Implementing the BGE-M3 Model

2025.06.29
Β·WebΒ·by Anonymous
#BGE-M3#Transformer#NLP#TensorFlow#Keras

Key Points

  • 1This paper presents a hands-on guide for implementing the BGE-M3 multilingual embedding model from scratch using TensorFlow-Keras, focusing on its core architecture composed mainly of Dense and LayerNormalization layers.
  • 2It details the step-by-step construction of the model's components, including word, position, and token type embeddings, as well as the Transformer Block's Multi-Head Attention and Feed-Forward Network, adhering to the Roberta XL base structure.
  • 3The complete TensorFlow implementation enables versatile deployment for inference across various platforms, highlighting its utility for tasks like Retrieval-Augmented Generation (RAG) and efficient multilingual search.

The paper details the implementation of the BGE-M3 (BAAI General Embedding - Multi-lingual, Multi-task, Multi-vector) model using TensorFlow-Keras, emphasizing its lean architecture that primarily relies on Dense layers and LayerNormalization for inference. BGE-M3 is a multilingual embedding model supporting over 70 languages, noted for its strong performance in Korean MTEB benchmarks and its utility in Retrieval-Augmented Generation (RAG) tasks.

The model's core architecture is based on the XLMRobertaModel, which is characterized by its simplicity, avoiding more recent complexities like Rotary Position Embedding (RoPE), Pre-Normalization, or linear bias removal. The paper highlights that the inference structure can be realized with just nine fundamental linear layers (Dense, Linear, MLP) and three LayerNormalizations per block, repeated across 24 Transformer blocks.

The implementation consists of three main parts:

  1. Embedding Layer:
    • Word Embedding: Maps 250,002 tokens to 1024-dimensional vectors. Implemented via tf.keras.layers.Embedding.
    • Position Embedding: Encodes positional information for up to 8,194 positions into 1024-dimensional vectors. Implemented via tf.keras.layers.Embedding. Position IDs are generated dynamically using create_position_ids_from_input_ids, which computes cumulative sums of a mask derived from input_ids (where padding tokens are ignored).
    • Token Type Embedding: A single 1024-dimensional constant vector applied to all tokens, as BGE-M3 uses a single token type. Implemented via tf.keras.layers.Embedding(inputdim=1,outputdim=1024)tf.keras.layers.Embedding(input_dim=1, output_dim=1024).
    • All three embeddings are summed: embedding_output=inputs_embeds+position_embeds+token_type_embeds\text{embedding\_output} = \text{inputs\_embeds} + \text{position\_embeds} + \text{token\_type\_embeds}.
    • Finally, the combined embedding output undergoes LayerNormalization with Ο΅=10βˆ’5\epsilon = 10^{-5}: embedding_output=LayerNorm(embedding_output)\text{embedding\_output} = \text{LayerNorm}(\text{embedding\_output}).
  1. Transformer Block: Each block is composed of Multi-Head Self-Attention (MHA) and a Feed-Forward Neural Network (FFNN), with residual connections and LayerNormalization applied after each sub-layer. The model uses 24 such blocks.
    • Multi-Head Attention (MHA):
      • Input inputs (from embedding or previous block) is linearly transformed into Query (Q), Key (K), and Value (V) tensors using distinct tf.keras.layers.Dense layers, each with outputdim=1024output_dim=1024.
      • Q, K, V are then split into 16 heads (numheads=16num_heads=16), where each head has a depth of 1024/16=641024 / 16 = 64. This involves reshaping and transposing: (batch_size,seq_len,d_model)β†’(batch_size,num_heads,seq_len,depth)(batch\_size, \text{seq\_len}, \text{d\_model}) \rightarrow (batch\_size, \text{num\_heads}, \text{seq\_len}, \text{depth}).
      • Scaled Dot-Product Attention: Attention scores are computed as softmax(QKTdk+mask)V\text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}} + \text{mask}\right)V. Here, dkd_k is the dimension of the keys (64), so the scores are divided by 64=8.0\sqrt{64} = 8.0.
      • Attention mask is applied by adding a large negative value (βˆ’10000.0-10000.0) to padding token positions before softmax, effectively zeroing out their attention scores. The extended_attention_mask is derived from attention_mask_origin by (1βˆ’mask)Γ—βˆ’10000.0(1 - \text{mask}) \times -10000.0.
      • The output from all heads is concatenated and passed through a final tf.keras.layers.Dense layer (outputdim=1024output_dim=1024).
      • A residual connection adds the initial inputs to the attention output, followed by LayerNormalization: attention_output=LayerNorm(attention_output+inputs)\text{attention\_output} = \text{LayerNorm}(\text{attention\_output} + \text{inputs}).
    • Feed-Forward Neural Network (FFNN):
      • The output from the attention sub-layer (attention_output) is passed through an intermediate tf.keras.layers.Dense layer that expands the dimension to intermediatesize=4096intermediate_size=4096.
      • A GELU (Gaussian Error Linear Unit) approximation activation function is applied: GELU(x)=xΓ—0.5Γ—(1.0+erf(x/2.0))\text{GELU}(x) = x \times 0.5 \times (1.0 + \text{erf}(x / \sqrt{2.0})).
      • The result then passes through an output_dense tf.keras.layers.Dense layer, projecting back to dmodel=1024d_model=1024.
      • A second residual connection adds attention_output to the FFNN output, followed by LayerNormalization: output=LayerNorm(layer_output+attention_output)\text{output} = \text{LayerNorm}(\text{layer\_output} + \text{attention\_output}).
  1. Model Forward Flow and Output:
    • Input input_ids and attention_mask are processed through the embedding layer and then iteratively through 24 Transformer blocks. The output of the last block is hidden_states.
    • Dense Retrieval Output: The pooled output for dense retrieval is extracted as the first token's hidden state (CLS token): pooled_output=hidden_states[:,0,:]\text{pooled\_output} = \text{hidden\_states}[:, 0, :].
    • Multi-Vector Retrieval Output: An additional colbert_linear tf.keras.layers.Dense layer is applied to the non-CLS tokens of hidden_states (i.e., hidden_states[:, 1:]). This output is then masked by the original attention mask (excluding the CLS token) to zero out padding tokens: colbert_vecs=colbert_linear(hidden_states[:,1:])Γ—attention_mask_origin[:,1:][:,:,None]\text{colbert\_vecs} = \text{colbert\_linear}(\text{hidden\_states}[:, 1:]) \times \text{attention\_mask\_origin}[:, 1:][:, :, \text{None}].
    • The model returns a dictionary containing dense_vecs and colbert_vecs.
    • The paper also provides instructions for loading the colbert_linear weights, which are typically found as separate PyTorch files.

The paper concludes by demonstrating how to save the implemented TensorFlow-Keras model with a serving signature for deployment, enabling its use across various platforms and applications, from large-scale Hadoop/Spark jobs to mobile inference via TensorFlow Lite.