Paper

STEM: Scaling Transformers with Embedding Modules

Attiano Purpura-Pontoniere

2026.01.22

·Arxiv·by 이호민

#Transformer#Sparsity#LLM#Embedding#Efficiency

Key Points

1STEM (Scaling Transformers with Embedding Modules) introduces a static, token-indexed method that replaces the FFN up-projection with a layer-local embedding lookup, aiming to address the instability and overheads of fine-grained sparsity.
2This approach enhances training stability, reduces per-token FLOPs and parameter access by eliminating about one-third of FFN parameters, and enables CPU offload by prefetching embeddings.
3STEM demonstrates significant accuracy improvements on knowledge-intensive tasks, provides unique interpretability for knowledge editing, and exhibits robust capacity scaling for long-context performance.

W_u

Paper

Attiano Purpura-Pontoniere

2026.01.22

·Arxiv·by 이호민

#Transformer#Sparsity#LLM#Embedding#Efficiency

1STEM (Scaling Transformers with Embedding Modules) introduces a static, token-indexed method that replaces the FFN up-projection with a layer-local embedding lookup, aiming to address the instability and overheads of fine-grained sparsity.
2This approach enhances training stability, reduces per-token FLOPs and parameter access by eliminating about one-third of FFN parameters, and enables CPU offload by prefetching embeddings.
3STEM demonstrates significant accuracy improvements on knowledge-intensive tasks, provides unique interpretability for knowledge editing, and exhibits robust capacity scaling for long-context performance.

W_u