DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling
Paper

DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling

Mu Yang
2026.01.12
·Arxiv·by 이호민
#DRoPE#RoPE#Autonomous Driving#Trajectory Generation#Deep Learning

Key Points

  • 1Current trajectory generation methods in autonomous driving face an "impossible triangle" of accuracy, computational time, and memory efficiency, with existing relative position encoding (RPE) being memory-intensive and RoPE unable to naturally represent periodic angular information.
  • 2To address this, DRoPE (Directional Rotary Position Embedding) is proposed as an adaptation of RoPE, which introduces a uniform identity scalar into the 2D rotary transformation to effectively encode relative angular information aligned with agent headings.
  • 3DRoPE, combined with RoPE, allows for efficient modeling of both relative positions and headings with significantly reduced O(N) space complexity, while maintaining high performance, as validated theoretically and empirically against state-of-the-art models.

The paper introduces Directional Rotary Position Embedding (DRoPE) to address the "impossible triangle" problem in autonomous driving trajectory generation, which involves balancing accuracy, time complexity, and memory efficiency. Current methods, including scene-centric, agent-centric, and query-centric frameworks, each have significant limitations.

Scene-centric methods, while computationally efficient, use absolute positions, leading to suboptimal accuracy for distant agents. Agent-centric methods improve accuracy by normalizing coordinates around each agent, but incur high time complexity, scaling linearly with the number of agents NN (O(N)O(N) inference steps). Query-centric methods, often using Relative Position Embeddings (RPE), allow simultaneous inference for all agents but suffer from high space complexity, typically O(N2)O(N^2) for storing relative positions between all agent pairs.

The paper identifies Rotary Position Embedding (RoPE), originating from natural language processing, as a promising alternative due to its efficient O(N)O(N) space complexity. RoPE encodes relative positions implicitly by embedding global positions into query-key (QK) vectors using rotary transformations, thus avoiding the explicit O(N2)O(N^2) storage of RPE. However, the paper points out that RoPE is inherently limited in handling periodic angular information, such as agent headings, because its varying scalar parameters {θl}l=0dk1\{ \theta_l \}_{l=0}^{d_k-1} across dimensions in its 2D rotary transformation break the desired periodic properties crucial for angles. This means that while θiθj(mod2π)\theta_i - \theta_j \pmod{2\pi} might be the same for two pairs of angles, RoPE's output might differ, making it unsuitable for robust angular encoding.

DRoPE is proposed as a novel adaptation of RoPE specifically designed to handle periodic angular information. The core idea of DRoPE is to unify the scalar value used in the 2D rotary transformation for angular embedding. Instead of using {θl}l=0dk1\{ \theta_l \}_{l=0}^{d_k-1} from RoPE (defined as θl=10000l/dk\theta_l = 10000^{-l/d_k}), DRoPE uses a uniform identity scalar for all dimensions, effectively creating a simplified global angle embedding function f(X,θ)=BlockDiag(R(θ),,R(θ))Xf_{\angle}(X, \theta) = \text{BlockDiag}(R(\theta), \dots, R(\theta))X. Here, R(θ)R(\theta) is the standard 2D rotary transformation matrix:
R(θ)=[cos(θ)sin(θ)sin(θ)cos(θ)]R(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{bmatrix}
This modification re-establishes the periodic nature of rotary transformations concerning relative angular differences. The paper theoretically proves that for QK vectors Qi,KjQ_i, K_j and global heading angles θi,θj\theta_i, \theta_j, if Qˉi=f(Qi,θi)\bar{Q}_i = f_{\angle}(Q_i, \theta_i) and Kˉj=f(Kj,θj)\bar{K}_j = f_{\angle}(K_j, \theta_j), then their dot product Qˉi,Kˉj\langle \bar{Q}_i, \bar{K}_j \rangle depends solely on Qi,KjQ_i, K_j and the periodic relative angle θiθj(mod2π)\theta_i - \theta_j \pmod{2\pi}. This is derived from the properties of rotary matrices R(α)T=R(α)R(\alpha)^T = R(-\alpha) and R(α)R(β)=R(α+β)R(\alpha)R(\beta) = R(\alpha+\beta), leading to:
Qˉi,Kˉj=QiTBlockDiag({R(θjθi)})Kj=QiTBlockDiag({R(θjθi(mod2π))})Kj\langle \bar{Q}_i, \bar{K}_j \rangle = Q_i^T \text{BlockDiag}(\{R(\theta_j - \theta_i)\}) K_j = Q_i^T \text{BlockDiag}(\{R(\theta_j - \theta_i \pmod{2\pi})\}) K_j
By integrating DRoPE with RoPE, the model can simultaneously embed both relative spatial positions and relative headings of agents without significantly increasing computational or space complexity, maintaining O(N)O(N) space complexity similar to RoPE while providing competitive performance to O(N2)O(N^2) RPE-based methods.

The paper proposes two practical integration methods for DRoPE and RoPE within a multi-head attention module:

  1. Head-by-head integration: This approach dedicates different attention heads to handle positional and angular information. For a given head hh:
    • If hh is an even-indexed head, it uses RoPE for positional embedding:
αijh=softmax(Q^A,ih,K^A,jhdk)\alpha_{ij}^h = \text{softmax}\left( \frac{\langle \hat{Q}_{A,i}^h, \hat{K}_{A,j}^h \rangle}{\sqrt{d_k}} \right), where Q^A,ih=f(QA,ih,posA,i)\hat{Q}_{A,i}^h = f \to(Q_{A,i}^h, \text{pos}_{A,i}) and K^A,jh=f(KA,jh,posA,j)\hat{K}_{A,j}^h = f \to(K_{A,j}^h, \text{pos}_{A,j}). Here, f(X,m)=BlockDiag(R(mθ0),,R(mθdk1))Xf \to(X, m) = \text{BlockDiag}(R(m\theta_0), \dots, R(m\theta_{d_k-1}))X with θl=10000l/dk\theta_l = 10000^{-l/d_k}.
  • If hh is an odd-indexed head, it uses DRoPE for angular embedding:
αijh=softmax(QˉA,ih,KˉA,jhdk)\alpha_{ij}^h = \text{softmax}\left( \frac{\langle \bar{Q}_{A,i}^h, \bar{K}_{A,j}^h \rangle}{\sqrt{d_k}} \right), where QˉA,ih=f(QA,ih,θA,i)\bar{Q}_{A,i}^h = f_{\angle}(Q_{A,i}^h, \theta_{A,i}) and KˉA,jh=f(KA,jh,θA,j)\bar{K}_{A,j}^h = f_{\angle}(K_{A,j}^h, \theta_{A,j}).

  1. Intra-head integration: In this method, each QK vector is decomposed into two sub-vectors within a single head: one for positional information and one for angular information.
    • QA,ih=[QA,ih,pos,QA,ih,angle]Q_{A,i}^h = [Q_{A,i}^{\text{h,pos}}, Q_{A,i}^{\text{h,angle}}] and KA,jh=[KA,jh,pos,KA,jh,angle]K_{A,j}^h = [K_{A,j}^{\text{h,pos}}, K_{A,j}^{\text{h,angle}}], where QA,ih,pos,KA,jh,posRdposQ_{A,i}^{\text{h,pos}}, K_{A,j}^{\text{h,pos}} \in \mathbb{R}^{d_{\text{pos}}} and QA,ih,angle,KA,jh,angleRdangleQ_{A,i}^{\text{h,angle}}, K_{A,j}^{\text{h,angle}} \in \mathbb{R}^{d_{\text{angle}}}, satisfying dpos+dangle=2dkd_{\text{pos}} + d_{\text{angle}} = 2d_k.
    • The attention score is computed by summing the dot products of the transformed sub-vectors:
αijh=softmax(Q^A,ih,pos,K^A,jh,pos+QˉA,ih,angle,KˉA,jh,angledk)\alpha_{ij}^h = \text{softmax}\left( \frac{\langle \hat{Q}_{A,i}^{\text{h,pos}}, \hat{K}_{A,j}^{\text{h,pos}} \rangle + \langle \bar{Q}_{A,i}^{\text{h,angle}}, \bar{K}_{A,j}^{\text{h,angle}} \rangle}{\sqrt{d_k}} \right).
Here, Q^A,ih,pos=f(QA,ih,pos,posA,i)\hat{Q}_{A,i}^{\text{h,pos}} = f \to(Q_{A,i}^{\text{h,pos}}, \text{pos}_{A,i}), K^A,jh,pos=f(KA,jh,pos,posA,j)\hat{K}_{A,j}^{\text{h,pos}} = f \to(K_{A,j}^{\text{h,pos}}, \text{pos}_{A,j}), and QˉA,ih,angle=f(QA,ih,angle,θA,i)\bar{Q}_{A,i}^{\text{h,angle}} = f_{\angle}(Q_{A,i}^{\text{h,angle}}, \theta_{A,i}), KˉA,jh,angle=f(KA,jh,angle,θA,j)\bar{K}_{A,j}^{\text{h,angle}} = f_{\angle}(K_{A,j}^{\text{h,angle}}, \theta_{A,j}).

The overall model architecture follows a transformer-like design. Agent tokens and map segment tokens are processed. An agent encoder module processes agent states (position, yaw, velocity, static attributes) over time. Map tokens are processed by a map encoder. The crucial interaction modeling happens in a multi-head attention module, which can use either of the DRoPE-RoPE integration methods to capture relative positions and headings between agents, and between agents and map elements. Finally, a decoder predicts the future control actions (acceleration and yaw rate) for the target agent based on the refined agent tokens. This allows the model to predict trajectories using a kinematic model, minimizing the "impossible triangle" trade-offs.