
DRoPE: Directional Rotary Position Embedding for Efficient Agent Interaction Modeling
Key Points
- 1Current trajectory generation methods in autonomous driving face an "impossible triangle" of accuracy, computational time, and memory efficiency, with existing relative position encoding (RPE) being memory-intensive and RoPE unable to naturally represent periodic angular information.
- 2To address this, DRoPE (Directional Rotary Position Embedding) is proposed as an adaptation of RoPE, which introduces a uniform identity scalar into the 2D rotary transformation to effectively encode relative angular information aligned with agent headings.
- 3DRoPE, combined with RoPE, allows for efficient modeling of both relative positions and headings with significantly reduced O(N) space complexity, while maintaining high performance, as validated theoretically and empirically against state-of-the-art models.
The paper introduces Directional Rotary Position Embedding (DRoPE) to address the "impossible triangle" problem in autonomous driving trajectory generation, which involves balancing accuracy, time complexity, and memory efficiency. Current methods, including scene-centric, agent-centric, and query-centric frameworks, each have significant limitations.
Scene-centric methods, while computationally efficient, use absolute positions, leading to suboptimal accuracy for distant agents. Agent-centric methods improve accuracy by normalizing coordinates around each agent, but incur high time complexity, scaling linearly with the number of agents ( inference steps). Query-centric methods, often using Relative Position Embeddings (RPE), allow simultaneous inference for all agents but suffer from high space complexity, typically for storing relative positions between all agent pairs.
The paper identifies Rotary Position Embedding (RoPE), originating from natural language processing, as a promising alternative due to its efficient space complexity. RoPE encodes relative positions implicitly by embedding global positions into query-key (QK) vectors using rotary transformations, thus avoiding the explicit storage of RPE. However, the paper points out that RoPE is inherently limited in handling periodic angular information, such as agent headings, because its varying scalar parameters across dimensions in its 2D rotary transformation break the desired periodic properties crucial for angles. This means that while might be the same for two pairs of angles, RoPE's output might differ, making it unsuitable for robust angular encoding.
DRoPE is proposed as a novel adaptation of RoPE specifically designed to handle periodic angular information. The core idea of DRoPE is to unify the scalar value used in the 2D rotary transformation for angular embedding. Instead of using from RoPE (defined as ), DRoPE uses a uniform identity scalar for all dimensions, effectively creating a simplified global angle embedding function . Here, is the standard 2D rotary transformation matrix:
This modification re-establishes the periodic nature of rotary transformations concerning relative angular differences. The paper theoretically proves that for QK vectors and global heading angles , if and , then their dot product depends solely on and the periodic relative angle . This is derived from the properties of rotary matrices and , leading to:
By integrating DRoPE with RoPE, the model can simultaneously embed both relative spatial positions and relative headings of agents without significantly increasing computational or space complexity, maintaining space complexity similar to RoPE while providing competitive performance to RPE-based methods.
The paper proposes two practical integration methods for DRoPE and RoPE within a multi-head attention module:
- Head-by-head integration: This approach dedicates different attention heads to handle positional and angular information. For a given head :
- If is an even-indexed head, it uses RoPE for positional embedding:
- If is an odd-indexed head, it uses DRoPE for angular embedding:
- Intra-head integration: In this method, each QK vector is decomposed into two sub-vectors within a single head: one for positional information and one for angular information.
- and , where and , satisfying .
- The attention score is computed by summing the dot products of the transformed sub-vectors:
Here, , , and , .
The overall model architecture follows a transformer-like design. Agent tokens and map segment tokens are processed. An agent encoder module processes agent states (position, yaw, velocity, static attributes) over time. Map tokens are processed by a map encoder. The crucial interaction modeling happens in a multi-head attention module, which can use either of the DRoPE-RoPE integration methods to capture relative positions and headings between agents, and between agents and map elements. Finally, a decoder predicts the future control actions (acceleration and yaw rate) for the target agent based on the refined agent tokens. This allows the model to predict trajectories using a kinematic model, minimizing the "impossible triangle" trade-offs.