
Grounding World Simulation Models in a Real-World Metropolis
Key Points
- 1Seoul World Model (SWM) is a novel world simulation model that grounds autoregressive video generation in real-world cityscapes by using retrieval-augmented conditioning on a vast street-view image database.
- 2To enable faithful and long-horizon generation, SWM introduces innovations such as cross-temporal pairing, a diverse synthetic dataset for varied trajectories, view interpolation, and a Virtual Lookahead Sink that continuously re-grounds the model.
- 3This allows SWM to generate spatially faithful and temporally consistent videos spanning kilometers with free-form navigation and text-prompted scenario control, outperforming existing methods for real-world urban environments.
The Seoul World Model (SWM) is a pioneering city-scale generative world model that renders real-world urban environments by grounding autoregressive video generation in a metropolis, specifically Seoul. Unlike prior world models that synthesize imagined environments, SWM focuses on producing visually faithful and temporally consistent videos of actual cityscapes over multi-kilometer trajectories, supporting diverse camera movements and text-prompted scenario variations.
The core methodology of SWM is built upon Retrieval-Augmented Generation (RAG) conditioned on a geo-indexed street-view database. For each video chunk to be generated, SWM retrieves nearby street-view images based on geographic coordinates, camera actions, and text prompts. These retrieved images serve as complementary references, guiding the generation process through two pathways:
- Geometric Referencing: The nearest reference image is warped into the target viewpoint using depth-based splatting to provide explicit spatial layout cues. This anchors the generated content to the real-world geometry.
- Semantic Referencing: The original reference images are injected into the transformer's latent sequence, allowing the model to attend to appearance details across all retrieved references.
SWM addresses several key challenges inherent in grounding a world model in real-world data:
- Temporal Misalignment between References and Dynamic Scene: Real-world street-view images capture specific moments, including transient elements like vehicles. If used directly as references, these dynamic objects could "leak" into the generated video, even if the target scene should not contain them. SWM tackles this with Cross-Temporal Pairing. During training, the reference street-view images are deliberately chosen from a different capture time than the target video sequence. This forces the model to learn to rely on persistent spatial structures (e.g., buildings, roads) and ignore transient objects (e.g., cars, pedestrians) in the references, ensuring that generation focuses on the scene's static geometry.
- Limited Trajectory Diversity and Data Sparsity: Real-world data, especially from vehicle-mounted captures, often provides only forward-driving trajectories and is sparse. To overcome this, SWM leverages a substantial Synthetic Dataset rendered from an Unreal Engine-based CARLA urban simulator. This dataset, covering 431,500m², includes 10,000 synthetic videos with diverse camera trajectories: pedestrian (sidewalks, crossings), vehicle (highways, urban roads), and free-camera (arbitrary collision-free paths). This synthetic data, combined with 1.2 million real panoramic street-view images from Seoul, enables SWM to generalize beyond simple forward driving and support free-form navigation.
- Synthesis of Coherent Training Videos from Sparse Images: Real street-view keyframes are typically sparse (5-20m apart), making it challenging to create smooth training videos. SWM employs a View Interpolation Pipeline that synthesizes smooth sequences from these sparse keyframes. This process utilizes an "Intermittent Freeze-Frame strategy matched to the 3D VAE's temporal stride," ensuring temporal consistency in the training data.
- Error Accumulation in Long-Horizon Generation: Autoregressive generation inherently accumulates errors over long sequences, leading to quality degradation. Prior methods often use a static attention sink anchored to the initial frame, whose guidance weakens as the camera moves further away. SWM introduces a novel mechanism called the Virtual Lookahead Sink (VL Sink) to combat this. The VL Sink dynamically retrieves the nearest street-view image at a future location, which serves as a "virtual future destination." This retrieved image acts as a clean, error-free anchor ahead of the current generation chunk, continuously re-grounding the generation process. This dynamic re-grounding significantly stabilizes video quality over trajectories spanning hundreds of meters, preventing the typical degradation seen in long-horizon generative models.
The SWM architecture autoregressively generates video chunks conditioned on a text prompt (for scenario control), a camera trajectory, and the retrieved street-view images. The ability for text-prompted scenario control allows users to reshape familiar city scenes by injecting elements like a "massive wave" or "Godzilla" into the generated videos.
SWM is trained on a combined dataset of 1.2 million real panoramic street-view images captured across Seoul and 10,000 synthetic videos from CARLA. Evaluation against recent video world models across various cities (Seoul, Busan, Ann Arbor) demonstrates that SWM outperforms existing methods in generating spatially faithful, temporally consistent, and long-horizon videos grounded in actual urban environments, while supporting diverse camera movements and text-prompted variations. This work represents an industry-academic collaboration between NAVER and KAIST, utilizing NAVER Map data.