apple/starflow · Hugging Face
Service

apple/starflow · Hugging Face

2025.11.23
·Hugging Face·by Anonymous
#LLM#Image Generation#Video Generation#Normalizing Flows#Transformer

Key Points

  • 1STARFlow is a novel transformer autoregressive flow architecture that merges the expressiveness of autoregressive models with the efficiency of normalizing flows for generative tasks.
  • 2The model, with specialized versions for image (STARFlow) and video (STARFlow-V) generation, achieves state-of-the-art results in both text-to-image and text-to-video synthesis.
  • 3It features flexible resolution support, fast sampling through Jacobi iteration, efficient FSDP-enabled training, and robust text conditioning for high-quality content creation.

STARFlow is a novel transformer autoregressive flow architecture designed for high-quality image and video generation, achieving state-of-the-art results in both text-to-image and text-to-video synthesis. It combines the expressive power of autoregressive models with the computational efficiency and exact likelihood computation capabilities of normalizing flows.

The core methodology involves operating within a latent space derived from a Variational Autoencoder (VAE). For image synthesis, STARFlow utilizes an SD-VAE, and for video generation, STARFlow-V employs a WAN2.2-VAE. The generative process maps a simple base distribution (e.g., a Gaussian) in the latent space to the complex data distribution of latent representations via a sequence of invertible transformations parameterized by a Transformer. This allows for both efficient density estimation and sampling.

The architecture employs a "6-block deep-shallow architecture," a specific transformer configuration that likely balances model capacity and computational efficiency. For text conditioning, both models utilize a T5-XL text encoder to embed textual prompts into a representation used to guide the generation process, employing classifier-free guidance (cfg\text{cfg}) during sampling.

STARFlow (Text-to-Image Generation):

  • Resolution: 256x256 pixels.
  • Parameters: 3 Billion.
  • Features: Incorporates RoPE (Rotary Positional Encoding) for positional information within the transformer and supports mixed-precision training.
  • Sampling: Allows for flexible aspect ratios and utilizes block-wise Jacobi iteration (controlled by --jacobi and --jacobi_th) for accelerated inference, significantly speeding up the generative process.

STARFlow-V (Text-to-Video Generation):

  • Resolution: Up to 640x480 pixels (480p).
  • Parameters: 7 Billion.
  • Temporal Aspect: Generates videos of 81 frames (~5 seconds at 16 FPS) during training, with support for variable length generation up to 481+ frames (~30 seconds) during inference.
  • Features: Employs causal attention within its transformer blocks to ensure autoregressive generation, where each frame/token depends only on previous ones, crucial for temporal consistency. It also supports FPS conditioning.
  • Sampling: Leverages autoregressive generation and Jacobi iteration for efficient video sampling, with options to specify target length (--target_length) and input images for image-to-video (TI2V) generation.

Key Features and Enhancements:

  • High-Quality Generation: Achieves competitive FID scores against state-of-the-art diffusion models, yielding high visual fidelity and temporal consistency in videos.
  • Flexible Resolution and Aspect Ratios: Supports various output dimensions, allowing for diverse applications.
  • Efficient Training: Integrated with FSDP (Fully Sharded Data Parallel) for large-scale distributed training, enabling training of large models with high batch sizes. Gradient checkpointing is also used to reduce memory footprint.
  • Fast Sampling: The implementation of block-wise Jacobi iteration significantly accelerates inference, making generation more practical.
  • Text Conditioning: Robust text-to-image and text-to-video capabilities through advanced text encoders and classifier-free guidance.

The models operate in a latent space, and the flow transforms these latent representations. The specific formulation of the autoregressive flow involves invertible transformations that enable exact likelihood computation. During sampling, the model iteratively refines the latent representation to match the desired data distribution, with Jacobi iteration providing an efficient mechanism to accelerate this refinement process.