
TIPS: Text-Image Pretraining with Spatial awareness
Key Points
- 1TIPS introduces a novel image-text pretraining framework designed to bridge the performance gap between image-text and self-supervised learning for both dense and global vision tasks.
- 2The method innovates by combining noisy web captions with synthetically generated, spatially-aware descriptions, utilizing a dual embedding approach to boost both dense and global task performance.
- 3TIPS also incorporates self-distillation and masked image modeling into its training, leading to significantly enhanced spatial coherence and competitive off-the-shelf performance across 16 diverse vision datasets.
TIPS (Text-Image Pretraining with Spatial awareness) addresses the limitation of existing image-text representation learning models, such as CLIP, which often lack spatial awareness and are thus less directly applicable to dense understanding tasks (e.g., depth estimation, semantic segmentation) compared to self-supervised image-only pre-training methods. The paper proposes a novel general-purpose image-text model that can be effectively used off-the-shelf for both dense and global vision tasks by integrating insights from both image-text and self-supervised learning paradigms.
The core methodology of TIPS relies on two main innovations:
- Enhanced Textual Supervision with Synthetic Captions:
- Problem: Noisy web image captions often describe salient objects but lack comprehensive spatial details or relationships, limiting their utility for learning spatially-aware representations. They might also contain irrelevant metadata.
- Solution 1: Synthetic Caption Generation: TIPS leverages off-the-shelf multimodal generative models (e.g., PaliGemma) to generate synthetic, spatially rich textual descriptions () for images (). These synthetic captions tend to comprehensively describe visual content, including objects, attributes (e.g., color), and spatial relationships (e.g., "in front of").
- Problem with Synthetic Captions: While rich in spatial details, synthetic captions might miss fine-grained, discriminative information present in original noisy web captions (e.g., specific car model year, dealership details).
- Solution 2: Dual Image-Text Embedding: To combine the benefits of both noisy web captions () and synthetic captions (), TIPS modifies the Vision Transformer (ViT) architecture. It introduces an additional
[CLS]token, resulting in two global image embeddings:- : The standard
[CLS]token embedding, primarily aligned with the noisy web caption . - : A new
[CLS]token embedding, primarily aligned with the synthetic caption .
- : The standard
- Training: During training, both and are fed to the text encoder to obtain their respective embeddings, and . Two separate contrastive losses are computed:
- : Between and , aligning the image's general representation with its noisy web caption.
- : Between and , aligning the image's spatially-aware representation with its synthetic caption.
- Inference: This dual embedding allows the model to access both object-centric () and spatially-aware () global image embeddings, enabling flexibility depending on the downstream task. Both types of global embeddings back-propagate through the model to improve dense patch embeddings .
- Integrating Self-Distillation and Masked Image Modeling (MIM):
- Motivation: To further encourage spatially coherent and discriminative image features, TIPS incorporates self-supervised learning techniques adapted from DINO and iBOT.
- Teacher-Student Architecture: A teacher ViT model, , guides the training of the student ViT model, (which is the main model ). The teacher's weights are updated via an Exponential Moving Average (EMA) of the student's weights.
- Self-Distillation Loss ():
- local crops are extracted from the input image . These crops are processed by the student to obtain local crop embeddings, (via their
[CLS]tokens). - The teacher processes the full image to obtain its global
[CLS]token embedding, . - The self-distillation loss enforces consistency between the student's local crop embeddings and the teacher's global embedding. It is computed as:
- local crops are extracted from the input image . These crops are processed by the student to obtain local crop embeddings, (via their
where and are prototype scores obtained by applying projection heads () to the embeddings. is an EMA of . are teacher/student temperatures, and is a centering variable. This ensures local features are consistent with the global view.
- Masked Image Modeling (MIM) Loss ():
- A masked version of the input image (where masked patches are replaced by mask tokens ) is fed through the student . The encoded mask tokens are denoted .
- The teacher processes the unmasked image to obtain its unmasked patch tokens .
- The MIM loss encourages the student's encoded mask tokens to reconstruct the semantics of the corresponding unmasked patches as represented by the teacher. It is computed similarly to :
where and are prototype scores from respective projection heads (). are temperatures and is a centering variable.
- Total Loss: The overall training objective is a weighted sum:
where and are hyperparameters.
Scaling TIPS:
TIPS is scaled to a large architecture and dataset for enhanced image representations:
- Model: The image encoder uses a ViT-g architecture (patch size 14, SwiGLU variant) with 1.1B parameters, embedding dimension 1536, and 24 heads, making it comparable to DINOv2-g. The text encoder is a standard Transformer with 12 layers, matching the image encoder's dimensions.
- Data: Training leverages a curated subset of the WebLI dataset (10B image-text pairs). The data is filtered by image-text similarity (using a pretrained alignment model) and language (English captions). A final curation step selects images similar to those in existing curated datasets, resulting in 116M high-quality image-text pairs. Near-duplicate images from evaluation datasets are removed.
Experimental Evaluation:
TIPS is evaluated on 8 tasks across 16 datasets, assessing off-the-shelf performance with frozen image-text representations.
- Dense Prediction Tasks: Semantic Segmentation (PASCAL VOC, ADE20k - mIoU), Monocular Depth Estimation (NYUv2, NAVI - RMSE), Surface Normal Estimation (NYUv2, NAVI - Angular RMSE). Probing strategies involve linear classifiers on spatial features or concatenated patch/global embeddings.
- Global Image Understanding Tasks: Image Classification (ImageNet-1K - KNN/linear probe accuracy), Zero-shot Classification (ImageNet-1K - top-1 accuracy by text embedding retrieval).
- Multimodal Retrieval Tasks: Fine-grained and Instance-level Retrieval (Universal Embeddings Dataset - R@1), Image-to-text (I→T) Retrieval (Flickr30K, DOCCI, COCO - R@1), Text-to-image (T→I) Retrieval (Flickr30K, DOCCI, COCO - R@1).
The paper demonstrates that combining synthetic captions with a dual embedding strategy and integrating both self-distillation and masked image modeling leads to significant performance improvements across both dense and global vision tasks, bridging the performance gap between image-text and self-supervised models for dense understanding.