TIPS: Text-Image Pretraining with Spatial awareness
Paper

TIPS: Text-Image Pretraining with Spatial awareness

Kevis-Kokitsi Maninis
2026.04.11
·Arxiv·by 이호민/AI
#Computer Vision#Image-Text Pretraining#Self-Supervised Learning#Spatial Awareness#Transformer

Key Points

  • 1TIPS introduces a novel image-text pretraining framework designed to bridge the performance gap between image-text and self-supervised learning for both dense and global vision tasks.
  • 2The method innovates by combining noisy web captions with synthetically generated, spatially-aware descriptions, utilizing a dual embedding approach to boost both dense and global task performance.
  • 3TIPS also incorporates self-distillation and masked image modeling into its training, leading to significantly enhanced spatial coherence and competitive off-the-shelf performance across 16 diverse vision datasets.

TIPS (Text-Image Pretraining with Spatial awareness) addresses the limitation of existing image-text representation learning models, such as CLIP, which often lack spatial awareness and are thus less directly applicable to dense understanding tasks (e.g., depth estimation, semantic segmentation) compared to self-supervised image-only pre-training methods. The paper proposes a novel general-purpose image-text model that can be effectively used off-the-shelf for both dense and global vision tasks by integrating insights from both image-text and self-supervised learning paradigms.

The core methodology of TIPS relies on two main innovations:

  1. Enhanced Textual Supervision with Synthetic Captions:
    • Problem: Noisy web image captions often describe salient objects but lack comprehensive spatial details or relationships, limiting their utility for learning spatially-aware representations. They might also contain irrelevant metadata.
    • Solution 1: Synthetic Caption Generation: TIPS leverages off-the-shelf multimodal generative models (e.g., PaliGemma) to generate synthetic, spatially rich textual descriptions (T^\hat{T}) for images (II). These synthetic captions tend to comprehensively describe visual content, including objects, attributes (e.g., color), and spatial relationships (e.g., "in front of").
    • Problem with Synthetic Captions: While rich in spatial details, synthetic captions might miss fine-grained, discriminative information present in original noisy web captions (e.g., specific car model year, dealership details).
    • Solution 2: Dual Image-Text Embedding: To combine the benefits of both noisy web captions (TT) and synthetic captions (T^\hat{T}), TIPS modifies the Vision Transformer (ViT) architecture. It introduces an additional [CLS] token, resulting in two global image embeddings:
      • ege_g: The standard [CLS] token embedding, primarily aligned with the noisy web caption TT.
      • e^g\hat{e}_g: A new [CLS] token embedding, primarily aligned with the synthetic caption T^\hat{T}.
    • Training: During training, both TT and T^\hat{T} are fed to the text encoder to obtain their respective embeddings, ete_t and e^t\hat{e}_t. Two separate contrastive losses are computed:
      • LCLIPL_{CLIP}: Between ege_g and ete_t, aligning the image's general representation with its noisy web caption.
      • L^CLIP\hat{L}_{CLIP}: Between e^g\hat{e}_g and e^t\hat{e}_t, aligning the image's spatially-aware representation with its synthetic caption.
    • Inference: This dual embedding allows the model to access both object-centric (ege_g) and spatially-aware (e^g\hat{e}_g) global image embeddings, enabling flexibility depending on the downstream task. Both types of global embeddings back-propagate through the model to improve dense patch embeddings {en}n=1N\{e_n\}^N_{n=1}.
  1. Integrating Self-Distillation and Masked Image Modeling (MIM):
    • Motivation: To further encourage spatially coherent and discriminative image features, TIPS incorporates self-supervised learning techniques adapted from DINO and iBOT.
    • Teacher-Student Architecture: A teacher ViT model, ftf_t, guides the training of the student ViT model, fsf_s (which is the main model ff). The teacher's weights are updated via an Exponential Moving Average (EMA) of the student's weights.
    • Self-Distillation Loss (LdistillL_{distill}):
      • MM local crops are extracted from the input image II. These crops are processed by the student fsf_s to obtain MM local crop embeddings, {eg,m}m=1M\{e_{g,m}\}^M_{m=1} (via their [CLS] tokens).
      • The teacher ftf_t processes the full image II to obtain its global [CLS] token embedding, eg,te_{g,t}.
      • The self-distillation loss enforces consistency between the student's local crop embeddings and the teacher's global embedding. It is computed as:
Ldistill=bmsoftmax((pbtc)/τt)log(softmax(pbm/τs))L_{distill} = - \sum_b \sum_m \text{softmax}((p^t_b - c)/\tau_t) \log(\text{softmax}(p^m_b/\tau_s))
where pbt=Pt(eg,t)p^t_b = P_t(e_{g,t}) and pbm=Ps(eg,m)p^m_b = P_s(e_{g,m}) are prototype scores obtained by applying projection heads (Pt,PsP_t, P_s) to the embeddings. PtP_t is an EMA of PsP_s. τt,τs\tau_t, \tau_s are teacher/student temperatures, and cc is a centering variable. This ensures local features are consistent with the global view.
  • Masked Image Modeling (MIM) Loss (LmaskL_{mask}):
    • A masked version of the input image II (where masked patches are replaced by mask tokens {mn}\{m_n\}) is fed through the student fsf_s. The encoded mask tokens are denoted {enm}\{e^m_n\}.
    • The teacher ftf_t processes the unmasked image II to obtain its unmasked patch tokens {ent}\{e^t_n\}.
    • The MIM loss encourages the student's encoded mask tokens to reconstruct the semantics of the corresponding unmasked patches as represented by the teacher. It is computed similarly to LdistillL_{distill}:
Lmask=bnsoftmax((pb,ntc)/τt)log(softmax(pb,nm/τs))L_{mask} = - \sum_b \sum_n \text{softmax}((p^t_{b,n} - c')/\tau'_t) \log(\text{softmax}(p^m_{b,n}/\tau'_s))
where pb,nt=Pt(ent)p^t_{b,n} = P'_t(e^t_n) and pb,nm=Ps(enm)p^m_{b,n} = P'_s(e^m_n) are prototype scores from respective projection heads (Pt,PsP'_t, P'_s). τt,τs\tau'_t, \tau'_s are temperatures and cc' is a centering variable.
  • Total Loss: The overall training objective is a weighted sum:
Ltotal=12(LCLIP+L^CLIP)+αLdistill+βLmaskL_{total} = \frac{1}{2} (L_{CLIP} + \hat{L}_{CLIP}) + \alpha L_{distill} + \beta L_{mask}
where α\alpha and β\beta are hyperparameters.

Scaling TIPS:
TIPS is scaled to a large architecture and dataset for enhanced image representations:

  • Model: The image encoder uses a ViT-g architecture (patch size 14, SwiGLU variant) with 1.1B parameters, embedding dimension 1536, and 24 heads, making it comparable to DINOv2-g. The text encoder is a standard Transformer with 12 layers, matching the image encoder's dimensions.
  • Data: Training leverages a curated subset of the WebLI dataset (10B image-text pairs). The data is filtered by image-text similarity (using a pretrained alignment model) and language (English captions). A final curation step selects images similar to those in existing curated datasets, resulting in 116M high-quality image-text pairs. Near-duplicate images from evaluation datasets are removed.

Experimental Evaluation:
TIPS is evaluated on 8 tasks across 16 datasets, assessing off-the-shelf performance with frozen image-text representations.

  • Dense Prediction Tasks: Semantic Segmentation (PASCAL VOC, ADE20k - mIoU), Monocular Depth Estimation (NYUv2, NAVI - RMSE), Surface Normal Estimation (NYUv2, NAVI - Angular RMSE). Probing strategies involve linear classifiers on spatial features or concatenated patch/global embeddings.
  • Global Image Understanding Tasks: Image Classification (ImageNet-1K - KNN/linear probe accuracy), Zero-shot Classification (ImageNet-1K - top-1 accuracy by text embedding retrieval).
  • Multimodal Retrieval Tasks: Fine-grained and Instance-level Retrieval (Universal Embeddings Dataset - R@1), Image-to-text (I→T) Retrieval (Flickr30K, DOCCI, COCO - R@1), Text-to-image (T→I) Retrieval (Flickr30K, DOCCI, COCO - R@1).

The paper demonstrates that combining synthetic captions with a dual embedding strategy and integrating both self-distillation and masked image modeling leads to significant performance improvements across both dense and global vision tasks, bridging the performance gap between image-text and self-supervised models for dense understanding.