Paper

TIPS: Text-Image Pretraining with Spatial awareness

Kevis-Kokitsi Maninis

2026.04.11

·Arxiv·by 이호민/AI

#Computer Vision#Image-Text Pretraining#Self-Supervised Learning#Spatial Awareness#Transformer

Key Points

1TIPS introduces a novel image-text pretraining framework designed to bridge the performance gap between image-text and self-supervised learning for both dense and global vision tasks.
2The method innovates by combining noisy web captions with synthetically generated, spatially-aware descriptions, utilizing a dual embedding approach to boost both dense and global task performance.
3TIPS also incorporates self-distillation and masked image modeling into its training, leading to significantly enhanced spatial coherence and competitive off-the-shelf performance across 16 diverse vision datasets.

\hat{T}

Paper

Kevis-Kokitsi Maninis

2026.04.11

·Arxiv·by 이호민/AI

#Computer Vision#Image-Text Pretraining#Self-Supervised Learning#Spatial Awareness#Transformer

1TIPS introduces a novel image-text pretraining framework designed to bridge the performance gap between image-text and self-supervised learning for both dense and global vision tasks.
2The method innovates by combining noisy web captions with synthetically generated, spatially-aware descriptions, utilizing a dual embedding approach to boost both dense and global task performance.
3TIPS also incorporates self-distillation and masked image modeling into its training, leading to significantly enhanced spatial coherence and competitive off-the-shelf performance across 16 diverse vision datasets.

\hat{T}