4

CFM: Language-aligned Concept Foundation Model for Vision

CFM is a language-aligned concept foundation model that extracts spatially localized, visually grounded, and human-interpretable concepts at various granularities from images, organizes them into hierarchies, automatically assigns names to them, enabling concept-based explanations for any downstream task that the foundation model can perform, such as classification, open vocabulary segmentation, and captioning.

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI is a framework that uses captions to guide the editing of image embeddings via sparse autoencoders, improving vision-language alignment and retrieval performance.