cross-modal-retrieval

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI is a framework that uses captions to guide the editing of image embeddings via sparse autoencoders, improving vision-language alignment and retrieval performance.