foundation-models

Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers

ALOE is a method to transform large-scale ViT-based foundation models such as DINOv3 and SigLIP2 into inherently interpretable B-cos variants at a fraction of the cost of training from scratch, bringing interpretability while retaining strong performance across a range of downstream tasks and datasets.

CFM: Language-aligned Concept Foundation Model for Vision

CFM is a language-aligned concept foundation model that extracts spatially localized, visually grounded, and human-interpretable concepts at various granularities from images, organizes them into hierarchies, automatically assigns names to them, enabling concept-based explanations for any downstream task that the foundation model can perform, such as classification, open vocabulary segmentation, and captioning.