ALOE is a method to transform large-scale ViT-based foundation models such as DINOv3 and SigLIP2 into inherently interpretable B-cos variants at a fraction of the cost of training from scratch, bringing interpretability while retaining strong performance across a range of downstream tasks and datasets.
CFM is a language-aligned concept foundation model that extracts spatially localized, visually grounded, and human-interpretable concepts at various granularities from images, organizes them into hierarchies, automatically assigns names to them, enabling concept-based explanations for any downstream task that the foundation model can perform, such as classification, open vocabulary segmentation, and captioning.
FaCT combines concept-discovery with model-inherent attributions to construct a model that provides faithful concept traces for explaining its decisions, i.e., contributions of pixels to concepts and concepts to the final decision can be faithfully traced. We also propose a novel concept-consistency metric, C2-Score, and show that FaCT yields more consistent and interpretable concepts while retaining competitive performance.
B-cos LMs extend B-cos networks to language models, providing more faithful and human interpretable explanations than post-hoc methods while maintaining comparable task performance.
B-cosification is a method to transform existing pre-trained models to inherently interpretable B-cos variants at a fraction of the cost of training from scratch, yielding models that are interpretable while often outperforming them in terms of classification performance. We also apply B-cosification to CLIP and show that the B-cosified version remains competitive on performance while being interpretable.
Discover-then-Name is an efficient task-agnostic approach to build concept bottleneck models (CBMs) by first discovering concepts learnt by the model using sparse autoencoders and then naming them automatically, yielding semantically meaningful concepts with appropriate names that help construct performant and interpretable CBMs.
Explanation-enhanced knowledge distillation (e2KD) is a method to faithfully distill teacher models into students by additionally optimizing the similarity of teacher and student explanations. We show that e2KD consistently improves accuracy and student-teacher agreement, ensures that students learn from teachers to be right for the right reasons, and is robust across architectures, data amounts, and works even with pre-computed explanations.
We propose three novel evaluation schemes to better understand the faithfulness and differences between attribution methods, and use them to study strengths and shortcomings of some widely used attribution methods. We extend [our work on attribution evaluation](publication/towards-better-understanding-attribution-methods/) to more attribution methods, models, and perform additional analyses.
We perform an in-depth study on model guidance with explanations by evaluating various design choices, and also explore ways to improve efficiency. We show that guidance is effective even with limited, coarse, and noisy annotations, using the energy loss with model-inherent B-cos explanations works the best, and that guidance can help improve generalization under distribution shifts.