explainability | Sukrut Rao

CFM: Language-aligned Concept Foundation Model for Vision

CFM is a language-aligned concept foundation model that extracts spatially localized, visually grounded, and human-interpretable concepts at various granularities from images, organizes them into hierarchies, automatically assigns names to them, enabling concept-based explanations for any downstream task that the foundation model can perform, such as classification, open vocabulary segmentation, and captioning.

Align Once to Explain: Feature Alignment for Scalable B-cosification of Foundational Vision Transformers

ALOE is a method to transform large-scale ViT-based foundation models such as DINOv3 and SigLIP2 into inherently interpretable B-cos variants at a fraction of the cost of training from scratch, bringing interpretability while retaining strong performance across a range of downstream tasks and datasets.

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI is a framework that uses captions to guide the editing of image embeddings via sparse autoencoders, improving vision-language alignment and retrieval performance.

FaCT: Faithful Concept Traces for Explaining Neural Network Decisions

FaCT combines concept-discovery with model-inherent attributions to construct a model that provides faithful concept traces for explaining its decisions, i.e., contributions of pixels to concepts and concepts to the final decision can be faithfully traced. We also propose a novel concept-consistency metric, C2-Score, and show that FaCT yields more consistent and interpretable concepts while retaining competitive performance.

B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability

B-cos LMs extend B-cos networks to language models, providing more faithful and human interpretable explanations than post-hoc methods while maintaining comparable task performance.

B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable

B-cosification is a method to transform existing pre-trained models to inherently interpretable B-cos variants at a fraction of the cost of training from scratch, yielding models that are interpretable while often outperforming them in terms of classification performance. We also apply B-cosification to CLIP and show that the B-cosified version remains competitive on performance while being interpretable.

Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery

Discover-then-Name is an efficient task-agnostic approach to build concept bottleneck models (CBMs) by first discovering concepts learnt by the model using sparse autoencoders and then naming them automatically, yielding semantically meaningful concepts with appropriate names that help construct performant and interpretable CBMs.

Good Teachers Explain: Explanation-Enhanced Knowledge Distillation

Explanation-enhanced knowledge distillation (e2KD) is a method to faithfully distill teacher models into students by additionally optimizing the similarity of teacher and student explanations. We show that e2KD consistently improves accuracy and student-teacher agreement, ensures that students learn from teachers to be right for the right reasons, and is robust across architectures, data amounts, and works even with pre-computed explanations.

Better Understanding Differences in Attribution Methods via Systematic Evaluations

We propose three novel evaluation schemes to better understand the faithfulness and differences between attribution methods, and use them to study strengths and shortcomings of some widely used attribution methods. We extend [our work on attribution evaluation](publication/towards-better-understanding-attribution-methods/) to more attribution methods, models, and perform additional analyses.

Studying How to Efficiently and Effectively Guide Models with Explanations

We perform an in-depth study on model guidance with explanations by evaluating various design choices, and also explore ways to improve efficiency. We show that guidance is effective even with limited, coarse, and noisy annotations, using the energy loss with model-inherent B-cos explanations works the best, and that guidance can help improve generalization under distribution shifts.