Interpretability
Interpretability
  • Interpretability

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

February 11, 2025
Read more
Interpretability
  • Interpretability

Detecting Strategic Deception Using Linear Probes

February 6, 2025
Read more
Interpretability
  • Interpretability

Identifying functionally important features with end-to-end sparse dictionary learning

May 30, 2024
Read more
Interpretability
  • Interpretability

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

May 30, 2024
Read more