Interpretability Archives – Apollo Research

Interpretability

Interpretability

Interpretability

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

February 11, 2025

Read more

Interpretability

Interpretability

Detecting Strategic Deception Using Linear Probes

February 6, 2025

Read more

Interpretability

Interpretability

Identifying functionally important features with end-to-end sparse dictionary learning

May 30, 2024

Read more

Interpretability

Interpretability

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

May 30, 2024

Read more

Load more