A downloadable project

This report investigates the automated identification of neurons which potentially correspond to a feature in a language model, using an initial dataset of maximum activation texts and word embeddings. This method could speed up the rate of interpretability research by flagging high potential feature neurons, and building on existing infrastructure such as Neuroscope. We show that this method is feasible for quantifying the level of semantic relatedness between maximum activating tokens on an existing dataset, performing basic interpretability analysis by comparing activations on synonyms, and generating prompt guidance for further avenues of human investigation. We also show that this method is generalisable across multiple language models and suggest areas of further exploration based on results.

Download

Download
Automated Identification of Potential Feature Neurons.ipynb 108 kB
Download
Automated Identification of Potential Feature Neurons (3).pdf 183 kB

Leave a comment

Log in with itch.io to leave a comment.