This project aims to display the inner workings of very small transformers (less than 100 parameters so far).
The embedding dimension is 2 to allow for drawing the inner state as a 2D-vector. The vocabulary is small enough to simply draw all possible inputs simultaneously.
Videos are created with the manim library. https://www.manim.community/
I'm currently aiming at studying how completely random data is memorized, particularly in the MLP layers.