Dr. Robert 'Butch' Buccigrossi and Bear
Watch The Stochastic Dawn (Episode 1)

If you have watched the news over the last two years, you have probably been told that magic has arrived. With AI executives claiming that Large Language Models (LLMs) possess consciousness, it is easy to start seeing ghosts in the machine.

But when we treat AI like a black box, we lose the ability to understand it.

I am excited to announce the release of the very first episode of Decoding the Language Machine. In this series, we are opening the box. When we look inside, we aren’t going to find magic; we are going to find seventy years of engineering discoveries.

The Stochastic Dawn

In Episode 1, we travel back to 1948. Before the internet, before the GPU, and before the microchip, Claude Shannon sat in a quiet office at Bell Labs and manually simulated a mind. He built the world’s first large language model using nothing but a book and a pencil.

Shannon was trying to solve the problem of noisy analog communication over telephone wires. To do this, he defined “information” as the “Resolution of Uncertainty”. If a message is entirely predictable, it contains zero information (zero surprise).

By calculating the probability of English letters, he realized our language is highly structured and about 50% redundant. Because humans are naturally prediction machines, we can read sentences even when half the letters are missing.

The N-Gram Experiment

To prove that structure could emerge from pure statistics, Shannon designed the “Approximations to English” experiment. He looked at “N-Grams” (sequences of N items) to generate text.

Because he did not have a computer, he used a book to sample the random distribution of the English language. Here is how the structure of language emerged:

Level Constraint Resulting Text Observation
Level 0 Equiprobable (Random) Gibberish Maximum entropy and maximum surprise.
Level 1 Frequency Weighted “OCRO”, “NMIELWIS” ‘E’ appears 12% of the time, ‘Z’ almost never. Vowels appear between consonants, making it more pronounceable.
Level 2 Digrams (Letter Pairs) “ON”, “ARE”, “BE” Real words emerge from letter probabilities, without the machine having a dictionary.
Level 3 Trigrams (Three Letters) “DEMONSTURES” Plausible, English-sounding non-words appear as the system learns local morphology.

The First Stochastic Parrot

Shannon didn’t stop at letters. By stepping up to word-based N-Grams, his manual lookup tables generated sequences like: “The head and in frontal attack on an English writer that the character of this point is therefore…“.

It sounds profound, like a literary critic. But it is just a “Stochastic Parrot”: a blind hop from one word to the next based on probability, devoid of actual meaning.

This gives us our first major intuition about modern Artificial Intelligence: Stochastic systems have emergent properties.

If Shannon could do this with a single book, imagine what happens when you use the entire internet? Modern Large Language Models are doing exactly this, but they use deep neural networks to estimate these massive distributions, and vectors to understand the geometric relationships between words.

Dive Deeper

LLMs are stochastic parrots with capabilities that emerge from human language. When we peel back the marketing hype and look at the math, we replace the magic with something even more fascinating: a mirror reflecting our own linguistic patterns.

Watch the full episode above, and if you are ready to exorcise the ghosts in the machine, don’t forget to subscribe to SkepticCTO on YouTube.

The source code and Manim animations for this episode are available under a Creative Commons License on our GitHub Repository.