Interactive Text Entropy Visualization

This is an interactive open-source visualization of the entropy of arbitrary text,
based on word-based Markov models of modern printed English.

The amount of information that you get from a message depends on your model of what you’re reading.
If it’s in your native language, you can even skip the last word in a sentence but still grasp the gist of __.

I made five word-based Markov models of modern formal English, zeroth-order to fourth-order, from printed text from 1800-2008, via the Google Books n-gram dataset. The visualization represents each word position as a stack of colored bars, stacked in a square block. Each bar in the stack represents a word in context that the model predicted; the vertical percentage of each bar starts as its probability predicted by the model, and the percentage tweens to either 0% or 100% depending on whether that word was the actual word in context. The color of the actual word is faded, whereas the colors of all the other words are vivid: a vivid image represents a more information-rich, less predictable message.

Note how the 5-gram visualization is a much more faded image than the 4-gram visualization, which implies that the 5-gram model is a better predictor than the 4-gram model, and similarly the 4-gram model is more predictive than the 3-gram model.


Try it out yourself!

Code available on GitHub for creating the statistical model, the web service to pull stats on words, and this data visualization.

© 2012 Lee Butterman.