The amount of information that you get from a message depends on your model of what you’re reading.
If it’s in your native language, you can even skip the last word in a sentence but still grasp the gist of __.
I made five word-based Markov models of modern formal English, zeroth-order to fourth-order, from printed text from 1800-2008, via the Google Books n-gram dataset. The visualization represents each word position as a stack of colored bars, stacked in a square block. Each bar in the stack represents a word in context that the model predicted; the vertical percentage of each bar starts as its probability predicted by the model, and the percentage tweens to either 0% or 100% depending on whether that word was the actual word in context. The color of the actual word is faded, whereas the colors of all the other words are vivid: a vivid image represents a more information-rich, less predictable message.
Note how the 5-gram visualization is a much more faded image than the 4-gram visualization, which implies that the 5-gram model is a better predictor than the 4-gram model, and similarly the 4-gram model is more predictive than the 3-gram model.
Try it out yourself!
Code available on GitHub for creating the statistical model, the web service to pull stats on words, and this data visualization.
© 2012 Lee Butterman.