Generating Dickensian Gibberish

Described by Billy Pilgrim in Slaughterhouse-Five (Kurt Vonnegut, 1969) as two feet high green toilet plungers with a hand on top, and a single green eye in the palm, the Tralfamadorians are friendly alien creatures that can see in four dimensions. Billy recounts (based on his observations while caged in a Tralfamadorian zoo)

…they could see in four dimensions. They pitied Earthlings for being able to see only three. They had many wonderful things to teach Earthlings about time.
Because future and past are the same to them, they are not greatly concerned by events (including the fact that they are responsible for the eventual destruction of the universe!), and tend to simply respond with:
So it goes.

Now suppose the only aid the Tralfamadorians have for learning to communicate with the people of Earth is a copy of Great Expectations by Charles Dickens, that somehow came floating through space and landed on their planet. Let's follow the progress of their mathematicians as they analysed this wonderful book in an effort to understand how we on Earth communicate.

Studying Letters

  • The Tralfamadorian mathematicians quickly realised that this Earth language was written with an alphabet, so after identifying the letter symbols, they began constructing random sentences using these letters and a space as below.

    These results were not particularly encouraging — the sentences produced did not display any real similarity to the Dickens text, and would thus unlikely be of any use in communicating on Earth.

  • They continued their analysis, and recognised that the letter symbols appeared with different frequencies in the text. For example, an 'e' appeared 9.7% of the time (that's very nearly one in every ten letters), more often than the 2.4% for an 'm', and much more often than the tiny 0.1% for an 'x' (or once in every 100 letters). The complete table of letter frequencies is as follows.

    space  19.6% a  6.6% b  1.3%
    c  1.8% d  3.9% e  9.7%
    f  1.7% g  1.7% h  5.1%
    i  5.8% j  0.18% k  0.8%
    l  2.9% m  2.4% n  5.6%
    o  6.3% p  1.4% q  0.07%
    r  4.3% s  4.8% t  7.3%
    u  2.3% v  0.71% w  2.1%
    x  0.1% y  1.7% z  0.02%
    Given this they again generated text by choosing letters randomly, but weighting the choices according to the required letter frequencies.

    This improved the appearance of the text, but only very rarely did it generate words that matched the original.

  • Next they started to make their letter choices conditional on the previous letter, noting that, for example, a 't' is followed by an 'h' nearly 30% of the time, and is never followed by a 'g'. These pairs are known as bigrams, and the following table shows the most common of the 516 bigrams present in the text.
    th  2.2% he  2.1% in  1.5%
    an  1.4% er  1.4% nd  1.1%

    Working this way there are still only very few genuine "words" generated, but the structure of the text and the original are definitely converging.

  • Continuing along this path, they made their letter choices conditional on the previous two letters, thus generating trigrams. These make use of statistics such as the pair of letters 'cl' are only ever followed by the letters 'a', 'e', 'i', 'o', 'u' and 'y', with 'e' and 'o' being the most likely. The text contains 4623 distinct trigrams, and the following table shows the most common.
    the  1.22% and  0.86% ing  0.64%
    her  0.42% tha  0.36% you  0.32%
    (Notice how the percentages are falling as we go from single letters, to bigrams, to trigrams, since the number of possible combinations increases.)

    Genuine words are now beginning to appear, and it seems that the Tralfamadorians are really onto something — something that can unlock the structure of this strange and alien language called "English".

  • Moving onto quadgrams — where the previous three letters are used to determine the probability of the fourth, and quintgrams — where the previous four letters are used, this weighted probability approach generates many valid words.

    By this stage the Tralfamadorian mathematicians could generate text that matched quite well with the original, and so completed their study of letter frequency. They next turned their analytical attention to the statistical properties of complete words.

Studying Words

Just like letter frequencies, different words occur with different frequencies. Of the almost 200 000 words in the Dickens text, 586 of them appear more than 100 times, whereas 4683 words appear only once. Most common is the word 'the', appearing more than 8000 times (or approximately 4% of all words). The following table shows the number of appearances for the 12 most common words.

the  8145 and  7098 to  5157
of  4438 a  4049 in  3028
that  2987 was  2836 it  2671
he  2208 you  2185 my  2070

With this in mind, the Tralfamadorian mathematicians turned their attention to how meaning emerges from the arrangement of words.

  • First they generated text by choosing words randomly, but weighting according to the observed word frequencies, and checking the resulting sentences for any emergent meaning.

    Just like generating words from letter frequencies was insufficient for reproducing the appearance of real text, a string of random words does not read at all like a sentence.

  • However, just as for letters, some word combinations are far more likely to appear than others. For example, 'I' was followed by 'am' over 200 times, by 'was' over 500 times, and 'had' over 600 times. Other very common pairs were 'had been' and 'have been', 'in the', 'it was', 'that I' etc. So next they generated random sentences according to the likelihood of word pairs.

    While this still results in very strange sentences, there is definitely some coherence, and it is often possible to assign meaning to reasonably long sections.

  • The last step was to look at groups of 3 words, such as 'I could not', 'that I was', 'it was a', 'I should have' etc., and use the probabilities of these triplets to generate sentences.

    Surprisingly, this frequently produces sentences with a coherent meaning that extends over much more than 3 words.

Things to think about…

  • The progression of this analysis illustrates the effect of the important statistical concept of weighting. How might you use dice to obtain weighted choices?
  • To what extent do the generated sentences have the same statistical properties as the full text?
  • What kind of problems may arise from using just a single text? Can you find examples in the generated sentences?
  • When we generate sentences using word pairs and word triplets, small parts of the sentence become coherent and meaning begins to emerge. On average, how long (in terms of words) do you expect these coherent sections to be in each case? Generate some sample sentences, count and tabulate the length of the coherent subsentences, and calculate the average. What do you find? Try and explain your results.