Harry Potter and the Markov chains
There is a funny thing called a Markov chain. In scientific terms, this is a sequence of random events where probability of occurrence of a next event depends only on a certain number of previous events. Markov chains do a lot of various jobs, from prediction of stock prices on the stock exchange to prediction of which word you are going to type on a virtual keyboard in your smartphone. And they work also in the analysis of DNA chains, data compression, cartography, text spell-checking. They do recognition and creating of voice messages, music and texts on the computer.
Markov chain is a kind of Artificial Intelligence and Machine Learning algorithm but its idea is very simple.
Let’s take for example a text of a book. All words in the text going one by one. We may say that appearance of each next word in the text is an event. Each certain word appears in the text with a certain probability, and this probability depends on the previous words. For example, in a book Harry Potter and the Philosopher’s Stone probability of appearance of word “Potter” after word “Harry” is much higher than any other word. We can analyze the text of the book and count how often each word appears after each other for all words in the book. Then we can write this data in a table and use this table to generate a new text in a style of the original book which was used for making the table.
For example, if we do it for the text above, we get this table:
This is a little part of the table. Whole table has 241 row.
If we build the table for whole book Harry Potter and the Philosopher’s Stone it will look like this:
This is a part where first word is “Harry”. Whole table has 42708 rows.
To generate a new text, we take a random first word from the table. Then we find the most probable second word based on number of repeats. Then we use this second word as a first word and find the most probable next word for this word. We may repeat this cycle until we get a sentence with needed length.
Of course, such text will not have a general meaning. Each sentence will not be linked to other but style of the text will be similar to style of the author of the original text. Each certain sentence even can have meaning itself.
This algorithm is a base for computer programs which find plagiarism or just similar texts or music. When YouTube blocks your video because it founds a copyrighted content there it uses similar algorithm but instead of words it uses sounds and video frames.
Virtual keyboards of smartphones which prompt you the most probable words which you may want to type do it based on the same tables of probabilities of words for selected language and texts which you typed earlier. For example, after analysis of the text which you typed earlier, your phone may prompt you a word “am” after text “I” because you typed “I am” the most often in the past, and after text “I am” it prompts words “a”, “going” or “so” because they are most probable too. But it never prompts “apple” after text “I am” because you never typed it. But try to type many times “I am apple” and if your virtual keyboard is smart enough someday it starts to prompt you “apple” after your “I am”.
The more previous words we use for prediction the better result. Sequences of previous words used for prediction call N-grams. If we counting probabilities between two words only, couples of these words calls bigrams. If we counting probabilities between three words their sequences call trigrams. Generally, the bigger number of N the better result of the prediction. If we use it for generation of text it means the bigger N the more beautiful text we get and the more this text will be similar to the original text which was used for teaching the program.
In general, if this computer program analyzes a book about Harry Potter it will generate the text in style of Joan Rowling, if it analyzes books of J.R.R. Tolkien it will generate the texts about Frodo Baggins, if it analyzes the Bible it will generates epic texts about Jesus Christ and Jews. But if we teach this program on all these books together it will produce texts about learning of Frodo in Hoggwards under the guidance of Jesus Christ and fights of Harry Potter against of army of Sauron on the Jewish land. :)
Since time when I have read about the Markov chains, I wanted to see which text I get if I use 6-grams or more and will teach the program on my favorite books. You can find a lot of computer programs in the Internet which generates the texts based on bigrams because the algorithm is very simple and it is easy to write such program. But if you want to play with N-grams where N is significantly bigger than 2 it is very difficult to find free handy programs with open source in the Internet. At least I was unable to find even one.
The main problem with the programs which can work with big N-grams is lack of computer memory for keeping of all possible probabilities between of all words in the text. For example, the Bible has about 52000 unique words. For storing in the memory of probabilities for all these words we need 52000*52000*8=21632000000 bytes, that is about 21 gigabytes, for trigrams we need 52000*52000*52000*8=1124864 gigabytes! It is obvious that there is no a personal computer with so big amount of memory.
In the end, yesterday I made such a program. It stores only probabilities which are bigger than zero so it needs much less of the memory and allows to work with N-grams where N is significantly bigger than 2 or even 6.
Today I played a little with the program.
At first, I have set N = 2 and fed the program a book Harry Potter and the Philosopher’s Stone.
Here are several sentences I got:
“Harry pleased with a bit of brandy mixed up to create a goal posts and I don’t see you think.”
“Aunt Petunia had any of keeping her hand almost at Hogwarts before pierced his cloak quickly found out everything to the large chocolate cake and Ron went off the Stone before his head toward Ron said Hermione stayed awake.”
“What came out of his shoulder to have been happening down their minds work of something like Hagrid chuckled and defenseless to lend us neither of red boils started to worry yet dad were three heads at Hogwarts Express eleven.”
Looks not too beautiful, but it’s OK for so small number of N.
Then I have set N = 6 and fed the program the same book. This little smart-ass memorized the book completely and quoted whole phrases but didn’t want to invent something original.
I stuck together The Silmarillion and The Bible and got these sentences:
“Where the Esgalduin flowed down and parted Neldoreth from Region there rose in the midst of the teachers both listening to them and asking them questions.”
“All that was in the city both man and woman both young and old and ox and sheep and donkey with the edge of the sword destroying it utterly and all that is therein and the cattle of it shall you take for a prey to yourself.”
“Only be strong and very courageous to observe to do according to all that they shall teach you according to the tenor of the law which they shall teach you and according to the judgment which they shall tell you you shall do.”
“Hear my voice according to your loving kindness so I will obey the statutes of your mouth.”
“When he was come into the house behold the child was dead and laid on his bed.”
These sentences look better but have big citations from the original texts. It means that the program doesn’t extract a general a idea, doesn’t generalize its knowledge but simply memorized given texts completely. In terms of Artificial Intelligence, it calls “overfitting”, artificial neural networks have the same problem. If their “brain” is too big they prefer don’t generalize input information and find new knowledge but memorize whole input info.
But the last two sentences are excellent. They look very natural. They make sense. They are completely originally generated and have no citation of the original texts. It means that we need to increase amount of texts for teaching of the program or decrease N to make the program more stupid which makes it unable to memorize all teaching texts. But decreasing of N makes the generated sentences ugly as we seen for bigrams above.
Finally, I joined together all seven books about Harry Potter, The Hobbit, The Silmarillion, The Lord of the Rings and The Bible and tried to teach the program with N=6. Here is the result:
“With that he waved his wand vanishing the hoops and walked out of the room unseen by both Hermione and Bathilda he slipped the silver framed photograph of the unknown thief inside his jacket.”
“But the next morning they got up early for as the weather was still rainy, they wished to reach the Shire before night and it was a long time after Charlies snores filled the tent that Harry finally dozed off.”
“George made to shut the door to drown the noise but before he could do anything else Crookshanks had vanished with one swish of his long ginger tail.”
“They had ridden for some four hours from the branching of the roads when they drew near to the evil region of Nan Dungortheb the riders became enmeshed in shadows and Aredhel strayed from her companions and was lost.”
“Honor your father and mother which is the first commandment with a promise that it may be well with you and show kindness please to me and make mention of me to Pharaoh and bring me out of this house.”
“Then was the iron the clay the brass the silver and the gold and the spices and the precious oil and all the house of Israel all of them are they to whom the inhabitants of Jerusalem have said get you far from Yahweh.”
Number of completely original generated sentences without big citations from the teaching texts significantly increased and they have better look and meaning. All sentences above don’t contain citations longer than several words. But still there are many sentences completely cited from the teaching texts. I guess if to prepare several gigabytes of qualitative teaching texts it will produce all completely original sentences and it will make sense to increase order of N-grams to improve quality of these sentences.
It seems it will be a good toy for further experiments. :)
You can play with this program too or help to add there something new.
Repository with the source code is here: https://github.com/jolly-fellow/n-gram-text-gen
Binary executable files with the program you can download here: for Linux, for Windows
It doesn’t need for installation or settings.
To use the program just type name of the program and the keys in the command line. For example
./textgen –N 6 –input my_learning.txt –generate 50
will use text from my_learning.txt file for learning and generate 50 sentences based on 6-grams.
All possible keys are here:
— help show help message
— N arg set N-gram order (2 by default)
— generate arg set how many sentences to generate (10 by default)
— input arg set input file for learning
— matrix arg set output file for saving of the matrix
— print_stats print the statistics data
— print_chains print the N-gram chains
— print_dictionary print the dictionary