Machine Learning But Funner 01

Welcome to my first ever tutorial blog thing! I figured I would start with something nice and simple, also not technically machine learning at all, but it's fun! (I will try to stop this gratuitous incorrect use of the word 'fun' in the future). The aim today is to utilise the power of a simple technique called Markov chains, in order to generate seemingly random text from a sample text, in this case it will be rap songs. This is in fact the same type of algorithm that your phone's auto-predict uses, and the sentences produced will be similar in nature to those produced when you spam the auto-predict button: senseless rubbish. But, sometimes they can be entertaining so screw it let's get to the good stuff.

Markov chains work by iterating through sentences in a sample text, and building a distribution of words that are likely to follow or precede other words. You end up with a hash-table or in Python, a dictionary (if you're not strong with data-structures, think of key and value pairs) of words, and their following words, along with all the words which can follow them. Here's an example of what I mean:

[{("These","words"): ["are"]}, {("words","are"): ["followed"]}, {("are","followed"): ["by"]}, {("followed","by"): ["these"]}]

Note that here we're using a key length of 2 and the longer the key length is, the more your sentences will make sense, but choose one too long and you'll find the generated text too similar to the input text.

I first implemented this in Swift back when I was trying to learn the language, but today I'll be writing it in Python for practicality reasons, oh and it doesn't support the capitalist fuelled monopoly Apple enforce by requiring you to use their hardware to run their language. First things first, let's scrape some text! I don't think it's necessary for me to teach you how to use HTTP requests and parse HTML data so I'll skim over this quickly. EDIT: I did have some code in here which scraped all the lyrics of a given artist off a given lyrics site and packs them neatly into a single txt file, but running this script got my IP address banned from said lyrics site so now I'm obliged now to not share this code, or the name of the lyrics site with you, but you're all resourceful enough to figure it out!

Once you have your "corpus.txt" or "raplyrics.txt" whatever you may have called it, you're ready to start analysing it. (Note, the larger the data set of sample lyrics you have, the more variety you'll find in your results so make sure you have a GOOD amount of songs, and formatted with as much consistency as possible.) See a sample from my text file here:

We'll start the analysis by getting the txt file as a string, and splitting it into a list of strings such that one element in the list is one word in the txt file. That way we can iterate over each word, in order to build our distribution data structure as above. To do so, I wrote a split_to_words method, along with another helper function called triples which takes the list of words and creates a list of 3-tuples from it. i.e. ["This, "random", "sentence", "with", "words"] becomes [("This, "random", "sentence"), ("random","sentence", "with"), ("sentence","with", "words")]
These methods aid in the creation of the data structure we will use to do the generation. Now my code looks like this:
Now we just need to figure out how to build the data structure I've spoken about so keenly. Based on my example structure above, it's clear that it will be a dictionary, where each key is a pair of words which follow each other, and each value is a list of words which are found to appear after said pair. This way, words which appear more frequently after any given pair are more likely to be chosen in the generation process, which is exactly what we want despite the fact it may use a stupid amount of memory. A smarter approach would be to store another dictionary as the value so that {("Some","pair"): {"always":n}]} is stored where n is the number of occurrences, instead of {("Some","pair"): ["always","always","always"...]}. But I'm less concerned right now with this and so in the spirit of every lazy writer, I leave it as an exercise to the reader.
So effectively all our create_distribution method will do is iterate over every triple and append to a dictionary the row {(word1,word2):[word3]} unless the key (word1,word2) already exists, in which case we simply append word3 to the list already there:
Finally we need to write a generate_song method which will pick random key-value pairs and start rapping! This method is easy enough to understand, pick two consecutive words as seeds, and foreach word we want to generate, pick a random value from the list of possible next words:

There we have it! Add the create_distribution method call to def __init__(self, open_file): and you're ready to create a Markov object and start getting your own rapspiration. Find the final code on GitHub, and enjoy some interesting raps generated below:

What is he wearing?
Somebody jack that fool's steeze,
If I'm a phenom I'm assassin,
I'ma kill y'all I'ma say ma ma,
sa ma ma ma,

I hear myself moaning,
Take one more toke,
And I see his fears through her tears,
Now she's wishing we were real close,

Especially when your work is being put in,
I'm aware of what I learned,
I break bread with the big dogs,
Though I know you got something on me so sick of cause.

Search This Blog

Authomaton - Maching Learning, Programming and the Web

Machine Learning But Funner 01 - Rap Bot

Comments

Post a Comment

Popular posts from this blog

Machine Learning But Funner 02 - The (Simplified) Theory of Convolutional Neural Networks

An Introduction