Captain Markov

When I first came across bots running on Markov Chains on the internet, I knew I had to make my own. I decided that a Twitter bot that generates and tweets Captains Logs as in the Star Trek franchise would be pretty cool. So I spent some time learning how to build the various components I would need. A web scraper for gathering the scripts, the code required to interface with Twitter, a MarkovChain class and the data structures and UI it would rely on.
The most interesting part of this project was MarkovChain, and of course the hilarious output. The account gained some traction when Wil Weaton, an actor on the show, followed the account and retweeted several of its posts.
First lets look at some examples, and then we’ll dive into how it works.

In general, a Markov Chain is a statistical model that relates a word (or character) to the probability of any other word (or character) occurring after it. The core data structure of the program used to accomplish this is a Hashtable<String, Vector<String>>.
Every line of the script is broken down and fed into the Hashtable. In order to increase coherence, the bot uses a chain length of 3. To achieve this, a line is broken down in the following manner:
Input line: “Captain’s Log, stardate 8654.5, this is an example”
Output strings: “Captain’s Log,”, “Log, stardate”, “stardate 8654.5”, “8654.5, this”, “this is”, “is an”, “an example”
That way, at a minimum, any 3 words in order in the output are guaranteed to have occurred at least once in the original scripts.
When these strings are fed into the Hashtable, the result is that any two words can be used as a key to retrieve a list of word pairs that have occurred after them. Note that because this is a Vector (basically just an Array) and not a Set, duplicates will exist. The number of duplicates of an entry affects the probability of selecting that path, thus satisfying the Markov property.

If “Captain’s Log,” is used as the starting word pair, it will select randomly from the list of word pairs that have occurred after that. For example, the Vector for that “seed” might be {“Log, stardate”, “Log, supplemental”}.
It is easy to see how this method of choosing the next word in the phrase based on how likely it is for that word to occur after the previous word in the source data set can lead to a coherent and humorous output.

I also built a GUI for customizing exactly how the scripts were parsed to allow myself to generate dialog from specific characters, or even scene change notes. I also experimented with a Treknobabble generator to generate phrases that used as few words from the top 10,000 most common words as possible, but the results were less than satisfying. Outputs usually just contained peoples names and other “uncommon” words, rather than containing unique Star Trek technical words.

To see more tweets, visit the bot’s Twitter Account
To see the code, visit the github page for the project