I recently revisited one of my oldest real projects, Captain Markov. It was a Twitter bot that used a Markov Chain to generate fake captains logs from Star Trek. I learned a lot doing, but I’ve learned much more since then. I decided to rewrite it into Kotlin and do some general clean up along the way.

Kotlin and the considerable experience I’ve gained these last few years allowed me to shrink the size and complexity of the project greatly. A whole system of classes I had been using to scrape scripts was compressed down to a single short function. And had much cleaner and more reliable output. I also simplified the core MarkovChain class and fixed some long-standing issues with it as well.

But I wasn’t satisfied to just rewrite an old project and not doing anything new with it, so decided I wanted to make it generate something else as well.

I had been reading a lot of SCP Foundation articles at the time, and noticed that each article is very structurally similar to every other article. Additionally, the vocabulary and writing style is remarkably consistent. I believe it lends it’s self better to building a Markov model than Star Trek did.

So I found the SCP Foundation wiki’s robots.txt, downloaded their sitemap, and scraped all the articles. Made some changes to the bot to have it generate two types of paragraphs: “Containment Procedures” and a “Description”. It had some amusing outputs, so I generated 15,000 fake SCP articles, and uploaded them to my website.

Next, I downloaded the page of a random article, stripped it down, and added some javascript to fetch a randomly generated article from my server and display it to the user.

Articles number 1-3000 were generated with a chain length of 4, and articles above that were generated with a chain length of 3. A longer chain length generally means that the outputs will be more coherent because longer stretches of text are from the original source material.

I also wrote one of the articles myself, but you will only see it if you cause my PHP code to throw an exception.

Check it out here

Understanding this post requires at least a cursory understanding of what Reddit is. If you don’t know what Reddit is, I highly recommend this video by CGP Grey.

When I realized that almost every subreddit on Reddit has links to subreddits about similar topics contained in an easily accessible part of the page thanks to Reddit’s REST API, I knew I had to find a way to make a map of the connections between all of the subreddits on Reddit. Not because it would be useful, but because it would be cool.

So, I whipped up a web crawler in Java and used gson to parse the JSON responses. The related subreddits were added to a Queue and are retrieved in FIFO order. A Graph data structure of Nodes and Edges is maintained until the crawling is done, at which point it is exported to CSV format and imported into a program called Gephi, which can be used to build the following visualizations.

There are nearly a million subreddits, each one with an average of 10 connections, so I had to trim down the data set to a more manageable size. I chose to limit it to only subreddits with more than 1,000 subscribers. This leaves some subreddits stranded, with no links to them from the central cluster, and as such they form a sort of reddit “Oort cloud”. Nodes are subreddits, edges are links between subreddits, and node size is determined by number of subscribers to that subreddit. I ran an algorithm called OpenOrd to form the clusters, and used those clusters of subreddits with high mutual linkage to determine node color.

Then, I ran an expansion algorithm to spread out the densely packed clusters to make it easier to see what is going on.

Next, I hide the edges between nodes, again in the interest of clarity.

I did some poking around in Gephi to determine which category each cluster was comprised of, and labeled them in Photoshop. Probably the most interesting trend I found while poking around was that gay porn subreddits tended to link to LGBTQ support group subreddits, which in turn linked to self improvement subreddits, which explains the proximity of the porn cluster to the self improvement cluster.

In this image I enabled node labels for subreddits with more than 10,000 subscribers. All of the images in this post are high resolution (4000×4000) so if you open them in a new tab you’ll be able to zoom in very far and read the labels much easier.

The algorithm that generates these graphs is actually a sort of physics simulation, so watching it simulate the graph looks very cool. Below are a few gifs of the process in action. If they aren’t loading on your device or you would like to be able to zoom in on them, click “View full resolution”

View full resolution

View full resolution
I also had the scraper gather the creation date for the subreddits, and made an animation where the output was filtered by year, in order to display the growth of reddit over time.

View full resolution

View full resolution

The final addition to this project was a map of the power moderators of Reddit. Because of the extreme number of edges in this case, I limited the scope of the scraping and visualization to only moderators of the 49 default subreddits. Each moderator got a node, and each edge meant that those moderators shared a subreddit. The more subreddits two moderators share, the higher the edge weight. The more subscribers that moderator was in charge of, the large their node.

The 49 default subs have a total of 2627 moderators, with 2,673,294 connections between them. The top 10 moderators on Reddit are in charge of between 43 million to 200 million users each. Again, colored clusters represent high degrees of linkage. This means that that small group of moderators have all added each other to their respective subreddits.

As a followup to this project, I also downloaded Wikipedia’s SQL database and parsed through it to generate a similar data set. However, with over 5 million articles, each easily containing over 50 links, the data set was too large to be handled by Gephi. And I was unfortunately not able to come up with a satisfactory way of filtering down which articles would be included, and eventually lost interest in the project and moved on in favor of more interesting projects.

Source code is available here

When I first came across bots running on Markov Chains on the internet, I knew I had to make my own. I decided that a Twitter bot that generates and tweets Captains Logs as in the Star Trek franchise would be pretty cool. So I spent some time learning how to build the various components I would need. A web scraper for gathering the scripts, the code required to interface with Twitter, a MarkovChain class and the data structures and UI it would rely on.
The most interesting part of this project was MarkovChain, and of course the hilarious output. The account gained some traction when Wil Weaton, an actor on the show, followed the account and retweeted several of its posts.
First lets look at some examples, and then we’ll dive into how it works.

In general, a Markov Chain is a statistical model that relates a word (or character) to the probability of any other word (or character) occurring after it. The core data structure of the program used to accomplish this is a Hashtable<String, Vector<String>>.
Every line of the script is broken down and fed into the Hashtable. In order to increase coherence, the bot uses a chain length of 3. To achieve this, a line is broken down in the following manner:
Input line: “Captain’s Log, stardate 8654.5, this is an example”
Output strings: “Captain’s Log,”, “Log, stardate”, “stardate 8654.5”, “8654.5, this”, “this is”, “is an”, “an example”
That way, at a minimum, any 3 words in order in the output are guaranteed to have occurred at least once in the original scripts.
When these strings are fed into the Hashtable, the result is that any two words can be used as a key to retrieve a list of word pairs that have occurred after them. Note that because this is a Vector (basically just an Array) and not a Set, duplicates will exist. The number of duplicates of an entry affects the probability of selecting that path, thus satisfying the Markov property.

If “Captain’s Log,” is used as the starting word pair, it will select randomly from the list of word pairs that have occurred after that. For example, the Vector for that “seed” might be {“Log, stardate”, “Log, supplemental”}.
It is easy to see how this method of choosing the next word in the phrase based on how likely it is for that word to occur after the previous word in the source data set can lead to a coherent and humorous output.

I also built a GUI for customizing exactly how the scripts were parsed to allow myself to generate dialog from specific characters, or even scene change notes. I also experimented with a Treknobabble generator to generate phrases that used as few words from the top 10,000 most common words as possible, but the results were less than satisfying. Outputs usually just contained peoples names and other “uncommon” words, rather than containing unique Star Trek technical words.

To see more tweets, visit the bot’s Twitter Account
To see the code, visit the github page for the project