Map of Reddit

Understanding this post requires at least a cursory understanding of what Reddit is. If you don’t know what Reddit is, I highly recommend this video by CGP Grey.

When I realized that almost every subreddit on Reddit has links to subreddits about similar topics contained in an easily accessible part of the page thanks to Reddit’s REST API, I knew I had to find a way to make a map of the connections between all of the subreddits on Reddit. Not because it would be useful, but because it would be cool.

So, I whipped up a web crawler in Java and used gson to parse the JSON responses. The related subreddits were added to a Queue and are retrieved in FIFO order. A Graph data structure of Nodes and Edges is maintained until the crawling is done, at which point it is exported to CSV format and imported into a program called Gephi, which can be used to build the following visualizations.

There are nearly a million subreddits, each one with an average of 10 connections, so I had to trim down the data set to a more manageable size. I chose to limit it to only subreddits with more than 1,000 subscribers. This leaves some subreddits stranded, with no links to them from the central cluster, and as such they form a sort of reddit “Oort cloud”. Nodes are subreddits, edges are links between subreddits, and node size is determined by number of subscribers to that subreddit. I ran an algorithm called OpenOrd to form the clusters, and used those clusters of subreddits with high mutual linkage to determine node color.

Then, I ran an expansion algorithm to spread out the densely packed clusters to make it easier to see what is going on.

Next, I hide the edges between nodes, again in the interest of clarity.

I did some poking around in Gephi to determine which category each cluster was comprised of, and labeled them in Photoshop. Probably the most interesting trend I found while poking around was that gay porn subreddits tended to link to LGBTQ support group subreddits, which in turn linked to self improvement subreddits, which explains the proximity of the porn cluster to the self improvement cluster.

In this image I enabled node labels for subreddits with more than 10,000 subscribers. All of the images in this post are high resolution (4000×4000) so if you open them in a new tab you’ll be able to zoom in very far and read the labels much easier.

The algorithm that generates these graphs is actually a sort of physics simulation, so watching it simulate the graph looks very cool. Below are a few gifs of the process in action. If they aren’t loading on your device or you would like to be able to zoom in on them, click “View full resolution”

View full resolution

View full resolution
I also had the scraper gather the creation date for the subreddits, and made an animation where the output was filtered by year, in order to display the growth of reddit over time.

View full resolution

View full resolution

The final addition to this project was a map of the power moderators of Reddit. Because of the extreme number of edges in this case, I limited the scope of the scraping and visualization to only moderators of the 49 default subreddits. Each moderator got a node, and each edge meant that those moderators shared a subreddit. The more subreddits two moderators share, the higher the edge weight. The more subscribers that moderator was in charge of, the large their node.

The 49 default subs have a total of 2627 moderators, with 2,673,294 connections between them. The top 10 moderators on Reddit are in charge of between 43 million to 200 million users each. Again, colored clusters represent high degrees of linkage. This means that that small group of moderators have all added each other to their respective subreddits.

As a followup to this project, I also downloaded Wikipedia’s SQL database and parsed through it to generate a similar data set. However, with over 5 million articles, each easily containing over 50 links, the data set was too large to be handled by Gephi. And I was unfortunately not able to come up with a satisfactory way of filtering down which articles would be included, and eventually lost interest in the project and moved on in favor of more interesting projects.

Source code is available here