This week I have finished to crawl an entire year (365 days) of all submissions added to Reddit with the help of Reddit's awesome API . In the following table some main characteristics of the dataset are depicted. One thing to keep in mind is that my dataset at hand just contains infos of all submissions but not the actual comments of all submissions.
We can see from these primary statistics that a submission gets on average 11.9 comments. This seems to be a pretty high response rate for posted content. Furthermore, we can see that users tend to up-vote more than they down-vote submissions which leads to an average positive score. This is still the case if we subtract all the automatic up-votes by the authors of a submission. On average a submission gets around 92.1 up-votes and 53.5 down-votes.
The following table present the top 5 submissions of the last year based on the score -- i.e., number of up-votes subtracted by the number of down-votes.
|2012-07-29||The Bus Knight||1,459||68,672||47,362||21,310|
|2012-10-16||This guy is a reporter on Fox 2 here in Detroit. His name is Charlie LeDuff. He is fucking awesome.||1,068||28,449||11,433||17,016|
|2012-11-02||I live in the same valley as Adam West. I decided to look him up in the phone book today.||912||60,243||43,817||16,426|
|2012-10-02||Airline screwed up, a friend just posted this on Facebook.||3,474||103,692||88,850||14,842|
|2012-08-29||I am Barack Obama, President of the United States -- AMA||24,293||240,728||225,972||14,756|
We can see that the score alone might not be a valid measure to specify very high attention submissions as we can see on the low comment response rate except for the Obama AMA. Maybe the number of comments or very controversial submissions might represent more interesting patterns.
As the data consists of a time series of submissions for one year, one interesting statistic to look at is how many submissions have been added to Reddit over the course of one year on a daily base.
A first impression we get is that the number of submissions was rising on a daily base over the last year. This demonstrates the important and prominent role Reddit plays in today's Web. Furthermore, we can see that one day -- i.e., 2013-04-02 with 96,570 submissions -- clearly has more submissions than the other days. I found out that not only the large amount of April Fool submissions played a high impact on the large number of submissions for that day (the terms "april" and "first" occur very frequently throughout the titles) but also the "war" between the subreddits "periwinkle" and "orangered" had a high impact. This was a game initiated by Reddit by theming the platform in a Team Fortress 2 style and every user was part of one team. The goal was to collect as many karma for your team and hence, users created a large amount of submissions. In the end "orangered" won.
A final observation from aboves figure is that there are irregularities between days of a week regarding the number of submissions. There seem to be some patterns, which we want to identify via the next plot which averages the number of submissions for each weekday.
Interestingly, one can observe that on average the number of submissions are added to Reddit to a far lesser extent on the weekend -- i.e., Saturday and Sunday -- and to a larger extend during the week with a peak on Tuesday. An now upcoming question is whether there are differences in the quality of submissions and the attention they create. One basic way to investigate this is to look at the average number of comments, up-votes, down-votes and average score submissions arousing on specific weekdays. The following figures illustrate these average values per submission per weekday (do not forget to click through the figures).
We can see that Sunday submissions on average get the most comments, the most down- and up-votes and also the highest score. Interestingly, this does not seem to be the case for Saturday submissions. These observations may be an indicator that Sunday submissions create the most average attention on Reddit and also may be of better quality. But we do need to look deeper into the data to investigate this hypothesis. We also do not know when the comments have been postet. The reason for the observations might be that starting from Tuesday Reddit gets the highest usage from users and Sunday submissions still get attention. But for investigating this we would need all comment data for all submissions.
The statistics presented in this blog post are just some primary investigations that should give basic insights into some of the dynamics of Reddit over the last year and demonstrate the fast growth of Reddit. Furthermore, it opens many new interesting research questions for the future. My goal is to investigate the data in greater detail and also get my hands on comment data. I would be very interested in some reader's suggestions for further investigations which I am very eager to investigate and provide.