Reddit submission dataset

Description

This Reddit dataset consists specific metadata of all submissions posted to Reddit from the beginning of Nov. 2007 to the end of July 2013. The metadata of each submission (e.g., score) were collected around 1-2 months after the initial submission (i.e., when they get blocked from voting) as the metadata has most likely been settled after this period. The dataset is available in JSON format and is zipped. Concretely, the following information is available:

Furthermore, we make our manual categorization of top-level domains to content types available. These two datasets allow to reproduce the results presented in:
Philipp Singer, Fabian Flöck, Clemens Meinhart, Elias Zeitfogel and Markus Strohmaier,
Evolution of Reddit: From the Front Page of the Internet to a Self-referential Community?,
Web-Science Track at the 23rd International World Wide Web Conference, Seoul, South Korea, 2014 [PDF]

Anonymity

We have limited the metadata in the Reddit dataset to information necesarry to reproduce our scientific results. The rest of the metadata has been removed in order to sustain anonymity of Reddit users.

Accessibility

For accessing the dataset please contact Philipp Singer (philipp.singer@gesis.org). Please, add a short description for which purposes you want to use the dataset.

Please, use the dataset for scientific purposes only and follow general ethical rules. If you publish results obtained from using this dataset, please cite:

Philipp Singer, Fabian Flöck, Clemens Meinhart, Elias Zeitfogel and Markus Strohmaier,
Evolution of Reddit: From the Front Page of the Internet to a Self-referential Community?,
Web-Science Track at the 23rd International World Wide Web Conference, Seoul, South Korea, 2014 [PDF]

Acknowledgements

We want to sincerely thank Jason Baumgartner (aka u/stuck_in_the_matrix) for conducting the data collection and providing us initial access to the data.