In my data science workflow, I have recently started to heavily utilize Google's BigQuery which allows you to store and query large data in SQL style. Internally, Google uses their enormeous processing power in order to guarantee blazing fast queriees; even if those are complex ones that operate on huge data. There is a specific amount of operations that is free, and after exceeding the free quota, Google has a reasonable pricing model.
In this blog post, I want to give an example of what can be done with this functionality. In detail, I want to demonstrate how to calculate the readability of text based on a set of different state-of-the-art readability formulas. In general, a readability score of a text should give you an indicator about the ease for a reader to understand the text. As a use-case, I want to use the Reddit comment data available in BigQuery.
An implementation can be found in a github gist. The most complicated part in the code is the calculation of the number of syllables inside a text which is crucial when one wants to calculate readability formulas. I have searched for a longer period to find a reasonable algorithm that captures syllables in a balanced way, meaning that it does not ignore too many existing syllables and that it does not count too many non-existing syllables. In the end, I adopted a python implementation presented in a blog post.
SELECT flesch_reading_ease,flesch_kincaid_grade,smog_index,gunning_fog_index FROM features(SELECT body FROM[fh-bigquery:reddit_comments.2015_05]) LIMIT 1000
You can play around with that, e.g., maybe try to refine the readability implementation. You can also do some post analyses based on calculated readability scores. For instance, I correlated the readability scores with the Reddit score of comments; no correlation was found though (maybe better that way).