UDF in Google's BigQuery: An example based on calculating text readability

In my data science workflow, I have recently started to heavily utilize Google's BigQuery which allows you to store and query large data in SQL style. Internally, Google uses their enormeous processing power in order to guarantee blazing fast queriees; even if those are complex ones that operate on huge data. There is a specific amount of operations that is free, and after exceeding the free quota, Google has a reasonable pricing model.

Due to the fast calculation of queries, it is mostly a good idea to do as much calculation inside a query, so that your post-processing data-science analyses then do not need to be concerned with specifically complex operations. Recently, Google has introduced user-defined-functions (UDF) which allow you to manually specify functions inside a query in Javascript code.

In this blog post, I want to give an example of what can be done with this functionality. In detail, I want to demonstrate how to calculate the readability of text based on a set of different state-of-the-art readability formulas. In general, a readability score of a text should give you an indicator about the ease for a reader to understand the text. As a use-case, I want to use the Reddit comment data available in BigQuery.

First of all, we want to implement the readability formulas in Javascript; for the specific layout of how to set-up a UDF for BigQuery, please consult the official documentation.

An implementation can be found in a github gist. The most complicated part in the code is the calculation of the number of syllables inside a text which is crucial when one wants to calculate readability formulas. I have searched for a longer period to find a reasonable algorithm that captures syllables in a balanced way, meaning that it does not ignore too many existing syllables and that it does not count too many non-existing syllables. In the end, I adopted a python implementation presented in a blog post.

Next, having this implementation, we can go ahead and calculate the readability of Reddit comments via the following SQL query. Don't forget to copy the javascript code to the "UDF editor" tab in BigQuery. Alternatively, you can also store the javascript code directly in a Google cloud bucket and then reference it inside BigQuery.

SELECT flesch_reading_ease,flesch_kincaid_grade,smog_index,gunning_fog_index FROM features(SELECT body FROM[fh-bigquery:reddit_comments.2015_05]) LIMIT 1000

You can play around with that, e.g., maybe try to refine the readability implementation. You can also do some post analyses based on calculated readability scores. For instance, I correlated the readability scores with the Reddit score of comments; no correlation was found though (maybe better that way).

Posted in Coding, General