Introduction to Bayesian Inference: A Coin Flipping Example

Recently, I have been involved with more teaching and one part of my teaching efforts has been to provide an introduction to Bayesian inference. Personally, I have the intuition that this can be best achieved by working through a very simple example: namely the classic coin flip example. Also, it is often helpful to accompandy the elaborations of Bayesian statistics with methods of how frequentists would tackled given problems. After all, these are the things most people are familiar with.

To that end, I have prepared slides and a jupyter notebook which can be found here:

Slides: Speakerdeck

Notebook: Nbviewer

In case of any comments, please let me know. I am also happy for help with extending current work, e.g., by collaborating on github.

Posted in General

UDF in Google's BigQuery: An example based on calculating text readability

In my data science workflow, I have recently started to heavily utilize Google's BigQuery which allows you to store and query large data in SQL style. Internally, Google uses their enormeous processing power in order to guarantee blazing fast queriees; even if those are complex ones that operate on huge data. There is a specific amount of operations that is free, and after exceeding the free quota, Google has a reasonable pricing model.

Due to the fast calculation of queries, it is mostly a good idea to do as much calculation inside a query, so that your post-processing data-science analyses then do not need to be concerned with specifically complex operations. Recently, Google has introduced user-defined-functions (UDF) which allow you to manually specify functions inside a query in Javascript code.

In this blog post, I want to give an example of what can be done with this functionality. In detail, I want to demonstrate how to calculate the readability of text based on a set of different state-of-the-art readability formulas. In general, a readability score of a text should give you an indicator about the ease for a reader to understand the text. As a use-case, I want to use the Reddit comment data available in BigQuery.

First of all, we want to implement the readability formulas in Javascript; for the specific layout of how to set-up a UDF for BigQuery, please consult the official documentation.

An implementation can be found in a github gist. The most complicated part in the code is the calculation of the number of syllables inside a text which is crucial when one wants to calculate readability formulas. I have searched for a longer period to find a reasonable algorithm that captures syllables in a balanced way, meaning that it does not ignore too many existing syllables and that it does not count too many non-existing syllables. In the end, I adopted a python implementation presented in a blog post.

Next, having this implementation, we can go ahead and calculate the readability of Reddit comments via the following SQL query. Don't forget to copy the javascript code to the "UDF editor" tab in BigQuery. Alternatively, you can also store the javascript code directly in a Google cloud bucket and then reference it inside BigQuery.


SELECT flesch_reading_ease,flesch_kincaid_grade,smog_index,gunning_fog_index FROM features(SELECT body FROM[fh-bigquery:reddit_comments.2015_05]) LIMIT 1000

You can play around with that, e.g., maybe try to refine the readability implementation. You can also do some post analyses based on calculated readability scores. For instance, I correlated the readability scores with the Reddit score of comments; no correlation was found though (maybe better that way).

Posted in Coding, General

Bayesian Correlation with PyMC

Recently, I have been getting more and more interested in Bayesian techniques and specifically, I have been researching how to approach classical statistical inference problems within the Bayesian framework. As a start, I have looked into calculating Pearson correlation.

To that end, I have found great resources in the great blog by Rasmus Bååth who had a two-part series about how to model correlation in a Bayesian way [1,2]. A very similar model has also been proposed and discussed in [3].

My main contribution here is that I show how to apply the model with the Python library PyMC.

The main code, experiments and results can be found in the form of an IPython notebook. Next, I just want to highlight the code for the main model and exemplary results.

The core model(s) can be expressed as follows:

from pymc import Normal, Uniform, MvNormal, Exponential
from numpy.linalg import inv, det
from numpy import log, pi, dot
import numpy as np
from scipy.special import gammaln

def _model(data, robust=False):
    # priors might be adapted here to be less flat
    mu = Normal('mu', 0, 0.000001, size=2)
    sigma = Uniform('sigma', 0, 1000, size=2)
    rho = Uniform('r', -1, 1)

    # we have a further parameter (prior) for the robust case
    if robust == True:
        nu = Exponential('nu',1/29., 1)
        # we model nu as an Exponential plus one
        @pymc.deterministic
        def nuplus(nu=nu):
            return nu + 1

    @pymc.deterministic
    def precision(sigma=sigma,rho=rho):
        ss1 = float(sigma[0] * sigma[0])
        ss2 = float(sigma[1] * sigma[1])
        rss = float(rho * sigma[0] * sigma[1])
        return inv(np.mat([[ss1, rss], [rss, ss2]]))

    if robust == True:
        # log-likelihood of multivariate t-distribution
        @pymc.stochastic(observed=True)
        def mult_t(value=data.T, mu=mu, tau=precision, nu=nuplus):
            k = float(tau.shape[0])
            res = 0
            for r in value:
                delta = r - mu
                enum1 = gammaln((nu+k)/2.) + 0.5 * log(det(tau))
                denom = (k/2.)*log(nu*pi) + gammaln(nu/2.)
                enum2 = (-(nu+k)/2.) * log (1 + (1/nu)*delta.dot(tau).dot(delta.T))
                result = enum1 + enum2 - denom
                res += result[0]
            return res[0,0]

    else:
        mult_n = MvNormal('mult_n', mu=mu, tau=precision, value=data.T, observed=True)

    return locals()

def analyze(data, robust=False, plot=True):
    model = pymc.MCMC(_model(data,robust))
    model.sample(50000,25000)

    print
    if plot:
        pymc.Matplot.plot(model)

    return model

Then, let us define some data and do inference.

import numpy as np
x = np.array([525., 300., 450., 300., 400., 500., 550., 125., 300., 400., 500., 550.])
y = np.array([250., 225., 275., 350., 325., 375., 450., 400., 500., 550., 600., 525.])
data = np.array([x, y])
model = analyze(data)

The marginal posterior of the correlation rho (together with the mcmc trace and autocorrelation) looks like the following:

rho

What we can see here, is that the mean of rho is around 0.13 (which is similar to what a classic Pearson correlation estimation tells us). However, when we take a look at the histogram of the marginal posterior for rho, we can see that the frequency of distinct values for rho are pretty wide. We can characterize this with the 95% HDP (highest probability density) interval---also called credible interval---which is [-0.42 0.64]. Thus, with this HDP, we can get a very thorough view regarding the distribution of the parameters for rho. While the mean correlation is slightly positive, we cannot rule out a negative correlation or a non-existing correlation (rho=0).

For further details, examples and also an experiment with the robust model, please take a look at the IPython notebook and the references.

[1] http://www.sumsar.net/blog/2013/08/bayesian-estimation-of-correlation/
[2] http://www.sumsar.net/blog/2013/08/robust-bayesian-estimation-of-correlation/
[3] Lee, Michael D., and Eric-Jan Wagenmakers. Bayesian cognitive modeling: A practical course. Cambridge University Press, 2014.

Posted in Coding, General, Statistics

HypTrails Tutorial

At this year's World Wide Web conference, colleagues and I have published and presented the following paper:

Philipp Singer, Denis Helic, Andreas Hotho and Markus Strohmaier, HypTrails: A Bayesian Approach for Comparing Hypotheses About Human Trails on the Web, 24th International World Wide Web Conference, Florence, Italy, 2015 (Best Paper Award) 
[PDF] [arXiv] [Slides] [Tutorial] [Code]

Abstract

When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful for e.g., improving underlying network structures, predicting user clicks or enhancing recommendations. In this work, we present a general approach called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our approach utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to leverage the sensitivity of Bayes factors on the prior for comparing hypotheses with each other. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including website navigation, business reviews and online music played. Our work expands the repertoire of methods available for studying human trails on the Web.

Tutorial

In order to better communicate the main concepts and ideas of this approach, I have prepared a tutorial in the form of an IPython notebook. This tutorial also goes into detail about Bayesian inference, Markov chain models and Dirichlet distributions which are all concepts that HypTrails utilizes. The tutorial utilizes the Python implementation of HypTrails.

I hope that this tutorial is helpful to people who are interested in the approach and its Python implementation. I am also happy to hear any feedback and suggestions for improvement.

 

Posted in General

Influential papers for my PhD studies

As I am reaching the final stage of my PhD studies, I have been reflecting about which papers have influenced my work the most and which have amazed me the most. Thus, I want to devote this short blog post to honor the five most influential papers. Some of them might be generally valuable to others, while some are very specific for some problem settings of my thesis. Additionally, I also want to show my gratitude to all authors of all other articles that I have read and cited throughout my PhD studies.

  • Clauset, A., Shalizi C. R., and Newman M. E. J.,
    Power-law distributions in empirical data,
    SIAM Rev., 51.4, 661-703, 2009

Even though I am not utilizing any power law fitting in my final cumulative thesis, I have been using the methods proposed by Clauset et al. in various experiments throughout the course of my PhD studies. It has amazed me, of how the authors have been able to communicate much technical detail in an intuitive way. For instance, at page 3 the article presents a general recipe for analyzing power law distributed data that -- without knowing the technical details -- already gives the reader a way to understand the problem and process. Apart from the description of how to analyze power law distributions, this article has changed my way of thinking of how to approach problem settings. For instance, literature has frequently argued (quite shockingly) that distributions follow a power law distribution by simply fitting a straight line to the log-log plot of the data. While this is indeed one common property of power law distributions, there is so much missing with this simple analysis and as Clauset et al. point out such results can not be trusted and one should resort to more statistically sound techniques such as the ones proposed in their paper. For example, a further question to be answered is whether other potential candidate distributions might be better fits to the data compare to the power law fit. In a nutshell, this paper has tought me that the most simplistic approach or answer often is not enough and that one has to think more deeply about what it actually means what someone is doing. For my thesis, this might e.g., refer to the fact that likelihood ratios of nested models are no sufficient way of judging whether one model is a better fit compared to another one. I have blogged about power law distributions in two different blog posts (Power law fitting in Python and Bayesian power law fitting) where the concepts described in this article are relevant.

  • Chierichetti, F., Kumar, R., Raghavan, P., and Sarlos, T. ,
    Are web users really markovian?,
    21st International Conference on World Wide Web, 2012

In 2012, I attended my first WWW conference in Lyon and it has been highly influential to my studies and ways of researching. I have attended the presentation of listed paper and was instantly interested in the question of whether human navigational behavior on the Web is markovian or whether memory effects play a more significant role than partly supposed in related works. As our research group in Graz has been studying human navigational trails for some time, I came home from the conference with the ambition to study more about this problem. After studying the mentioned paper and related works, I decided to devote some time of my PhD studies to the issue of detecting memory effects (Markov chain orders) given human trail data. As a result, I have developed a framework that implements a set of statistical inference methods for detecting the appropriate Markov chain order given human trail data. The corresponding work and framework is now a fundamental aspect of my thesis; all motivated and stimulated by this article.

  • Kass, R. E., and Raftery, A. E.,
    Bayes factors,
    Journal of the american statistical association, 90(430), 773-795, 1995

The concept of Bayesian model selection and corresponding Bayes factors has been very crucial for several aspects of my thesis. Thus, this seminal work on Bayes factors has been highly valuable with regard to my understanding of Bayesian model selection using Bayes factors as well as their advantages and disadvantages. Kass and Raftery have achieved to present their ideas and concepts in an intuitive way coupled with several (partly advanced) examples  of how to use Bayes factors.

  • Gabriel, K. R., and Neumann, J.,
    A Markov chain model for daily rainfall occurrence at Tel Aviv,
    Quarterly Journal of the Royal Meteorological Society 88(375), 90-95, 1962

I love reading older papers. This one was actually published more than 50 years ago. Sometimes, I feel like they are written more clearly compared to modern papers (even though some might say the difference). This article studies the appropriate Markov chain order given rainfall data in Tel Aviv. This is similar to what I have been doing throughout my thesis in a different context. The work by Gabriel and Neumann was based on much smaller data size and complexity though as most of the calculations had to be done by hand. This really demonstrates the advancements we have made the last decades in terms of computational capabilities. In detail, this paper has been valuable to me in several ways. First of all, it presents the concept of calculating log likelihoods and likelihood ratio tests for Markov chain models in an intuitive way. Second, I have used the data (presented via transition tables) as well as corresponding results in order to test the technical soundness of my methods. Also, I have used the data and results for one of my blog posts where I explain how one can determine the appropriate Markov chain order.

  • West, R., and Leskovec, J.,
    Human wayfinding in information networks,
    21st International Conference on World Wide Web, 2012

This article presents experiments on human navigational data derived from the Wikipedia game Wikispeedia. Actually, in my studies I have analyzed similar datasets (i.e., Wikigame and Wikispeedia) in several occasions. Thus, this paper has been highly relevant related work to my studies. Apart from giving me several incentives for my studies, this paper is a great example of how to present scientific work in an intuitive way.

Honorable mention:

  • C. Davidson-Pilon,
    Probablistic Programming & Bayesian Methods for Hackers,
    Github, 2014

Apart from my list presented above, I want to mention this "book" about probabilistic programming. In detail, this is a freely available book hosted on Github that gives insights into Bayesian statistics in Python using the PyMC library. The content of this book has been really helpful to me as I have focused on Bayesian statistics throughout several occasions during my PhD studies. Apart from the high quality content, the book is novel in several ways. First of all, it is produced collaboratively. Even though C. Davidson-Pilon is the main author, everyone can collaborate on the project via Github. Furthermore, this book is an actual IPython book which allows you to interactively see, use and test the Python code presented in the book.

Posted in General

Handling huge matrices in Python

Everyone who does scientific computing in Python has to handle matrices at least sometimes. The go-to library for using matrices and performing calculations on them is Numpy. However, sometimes your matrices grow so large that you cannot store them any longer in memory. This blog post should provide insights into how to handle such a scenario.

The most prominent, and the solution I would suggest at first, is to use Scipy's sparse matrices. Scipy is a package that builds upon Numpy but provides further mechanisms like sparse matrices which are regular matrices that do only store elements that exhibit a value different from zero. When confronted with large matrices, it is very common that the matrices are very sparse and hence, using sparse matrices directly can already help a lot. Additionally to the benefit of lesser memory usage, Scipy's calculation methods are also tuned to sparse calculations which also speeds up your code a lot usually (Scipy uses a lot of C code which also helps). Scipy offers a variety of different types of sparse matrices that each have their own internal representation and benefits; for an overview please consult [1].

Alas, this is not enough sometimes. For example, I recently had to calculate the dot product of a matrix having dimensions of 360,000 times 360,000 with the transpose of itself. If this matrix would be very sparse, we would already solve the problem by using sparse matrices. However, in my case it was not that sparse at all and the final output needed something like more than 100GB of memory even though I used float32 as dtype -- altering the dtypes [2] is on a side note also an approach to reduce memory usage on the expense of value precision. Even the server I was using could not handle such memory usage.

So as a second approach, I suggest to use and HDF5 store in e.g., the form of PyTables. PyTables allows you to store Numpy arrays on disk and then you can directly access the array on disk in your Python code partially. For example, instead of now holding everything in memory, I would first store the array to disk and then slice rows from disk and work with them. For a detailed tutorial on PyTables please refer to [3]. I want to give an example now though of how we can make our dot calculation and store the output in an hdf5 store.

from scipy.sparse import csr_matrix, rand
import tables as tb
a = rand(2000,2000, format='csr') #imagine that many values are stored in this matrix and that sparsity is low
b = a.T
l, m, n = a.shape[0], a.shape[1], b.shape[1]

f = tb.open_file('dot.h5', 'w')
filters = tb.Filters(complevel=5, complib='blosc')
out = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, n), filters=filters)

bl = 1000 #this is the number of rows we calculate each loop
#this may not the most efficient value
#look into buffersize usage in PyTables and adopt the buffersite of the
#carray accordingly to improve specifically fetching performance

b = b.tocsc() #we slice b on columns, csc improves performance

#this can also be changed to slice on rows instead of columns
for i in range(0, l, bl):
  out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray()

f.close()

When our calculation is finished we can quite easily access the data:

h5 = tb.open_file('dot.h5', 'r')
a = h5.root.data
row = a[0,:] #only one row gets loaded into memory
print row

In this case we only slice one row of the hdf5 stored matrix and hence, only this single row gets loaded into memory. If we want to perform any further calculations on this matrix, we could completely load it into memory (which will not work as we want to circumvent memory issues), or also perform the calculations on smaller slices and then piece the results together (again in and hf5 store or a numpy/sparse matrix).

One downfall is that PyTables can not work with sparse matrices directly which is why we have to use toarray() in order to make the sliced calculation dense and store it into the hf5 CArray. However, there is also a way to work with sparse matrices in PyTables by using EArrays (which have no pre-defined shape) and storing the data and indices of the sparse matrix:

f = tb.open_file('dot.h5', 'w')
filters = tb.Filters(complevel=5, complib='blosc')
out_data = f.create_earray(f.root, 'data', tb.Float32Atom(), shape=(0,), filters=filters)
out_ri = f.create_earray(f.root, 'ri', tb.Float32Atom(),shape=(0,), filters=filters)
out_ci = f.create_earray(f.root, 'ci', tb.Float32Atom(), shape=(0,), filters=filters)
for i in range(0, l, bl):
  res = a.dot(b[:,i:min(i+bl, l)])
  vals = res.data
  ri, ci = res.nonzero()
  out_data.append(vals)
  out_ri.append(ri)
  out_ci.append(ci)

Finally, we can rebuild the sparse matrix the following way:

h5 = tb.open_file('dot.h5', 'r')
a = csr_matrix((h5.root.data[:], (h5.root.ri[:], h5.root.ci[:]), shape=(l,n))

We need to remember or store the shape parameters though as you can see in aboves example. Still, the problem remains if we cannot store the final sparse matrix in memory. Hence, it is more elegant to store the column indices together with the indptr the following way:

import numpy as np
f = tb.open_file('dot.h5', 'w')
filters = tb.Filters(complevel=5, complib='blosc')
out_data = f.create_earray(f.root, 'data', tb.Float32Atom(), shape=(0,), filters=filters)
out_indices = f.create_earray(f.root, 'indices', tb.Int32Atom(),shape=(0,), filters=filters)
out_indptr = f.create_earray(f.root, 'indptr', tb.Int32Atom(), shape=(0,), filters=filters)
out_indptr.append(np.array([0])) #this is needed as a first indptr
max_indptr = 0
for i in range(0, l, bl):
 res = a[i:min(i+bl, l),:].dot(b)
 out_data.append(res.data)
 indices = res.indices
 indptr = res.indptr
 out_indices.append(indices)
 out_indptr.append(max_indptr+indptr[1:])
 max_indptr = indices.shape[0]

Again, we can generate the complete sparse matrix the following way:

h5 = tb.open_file('dot.h5', 'r')
a = csr_matrix((h5.root.data[:], h5.root.indices[:], h5.root.indptr[:]), shape=(l,n))

Now, the beauty is that we can also slice the matrix without ever using dense Numpy arrays:

#let's get the third row
a = h5.root
b = csr_matrix((a.data[a.indptr[3]:a.indptr[3+1]], a.indices[a.indptr[3]:a.indptr[3+1]], np.array([0,len(a.indices[a.indptr[3]:a.indptr[3+1]])])), shape=(1,n))

Please let me know in case of any questions or mistakes found. I am also happy to hear further solutions that aim to tackle this issue.

[1] http://docs.scipy.org/doc/scipy/reference/sparse.html
[2] http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
[3] http://pytables.github.io/usersguide/

Posted in General

Determining Power Law parameter(s) using Bayesian modeling with PyMC

In a previous post I talked about fitting the power law function to empirical data. Recently, I got highly interested in Bayesian modelling and probabilistic programming. I am currently re-reading the excellent freely available book "Probabilistic Programming and Bayesian Methods for Hackers" which provides a thorough tutorial for using the PyMC Python library. In order to get familiar with the framework, I had the idea to try to model the parameters of a power law function given empirical data using Bayesian inference.

I make the code and explanations available via an iPython notebook.

The core model is defined as follows:

import pymc as mc
from scipy.special import zeta

def _model(data, discrete=True, xmin=1.):
    alpha = mc.Exponential('alpha', 1. / 1.5)

    @mc.stochastic(observed=True)
    def custom_stochastic(value=data, alpha=alpha, xmin=xmin, discrete=discrete):
        value = value[value >= xmin]
        if discrete == True:
            return np.sum(np.log(value**-alpha / zeta(alpha,xmin)))
        else:
            return np.sum(np.log((alpha-1) * xmin**(alpha-1) * value**-alpha))

    return locals()

This can then be analyzed via:

def analyze(data, discrete=True, xmin=1.):
    model = mc.MCMC(_model(data,discrete,xmin)) 
    model.sample(5000) 

    print
    print(model.stats()['alpha']['mean'])

    mc.Matplot.plot(model) 

For details results and plots, please refer to the notebook.

Next, I want to incorporate the process of also determining the appropriate x_min value. If anyone has a hint or idea of how to solve this, please let me know. Furthermore, I want to analyze the goodness-of-fit and also compare the power law distribution to other candidate distributions like the Weibull distribution. This is a crucial step needed for gaining insights into the underlying processes of the data. I previously blogged about this issue.

Posted in General

Statistical test for randomness in categorical data sequences

Previously, I worked a lot with sequences consisting of categorical data. For example, sequences of categories where the set of categories is finite. As a prerequisity of my further modeling approaches of such data, I was interested in first applying a statistical test that kind-of gives me an idea whether the data has been produced in random fashion or based on some regularities.

My first idea was to use autocorrelation and subsequently, apply something like a Ljung-Box test which could give me insights into the overall randomness of a sequence. However, autocorrelation builds upon using Pearson's product-management correlation coefficients for correlating lagged sequences (time series). Hence, this approach was not directly applicable to my data as my variables (categories) have no specific level of measurement -- e.g., category 5 can not be interpreted as being "higher" than category 2.

Through my search for a suitable method [1,2] (thanks a lot for the helpful stackexchange users) I stumbled over the so-called runs-test also called Wald–Wolfowitz test [3,4]. This is a non-parametric test in which the null-hypothesis stating that the sequence gets produced in random fashion gets tested against the alternative hypothesis stating the opposite -- i.e., produced randomly. In detail, the null-hypothesis gets rejected in case of systematic or clustered arrangement of variables in a sequence. The test is called runs-test as it works with runs in a series -- e.g., the sequence "ABBC" has the three runs "A", "BB" and "C". Nonetheless, this test is only designed to work with dichotomous (binary) observations which is not the case in my sequences.

So what now? If this would be the end, this blog post would be pretty useless, right? Of course, there is a solution. In 1985 O'Brien and Dyck developed an adaption of the runs test that works with a linear combination of the weighted variances of run lengths. This can now be extended to also work with multi-categorical sequences (an example can be found in [2]). The statistic gets evaluated by looking at the right tail of the chi-square distribution.

As there was no implementation of this method available, I decided to quickly implement it in Python. The code is available online [6]. The input can be any multi-categorical sequence. So for example, inputting the sequence ["S", "S", "S", "F", "S", "F", "F", "F", "F"] results in a p-value of 0.12 which would lead to not rejecting the null hypothesis of randomness (dependent on significance level of course). I also recalculated the example provided by O'Brien [5] in Section 2 and ended up with the same p-value of 0.086 which would reject the null hypothesis.

One final thing I have to note though is that the method is not defined for several cases. For example, there have to be more than one run length for an element, more than one run and  the number of occurrences minus the number of runs of an element has to exceed one. For details about these limitations please refer to the code or to [5]. Hence, I would only recommend to perform the test on somewhat longer sequences with more runs. I have included several warnings and exceptions into the code which should make it easier to catch the issues.

If anyone has further ideas of how to approach this problem of detecting randomness in multi-categorical sequences, please let me/us know via the comments.

[1] http://stats.stackexchange.com/questions/100838/autocorrelation-of-discrete-time-series
[2] http://stats.stackexchange.com/questions/73084/analysis-of-temporal-patterns/73170#73170
[3] J. V. Bradley. Distribution-free statistical tests. 1968.
[4] A. Wald and J. Wolfowitz. On a test whether two samples are from the same population. The Annals of Mathematical Statistics, 11(2):147{162, 1940.
[5] P. C. O'Brien and P. J. Dyck. A runs test based on run lengths. Biometrics, pages 237-244, 1985.
[6] https://github.com/psinger/RunsTest

Posted in Coding, General, Research, Statistics

Statistical Significance Tests on Correlation Coefficients

Recently, I had to determine whether two calculated correlation coefficient are statistically significantly different from each other. Basically, there exist two types of scenarios: (i) You want to compare two dependent correlations or (ii) you want to compare two independent correlations. I now want to cover both cases and present methods of determining the statistical significance.

Two independent correlations

This use case applies when you have two correlations that come from different samples and are independent to each other. An example would be that you want to know whether height and weight are correlated in the same way for two distinct social groups. The following figure illustrates such a case:

indcorr

We could think of that X and Y represent height and weight for the first social group and A and B represent it for the second social group. So we have no overlapping correlations and hence, no dependent correlations and we can focus on an independent significance test.

Two dependent (overlapping) correlations

The -- in my view -- much more interesting case is when you want to determine statistical significance between two dependent correlations. To give you an example, I want to describe a previous use case I was presented with. I was working with calculating semantic relatedness scores between concepts. However, it is very difficult to find a good way of evaluating the scores you produce. But, there is one widely used evaluation dataset called WordSimilarity-353. This gold standard cosnsists of 353 word pairs where you know corresponding relatedness scores which have been judged by humans. Hence, you use the same word pairs and calculate semantic relatedness scores using your method. Finally, you compute the correlation coefficient between both vectors which represents the accuracy of your method. However, there exists a large array of well-performing methods. One of the best ones, for example, is called ESA [1] which ends up with a Spearman rank correlation of 0.75. Now, suppose your method receives a correlation score of 0.76 and you want to judge whether there are statistic significant improvements in your results. This might be a legit question especially due to the small gold standard dataset. As both methods are calculating correlation coefficients against the same gold standard, we have to cope with dependent correlations. This is visualized in the following figure:

dependcorr

We could think of X being the WordSimilarity-353 gold standard, Y being our results and Z being those by the ESA method. We are interested whether the correlation score XY (representing the accuracy of our method) is statistically significantly different to XZ (representing the accuracy of the ESA method). However, for calculating this, we also need to know the correlation coefficient between Y and Z.

Methods

In the past, I came around two ways of calculating the statistical significances for abovementioned cases. The first one is described in detail in the book Statistical Methods for Psychology [2]. It represents two solutions for both the independent and dependent case. For the independent case, one basically uses Fisher's z-transformation for correlation coefficients [3] and then tests the null hypothesis that p1 - p2 = 0. The dependent case, is a bit more complicated. Yet, the book represents one method by Steiger [4] which incorporates a term describing how the two tests are themselves correlated. There exists a working implementation for both methods in form of an R package.

The second method is by G. Zou [5] and represents as well methods for both the dependent and independent case. The advantage of this method is the acknowledgement of the asymmetry of sample distributions for single correlations and it only requires confidence intervals. The results lead to confidence intervals where one can reject the null hypothesis of no difference if the interval does not include zero. There is R code for this method available.

As I am mainly working with Python, I tackled the lack of a Python implementation for all methods. Hence, I put a working Python script online that builds upon the abovementioned citations and R codes. I hope, this will help someone and if questions come up, please fell free to ask them here or on the Github page. One needs to note though, that even though we can compare two correlation coefficients, does not necessarily mean that is is a good idea and it may depend strongly on the use case. For a short discussion about this topic, I want to refer to a blog post.

[1] E. Gabrilovich and S. Markovitch, "Computing semantic relatedness using wikipedia-based explicit semantic analysis," in In proceedings of the 20th international joint conference on artificial intelligence, 2007, pp. 1606-1611.
[Bibtex]
@INPROCEEDINGS{esa,
author = {Evgeniy Gabrilovich and Shaul Markovitch},
title = {Computing semantic relatedness using Wikipedia-based explicit semantic analysis},
booktitle = {In Proceedings of the 20th International Joint Conference on Artificial Intelligence},
year = {2007},
pages = {1606--1611}
}
[2] D. C. Howell, Statistical methods for psychology, Cengage Learning, 2011.
[Bibtex]
@book{howell2011statistical,
title={Statistical methods for psychology},
author={Howell, David C},
year={2011},
publisher={Cengage Learning}
}
[3] R. A. Fisher, "On the probable error of a coefficient of correlation deduced from a small sample," Metron, vol. 1, pp. 3-32, 1921.
[Bibtex]
@article{Fisher1921,
author = {Fisher, R. A.},
citeulike-article-id = {2346712},
journal = {Metron},
keywords = {methods},
pages = {3--32},
posted-at = {2008-02-07 00:05:17},
priority = {2},
title = {{On the probable error of a coefficient of correlation deduced from a small sample}},
volume = {1},
year = {1921}
}
[4] J. H. Steiger, "Tests for comparing elements of a correlation matrix.," Psychological bulletin, vol. 87, iss. 2, p. 245, 1980.
[Bibtex]
@article{steiger1980tests,
title={Tests for comparing elements of a correlation matrix.},
author={Steiger, James H},
journal={Psychological Bulletin},
volume={87},
number={2},
pages={245},
year={1980},
publisher={American Psychological Association}
}
[5] [doi] G. Y. Zou, "Toward using confidence intervals to compare correlations.," Psychological methods, vol. 12, iss. 4, pp. 399-413, 2007.
[Bibtex]
@article{zou,
abstract = {{Confidence intervals are widely accepted as a preferred way to present study results. They encompass significance tests and provide an estimate of the magnitude of the effect. However, comparisons of correlations still rely heavily on significance testing. The persistence of this practice is caused primarily by the lack of simple yet accurate procedures that can maintain coverage at the nominal level in a nonlopsided manner. The purpose of this article is to present a general approach to constructing approximate confidence intervals for differences between (a) 2 independent correlations, (b) 2 overlapping correlations, (c) 2 nonoverlapping correlations, and (d) 2 independent R2s. The distinctive feature of this approach is its acknowledgment of the asymmetry of sampling distributions for single correlations. This approach requires only the availability of confidence limits for the separate correlations and, for correlated correlations, a method for taking into account the dependency between correlations. These closed-form procedures are shown by simulation studies to provide very satisfactory results in small to moderate sample sizes. The proposed approach is illustrated with worked examples.}},
author = {Zou, Guang Y.},
citeulike-article-id = {8966057},
citeulike-linkout-0 = {http://dx.doi.org/10.1037/1082-989x.12.4.399},
citeulike-linkout-1 = {http://view.ncbi.nlm.nih.gov/pubmed/18179351},
citeulike-linkout-2 = {http://www.hubmed.org/display.cgi?uids=18179351},
doi = {10.1037/1082-989x.12.4.399},
issn = {1082-989X},
journal = {Psychological methods},
keywords = {bootstrap, confidence\_interval, regression, statistics},
month = dec,
number = {4},
pages = {399--413},
pmid = {18179351},
posted-at = {2011-03-09 09:30:13},
priority = {2},
title = {{Toward using confidence intervals to compare correlations.}},
url = {http://dx.doi.org/10.1037/1082-989x.12.4.399},
volume = {12},
year = {2007}
}
Posted in General

The popularity of subreddits and domains on Reddit

In a previous blog post I introduced a Reddit dataset I crawled which includes submission data for one complete year (2012-04-24 - 2014-04-23). I showed that the number of submissions each day were steadily rising but that interestingly more submissions are added to Reddit during the week. However, weekend submissions seem to get more attention by Redditors on average. Motivated by comments to this blog post and some further ideas I now extend the analysis of this dataset. Throughout this blog post I want to focus on all the different subreddits on Reddit and as well on all the different high level domains submissions link to.

To begin with, let's start with a basic figure showing the number of distinct submissions added to Reddit each week of the year at hand:

submissions

Number of distinct submissions each week

We can see a steady rise of the content added to Reddit. Furthermore, submissions on Reddit can be added to a vast variety of subreddits -- which are basically custom-made subforums and represent some kind of community-focused content aggregators. Hence, it is interesting to observe how submissions are added to distinct subreddits. We will investigate this by illustrating the number of distinct subreddits submissions get added to each week:

distinct_subreddits

Number of distinct subreddits posted to each week

Again, this figure shows us the steady rise of Reddit and the diversity of subreddits (note a similar figure over a longer period of time by a great blog post by Randal Olson).

On Reddit, it is possible to do two types of submissions: (a) self posts and (b) link posts. The first one is typically a plain text post added to a subreddit. The second one is a hyperlink to an external website. Both types of submissions can get up- and down-voted and commented. Overall, the dataset at hand for one year consists of a total of 21,288,540 submissions where 7,295,667 are self (text) posts and 13,992,873 are link posts. This reveals us that there is a dominance of link posts in Reddit while still around each third submission is a self post. One need to note though, that several subreddits have their own rules and may only allow distinct types of submissions which may influence the corresponding results based on their general popularity in Reddit. Now the question comes up if this behavior changed over time. Hence, we plot the percentage of self posts (based on the complete number of submissions) added to Reddit for each week:

selfposts

Number of self posts each week normalized by the total amount of submissions each week

This is quite interesting, as we can see that self (text) submissions get more important over the course of time. The frequently stated assumption about Reddit is that it becomes more and more an image or video board for funny posts (also hypothesized in the blog post by Brian Olson) not only because Redditors only gain karma for external link submissions. However, this figure shows us a slight different trend as it points out that the fraction of self posts is rising over time (even though just slightly).

As pointed out, also links to external websites play a crucial role in Reddit. As more and more submissions are added to Reddit, also a higher diversity of distinct domains linked to rise over time as the next figure shows:

distinct_domains

Number of distinct domains linked to each week

We now know that the diversity of subreddits and high level domains that are linked to in Reddit rises over time. But what are these distinct subreddits and domains Redditors mostly use? We start by visualizing the top 30 subreddits users added posts to in the course of the year at interest:

subreddits

Top subreddits usage

The usage percentages are calculated by normalizing by the amount of total submissions. We can see that the top two subreddits cover funny content that often lands on the front page of Reddit. One needs to note that /r/funny allows both external link submissions as well as self text posts but is highly dominated by the former one. Contrary, /r/AdviceAnimals only allows image submissions that are hosted on external websites while /r/AskReddit only allows text submissions. This already somehow covers the trend we discovered above. However, Reddit does not seem to be dominated by only one or a few subreddits that get most of the attention. The top subreddit "only" makes up for nearly 5% of all submissions. But, there is also a wide variety of other subreddits that get decent attention. Next, we produce a similar figure for the top domains being linked to in Reddit (again normalized, top 20):

domains

Top domain usage

In this figure we can see a much larger dominance by a few domains that get the vast majority of attention of Reddit submissions. Imgur has been established to be the number one domain in Reddit by making up nearly 40% of all Reddit link submissions. This is followed by YouTube (~13%) and then there is a large gap to Reddit itself. This distribution seems to have a much longer tail. In future, I want to explicitly investigate the distributions of these two figures (subreddits and domains). For example, it would be interesting to see whether these distributions follow power law functions or other functions that are known to describe heavy-tailed distributions well. Based on this we could probably have better capabilities to explain the process behind it.

But are the most often used subreddits and domains also the most qualitative ones? In the following we show the top subreddits based on their average score of submissions posted to a specific subreddit. We only consider those subreddits that at least have 10,000 submissions over the course of the year at hand to reduce the noise as much as possible.

subreddits_score_min10000

Top subreddits based on their average submission score (minimum of 10,000 submissions for each subreddit)

The clear winner in this figure is /r/4chan which is a subreddit dedicated to stuff about 4chan. However, in this subreddit it is not allowed to post direct links to 4chan itself. The scond place is held by /r/gifs and the third one by /r/gonewild. One need to note though that it is not allowed to downvote submissions in /r/gonewild. Furthermore, we now want to conduct the same analysis on the domains of Reddit submissions (this time setting the threshold for the minimum number of submissions for that domain to 1,000):

domains_score

Top domains based on their average score (minimum of 1,000 submissions for each domain)

Not surprisingly, the most often used domain (i.e. imgur) is not in the top domains regarding their average score. Still, domains covering images or gifs are on top of this list (e.g., forgifs, imgflip or minus). But also sports and technology news are present on the top (e.g., sbnation or bgr). The most interesting aspect however, is the presence of deviantart. On deviantart independ artists posts and share their artwork and you can find many submissions on Reddit that link to such content in various subreddits. Redditors seem to like such content especially. If we raise the threshold to 10,000 to cover only those domains linked to really often in Reddit we end up with the following results:

domains_score_min10000

Top domains based on their average score (minimum of 10,000 submissions for each domain)

Similar to the subreddit winner a hosting service that is mostly used for linking to gifs in Reddit is winning this race (i.e., minus). Interestingly, the average imgur link receives a score between 80 and 100 which is surprisingly high if we consider the large amount of submissions to this domain. Also, "quality" textual content like articles in Wikipedia seem to receive high scores on Reddit.

In this blog post I assumed that the score of submissions is a great way to evaluate the popularity and quality of a submission. As this is a very important concept of Reddit we could also use the number of comments as another way to measure this. I will cover this in a future post about the data where I also want to find similarities or dissimilarities between these two ways of measuring popularity in Reddit (i.e., score and number of comments).

Posted in General