Bayesian Correlation with PyMC

Recently, I have been getting more and more interested in Bayesian techniques and specifically, I have been researching how to approach classical statistical inference problems within the Bayesian framework. As a start, I have looked into calculating Pearson correlation.

To that end, I have found great resources in the great blog by Rasmus Bååth who had a two-part series about how to model correlation in a Bayesian way [1,2]. A very similar model has also been proposed and discussed in [3].

My main contribution here is that I show how to apply the model with the Python library PyMC.

The main code, experiments and results can be found in the form of an IPython notebook. Next, I just want to highlight the code for the main model and exemplary results.

The core model(s) can be expressed as follows:

from pymc import Normal, Uniform, MvNormal, Exponential
from numpy.linalg import inv, det
from numpy import log, pi, dot
import numpy as np
from scipy.special import gammaln

def _model(data, robust=False):
    # priors might be adapted here to be less flat
    mu = Normal('mu', 0, 0.000001, size=2)
    sigma = Uniform('sigma', 0, 1000, size=2)
    rho = Uniform('r', -1, 1)

    # we have a further parameter (prior) for the robust case
    if robust == True:
        nu = Exponential('nu',1/29., 1)
        # we model nu as an Exponential plus one
        @pymc.deterministic
        def nuplus(nu=nu):
            return nu + 1

    @pymc.deterministic
    def precision(sigma=sigma,rho=rho):
        ss1 = float(sigma[0] * sigma[0])
        ss2 = float(sigma[1] * sigma[1])
        rss = float(rho * sigma[0] * sigma[1])
        return inv(np.mat([[ss1, rss], [rss, ss2]]))

    if robust == True:
        # log-likelihood of multivariate t-distribution
        @pymc.stochastic(observed=True)
        def mult_t(value=data.T, mu=mu, tau=precision, nu=nuplus):
            k = float(tau.shape[0])
            res = 0
            for r in value:
                delta = r - mu
                enum1 = gammaln((nu+k)/2.) + 0.5 * log(det(tau))
                denom = (k/2.)*log(nu*pi) + gammaln(nu/2.)
                enum2 = (-(nu+k)/2.) * log (1 + (1/nu)*delta.dot(tau).dot(delta.T))
                result = enum1 + enum2 - denom
                res += result[0]
            return res[0,0]

    else:
        mult_n = MvNormal('mult_n', mu=mu, tau=precision, value=data.T, observed=True)

    return locals()

def analyze(data, robust=False, plot=True):
    model = pymc.MCMC(_model(data,robust))
    model.sample(50000,25000)

    print
    if plot:
        pymc.Matplot.plot(model)

    return model

Then, let us define some data and do inference.

import numpy as np
x = np.array([525., 300., 450., 300., 400., 500., 550., 125., 300., 400., 500., 550.])
y = np.array([250., 225., 275., 350., 325., 375., 450., 400., 500., 550., 600., 525.])
data = np.array([x, y])
model = analyze(data)

The marginal posterior of the correlation rho (together with the mcmc trace and autocorrelation) looks like the following:

rho

What we can see here, is that the mean of rho is around 0.13 (which is similar to what a classic Pearson correlation estimation tells us). However, when we take a look at the histogram of the marginal posterior for rho, we can see that the frequency of distinct values for rho are pretty wide. We can characterize this with the 95% HDP (highest probability density) interval---also called credible interval---which is [-0.42 0.64]. Thus, with this HDP, we can get a very thorough view regarding the distribution of the parameters for rho. While the mean correlation is slightly positive, we cannot rule out a negative correlation or a non-existing correlation (rho=0).

For further details, examples and also an experiment with the robust model, please take a look at the IPython notebook and the references.

[1] http://www.sumsar.net/blog/2013/08/bayesian-estimation-of-correlation/
[2] http://www.sumsar.net/blog/2013/08/robust-bayesian-estimation-of-correlation/
[3] Lee, Michael D., and Eric-Jan Wagenmakers. Bayesian cognitive modeling: A practical course. Cambridge University Press, 2014.

Posted in Coding, General, Statistics