Recently, I have been getting more and more interested in Bayesian techniques and specifically, I have been researching how to approach classical statistical inference problems within the Bayesian framework. As a start, I have looked into calculating Pearson correlation.

To that end, I have found great resources in the great blog by Rasmus Bååth who had a two-part series about how to model correlation in a Bayesian way [1,2]. A very similar model has also been proposed and discussed in [3].

My main contribution here is that I show how to apply the model with the Python library PyMC.

The main code, experiments and results can be found in the form of an IPython notebook. Next, I just want to highlight the code for the main model and exemplary results.

The core model(s) can be expressed as follows:

from pymc import Normal, Uniform, MvNormal, Exponential from numpy.linalg import inv, det from numpy import log, pi, dot import numpy as np from scipy.special import gammaln def _model(data, robust=False): # priors might be adapted here to be less flat mu = Normal('mu', 0, 0.000001, size=2) sigma = Uniform('sigma', 0, 1000, size=2) rho = Uniform('r', -1, 1) # we have a further parameter (prior) for the robust case if robust == True: nu = Exponential('nu',1/29., 1) # we model nu as an Exponential plus one @pymc.deterministic def nuplus(nu=nu): return nu + 1 @pymc.deterministic def precision(sigma=sigma,rho=rho): ss1 = float(sigma[0] * sigma[0]) ss2 = float(sigma[1] * sigma[1]) rss = float(rho * sigma[0] * sigma[1]) return inv(np.mat([[ss1, rss], [rss, ss2]])) if robust == True: # log-likelihood of multivariate t-distribution @pymc.stochastic(observed=True) def mult_t(value=data.T, mu=mu, tau=precision, nu=nuplus): k = float(tau.shape[0]) res = 0 for r in value: delta = r - mu enum1 = gammaln((nu+k)/2.) + 0.5 * log(det(tau)) denom = (k/2.)*log(nu*pi) + gammaln(nu/2.) enum2 = (-(nu+k)/2.) * log (1 + (1/nu)*delta.dot(tau).dot(delta.T)) result = enum1 + enum2 - denom res += result[0] return res[0,0] else: mult_n = MvNormal('mult_n', mu=mu, tau=precision, value=data.T, observed=True) return locals() def analyze(data, robust=False, plot=True): model = pymc.MCMC(_model(data,robust)) model.sample(50000,25000) print if plot: pymc.Matplot.plot(model) return model

Then, let us define some data and do inference.

import numpy as np x = np.array([525., 300., 450., 300., 400., 500., 550., 125., 300., 400., 500., 550.]) y = np.array([250., 225., 275., 350., 325., 375., 450., 400., 500., 550., 600., 525.]) data = np.array([x, y]) model = analyze(data)

The marginal posterior of the correlation rho (together with the mcmc trace and autocorrelation) looks like the following:

What we can see here, is that the mean of rho is around 0.13 (which is similar to what a classic Pearson correlation estimation tells us). However, when we take a look at the histogram of the marginal posterior for rho, we can see that the frequency of distinct values for rho are pretty wide. We can characterize this with the 95% HDP (highest probability density) interval---also called credible interval---which is [-0.42 0.64]. Thus, with this HDP, we can get a very thorough view regarding the distribution of the parameters for rho. While the mean correlation is slightly positive, we cannot rule out a negative correlation or a non-existing correlation (rho=0).

For further details, examples and also an experiment with the robust model, please take a look at the IPython notebook and the references.

[1] http://www.sumsar.net/blog/2013/08/bayesian-estimation-of-correlation/

[2] http://www.sumsar.net/blog/2013/08/robust-bayesian-estimation-of-correlation/

[3] Lee, Michael D., and Eric-Jan Wagenmakers. *Bayesian cognitive modeling: A practical course*. Cambridge University Press, 2014.

Hi,

I stumbled across your blog while looking for a python version of Rasmus Bååth code. I was wondering how do you access the 95% CI intervals. I'd like to print them to the screen.

Thanks!

Via model.stats()['r']['95% HPD interval']. You can check out the linked notebook that has code for that: http://nbviewer.jupyter.org/github/psinger/notebooks/blob/master/bayesian_correlation_pymc.ipynb

Hi Philipp,

thanks for your fantastic work, this is very helpful! I am not very familiar with PyMC yet, how could one give two independent priors on mu and sigma? I.e. instead of

mu = Normal('mu', 0, 0.000001, size=2)

sigma = Uniform('sigma', 0, 1000, size=2)

I would like to give mu1, mu2, sigma1, sigma2, but then don't know to combine them into one mu and one sigma.

You can simply define them separately as mu1, mu2, sigma1 and sigma2 and then use these inside the precision and mult_t functions. So for example, in precision you pass sigma1 and sigma2 and access them instead of sigma[0] and sigma[1]. Alternatively, so-called containers in Pymc2 should also work: https://pymc-devs.github.io/pymc/modelbuilding.html#containers

Hi Philipp,

thanks for providing your code. I'm in the process of porting your code to PyMC3, but get quite different results for the robust model. Unfortunately, I could not find out why this happens. I'd appreciate if you could have a look at http://stackoverflow.com/questions/40126809/robust-bayesian-correlation-with-pymc3, maybe you have an idea.

Thanks!

Hi Sebastian, sorry for the late response. I am also having trouble sampling from the robust model. Either I end up with errors, or I have traces not really converging. Have you found a solution in the meantime? I will also play around a bit.

Can't find any issue. Did you ask on Pymc3 github issue list? Think that would be the best place to go.