Recently, I have been getting more and more interested in Bayesian techniques and specifically, I have been researching how to approach classical statistical inference problems within the Bayesian framework. As a start, I have looked into calculating Pearson correlation.
To that end, I have found great resources in the great blog by Rasmus Bååth who had a two-part series about how to model correlation in a Bayesian way [1,2]. A very similar model has also been proposed and discussed in .
My main contribution here is that I show how to apply the model with the Python library PyMC.
The main code, experiments and results can be found in the form of an IPython notebook. Next, I just want to highlight the code for the main model and exemplary results.
The core model(s) can be expressed as follows:
from pymc import Normal, Uniform, MvNormal, Exponential from numpy.linalg import inv, det from numpy import log, pi, dot import numpy as np from scipy.special import gammaln def _model(data, robust=False): # priors might be adapted here to be less flat mu = Normal('mu', 0, 0.000001, size=2) sigma = Uniform('sigma', 0, 1000, size=2) rho = Uniform('r', -1, 1) # we have a further parameter (prior) for the robust case if robust == True: nu = Exponential('nu',1/29., 1) # we model nu as an Exponential plus one @pymc.deterministic def nuplus(nu=nu): return nu + 1 @pymc.deterministic def precision(sigma=sigma,rho=rho): ss1 = float(sigma * sigma) ss2 = float(sigma * sigma) rss = float(rho * sigma * sigma) return inv(np.mat([[ss1, rss], [rss, ss2]])) if robust == True: # log-likelihood of multivariate t-distribution @pymc.stochastic(observed=True) def mult_t(value=data.T, mu=mu, tau=precision, nu=nuplus): k = float(tau.shape) res = 0 for r in value: delta = r - mu enum1 = gammaln((nu+k)/2.) + 0.5 * log(det(tau)) denom = (k/2.)*log(nu*pi) + gammaln(nu/2.) enum2 = (-(nu+k)/2.) * log (1 + (1/nu)*delta.dot(tau).dot(delta.T)) result = enum1 + enum2 - denom res += result return res[0,0] else: mult_n = MvNormal('mult_n', mu=mu, tau=precision, value=data.T, observed=True) return locals() def analyze(data, robust=False, plot=True): model = pymc.MCMC(_model(data,robust)) model.sample(50000,25000) print if plot: pymc.Matplot.plot(model) return model
Then, let us define some data and do inference.
import numpy as np x = np.array([525., 300., 450., 300., 400., 500., 550., 125., 300., 400., 500., 550.]) y = np.array([250., 225., 275., 350., 325., 375., 450., 400., 500., 550., 600., 525.]) data = np.array([x, y]) model = analyze(data)
The marginal posterior of the correlation rho (together with the mcmc trace and autocorrelation) looks like the following:
What we can see here, is that the mean of rho is around 0.13 (which is similar to what a classic Pearson correlation estimation tells us). However, when we take a look at the histogram of the marginal posterior for rho, we can see that the frequency of distinct values for rho are pretty wide. We can characterize this with the 95% HDP (highest probability density) interval---also called credible interval---which is [-0.42 0.64]. Thus, with this HDP, we can get a very thorough view regarding the distribution of the parameters for rho. While the mean correlation is slightly positive, we cannot rule out a negative correlation or a non-existing correlation (rho=0).
For further details, examples and also an experiment with the robust model, please take a look at the IPython notebook and the references.
 Lee, Michael D., and Eric-Jan Wagenmakers. Bayesian cognitive modeling: A practical course. Cambridge University Press, 2014.