Handling huge matrices in Python

Everyone who does scientific computing in Python has to handle matrices at least sometimes. The go-to library for using matrices and performing calculations on them is Numpy. However, sometimes your matrices grow so large that you cannot store them any longer in memory. This blog post should provide insights into how to handle such a scenario.

The most prominent, and the solution I would suggest at first, is to use Scipy's sparse matrices. Scipy is a package that builds upon Numpy but provides further mechanisms like sparse matrices which are regular matrices that do only store elements that exhibit a value different from zero. When confronted with large matrices, it is very common that the matrices are very sparse and hence, using sparse matrices directly can already help a lot. Additionally to the benefit of lesser memory usage, Scipy's calculation methods are also tuned to sparse calculations which also speeds up your code a lot usually (Scipy uses a lot of C code which also helps). Scipy offers a variety of different types of sparse matrices that each have their own internal representation and benefits; for an overview please consult [1].

Alas, this is not enough sometimes. For example, I recently had to calculate the dot product of a matrix having dimensions of 360,000 times 360,000 with the transpose of itself. If this matrix would be very sparse, we would already solve the problem by using sparse matrices. However, in my case it was not that sparse at all and the final output needed something like more than 100GB of memory even though I used float32 as dtype -- altering the dtypes [2] is on a side note also an approach to reduce memory usage on the expense of value precision. Even the server I was using could not handle such memory usage.

So as a second approach, I suggest to use and HDF5 store in e.g., the form of PyTables. PyTables allows you to store Numpy arrays on disk and then you can directly access the array on disk in your Python code partially. For example, instead of now holding everything in memory, I would first store the array to disk and then slice rows from disk and work with them. For a detailed tutorial on PyTables please refer to [3]. I want to give an example now though of how we can make our dot calculation and store the output in an hdf5 store.

from scipy.sparse import csr_matrix, rand
import tables as tb
a = rand(2000,2000, format='csr') #imagine that many values are stored in this matrix and that sparsity is low
b = a.T
l, m, n = a.shape[0], a.shape[1], b.shape[1]

f = tb.open_file('dot.h5', 'w')
filters = tb.Filters(complevel=5, complib='blosc')
out = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, n), filters=filters)

bl = 1000 #this is the number of rows we calculate each loop
#this may not the most efficient value
#look into buffersize usage in PyTables and adopt the buffersite of the
#carray accordingly to improve specifically fetching performance

b = b.tocsc() #we slice b on columns, csc improves performance

#this can also be changed to slice on rows instead of columns
for i in range(0, l, bl):
  out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray()


When our calculation is finished we can quite easily access the data:

h5 = tb.open_file('dot.h5', 'r')
a = h5.root.data
row = a[0,:] #only one row gets loaded into memory
print row

In this case we only slice one row of the hdf5 stored matrix and hence, only this single row gets loaded into memory. If we want to perform any further calculations on this matrix, we could completely load it into memory (which will not work as we want to circumvent memory issues), or also perform the calculations on smaller slices and then piece the results together (again in and hf5 store or a numpy/sparse matrix).

One downfall is that PyTables can not work with sparse matrices directly which is why we have to use toarray() in order to make the sliced calculation dense and store it into the hf5 CArray. However, there is also a way to work with sparse matrices in PyTables by using EArrays (which have no pre-defined shape) and storing the data and indices of the sparse matrix:

f = tb.open_file('dot.h5', 'w')
filters = tb.Filters(complevel=5, complib='blosc')
out_data = f.create_earray(f.root, 'data', tb.Float32Atom(), shape=(0,), filters=filters)
out_ri = f.create_earray(f.root, 'ri', tb.Float32Atom(),shape=(0,), filters=filters)
out_ci = f.create_earray(f.root, 'ci', tb.Float32Atom(), shape=(0,), filters=filters)
for i in range(0, l, bl):
  res = a.dot(b[:,i:min(i+bl, l)])
  vals = res.data
  ri, ci = res.nonzero()

Finally, we can rebuild the sparse matrix the following way:

h5 = tb.open_file('dot.h5', 'r')
a = csr_matrix((h5.root.data[:], (h5.root.ri[:], h5.root.ci[:])), shape=(l,n))

We need to remember or store the shape parameters though as you can see in aboves example. Still, the problem remains if we cannot store the final sparse matrix in memory. Hence, it is more elegant to store the column indices together with the indptr the following way:

import numpy as np
f = tb.open_file('dot.h5', 'w')
filters = tb.Filters(complevel=5, complib='blosc')
out_data = f.create_earray(f.root, 'data', tb.Float32Atom(), shape=(0,), filters=filters)
out_indices = f.create_earray(f.root, 'indices', tb.Int32Atom(),shape=(0,), filters=filters)
out_indptr = f.create_earray(f.root, 'indptr', tb.Int32Atom(), shape=(0,), filters=filters)
out_indptr.append(np.array([0])) #this is needed as a first indptr
max_indptr = 0
for i in range(0, l, bl):
 res = a[i:min(i+bl, l),:].dot(b)
 indices = res.indices
 indptr = res.indptr
 max_indptr += indices.shape[0]

Again, we can generate the complete sparse matrix the following way:

h5 = tb.open_file('dot.h5', 'r')
a = csr_matrix((h5.root.data[:], h5.root.indices[:], h5.root.indptr[:]), shape=(l,n))

Now, the beauty is that we can also slice the matrix without ever using dense Numpy arrays:

#let's get the third row
a = h5.root
b = csr_matrix((a.data[a.indptr[3]:a.indptr[3+1]], a.indices[a.indptr[3]:a.indptr[3+1]], np.array([0,len(a.indices[a.indptr[3]:a.indptr[3+1]])])), shape=(1,n))

Please let me know in case of any questions or mistakes found. I am also happy to hear further solutions that aim to tackle this issue.

[1] http://docs.scipy.org/doc/scipy/reference/sparse.html
[2] http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
[3] http://pytables.github.io/usersguide/

Posted in General