Everyone who does scientific computing in Python has to handle matrices at least sometimes. The go-to library for using matrices and performing calculations on them is Numpy. However, sometimes your matrices grow so large that you cannot store them any longer in memory. This blog post should provide insights into how to handle such a scenario.

The most prominent, and the solution I would suggest at first, is to use Scipy's sparse matrices. Scipy is a package that builds upon Numpy but provides further mechanisms like sparse matrices which are regular matrices that do only store elements that exhibit a value different from zero. When confronted with large matrices, it is very common that the matrices are very sparse and hence, using sparse matrices directly can already help a lot. Additionally to the benefit of lesser memory usage, Scipy's calculation methods are also tuned to sparse calculations which also speeds up your code a lot usually (Scipy uses a lot of C code which also helps). Scipy offers a variety of different types of sparse matrices that each have their own internal representation and benefits; for an overview please consult [1].

Alas, this is not enough sometimes. For example, I recently had to calculate the dot product of a matrix having dimensions of 360,000 times 360,000 with the transpose of itself. If this matrix would be very sparse, we would already solve the problem by using sparse matrices. However, in my case it was not that sparse at all and the final output needed something like more than 100GB of memory even though I used float32 as dtype -- altering the dtypes [2] is on a side note also an approach to reduce memory usage on the expense of value precision. Even the server I was using could not handle such memory usage.

So as a second approach, I suggest to use and HDF5 store in e.g., the form of PyTables. PyTables allows you to store Numpy arrays on disk and then you can directly access the array on disk in your Python code partially. For example, instead of now holding everything in memory, I would first store the array to disk and then slice rows from disk and work with them. For a detailed tutorial on PyTables please refer to [3]. I want to give an example now though of how we can make our dot calculation and store the output in an hdf5 store.

from scipy.sparse import csr_matrix, rand import tables as tb a = rand(2000,2000, format='csr') #imagine that many values are stored in this matrix and that sparsity is low b = a.T l, m, n = a.shape[0], a.shape[1], b.shape[1] f = tb.open_file('dot.h5', 'w') filters = tb.Filters(complevel=5, complib='blosc') out = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, n), filters=filters) bl = 1000 #this is the number of rows we calculate each loop #this may not the most efficient value #look into buffersize usage in PyTables and adopt the buffersite of the #carray accordingly to improve specifically fetching performance b = b.tocsc() #we slice b on columns, csc improves performance #this can also be changed to slice on rows instead of columns for i in range(0, l, bl): out[:,i:min(i+bl, l)] = (a.dot(b[:,i:min(i+bl, l)])).toarray() f.close()

When our calculation is finished we can quite easily access the data:

h5 = tb.open_file('dot.h5', 'r') a = h5.root.data row = a[0,:] #only one row gets loaded into memory print row

In this case we only slice one row of the hdf5 stored matrix and hence, only this single row gets loaded into memory. If we want to perform any further calculations on this matrix, we could completely load it into memory (which will not work as we want to circumvent memory issues), or also perform the calculations on smaller slices and then piece the results together (again in and hf5 store or a numpy/sparse matrix).

One downfall is that PyTables can not work with sparse matrices directly which is why we have to use toarray() in order to make the sliced calculation dense and store it into the hf5 CArray. However, there is also a way to work with sparse matrices in PyTables by using EArrays (which have no pre-defined shape) and storing the data and indices of the sparse matrix:

f = tb.open_file('dot.h5', 'w') filters = tb.Filters(complevel=5, complib='blosc') out_data = f.create_earray(f.root, 'data', tb.Float32Atom(), shape=(0,), filters=filters) out_ri = f.create_earray(f.root, 'ri', tb.Float32Atom(),shape=(0,), filters=filters) out_ci = f.create_earray(f.root, 'ci', tb.Float32Atom(), shape=(0,), filters=filters) for i in range(0, l, bl): res = a.dot(b[:,i:min(i+bl, l)]) vals = res.data ri, ci = res.nonzero() out_data.append(vals) out_ri.append(ri) out_ci.append(ci)

Finally, we can rebuild the sparse matrix the following way:

h5 = tb.open_file('dot.h5', 'r') a = csr_matrix((h5.root.data[:], (h5.root.ri[:], h5.root.ci[:])), shape=(l,n))

We need to remember or store the shape parameters though as you can see in aboves example. Still, the problem remains if we cannot store the final sparse matrix in memory. Hence, it is more elegant to store the column indices together with the indptr the following way:

import numpy as np f = tb.open_file('dot.h5', 'w') filters = tb.Filters(complevel=5, complib='blosc') out_data = f.create_earray(f.root, 'data', tb.Float32Atom(), shape=(0,), filters=filters) out_indices = f.create_earray(f.root, 'indices', tb.Int32Atom(),shape=(0,), filters=filters) out_indptr = f.create_earray(f.root, 'indptr', tb.Int32Atom(), shape=(0,), filters=filters) out_indptr.append(np.array([0])) #this is needed as a first indptr max_indptr = 0 for i in range(0, l, bl): res = a[i:min(i+bl, l),:].dot(b) out_data.append(res.data) indices = res.indices indptr = res.indptr out_indices.append(indices) out_indptr.append(max_indptr+indptr[1:]) max_indptr += indices.shape[0]

Again, we can generate the complete sparse matrix the following way:

h5 = tb.open_file('dot.h5', 'r') a = csr_matrix((h5.root.data[:], h5.root.indices[:], h5.root.indptr[:]), shape=(l,n))

Now, the beauty is that we can also slice the matrix without ever using dense Numpy arrays:

#let's get the third row a = h5.root b = csr_matrix((a.data[a.indptr[3]:a.indptr[3+1]], a.indices[a.indptr[3]:a.indptr[3+1]], np.array([0,len(a.indices[a.indptr[3]:a.indptr[3+1]])])), shape=(1,n))

Please let me know in case of any questions or mistakes found. I am also happy to hear further solutions that aim to tackle this issue.

[1] http://docs.scipy.org/doc/scipy/reference/sparse.html

[2] http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

[3] http://pytables.github.io/usersguide/

So awesome, thank for posting, exactly what I was looking for.

Glad to hear it helped 🙂

Do you know if you can transpose a table?

I think there are several ways to achieve this that depend on the use case. For example, you could transpose the data before you write it to the hdf5 table if you always want to access it in transposed form later. Or, you could simply transpose the data you slice later on by just saying h5.root.data[:].T or equivalent. Finally, you could also create two tables, one that is in transposed form and one that is not if disk space is not an issue.

If only your clever hacks were built into PyTables or h5... pull request?

Well, maybe I could spend some time to think how to integrate this into pytables, but if anyone would want to go ahead, I would be happy 😉

I have a matrix of size 57600 by 57600 and I have to give it completely to a function. Can I use pytables for this?

Yes, I don't see a problem with that.

Hi,

Thank you for your great post! really Helpful. I am not really sure to follow the last part though.

How would you modify the following line to select row 3 and 4?

b = csr_matrix((a.data[a.indptr[3]:a.indptr[3+1]], a.indices[a.indptr[3]:a.indptr[3+1]], np.array([0,len(a.indices[a.indptr[3]:a.indptr[3+1]])])), shape=(1,n))

Many thanks.

Thanks! You can just replace 3 with whatever row you want; and if you want to have 3 and 4, you should be able to do that by writing [3+2] instead of [3+1].

Hi,

Suppose I want to calculate a simple row sum in a matrix using nested for loops but the matrix is a sparse matrix.Is there any difference in using the loops while using the sparse matrix?

Hi - I don't think I really get the question. Could you try to elaborate?

Is there any way to store python huge dictionary in hard disc? I want to store list of dictionary objects. list size could be as large as 231000.

You could try to checkout https://pypi.python.org/pypi/shove. Or just simply use a database or sth similar.

I want to calculate svd of a very large sparse matrix(orders of 500000*200000).Is there a way to handle it in memory? I've tried to use python's cscmatrix and Spark Extension. But I'm always encountering with memory problem and process is killed with id 137. Any idea would be great.