Speeding up calculation of Pearson correlation for matrices with missing data

In the DepMap portal, we found ourselves wanting to precompute pearson correlation coefficients across a large number of profiles.

This generally took the form of two matrices $X$ and $Y$ and we would like to compute the correlation of every column in $X$ against every column in $Y$.

The R language has a very convenient cor function which does exactly this. Looking around in numpy's documentation there's a promising function named numpy.ma.corrcoef, however this function does not actually compute the correlation between the columns in x and y, but instead computes the R equivalent of cor(rbind(x,y)), which is not what we want.

Even though I thought computing pearson correlation between a set of vector pairs was a common use case, I was surprised that I could not find a function in python which supported the batch computation.

One can try to natively implement such a function by pairwise computing scipy.stats.pearsonr

import scipy.stats
import numpy as np

def naive_pearson_cor(X, Y):
    result = np.zeros(shape=(X.shape[1], Y.shape[1]))
    for i in range(X.shape[1]):
        for j in range(Y.shape[1]):
            r, _ = scipy.stats.pearsonr(X[:,i], Y[:,j])
            result[i,j] = r
    return result

x = np.random.uniform(size=(500,500))
y = np.random.uniform(size=(500,100))

%%timeit
naive_pearson_cor(x,y)

2.48 s ± 50.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, that run time is not so good, because this was only 500 columns against 100 columns. We need to compute correlations that are more on the order of 10k against 10k!

I did however, find several stack overflow posts describing how to implement an efficient matrix correlation function.

The general theme of doing computing the correlation matrix efficiently is:

Use vectorized numpy operations whenever possible.
Reuse computation that is common for each vector in $X$ and $Y$

Concretely, the Pearson correlation coefficient $r_{xy}$ from $n$ paired samples $\left\{(x_{1},y_{1}),\ldots ,(x_{n},y_{n})\right\}$ can be computed as:

$$r_{xy} =\frac{\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2} \sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}}$$

If we rewrite this as:

$$a=\sum ^n _{i=1}(x_i - \bar{x})(y_i - \bar{y})$$

$$b=\sqrt{\sum ^n _{i=1}(x_i - \bar{x})^2}$$

$$c=\sqrt{\sum ^n _{i=1}(y_i - \bar{y})^2}$$

and

$$r_{xy} =\frac{a}{b c}$$

We can compute $b$ for each column in X and $c$ for each column in $Y$. The numerator $a$ is actually the only value which dependent on both the columns from $X$ and $Y$. As a result, we can save a lot of time not recalcuating $b$ and $c$ for each pair.

Concretely the implementation I came up with looks like the following:

def np_pearson_cor(x, y):
    xv = x - x.mean(axis=0)
    yv = y - y.mean(axis=0)
    xvss = (xv * xv).sum(axis=0)
    yvss = (yv * yv).sum(axis=0)
    result = np.matmul(xv.transpose(), yv) / np.sqrt(np.outer(xvss, yvss))
    # bound the values to -1 to 1 in the event of precision issues
    return np.maximum(np.minimum(result, 1.0), -1.0)

%%timeit
np_pearson_cor(x,y)

1.88 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That was so much faster, let's actually make our test data much larger to get a better sense of time.

y = np.random.uniform(size=(500,20000))

%%timeit
np_pearson_cor(x,y)

266 ms ± 1.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, that's great, but our matrices have missing values (represented as NaNs in the matrices X and Y). That changes our implementation because $b$ and $c$ need to consider whether a value is missing in either vector for every combination of columns. We can no longer use our trick of computing these once per column.

However, in our specific case of DepMap datasets, we have relatively few holes in our matrices. Even more so, in some columns, we have holes due the matrix having been constructed by merging data from different experiments together. As a result, when we do have holes, they are often relatively consistent between columns.

This means we may be able to preprocess the matrices, splitting them into distinct dense matrices based on columns which share the same patterns of missing values.

My implementation of group_cols_with_same_mask identifies the columns which can be used to construct distinct dense matrices:

def group_cols_with_same_mask(x):
    "returns a sequence of tuples (mask, columns) where columns are the column indices in x which all have the mask"
    per_mask = {}
    for i in range(x.shape[1]):
        o_mask = np.isfinite(x[:,i])
        # take the binary vector o_mask and convert it to a compact
        # sequence of bytes which we can use as a dict key
        o_mask_b = np.packbits(o_mask).tobytes()
        if o_mask_b not in per_mask:
            per_mask[o_mask_b] = [o_mask, []]
        per_mask[o_mask_b][1].append(i)
    return per_mask.values()

Now we can use group_cols_with_same_mask to construct dense matrices, and for each one, call np_pearson_cor.

def fast_cor_with_missing(x,y):
    # preallocate storage for the result
    result = np.zeros(shape=(x.shape[1], y.shape[1]))

    x_groups = group_cols_with_same_mask(x)
    y_groups = group_cols_with_same_mask(y)
    for x_mask, x_columns in x_groups:
        for y_mask, y_columns in y_groups:
            # print(x_mask, x_columns, y_mask, y_columns)
            combined_mask = x_mask & y_mask

            # not sure if this is the fastest way to slice out the relevant subset
            x_without_holes = x[:, x_columns][combined_mask, :]
            y_without_holes = y[:, y_columns][combined_mask, :]

            c = np_pearson_cor(x_without_holes, y_without_holes)

            # update result with these correlations
            result[np.ix_(x_columns, y_columns)] = c

    return result

Now we are adding overhead by adding the preprocessing step. We can time it again to see how much overhead it adds.

%%timeit
fast_cor_with_missing(x,y)

589 ms ± 9.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It's significant, but it allows us to introduce holes and still keep performance comparable to the dense form.

Now, let's move on to measuring the executing time for a case where we have holes.

def add_holes(x, count=100):
    x = x.copy()
    i_v = np.random.choice(range(x.shape[0]), count)
    j_v = np.random.choice(range(x.shape[1]), count)
    for i in range(count):
        x[i_v[i],j_v[i]] = np.NaN
    return x

x_with_nans = add_holes(x)
y_with_nans = add_holes(y)

%%timeit
fast_cor_with_missing(x_with_nans,y_with_nans)

18.6 s ± 232 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Much slower, but still much faster then pairwise computing each correlation.

(Thanks to Josh Dempster and Andrew Boghassian for improvements to this post.)