Evaluating gene relatedness in CRISPR - Cancer Data Science Blog

Although the main goal of DepMap is identifying therapeutic opportunities in cancer, it can also be used to investigate questions of basic cell biology, such as gene function. One common approach to learning the function of an unknown gene is “guilt by association,” with the hypothesis that genes of similar function are essential in similar contexts. One way to evaluate the success of this strategy is to check whether known gene relationships are highly correlated in DepMap dependency data. One of the strongest examples is the TSC1-TSC2 paralog pair, which have highly correlated gene effects:

png

You could take this a step farther and suggest that recovery of known gene relationships is one of the metrics that we should use to assess how good a dependency dataset is. However, there are some significant pitfalls to this type of metric, which I'll show below.

(This analysis is based on the DepMap internal 22Q4 release because I had the data from that in a convenient form, but the results should be consistent regardless of the version of the data.)

One thing I won't discuss much is the chromosome arm effect, where genes on the same chromosome arm tend to be correlated with each other. We suspect this effect is due to uncorrected copy number effect. While it could confound a gene relationshup analysis by making collocated genes correlated, the effect size is generally weak and most pronounced for genes with no genuine variation in viability effect. Here, I'm focusing on the strongest correlations identifiable in the data, where the chromosome arm effect is unlikely to have much influence.

First, the forms of the CRISPR data I'll consider. I'll include Chronos gene effects since that's the default version of the data (the version used here does not have the chromosome arm correction we introduced in 23Q2). I also include the naive gene effect scores, which are just median fold changes across guides and replicates for each screen. I averaged overlapping screens for the same cell line across libraries. This Naive gene effect suffers significant batch effects, so we made a second version, NaiveScaled, with every cell line's gene effect rescaled to have the median of common essentials at -1. A simple linear rescaling does not completely correct quality effects, so we also made a fancier version where we did a spline regression to remove bias. For this, we model the gene effect of each cell line $c$ as $\xi_c =f_c(\bar{X})$, where $\xi_c$ is the estimated value of the gene effect $X_c$, $\bar{X}$ is the mean value of $X$ across all cell lines, and $f_c$ is a regularized spline model. The corrected NaiveSpline is $X - \xi + \bar{X}$. Since this is a useful technique for detrending nonlinear biases, the python code is below:

from patsy import dmatrix
from statsmodels.api import GLM
import numpy as np
import pandas as pd

#you'll need to download ScreenNaiveGeneScore.csv from the DepMap release
naive = pd.read_csv("ScreenNaiveGeneScore.csv", index_col=0)
means = naive.mean()
# the beauty of spline regression is that you can transform your exogenous variable into a 
# spline basis (matrix with one row per obervation and columns determined by the number
# of knots and the degree of your splines). Then spline regression becomes a linear
# regression problem.
spline_basis = dmatrix("cr(x, df=4)", {"x": means})

def get_spline_corrected(y):
    pred = GLM(y, spline_basis).fit().predict(spline_basis)
    return y - pred + means

# note that this is uglier but usually much more efficient than using pandas' native `apply`
spline_corrected = np.stack([get_spline_corrected(naive.loc[ind])
                            for ind in naive.index])
spline_corrected = pd.DataFrame(spline_corrected, index=naive.index, 
                                columns=naive.columns)

Here's an example of what the gene effects for a really high dropout cell line (ACH-002744) and a low dropout line (ACH-000500) look like with each version of this data. Gene effects for each gene in that cell line are on the y-axis, with the x-axis being the mean across all cell lines. Note the nonlinearity of the trend lines in Naive, which makes the spline-based correction clearly the best at removing this artifact. Note also that the high dropout line is much noisier than the low dropout line. We see a general trend in CRISPR data where lines with strong common essential dropout tend also to have greater noise in nonessentials, perhaps because they are cycling faster.

png

To measure recovery of gene relationships, we need a set of known pairs. I'll use three sources: protein-protein interactions (from InWeb), CORUM complex co-membership, or paralogs (from BioMart sequence similarity). After filtering for genes present in all lines in the DepMap CRISPR dataset, we found 1.15 million pairs, or about 0.3% of all possible gene pairs. This is a very generous set, and of course not all of these pairs are really functionally related. Even those that are may not have correlated gene effect profiles for many plausible reasons. On the flip side, it's extremely difficult to define a large set of pairs that we know aren't functionally related, and even harder to do so without being biased in some way. For example, we might say that we're confident that olfactory receptors and tyrosine kinases aren't functionally related, but these are two groups of genes that have specific patterns of dependency not representative of the genome as a whole.

Given these problems, probably the most robust way to measure known pair recovery is by rank enrichment. We correlate all gene pairs, take the absolute, and look for enrichment of known gene pairs in the top X% of all pairs. For example, if we find 4% of known gene pairs rank in top 1% of all pair correlations, that's a four fold enrichment. Here are the results, checking for enrichment in thresholds from 0.2% to 10%.

png

Plain Naive gene effect is the version of the data with the greatest bias. It does by far the best by this measure of related gene recovery, while NiaveSpline (with the best bias correction) performs worst on this metric. The core problem here is that known related pairs of genes–and particularly pairs that can be recovered from CRISPR–are highly biased towards common essentials. Because all essential genes will be more or less depleted as a group in better or worse screens, screen quality bias causes a generic correlation between all pairs of common essentials. This will automatically enrich for genes of known paired function because common essential genes are highly overrepresented. Take a look at how many known pairs are “recovered” (rank in the top 1% of all correlations) with Naive gene effect for different levels of essentiality:

png

Naive appears to be successful because it recovers all pairs with strong mean essentiality. We can try to correct for this by ranking known pairs only among pairs of genes with similar essentiality, i.e. mean gene effect. Specifically, I'll bin every gene by its mean effect (using bins [-10, -1.75, -1.5, -1.25, -1, -.75, -.5, -.25, 0, .15, 10]). Suppose our known related pair of interest is genes A and B, with A having mean gene effect of -1.6 in some version of the data and B having a mean of -2. Rather than comparing the correlation of A with B to the correlation of all possible pairs, I'll choose a subset: all the genes having means between -1.75 and -1.5, correlated with all the genes having means between -10 and -1.75. I'll then evaluate how the A to B correlation ranks within this smaller set of correlations. By controlling for the mean essentiality of genes, this reduces the influence of screen quality bias. Accordingly it reduces the enrichment of known pairs for all methods, but most dramatically for Naive:

png

This seems like a fairer way to measure recovery. But note that Naive still beats NaiveScaled and NaiveSpline, suggesting that screen quality bias is still influencing the results. This could be because of the discretization of the bins, or some other aspect of the bias that isn't captured in the means, or even real biology: perhaps cells that have more or less common essential dropout are also genuinely more or less dependent on certain functions. This would be extremely difficult to disentangle from screening artifacts.

What I can say is that in the past we've tried various approaches: changing how we select known related genes, the gene pairs that will make up the background, and how we measure recovery of known related genes. We've found that these choices, all individually defensible, strongly influence our conclusions–which is a problem for getting definitive answers about which version of the data performs best. This doesn't mean that having two highly correlated genes isn't a strong indicator of functional relationships, assuming your data has been corrected for quality. Rather, it means that correlation between known related genes isn't a reliable measure of your dataset's overall quality, and definitely should not be optimized for when deciding how to process a CRISPR dataset.