Assessing Confidence in Achilles Gene Profiles

One of the most common questions we get at DepMap is whether to trust a particular gene's CRISPR knockout profile in Achilles. We can understand why researchers are wary. In a field where billions have been spent chasing shadows, scientists are justifiably skeptical.

I wish I could say that all the untrustworthy genetic dependency results have been left behind in the bad old days of RNAi, but this is untrue. Even CRISPR has off-target effects. In Achilles, we remove those guides that seem like clear outliers relative to other guides targeting the same gene before we run CERES. Nonetheless, we know some guides still exhibit off-target effects. For example, they may be strongly depleted in cell lines where the target gene is unexpressed.

When someone is interested in a particular gene's dependency profile, there are a number of markers we suggest they look at: its predictability, top correlates, consistency with other pancancer profiles, and agreement among its guides. However this is an ad hoc, labor intensive process, and the results are often murky. For this reason, we've decided to calculate a confidence score for each gene effect profile in CRISPR Achilles in addition to providing the individual metrics. We hope these scores will be helpful to the community as they decide which genes to pursue.

Guide Consistency

Consistency in the pattern of viability effects across cell lines for the different guides targeting a given gene is an important indicator that the gene's profile is reliable, but it's not obvious how to measure consistency. Correlation measures seem like a natural choice, but consider a gene with no viability phenotype: We expect correlation to be poor as the variation across screens will be driven by random noise, but the guides might still be consistently near zero and therefore indicate a reliable result. On the other hand, a mean squared error measure will look good for genes where the guides have similar means, even if one guide has a few cell lines that look depleted and the other guides do not.

Since no single measure corresponds to our intuitive sense of guide consistency in all cases, we calculate multiple measures of guide agreement and train a random forest classifier to predict whether two guides target the same gene. Its confidence that they do will be used as the guide consistency score. Conveniently, this is a number between 0 and 1.

import numpy as np
import pandas as pd
from scipy.stats import spearmanr, pearsonr

def get_metrics(A, B):
    computes a bunch of agreement measures between vectors A and B, suitable for seeing
    whether two measurements of gene effect are in agreement
    assert all(A.index == B.index), "misaligned indices between the two series"
    mask = A.notnull() & B.notnull()
    a = A[mask]
    b = B[mask]
    a = a.loc[b.index]
    meanA = a.mean()
    meanB = b.mean()
    mean_se = (meanA - meanB)**2
    mean_mean = .5*(meanA + meanB)
    corr = a.corr(b)
    spearman = spearmanr(a, b)[0]
    varA = a.var()
    varB = b.var()
    var_se = np.abs(varA - varB)
    var_mean = .5*(varA + varB)
    centered_SE = ((a - meanA - b + meanB) **2).mean()
    status_agreement = ((a < -.5) == (b < -.5)).mean()
    return {
        'mean_se': mean_se,
        'mean_mean': mean_mean,
        'corr': corr,
        'spearman': spearman,
        'var_se': var_se,
        'var_mean': var_mean,
        'centered_SE': centered_SE,
        'max_mean': max(meanA, meanB),
        'min_mean': min(meanA, meanB),
        'max_var': max(varA, varB),
        'min_var': min(varA, varB),
        'max_max': max(max(a), max(b)),
        'min_min': min(min(a), min(b)),
        'status_agreement': status_agreement
from taigapy import default_tc as tc
achilles_lfc = tc.get(name='public-20q2-075d', file='Achilles_logfold_change')

achilles_guide_map = tc.get(name='public-20q2-075d', file='Achilles_guide_map')

achilles_replicate_map = tc.get(name='public-20q2-075d', file='Achilles_replicate_map')\
# collapse replicates to cell lines by median
achilles_lfc_cell = achilles_lfc.groupby(achilles_replicate_map.DepMap_ID, axis=1).median()

To train the random forest, we'll first generate the agreement metrics for 5000 pairs of guides targeting the same gene, and then 5000 pairs targeting two random genes. The random forest will be trained to use these metrics to discriminate the first group from the second.

First, the pairs targeting the same gene:

features = []
is_same = pd.Series()
nsamples = 5000
genes = np.random.choice(achilles_guide_map.gene.dropna().unique(), size=nsamples, replace=False)
for i, gene in enumerate(genes):
    if not i%1000:
        print('%1.0f%% done' % (i * 100./len(genes)))
    guides = achilles_guide_map.query('gene == "%s"' % gene).sgrna.tolist()
    if len(guides) < 2:
    out = get_metrics(achilles_lfc_cell.loc[guides[0]], achilles_lfc_cell.loc[guides[1]])
    out.update({'gene': gene, 'guide1': guides[0], 'guide2': guides[1]})
    is_same[gene] = 1

Now the pairs targeting different genes. We'll reuse the genes to avoid introducing unnecessary bias:

for i, gene in enumerate(genes):
    if not i%1000:
        print('%1.0f%% done' % (i * 100./len(genes)))
    guides = achilles_guide_map.query('gene == "%s"' % gene).sgrna.tolist()
    if len(guides) < 2:
    gene2 = gene
    while gene2 == gene or (gene + ' x ' + gene2) in is_same:
        gene2 = np.random.choice(genes)
        guide2 = np.random.choice(achilles_guide_map.query('gene == %r' % gene2).sgrna)

    guide1 = guides[0]
    # avoid the (unlikely) case that the two genes happen to share a guide and 
    # we've chosen that guide for both genes
    while guide1 == guide2:
        guide1 = np.random.choice(guides)
    # a human readable name for the pair of chosen genes
    pseudogene = gene + ' x ' + gene2
    out = get_metrics(achilles_lfc_cell.loc[guide1], achilles_lfc_cell.loc[guide2])
    out.update({'gene': pseudogene, 'guide1': guide1, 'guide2': guide2})
    is_same[pseudogene] = 0
features = pd.DataFrame(features).set_index('gene')

Now, to train the random forest:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(20, min_impurity_decrease=.00001, min_samples_leaf=100, max_depth=6

We visualize the results for the (in-sample) separation below. Remember that our goal is not really to accurately predict whether two guides target the same gene, but only to generate a measure of their agreement. That said, if you test the out-of-sample performance you will find it is similar.

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline['guide1', 'guide2'], axis=1).values, is_same.values)
in_sample = model.predict_proba(
    features.drop(['guide1', 'guide2'], axis=1).values, 
)[:, 1]

sns.kdeplot(in_sample[is_same==1], label="Same Gene", bw=.01)
sns.kdeplot(in_sample[is_same==0], label='Different Genes', bw=.01)
_ = plt.xlabel("Guide Consistency Score")


Finally, we'll calculate the agreement score for all possible pairs of guides targeting the same gene.

gene_indexed = achilles_guide_map.dropna().set_index('gene').sgrna.drop_duplicates()
agreement = []
features = []
for i, gene in enumerate(gene_indexed.index.unique()):
    if not i % 1000 and i > 0:
        print('%i genes complete' % i)
    guides = gene_indexed.loc[[gene]]
    if len(guides) < 2:
    for j, guide1 in enumerate(guides[:-1]):
        for guide2 in guides[j+1:]:
            assert guide1 != guide2, "%r %s %s" % (guides, guide1, guide2)
            agreement.append({'gene': gene, 'guide1': guide1, 'guide2': guide2})
features = pd.DataFrame(features)
agreement = pd.DataFrame(agreement)
agreement['agreement'] = model.predict_proba(features.values, )[:, 1]
sns.kdeplot(agreement['agreement'].values, bw=.01)
plt.xlabel("Guide Consistency Score All Achilles Genes")


We'll summarize guide consistency in two ways: the maximum agreement between any pair of guides, and the mean agreement over all possible pairs.

max_guide_consistency = agreement.groupby("gene")['agreement'].max()
mean_guide_consistency = agreement.groupby("gene")['agreement'].mean()

Another, simpler guide property that could determine how much confidence we have in the gene score is the number of guides uniquely aligned to that gene.

unique_guides = achilles_guide_map.groupby('sgrna').n_alignments.sum().loc[lambda x: x==1].index

nunique = achilles_guide_map[

Consistency Between Achilles, Score and RNAi data

Perhaps the most powerful signal that a gene profile is reliable comes from agreement between independent experiments, especially if one of those experiments uses a different perturbation technology altogether. Thus, we're interested in generating consistency scores between gene effects as measured in three different datasets: Project Achilles CRISPR, Project Score, and the DEMETER 2 combined RNAi dataset.

Just as with guide efficacy, no one measure of consistency is appropriate for all genes. We'll take a similar approach to combining measures. First, we'll take a gene and calculate various measures of consistency between its gene effect in two different datasets. Then, we'll calculate the same measures of consistency between a pair of random genes in the two datasets. Repeating this 5000 times gives us enough examples to train a random forest to predict whether a pair of gene effect profiles in two different datasets represent the same gene, or different genes. For any gene, the random forest's confidence that its gene effect profiles in two different datasets represent the same gene will be our consistency score between the datasets.

def train_gene_consistency(A, B, nsamples=5000):
    overlap_genes = sorted(set(A.columns) & set(B.columns))
    genes = np.random.choice(overlap_genes, nsamples, replace=False)
    is_same = pd.Series()
    features = {}                     
    for i, gene in enumerate(genes):
        if not i % 1000 or i == nsamples - 1:
            print("%1.f done" % (i*100./nsamples))
        a = A[gene].dropna()
        b = B[gene].dropna()
        overlap_lines = sorted(set(a.index) & set(b.index))
        features[gene] = get_metrics(a[overlap_lines], b[overlap_lines])
        is_same[gene] = 1
        if i == 0:
            gene2 = genes[-1]
            gene2 = genes[i-1]
        b = B[gene2].dropna()
        overlap_lines = sorted(set(a.index) & set(b.index))
        pseudogene = '%s x %s' % (gene, gene2)
        features[pseudogene] = get_metrics(a[overlap_lines], b[overlap_lines])
        is_same[pseudogene] = 0
    features = pd.DataFrame(features).T.loc[is_same.index]
    model = RandomForestClassifier(20, 
                              ), is_same.values)
    return model, features, is_same
gene_effect_achilles = tc.get(name='public-20q2-075d', file='Achilles_gene_effect')
gene_effect_score = tc.get(name='depmap-a0ab', file='SCORE_gene_effect')
gene_effect_rnai = tc.get(name='depmap-a0ab', file='D2_gene_effect')
model_rnai, features_rnai, is_same_rnai = train_gene_consistency(gene_effect_achilles,
model_score, features_score, is_same_score = train_gene_consistency(gene_effect_achilles,

We'll visualize the consistency distributions as before. First, RNAi:

in_sample = model_rnai.predict_proba(
)[:, 1]

sns.kdeplot(in_sample[is_same_rnai==1], label="Same Gene", bw=.01)
sns.kdeplot(in_sample[is_same_rnai==0], label='Different Genes', bw=.01)
_ = plt.xlabel("RNAi/Achilles Consistency")


RNAi and CRISPR gene scores often look quite different. Many genes which are selectively essential in RNAi are common essentials in CRISPR. These differences are starkly reflected in the large body of profiles for which the random forest has no idea if they represent the same gene or not, producing the peak at 0.5. Thus, while good agreement between RNAi and CRISPR is a very strong positive signal, lack of agreement is not necessarily meaningful.

Next, agreement between Project Score and Project Achilles:

in_sample = model_score.predict_proba(
)[:, 1]

sns.kdeplot(in_sample[is_same_score==1], label="Same Gene", bw=.01)
sns.kdeplot(in_sample[is_same_score==0], label='Different Genes', bw=.01)
_ = plt.xlabel("Score/Achilles Consistency")


These two datasets have much better concordance, and for most genes it is relatively easy for a random forest to identify if two profiles belong to the same gene or not.

We now evaluate consistency for all overlapping genes:

def evaluate_consistency(A, B, model):
    features = {}
    overlap_genes = sorted(set(A.columns) & set(B.columns))
    for i, gene in enumerate(overlap_genes):
        if not i % 1000 and i > 0:
            print("%i genes done" % i)
        a = A[gene].dropna()
        b = B[gene].dropna()
        overlap_lines = sorted(set(a.index) & set(b.index))
        features[gene] = get_metrics(a[overlap_lines], b[overlap_lines])
    features = pd.DataFrame(features).T
    agreement = model.predict_proba(features.values)[:, 1]
    return pd.Series(agreement, index=features.index)
consistency_rnai = evaluate_consistency(gene_effect_rnai, gene_effect_achilles, model_rnai)
consistency_score = evaluate_consistency(gene_effect_score, gene_effect_achilles, model_score)


If the pattern of a gene’s dependency across cell lines is predictable from the underlying molecular features of the cell lines, that's a good indication that real biology is behind the observed effect – assuming the predictor isn't just using data artifacts. We will shortly introduce gene predictability to the DepMap portal. Its results are useful to consider in deciding how strongly to trust a gene's dependency results.

predictions_full = tc.get(name='predictability-d5b9', file='ensemble-regression-complete')
predictions = predictions_full\
        .query('feature_set == "Unbiased"')\
        .query("dataset == 'Avana'")\
        .query("sample_set == 'pan-cancer'")\

    predictability = predictions['pearson']

Some genes are predictable using confounding variables, which does not increase our confidence in their profile. We'll therefore create a column that indicates whether the most important feature in the predictive model is a confounder:

confounded = predictions.feature0.apply(lambda s: s.split('_')[-1] == "Confounders")

A “sparse” predictive model that makes heavy use of a few features is often easier to interpret than one which incorporates many features in some obscure way. In the extreme case, if a gene's dependency is entirely predicted by a single feature, we could call that feature the biomarker. Finding such a biomarker is another good signal that the gene's profile is real. Therefore, we'll include the feature importance of the top feature for the predictive model to assess confidence.

top_importance = predictions['feature0_importance']

Gene Distribution Properties

In their Project DRIVE paper, McDonald et al. introduced the normality likelihood ratio test (normLRT) score for the profile of gene knockout effects across cell lines. This score is the ratio of the likelihood of the data assuming it was generated by a skewed-t distribution vs a normal distribution, i.e. a measure of the non-normality of the gene's profile. It is unclear to us why this measure should be more successful at enriching for interesting selective dependencies than simpler metrics like plain skewness or variance, but in practice we've found that NormLRT is superior. Thus, we'll use this property as well to assign gene confidence.

gene_summary = tc.get(name="summary-table-0720", file='Target-Discovery-20Q2-internal')
    lambda x: "%s (%s)" % (x['symbol'], x['entrez_id']), axis=1),
normLRT = gene_summary['Avana_LRT']

Putting Features Together

We'll assemble the features in a single dataframe, and impute missing values to the median.

gene_confidence_features_unfilled = pd.DataFrame({
    'guide_consistency_mean': mean_guide_consistency,
    'guide_consistency_max': max_guide_consistency,
    'unique_guides': nunique,
    'score_consistency': consistency_score,
    'rnai_consistency': consistency_rnai,
    'normLRT': normLRT,
    'predictability': predictability,
    'top_feature_importance': top_importance,
    'top_feature_confounder': confounded
gene_confidence_features_unfilled['unique_guides'].fillna(0, inplace=True)

gene_confidence_features = gene_confidence_features_unfilled.copy()
for f in gene_confidence_features:
    gene_confidence_features[f].fillna(gene_confidence_features[f].median(), inplace=True)

Creating Training Labels

At the end of the day, we want a single number that tells us how much to trust an Achilles gene effect profile. However, it's not obvious how to weigh each of our confidence features, nor what the interactions between features might be. To solve this, we'll use the same strategy we employed to combine different measures of consistency into a single measure: train a model to distinguish good and bad examples.

Unlike our consistency measures, it's not so obvious what constitutes a bad gene or, especially, a good gene. We'll identify bad genes in various ways, some orthogonal to the data we supply the model and some not. We'll train our model to distinguish these genes from the rest of genes, under the assumption that most gene profiles are good. (This is justified by the strong agreement we found with the Project Score data.)

First, expression data:

expression = tc.get(name='public-20q2-075d', file='CCLE_expression')
expression = expression.reindex(index=gene_effect_achilles.index, 

unexpressed = expression < .01

depleted = gene_effect_achilles < -.5

unexpressed_depleted_sum = (unexpressed & depleted).sum()

Genes that are depleted in at least two cell lines that don't express them are suspicious.

unexpressed_depleted_genes = unexpressed_depleted_sum.loc[lambda x: x > 1].index

In DepMap, we use a set of known essentials which have been derived from orthogonal sources (not DepMap data) in Blomen et al. and Hart et al.. If the median score for such a gene is greater than -0.25, that also is an indication that the gene's profile is bad (this time because the gene fails to score when it should).

common_essentials = tc.get(name='public-20q2-075d', file='common_essentials')['gene']
common_essentials = sorted(set(common_essentials) & set(gene_effect_achilles.columns))

essentials_not_depleted = gene_effect_achilles[common_essentials].median()\
        .loc[lambda x: x > -.25].index

essentials_not_depleted = gene_effect_achilles[common_essentials].median()\
        .loc[lambda x: x > -.25].index

Most genes show good agreement with Project Score. Those that don't, and aren't rescued by RNAi agreement, we'll assume are bad.

not_consistent = sorted(
    set(consistency_score.loc[lambda x: x < .2].index) 
    & set(consistency_rnai.loc[lambda x: x < .5].index)

These measures aren't perfect. A gene that is supposed to be common essential might actually not be, or there could be errors in expression data. There are some “bad” genes which have countervailing evidence that they're good - for example, strong agreement with RNAi. We'll want to exclude these from our list of reference bad genes, to ensure we're only using genes we're convinced are bad.

Genes with good RNAi agreement:

rnai_agree = consistency_rnai.loc[lambda x: x > .7].index

Genes that are supposed to be nonessential and indeed don't show much depletion or enrichment:

nonessentials = tc.get(name='public-20q2-075d', file='nonessentials')['gene']
nonessentials = sorted(set(nonessentials) & set(gene_effect_achilles.columns))

good_nonessentials = (
    (gene_effect_achilles[nonessentials].max() < .6)
    & (gene_effect_achilles[nonessentials].min() > -.6)
).loc[lambda x: x].index

Genes that are supposed to be essential and indeed are depleted in most lines:

good_essentials = gene_effect_achilles[common_essentials].quantile(.9).loc[lambda x: x<-.25].index

Genes that are predictable using only features known to be related to the gene itself, and not relying on confounders:

related_predictable = predictions_full\
                        .query("feature_set == 'Related'")\
                        .query("dataset == 'Avana'")\
                        .query('sample_set == "pan-cancer"')\
                        .query('pearson > .4')
related_predictable = related_predictable[related_predictable.feature0.apply(
        lambda s: not s.endswith("Confounders")

And genes with at least four unique guides and very high consistency between guides:

consistent_guides = sorted(
    set(mean_guide_consistency.loc[lambda x: x > .9].index)
    & set(nunique.loc[lambda x: x >= 4].index)

Now, we'll assemble our lists of “good” and “bad” genes, such that the number of genes in each class is balanced. Note that we restrict the number of unexpressed depleted genes that we include as bad genes, so that they don't completely overwhelm other types of bad gene.

bad_genes = (
      set(np.random.choice(unexpressed_depleted_genes, 200, replace=False)) 
    | set(essentials_not_depleted)
    | set(not_consistent)

rescued_genes = (
    | set(good_nonessentials)
    | set(good_essentials)
    | set(related_predictable)
    | set(consistent_guides)

good_genes = sorted(set(gene_effect_achilles.columns) - set(bad_genes))

bad_genes = sorted(bad_genes - rescued_genes)

good_genes = list(np.random.choice(good_genes, size=len(bad_genes), replace=False))

Training a Model

With both features and labels in hand, we can train a model to distinguish these two. We'll use logistic regression so we can directly examine feature coefficients.

from sklearn.linear_model import LogisticRegression
confidence_model = LogisticRegression()[good_genes + bad_genes].values, 
          [1]*len(good_genes) + [0]*len(bad_genes)


Let's see how it does in sample. Given the low number of features (9) and a linear decision boundary, overfitting is not a major concern.

gene_confidence = pd.Series(confidence_model.predict_proba(gene_confidence_features.values)[:, 1],

sns.kdeplot(gene_confidence[good_genes], label="'Good' Genes", bw=.02)
sns.kdeplot(gene_confidence[bad_genes], label="'Bad' Genes", bw=.02)
sns.kdeplot(gene_confidence.drop(good_genes).drop(bad_genes), label="Other Genes")
plt.xlabel("Gene Confidence Score")


Let's consider the genes with the highest confidence. We find many familiar results, which indeed indicates that the model matches our intuitions about which genes we trust the most.

PAX8 (7849)       0.997523
HNF1B (6928)      0.995980
IRF4 (3662)       0.995862
NRAS (4893)       0.995521
BRAF (673)        0.995281
TP63 (8626)       0.994320
CCND2 (894)       0.994204
CTNNB1 (1499)     0.993539
SOX10 (6663)      0.993215
MYB (4602)        0.992541
DUSP4 (1846)      0.992265
SMARCA2 (6595)    0.992252
KRAS (3845)       0.991734
TP53 (7157)       0.990908
POU2AF1 (5450)    0.990762
TCF7L2 (6934)     0.989099
EBF1 (1879)       0.989049
TTC7A (57217)     0.988746
CDKN1A (1026)     0.988145
MYCN (4613)       0.987962
dtype: float64

Now consider one of the genes with the lowest confidence scores. On my run, the lowest confidence gene was RPL21. Due to the stochasticity in how we created our training sets, RPL21 may not be the lowest scoring gene in some runs, but it should always score badly.

badgene = 'RPL21 (6144)'

RPL21 (6144)    0.070075
dtype: float64

Examining its features and how they rank:

    "Feature_Value": gene_confidence_features_unfilled.loc[badgene],
    "Feature_Quantile": gene_confidence_features_unfilled.rank(pct=True).loc[badgene]
                       Feature_Value  Feature_Quantile
guide_consistency_mean      0.383238          0.032274
guide_consistency_max        0.82528          0.161757
unique_guides                      0          0.004802
score_consistency          0.0239811          0.000283
rnai_consistency                 NaN               NaN
normLRT                      47.7278          0.398915
predictability              0.236654          0.664963
top_feature_importance     0.0947605          0.946438
top_feature_confounder          True          0.967830

This gene has very poor guide consistency and equally poor consistency with Score gene effects. The likely reason is that all its guides are highly promiscuous, with exact matches to over twenty places in the genome (although almost none of these are in other exons).

Finally, let's look at how the model weights the different features we chose (normalized by standard deviation):

pd.Series(confidence_model.coef_[0], index=gene_confidence_features.columns)*gene_confidence_features.std()
guide_consistency_mean    0.311866
guide_consistency_max    -0.129073
unique_guides             0.081634
score_consistency         0.732798
rnai_consistency          0.338249
normLRT                   0.113466
predictability            0.213900
top_feature_importance    0.005192
top_feature_confounder   -0.028312
dtype: float64

The most important measures are consistency with SCORE data, guide consistency, and consistancy with RNAi, followed by overall predictability. Other measures contributed relatively little to the model. Intriguingly, while mean guide consistency is a strong positive signal, max guide consistency is negative. If we consider a gene for which a single pair of guides are in good agreement and the rest are not, the model tells us we should have low confidence in that gene.

Notes and Caveats

It's important to note a few things about these gene confidence scores. First, the actual distribution is not very meaningful. We can change the median confidence score very easily by changing the class balance of “good” and “bad” genes. With that in mind, we can't say that a confidence score of 0.6 means you should have 60% confidence that the result reflects real biology, or that the false discovery rate is 40%. What we can say from the distribution of confidence scores is that confidence below 0.5 is a bad sign, and we would recommend caution in following up any genes in this category.

Second, these confidence scores are not validated against an independent ground truth. We incorporated our own judgments about which genes are good or bad, in some cases using exactly the features we're supplying to the model. Thus, to some extent the gene confidence scores simply encode our prior intuitions, but hopefully in a robust and generalizable way.

That said, the high enrichment of known oncogenes and high-confidence lineage genes among the highest scores is a good indication that gene confidence scores are reflecting real and useful information.

comments powered by Disqus