Assessing Confidence in Achilles Gene Profiles


One of the most common questions we get at DepMap is whether to trust a particular gene's CRISPR knockout profile in Achilles. The motives for this question are perfectly understandable. In a field where billions have been spent chasing shadows, researchers once bitten have become twice shy.

I wish I could say that untrustworthy genetic dependency results have been left behind in the bad old days of RNAi, but this is untrue. Even CRISPR has offtarget reagents. In Achilles, we remove those guides that seem like clear outliers relative to other guides targeting the same gene before we run CERES. Nonetheless, we know some guides still exhibit offtarget effects. For example, they may be strongly depleted in cell lines where the target gene is unexpressed.

When someone is interested in a particular gene, there are a number of markers we can suggest they look at: its predictability, top correlates, consistency with other pancancer profiles, and (for the more computationally savvy) agreement among its guides. However this is an ad hoc, labor intensive process, and the results are often murky. For this reason, we've decided to calculate a confidence score for each gene effect profile in CRISPR Achilles. We hope these scores will be helpful to the community as they decide which genes to pursue.

Guide Consistency

Guide consistency is an important indicator that a gene's profile is reliable, but it's not obvious how to measure consistency. Correlation measures seem like a natural choice, but consider a gene with no viability phenotype: We expect correlation to be poor as the variation across screens will be driven by random noise, but the guides might still be consistently near zero and therefore indicate a reliable result. On the other hand, a mean squared error measure will look good for genes where the guides have similar means, even if one guide has a few cell lines that look depleted and the rest do not.

Since no single measure corresponds to our intuitive sense of guide consistency in all cases, we calculate multiple measures of guide agreement and train a random forest classifier to predict whether two guides target the same gene. Its confidence that they do will be used as the guide consistency score. Conveniently, this is a number between 0 and 1.

import numpy as np
import pandas as pd
from scipy.stats import spearmanr, pearsonr

def get_metrics(A, B):
    '''
    computes a bunch of agreement measures between vectors A and B, suitable for seeing
    whether two measurements of gene effect are in agreement
    '''
    assert all(A.index == B.index), "misaligned indices between the two series"
    mask = A.notnull() & B.notnull()
    a = A[mask]
    b = B[mask]
    a = a.loc[b.index]
    meanA = a.mean()
    meanB = b.mean()
    mean_se = (meanA - meanB)**2
    mean_mean = .5*(meanA + meanB)
    corr = a.corr(b)
    spearman = spearmanr(a, b)[0]
    varA = a.var()
    varB = b.var()
    var_se = np.abs(varA - varB)
    var_mean = .5*(varA + varB)
    centered_SE = ((a - meanA - b + meanB) **2).mean()
    status_agreement = ((a < -.5) == (b < -.5)).mean()
    return {
        'mean_se': mean_se,
        'mean_mean': mean_mean,
        'corr': corr,
        'spearman': spearman,
        'var_se': var_se,
        'var_mean': var_mean,
        'centered_SE': centered_SE,
        'max_mean': max(meanA, meanB),
        'min_mean': min(meanA, meanB),
        'max_var': max(varA, varB),
        'min_var': min(varA, varB),
        'max_max': max(max(a), max(b)),
        'min_min': min(min(a), min(b)),
        'status_agreement': status_agreement
    }
# Note for reviewer: Code with taiga pulls will be replaced in final version to load files locally, 
# assuming the user has saved the latest release in a specified diredtory. 

from taigapy import default_tc as tc
achilles_lfc = tc.get(name='public-20q2-075d', file='Achilles_logfold_change')

achilles_guide_map = tc.get(name='public-20q2-075d', file='Achilles_guide_map')

achilles_replicate_map = tc.get(name='public-20q2-075d', file='Achilles_replicate_map')\
                            .set_index("replicate_ID")
# collapse replicates to cell lines by median
achilles_lfc_cell = achilles_lfc.groupby(achilles_replicate_map.DepMap_ID, axis=1).median()

To train the random forest, we'll first generate the agreement metrics for 5000 pairs of guides targeting the same gene, and then 5000 pairs targeting two random genes. The random forest will be trained to use these metrics to descriminate the first group from the second.

First, the pairs targeting the same gene:

features = []
is_same = pd.Series()
nsamples = 5000
genes = np.random.choice(achilles_guide_map.gene.dropna().unique(), size=nsamples, 
            replace=False)
for i, gene in enumerate(genes):
    if not i%1000:
        print('%1.0f%% done' % (i * 100./len(genes)))
    guides = achilles_guide_map.query('gene == "%s"' % gene).sgrna.tolist()
    if len(guides) < 2:
        continue     
    np.random.shuffle(guides)
    
    out = get_metrics(achilles_lfc_cell.loc[guides[0]], achilles_lfc_cell.loc[guides[1]])
    out.update({'gene': gene, 'guide1': guides[0], 'guide2': guides[1]})
    features.append(out)
    is_same[gene] = 1

Now the pairs targeting different genes. We'll reuse the genes to avoid introducing unnecessary bias:

for i, gene in enumerate(genes):
    if not i%1000:
        print('%1.0f%% done' % (i * 100./len(genes)))
    guides = achilles_guide_map.query('gene == "%s"' % gene).sgrna.tolist()
    if len(guides) < 2:
        continue
        
    gene2 = gene
    while gene2 == gene or (gene + ' x ' + gene2) in is_same:
        gene2 = np.random.choice(genes)
        guide2 = np.random.choice(achilles_guide_map.query('gene == %r' % gene2).sgrna)

    guide1 = guides[0]
    # avoid the (unlikely) case that the two genes happen to share a guide and 
    # we've chosen that guide for both genes
    while guide1 == guide2:
        guide1 = np.random.choice(guides)
    np.random.shuffle(guides)
    # a human readable name for the pair of chosen genes
    pseudogene = gene + ' x ' + gene2
    
    out = get_metrics(achilles_lfc_cell.loc[guide1], achilles_lfc_cell.loc[guide2])
    out.update({'gene': pseudogene, 'guide1': guide1, 'guide2': guide2})
    features.append(out)
    is_same[pseudogene] = 0
    
features = pd.DataFrame(features).set_index('gene')

Now, to train the random forest:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(20, min_impurity_decrease=.00001, min_samples_leaf=100, max_depth=6
                              )

We visualize the results for the (in-sample) separation below. However, remember that our goal is not really to accurately predict whether two guides target the same gene, but only to generate a measure of their agreement. That said, if you test the out-of-sample separation you should find it is similar.

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

model.fit(features.drop(['guide1', 'guide2'], axis=1).values, is_same.values)
in_sample = model.predict_proba(
    features.drop(['guide1', 'guide2'], axis=1).values, 
)[:, 1]

sns.kdeplot(in_sample[is_same==1], label="Same Gene", bw=.01)
sns.kdeplot(in_sample[is_same==0], label='Different Genes', bw=.01)
_ = plt.xlabel("Guide Consistency Score")

png

Finally, we'll calculate the agreement score for all possible pairs of guides targeting the same gene.

gene_indexed = achilles_guide_map.dropna().set_index('gene').sgrna.drop_duplicates()
agreement = []
features = []
for i, gene in enumerate(gene_indexed.index.unique()):
    if not i % 1000 and i > 0:
        print('%i genes complete' % i)
    guides = gene_indexed.loc[[gene]]
    if len(guides) < 2:
        continue
    for j, guide1 in enumerate(guides[:-1]):
        for guide2 in guides[j+1:]:
            assert guide1 != guide2, "%r %s %s" % (guides, guide1, guide2)
            features.append(get_metrics(
                achilles_lfc_cell.loc[guide1], 
                achilles_lfc_cell.loc[guide2]
            ))
            agreement.append({'gene': gene, 'guide1': guide1, 'guide2': guide2})
features = pd.DataFrame(features)
agreement = pd.DataFrame(agreement)
agreement['agreement'] = model.predict_proba(features.values, )[:, 1]
sns.kdeplot(agreement['agreement'].values, bw=.01)
plt.xlabel("Guide Consistency Score All Achilles Genes")

png

We'll summarize guide consistency in two ways: the maximum agreement between any pair of guides, and the mean agreement over all possible pairs.

max_guide_consistency = agreement.groupby("gene")['agreement'].max()
mean_guide_consistency = agreement.groupby("gene")['agreement'].mean()

Another, simpler guide property that could determine how much confidence we have in the gene score is the number of guides uniquely aligned to that gene.

unique_guides = achilles_guide_map.groupby('sgrna').n_alignments.sum().loc[lambda x: x==1].index

nunique = achilles_guide_map[
    achilles_guide_map.sgrna.isin(unique_guides)
].groupby('gene').n_alignments.count()

Consistency Between Achilles, Score and RNAi data

Perhaps the most powerful signal that a gene profile is reliable comes from agreement between independent experiments, especially if one of those experiments uses a different perturbation technology altogether. Thus, we're interested in generating consistency scores for the gene effects as measured in three different datasets: Project Achilles CRISPR, Project Score, and the DEMETER 2 combined RNAi dataset. Due to its name, it is unfortunately easy to confuse Project Score with other scores of varius kinds.

Just as with guide efficacy, no one measure of consistency is appropriate for all genes. We'll use the same approach we did with guide efficacy:

def train_gene_consistency(A, B, nsamples=5000):
    overlap_genes = sorted(set(A.columns) & set(B.columns))
    genes = np.random.choice(overlap_genes, nsamples, replace=False)
    np.random.shuffle(genes)
    is_same = pd.Series()
    features = {}                     
    
    for i, gene in enumerate(genes):
        if not i % 1000 or i == nsamples - 1:
            print("%1.f done" % (i*100./nsamples))
        a = A[gene].dropna()
        b = B[gene].dropna()
        overlap_lines = sorted(set(a.index) & set(b.index))
        features[gene] = get_metrics(a[overlap_lines], b[overlap_lines])
        is_same[gene] = 1
        
        if i == 0:
            gene2 = genes[-1]
        else:
            gene2 = genes[i-1]
        b = B[gene2].dropna()
        overlap_lines = sorted(set(a.index) & set(b.index))
        pseudogene = '%s x %s' % (gene, gene2)
        features[pseudogene] = get_metrics(a[overlap_lines], b[overlap_lines])
        is_same[pseudogene] = 0
        
    features = pd.DataFrame(features).T.loc[is_same.index]
    model = RandomForestClassifier(20, 
                                   min_impurity_decrease=.00001, 
                                   min_samples_leaf=100, 
                                   max_depth=6
                              )
    model.fit(features.values, is_same.values)
    return model, features, is_same
gene_effect_achilles = tc.get(name='public-20q2-075d', file='Achilles_gene_effect')
gene_effect_score = tc.get(name='depmap-a0ab', file='SCORE_gene_effect')
gene_effect_rnai = tc.get(name='depmap-a0ab', file='D2_gene_effect')
model_rnai, features_rnai, is_same_rnai = train_gene_consistency(gene_effect_achilles,
                                                                   gene_effect_rnai) 
0 done
20 done
40 done
60 done
80 done
100 done
model_score, features_score, is_same_score = train_gene_consistency(gene_effect_achilles,
                                                                   gene_effect_score) 
0 done
20 done
40 done
60 done
80 done
100 done

We'll visualize the consistency distributions as before. First, RNAi:

in_sample = model_rnai.predict_proba(
    features_rnai.values, 
)[:, 1]

sns.kdeplot(in_sample[is_same_rnai==1], label="Same Gene", bw=.01)
sns.kdeplot(in_sample[is_same_rnai==0], label='Different Genes', bw=.01)
_ = plt.xlabel("RNAi/Achilles Consistency")

png

RNAi and CRISPR gene scores often look quite different. Many genes which are selectively essential in RNAi are common essentials in CRISPR. These differences starkly reflected in the large body of profiles for which the random forest has no idea if they represent the same gene or not, producing the peak at 0.5. Thus, while good agreement between RNAi and CRISPR is a very strong positive signal, lack of agreement is not necessarily meaningful.

Next, agreement between Project Score and Project Achilles:

in_sample = model_score.predict_proba(
    features_score.values, 
)[:, 1]

sns.kdeplot(in_sample[is_same_score==1], label="Same Gene", bw=.01)
sns.kdeplot(in_sample[is_same_score==0], label='Different Genes', bw=.01)
_ = plt.xlabel("Score/Achilles Consistency")

png

These two datasets have much better concordance, and for most genes it is relatively easy to identify if two profiles belong to the same gene or not.

We now evaluate consistency for all overlapping genes:

def evaluate_consistency(A, B, model):
    features = {}
    overlap_genes = sorted(set(A.columns) & set(B.columns))
    for i, gene in enumerate(overlap_genes):
        if not i % 1000 and i > 0:
            print("%i genes done" % i)
        a = A[gene].dropna()
        b = B[gene].dropna()
        overlap_lines = sorted(set(a.index) & set(b.index))
        features[gene] = get_metrics(a[overlap_lines], b[overlap_lines])
    features = pd.DataFrame(features).T
    agreement = model.predict_proba(features.values)[:, 1]
    return pd.Series(agreement, index=features.index)
consistency_rnai = evaluate_consistency(gene_effect_rnai, gene_effect_achilles, model_rnai)
1000 genes done
2000 genes done
3000 genes done
4000 genes done
5000 genes done
6000 genes done
7000 genes done
8000 genes done
9000 genes done
10000 genes done
11000 genes done
12000 genes done
13000 genes done
14000 genes done
15000 genes done
consistency_score = evaluate_consistency(gene_effect_score, gene_effect_achilles, model_score)
1000 genes done
2000 genes done
3000 genes done
4000 genes done
5000 genes done
6000 genes done
7000 genes done
8000 genes done
9000 genes done
10000 genes done
11000 genes done
12000 genes done
13000 genes done
14000 genes done
15000 genes done
16000 genes done
17000 genes done

Predictability

If a perturbation is predictable, that's a good indication that real biology is behind the observed effect – assuming the predictor isn't just using data artifacts. We've recently introduced gene predictability to the portal. Its results are useful to consider in deciding how strongly to trust a gene's dependency results.

predictions_full = tc.get(name='predictability-d5b9', file='ensemble-regression-complete')
predictions = predictions_full\
        .query('feature_set == "Unbiased"')\
        .query("dataset == 'Avana'")\
        .query("sample_set == 'pan-cancer'")\
        .set_index('gene')
    predictability = predictions['pearson']

Some genes are predictable using confounding variables, which does not increase our confidence in their profile. We'll therefore create a column that indicates whether the most important feature in the predictive model is a confounder:

confounded = predictions.feature0.apply(lambda s: s.split('_')[-1] == "Confounders")

A “sparse” predictive model that makes heavy use of a few features is easier to interpret than one which incorporates many features in some obscure way. In the extreme case, if a gene's dependency is entirely predicted by a single feature, we could call that feature the biomarker. Finding a biomarker is another good signal that the gene's profile is real. Therefore, we'll include the feature importance of the top feature for the predictive model to assess confidence.

top_importance = predictions['feature0_importance']

Gene Distribution Properties

In their Project DRIVE paper, McDonald et al. introduced the normal likelihood ratio test (normLRT) score for the profile of gene knockout effects across cell lines. This score is the ratio of the likelihood of the data assuming it was generated by a skewed-t distribution vs a normal distribution, i.e. a measure of the non-normality of the gene's profile. It is unclear to us why this measure should be more successful at enriching for interesting selective dependencies than simpler metrics like plain skewness or variance, but in practice we've found NormLRT to be superior. Thus, we'll use this property as well to assign gene confidence.

gene_summary = tc.get(name="summary-table-0720", file='Target-Discovery-20Q2-internal')
Taiga needs to convert data before we can fetch it.  Waiting...

Status: Running conversion


[##################]100% |  10.4 MiB/s |  21.4 MiB /  21.4 MiB | Time:  0:00:02
/usr/local/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2961: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  exec(code_obj, self.user_global_ns, self.user_ns)
gene_summary.set_index(gene_summary.apply(
    lambda x: "%s (%s)" % (x['symbol'], x['entrez_id']), axis=1),
                       inplace=True
                      )
normLRT = gene_summary['Avana_LRT']

Putting Features Together

We'll assemble the features in a single dataframe, and impute missing values to the median.

gene_confidence_features_unfilled = pd.DataFrame({
    'guide_consistency_mean': mean_guide_consistency,
    'guide_consistency_max': max_guide_consistency,
    'unique_guides': nunique,
    'score_consistency': consistency_score,
    'rnai_consistency': consistency_rnai,
    'normLRT': normLRT,
    'predictability': predictability,
    'top_feature_importance': top_importance,
    'top_feature_confounder': confounded
}).loc[gene_effect_achilles.columns]

gene_confidence_features = gene_confidence_features_unfilled.copy()
for f in gene_confidence_features:
    gene_confidence_features[f].fillna(gene_confidence_features[f].median(), inplace=True)

Creating Training Labels

At the end of the day, we want a single number that tells us how much to trust an Achilles gene effect profile. However, it's not obvious how to weigh each of our confidence features, nor what the interactions between features might be. To solve this, we'll use the same strategy we employed to combine different measures of consistency into a single measure: train a model to distinguish good and bad examples.

Unlike our consistency measures, it's not so obvious what constitutes a bad gene or, especially, a good gene. We'll identify bad genes in various ways, some orthogonal to the data we supply the model and some not. We'll train our model to distinguish these genes from the rest of genes, under the assumption that most gene profiles are good. (This is justified by the strong agreement we found with the Project Score data.)

First, expression data:

expression = tc.get(name='public-20q2-075d', file='CCLE_expression')
expression = expression.reindex(index=gene_effect_achilles.index, 
                                columns=gene_effect_achilles.columns)

unexpressed = expression < .01

depleted = gene_effect_achilles < -.5

unexpressed_depleted_sum = (unexpressed & depleted).sum()

Genes that are depleted in at least two cell lines that don't express them are suspicious.

unexpressed_depleted_genes = unexpressed_depleted_sum.loc[lambda x: x > 1].index

In DepMap, we use a set of known essentials which have been derived from orthogonal sources (not DepMap data). If the median score for such a gene is greater than -0.25, that also is an indication that the gene's profile is bad (this time because the gene fails to score when it should).

common_essentials = tc.get(name='public-20q2-075d', file='common_essentials')['gene']
common_essentials = sorted(set(common_essentials) & set(gene_effect_achilles.columns))

essentials_not_depleted = gene_effect_achilles[common_essentials].median()\
        .loc[lambda x: x > -.25].index

essentials_not_depleted = gene_effect_achilles[common_essentials].median()\
        .loc[lambda x: x > -.25].index

Most genes show good agreement with Project Score. Those that don't, and aren't rescued by RNAi agreement, we'll assume are bad.

not_consistent = sorted(
    set(consistency_score.loc[lambda x: x < .2].index) 
    & set(consistency_rnai.loc[lambda x: x < .5].index)
)

Finally, some genes are predicted by confounders. This does not necessarily mean a gene's profile is bad. For a strong common essential, we expect that most of its variation will be explained by the quality of the screen, but nonetheless it will be clearly a common essential. However, if we see that a gene's profile is explained by screen media, it may give a false impression of being an interesting selective dependency.

media_confounded = predictions_full\
                        .query("feature_set == 'Unbiased'")\
                        .query("dataset == 'Avana'")\
                        .query('sample_set == "pan-cancer"')

media_confounded = media_confounded[
    media_confounded.apply(lambda x: 
        x['feature0'].endswith('media_Confounders') 
     or x['feature1'].endswith("media_Confounders"),
                          axis=1)
].gene

These measures aren't perfect. A gene that is supposed to be common essential might actually not be, or there could be errors in expression data. There are some “bad” genes which have countervailing evidence that they're good - for example, strong agreement with RNAi. We'll want to exclude these from our list of reference bad genes, to ensure we're only using genes we're convinced are bad.

Genes with good RNAi agreement:

rnai_agree = consistency_rnai.loc[lambda x: x > .7].index

Genes that are supposed to be nonessential and indeed don't show much depletion or enrichment:

nonessentials = tc.get(name='public-20q2-075d', file='nonessentials')['gene']
nonessentials = sorted(set(nonessentials) & set(gene_effect_achilles.columns))

good_nonessentials = (
    (gene_effect_achilles[nonessentials].max() < .6)
    & (gene_effect_achilles[nonessentials].min() > -.6)
).loc[lambda x: x].index

Genes that are supposed to be essential and indeed are depleted in most lines:

good_essentials = gene_effect_achilles[common_essentials].quantile(.9).loc[lambda x: x<-.25].index

Genes that are predictable using only features known to be related, and not relying on confounders:

related_predictable = predictions_full\
                        .query("feature_set == 'Related'")\
                        .query("dataset == 'Avana'")\
                        .query('sample_set == "pan-cancer"')\
                        .query('pearson > .4')
related_predictable = related_predictable[related_predictable.feature0.apply(
        lambda s: not s.endswith("Confounders")
)].gene

And genes with at least four unique guides and very high consistency between guides:

consistent_guides = sorted(
    set(mean_guide_consistency.loc[lambda x: x > .9].index)
    & set(nunique.loc[lambda x: x >= 4].index)
)

Now, we'll assemble our lists of “good” and “bad” genes, such that the number of genes in each class is balanced. Note that we restrict the number of unexpressed depleted genes that we include as bad genes, so that they don't completely overwhelm other types of bad gene.

bad_genes = (
      set(np.random.choice(unexpressed_depleted_genes, 200, replace=False)) 
    | set(essentials_not_depleted)
    | set(not_consistent)
    | set(media_confounded)
     
                  )

rescued_genes = (
    set(rnai_agree)
    | set(good_nonessentials)
    | set(good_essentials)
    | set(related_predictable)
    | set(consistent_guides)
)

good_genes = sorted(set(gene_effect_achilles.columns) - set(bad_genes))

bad_genes = sorted(bad_genes - rescued_genes)

good_genes = list(np.random.choice(good_genes, size=len(bad_genes), replace=False))

Training a Model

With both features and labels in hand, we can train a model to distinguish these two. We'll use logistic regression so we can directly examine feature coefficients.

from sklearn.linear_model import LogisticRegression
confidence_model = LogisticRegression()
# RandomForestClassifier(
#     100, min_impurity_decrease=.00001, min_samples_leaf=5, max_depth=6
# )
confidence_model.fit(gene_confidence_features.loc[good_genes + bad_genes].values, 
          [1]*len(good_genes) + [0]*len(bad_genes)
         )
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Let's see how it does in sample. Given the low number of features (9) and simple model, overfitting is not a major concern.

gene_confidence = pd.Series(confidence_model.predict_proba(features.values)[:, 1], index=features.index)

sns.kdeplot(gene_confidence[good_genes], label="'Good' Genes", bw=.02)
sns.kdeplot(gene_confidence[bad_genes], label="'Bad' Genes", bw=.02)
sns.kdeplot(gene_confidence.drop(good_genes).drop(bad_genes), label="Other Genes")
plt.xlabel("Gene Confidence Score")
/usr/local/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval





Text(0.5, 0, 'Gene Confidence Score')

png

Let's consider the genes with the highest confidence. We find many familiar results, which indeed indicates that the model matches our intuitions about which genes we trust the most.

gene_confidence.drop(good_genes).drop(bad_genes).sort_values(ascending=False)[:20]
PAX8 (7849)       0.986695
HNF1B (6928)      0.983822
NRAS (4893)       0.983791
BRAF (673)        0.982439
CCND2 (894)       0.981985
MYB (4602)        0.981482
FOXA1 (3169)      0.980081
TP63 (8626)       0.979571
KRAS (3845)       0.979081
CTNNB1 (1499)     0.978836
DUSP4 (1846)      0.978219
SMARCA2 (6595)    0.978166
SOX10 (6663)      0.976077
IRF4 (3662)       0.974058
ZEB2 (9839)       0.973902
CCND3 (896)       0.972568
MDM2 (4193)       0.972512
TCF7L2 (6934)     0.972327
TTC7A (57217)     0.971921
FGFR1 (2260)      0.970606
dtype: float64

And finally looking at how the model weights the different features we chose (normalized by standard deviation):

pd.Series(confidence_model.coef_[0], index=gene_confidence_features.columns)*gene_confidence_features.std()
guide_consistency_mean    0.257402
guide_consistency_max    -0.106985
unique_guides             0.029300
score_consistency         0.723465
rnai_consistency          0.368621
normLRT                   0.052350
predictability            0.137485
top_feature_importance    0.002610
top_feature_confounder   -0.095015
dtype: float64

The most important measures are consistency with SCORE data, guide consistency, and consistancy with RNAi, followed by overall predictability. Other measures contributed relatively little to the model. Intriguingly, while mean guide consistency is a strong positive signal, max guide consistency is negative, meaning that having a single pair of guides in strong agreement but disagreeing with the rest does not indicate a good gene.

Notes and Caveats

It's important to note a few things about these gene confidence scores. First, the actual distribution is not very meaningful. We can easily change the median confidence score by changing the class balance of “good” and “bad” genes. With that in mind, we can't say that a confidence score of 0.6 means you should have 60% confidence that the result reflects real biology, or that the false discovery rate is 40%. What we can say from the distribution of confidence scores is that confidence below 0.5 is a bad sign, and we would recommend caution in following up any genes in this category.

Second, these confidence scores are not validated against an independent ground truth. We incorporated our own judgments about which genes are good or bad, in some cases using exactly the features we're supplying to the model. Thus, the gene confidence scores to some extent are simply encoding our prior intuitions, albeit hopefully in a robust and generalizable way.

That said, the high enrichment of known oncogenes and high-confidence lineage genes among the highest scores is a good indication that gene confidence scores are reflecting real and useful information.

comments powered by Disqus