1 A unique sgRNA is a guide that targets only one gene
The problem with multi-gene targeting guides
CERES, the mathematical model used in the Achilles pipeline to generate gene effect scores, doesn't handle multi-gene targeting guides well.
CERES tries to combine logfold change data (difference in the abundance of each guide between the beginning and end of the experiment) with copy number data to create gene effect scores. Gene effect scores are an estimate of how much cell death occurs when a gene is knocked out in a cell line - this estimate accounts for things like copy number variation and guide efficacy. More negative scores indicate lower viability, and more positive scores indicate outgrowth.
This works quite well in most cases. However, the CERES algorithm doesn't handle the ambiguity caused by genes sharing many guides. We expect genes that share many guides to have similar effect scores (since these are caluclated using the guide logfold change), but we find that CERES often assigns one gene a very high gene effect score and the other gene a very low gene effect score in the case where genes share many guides. CERES is basically saying I can explain the logfold change values for the guides that target both gene A and gene B by adding together their gene effect scores.
Note: you can learn more about CERES here: https://www.nature.com/articles/ng.3984
Let's look at an example (taken from depmap.org)
CDK11A and CDK11B share 3 guides (and CDK11A has one additional unique guide)
So what is CERES doing in this example? If these two genes share all but one of their guides why do they have such drastically different gene effect scores?
CERES is explaining the logfold change values of the shared guides on the left by saying that they're the result of adding the CDK11A gene effect score with the CDK11B gene effect score. In order to explain the logfold change it assigns CDK11A a very low gene effect score and CDK11B a very high gene effect score so that these two scores can be added together to explain the original logfold change
There are many examples like this where we are very confident that the gene effect scores for genes targeted by many multi-gene targeting guides are inaccurate
Below we show that the overall quality of our dataset improves when we drop these genes
We compare the original gene effect scores from our 19q3 release - labeled 19q3 - and the same dataset with the genes having no unique guides dropped - labeled 19q3_improved
In the above plot we see that the improved dataset contains less unexpressed false positives - a gene is an unexpressed false positive if it is in the 15% most depleted genes but is not expressed - than regular 19q3 (we expect genes that are unexpressed to not be vital for survival and thus have gene effect scores near zero). This means that many of the genes that we are dropping are unexpressed false positives (as is expected since we believe CERES is assigning inaccurate gene effect scores to these genes)
In the above plot, we see that biomarker Pearson correlation – how well-known biomarkers correlate with the gene effect scores of their respective genes - doesn't change. An example of a known biomarker-dependency relationship is high expression of MYCN and dependency on MYCN
Lastly, in the plot above we see that our ability to recover known functional relationships between pairs of genes increases slightly with the improved dataset. Known relationships were defined using paralogs, human core complexes from CORUM, and protein-protein interaction data (PPI). Correctly identified known pairs were defined as those with a Pearson correlation with FDR < .25 relative to a null distribution of Pearson correlations generated using unrelated pairs of genes.
Starting with the 20Q3 public release, all genes with no unique guides will be dropped.
A list of these genes can be found here
- We are are confident that many of these genes have inaccurate gene effect scores and that removing them increases the quality of the Achilles dataset
- These genes and guides will be retained in the logfold change matrix and in the raw readcounts matrix
- There will be ~400 genes dropped (out of ~18,000)