Library permutation leads to overly optimistic p-values in CRISPR screens

Although p-values have suffered some reputational damage recently, scientists still feel distinctly uncomfortable without them. Perhaps that's why so many methods for analyzing readcount data from pooled CRISPR screens estimate statistical significance as part of their pipeline. Examples include MaGECK-RRA, MaGECK-MLE, STARS and gCrisprTools. Each of these uses the same basic framework:

Calculate some estimate of effect
Repeat the calculation with the gene target labels for the reagents permuted
Calculate p-values using the permuted effect sizes as the null distribution

Let's call p-values generated this way “library permutation p-values.”

The Problem

Permutation testing is a well-respected method of generating empirical null distributions, and there aren't many options for permutation in a CRISPR screen besides gene assignments. But let's consider what these p-values actually mean. The null hypothesis being tested here is implicitly that all reagents for the gene of interest are entirely off-target to random genes.

Although random off-target effects are indeed a concern in CRISPR screens, clearing this extremely low bar by no means implies that you have controlled false discovery, because CRISPR abounds in artifacts that are shared across guides targeting a gene. Consider some examples:

The Copy Number (Cutting Toxicity) Effect

The granddaddy of CRISPR KO artifacts, the copy number effect is a consequence of indiscriminate Cas9 cutting toxicity. The more copies of a CRISPR sgRNA target sequence there are in a cell, the more cutting toxicity we see regardless of whether the target is nonessential or even intergenic. The problem for library permutation is obvious: a highly amplified nonessential gene will register as a highly significant “hit,” despite being the product of a pure technical artifact.

“Ahah,” you say, “But isn't that a solved problem? Didn't you, the Cancer Data Science team, produce the first cutting toxicity correction method and prove it worked?”

Sadly, if by “worked” you mean “removed all bias due to cutting toxicity,” the answer is no. The cutting toxicity effect is much more complicated than it sounds, which is a topic for its own post. For now, it's clear to us that the major published CN correction methods leave a residual bias. This bias is shared across guides targeting a gene, so it won't be controlled for by a null hypothesis involving permuting guide assignments.

The Copy Buffering Effect

Confusingly, for essential genes there exists a second copy number effect pointing in the opposite direction of the cutting toxicity effect. This is probably due to the stochastic nature of Cas9-induced mutations. The more copies of a gene you have, the more likely that at least one is still producing a functional protein after Cas9 cutting and repair. Thus, for essential genes, increasing copy number can actually decrease the observed depletion of its guides:

png

Caption: The copy buffering effect. For essential genes (left tail), more copies of a gene leads to increased survival after targeting with CRISPR and thus a positive correlation between the gene's CERES gene effect and its own copy number.

Shared Promiscuous Guides

Certain gene families share so much of their sequences that designing sgRNAs to only pefectly match one member of the family is difficult or impossible:

An example gene with no unique guides in the Avana library. It is impossible to separate the true gene KO effect of NBPF12 from the similar genes also being knocked out by the same reagents, or from the general toxicity caused by cutting multiple sites in the genome.

This gives rise to offtarget effects that are shared across the gene's guides and aren't controlled by library permutation p-values.

And Others

These examples are sources of bias we know about, but it is likely there are other sources of artifacts shared across CRISPR guides which we still haven't identified. For example, Kosicki et. al. found the CRISPR-induced double-stranded breaks can lead to deletions of up to several kilobases. This could lead to an adjacency-based off-target effect, where guides targeting one gene produce loss of function in neighboring genes. With our understanding of CRISPR editing systems still in early stages, we should assume undiscovered biases such as this are lurking in the background.

Better Methods for Controlling False Discovery

There is a method for generating p-values that gets around these known and unknown biases: use a set of genes, rather than permutations, as the null. For example, if a set of genes is unexpressed in the cell line of interest, we have a strong prior that any effects seen from targeting these genes are purely technical. This idea is used by BAGEL, by the Sanger Institute for calling hits, and by us in the Achilles project for reporting dependency probabilities.

Identifying the right negative control gene set can be tricky for some experimental designs. However, when it can be done, this approach is far more rigorous than permuting library assignments because the experimental gene scores implicitly capture the range of possible artifacts. The only downside is that your p-values will be monotonically related to effect size, turning the ubiquitous volcano plot of results into an unappealing squibble. And for truly strong, stand-out results the difference between methods does not matter much. But in cases where false discovery rates really matter, we strongly recommend using negative control genes as your null hypothesis.