The MON artifact - Cancer Data Science Blog

Project Achilles executes genome-wide KO screens on more than 150 cancer cell lines annually. Before data are released, they undergo a series of computationally executed QC checks, including replicate correlation, signal strength, and analysis of in-line controls. Users of this dataset at the Broad Institute found that the MON cell line had higher gene effect scores than expected for a substantial subset of genes. This narrative describes the analysis steps used to understand this observation

Understanding the artifact

We believe high gene effect scores* are unexpected when the target gene is unexpressed

*Gene effect scores are an estimate of how much knocking out a given gene affects the proliferation of a cell line. Logfold change data (difference in the abundance of each guide between the beginning and end of the experiment) is combined with copy number data to create gene effect scores. These scores account for things like copy number variation and guide efficacy. More negative scores indicate lower viability, and more positive scores indicate outgrowth.

This behavior was seen in the MON cell line in a subset of genes

png

In the above two plots, comparing expression of the target gene vs. its gene effect score across cell lines, we see DPF3 and TBX18 aren't expressed in MON but knocking them out appears to cause cells to outgrow.

We also saw that MON had some cell line buddies that had unusual outgrowth in these genes as well

We identified eight cell lines based on genes that others brought to our attention that, like MON, showed unexpected outgrowth upon targeting for CRISPR KO. Initially, the identified cell lines didn't seem to have anything striking in common (different lineages, different Cas9 activities, different media conditions, etc.).

In looking at screening information we found that these cell lines were screened at similar times

In the above plot, each replicate is represented as a black dot and organized in the order that the CRISPR screens were performed. The blue dots represent replicates for each of the 8 cell lines that we identified above - we see that this artifact flared up in March 2018 and again in January 2019

When looking at these time points, we found a cell line - CII - that had one replicate completed at a normal time and another replicate completed while this artifact was prevalent

Above, we see that one replicate should not be affected by this artifact. We decided to compare the normal and suspicious CII replicates to try to better understand this artifact

png

In the above plot, each point is a gene. We see that the suspicious CII replicate has a cloud of genes with logfold change (LFC) values that seem much higher than those of its normal counterpart (x-axis).

We identified 1,266 genes that had a LFC value that was at least .5 greater in the suspicious replicate than in the normal replicate. These genes are referred to as the CII-identified genes

We next wanted to see what these genes would look like if we compared our suspicious CII replicate with one of the MON replicates (which we know is suspicious)

png

In the above plot we see that the CII-indentified genes seem to agree pretty well between the MON replicate and the suspicious CII replicate - this is odd because they're different cell lines! Their only common quality is time of sequencing.

We now have a group of 1,266 genes that seem to be affected by this artifact - below we check what these genes look like in our other cell lines

png

Above we have the distributions for the gene effect of our CII-identified genes in our 8 identified cell lines as well as the average distribution across all cell lines. We see that our 8 cell lines have very right-shifted distributions as compared to the average

png

Above, on the x-axis we again have the order that these cell lines were screened in, and the y-axis is the mean unscaled gene effect of the CII-identified genes. Our 8 original cell lines are labeled; we see that there is another group of cell lines that show this artifact (although not as severely)

Finally, we wanted to check which gene sets were affected by this artifact

Below, we see that our CII-identified genes are enriched for pathways dealing with chromosome and chromatin. This is strange because we would expect an artifact to have no biological significance and this ones seems to affect a biologically-relevant subset of genes

Conclusion and next steps

With the help of Broad's Genetic Perturbation Platform (GPP - https://www.broadinstitute.org/genetic-perturbation-platform) we were able to identify the artifact as a PCR contamination with a targeted library that had been designed and used by someone working in the same lab space (explaining why the affected genes were so enriched for chromosome and chromatin pathways). There are of course laboratory measures in place to avoid such PCR contaminations for exactly this reason, but evidently such a contamination nevertheless occurred on this occasion and affected a couple of the weekly batches of PCR. The Venn diagram below shows the (large!) overlap between the targeted library and the CII-identified genes.

png

Next steps : All the cell lines in the mid and high-suspicion groups will be resequenced to remove this artifact

Our original suspicious 8 cell lines have been dropped from the dataset, but are in the process of being resequenced
Lines in the mid-suspicion group (mean unscaled gene effect of affected genes between .1 and 1.0) have the gene scores for the genes in the contaminating targeted library NA-ed until new screening data is available
Lab and analysis QC measures will be reviewed to best avoid and rapidly detect any such incidents in the future.

Thanks to the members of GPP for their help in understanding and correcting this artifact!