Working with MIX-Seq data


We recently published a new method called MIX-Seq that can be used to simultaneously measure the transcriptional response of hundreds of cancer cell lines to a perturbation. We do this by pooling the cancer cell lines, treating the pool with a drug or genetic perturbation, and then sequencing the pool using single-cell RNA-seq. We can resolve the cell line identity of each cell based on its single nucleotide polymorphism (SNP) profile.

The MIX-Seq paper includes single-cell RNA-seq data representing 13 drugs and one gene with pools of 25 to 100 cell lines and a variety of time points between 3 and 24 hours post perturbation. Here we’ll highlight some simple analyses we can do with these data using R and Seurat. For examples of additional analyses shown in our paper, check out the MIX-Seq GitHub.

Load MIX-Seq data

Data from the MIX-Seq paper can be downloaded from figshare. Each data folder (compressed as a zip file) on figshare represents a single experimental condition and contains the following files (the first three are standard outputs from the 10x Cell Ranger pipline):

  1. barcodes.tsv: table of identified cell barcodes
  2. genes.tsv: table of gene information (Ensemble ID and HGNC symbol)
  3. matrix.mtx: read counts matrix stored as matrix market format
  4. classifications.csv: table of cell metadata

Here is a function that loads a data folder into R as a Seurat object:

library(Seurat)
library(Matrix)
library(tidyverse)

load_mixseq_data <- function(expt_dir,condition) {
  counts <- Matrix::readMM(file.path(expt_dir, 'matrix.mtx'))
  rownames(counts) <- readr::read_tsv(file.path(expt_dir, 'genes.tsv'), col_names = F)$X2
  counts <- counts[!duplicated(rownames(counts)),]
  colnames(counts) <- readr::read_tsv(file.path(expt_dir, 'barcodes.tsv'), col_names = F)$X1
  classifications <- readr::read_csv(file.path(expt_dir, 'classifications.csv')) %>%
    dplyr::mutate(condition = condition) %>% tibble::column_to_rownames("barcode") %>%
    as.data.frame()
  sc_data <- Seurat::CreateSeuratObject(counts,meta.data = classifications)
}

Let’s look at the transcriptional response of a pool of cancer cell lines to treatment with the MDM2 inhibitor Idasanutlin. To do this, we’ll load Idasanutlin treated cells from MIX-Seq experiment 1. We’ll also load DMSO treated cells from the same experiment and time point for comparison.

nutlin_data <- load_mixseq_data("./Idasanutlin_24hr_expt1","Idasanutlin")
dmso_data <- load_mixseq_data("./DMSO_24hr_expt1","DMSO")
sc_data <- merge(nutlin_data,dmso_data)

The MIX-Seq data includes doublets and low-quality cells, so we need to filter these out before doing our analysis. Conveniently, the results from the MIX-Seq quality control pipeline are in the ‘cell_quality’ metadata column.

sc_data <- sc_data[,sc_data@meta.data$cell_quality == "normal"]

Generate a UMAP embedding

First, we’ll generate a UMAP embedding to visualize the response of different cell lines to Idasanutlin treatment. We can do this with the standard Seurat pipeline:

npcs <- length(unique(sc_data$singlet_ID)) * 2
sc_data <- Seurat::NormalizeData(sc_data,verbose = F)
sc_data <- Seurat::ScaleData(sc_data,verbose = F)
sc_data <- Seurat::FindVariableFeatures(sc_data,nfeatures = 5000,verbose = F)
sc_data <- Seurat::RunPCA(sc_data,features = VariableFeatures(sc_data),npcs = npcs,verbose = F)
sc_data <- Seurat::RunUMAP(sc_data,dims = 1:npcs,n.neighbors = 10,verbose = F)

Let’s plot the UMAP embedding and color it with the ‘singlet_ID’ metadata column. This column indicates the most likely cell line for each cell based on its SNP profile.

Seurat::DimPlot(sc_data,group.by = "singlet_ID",label = T,repel = T) + guides(color = F)

We can see that cells from the same cell line cluster nicely in the UMAP embedding. However, a few cell lines, like RCC10RGB_KIDNEY, have two clusters instead of one. To explain this, we can color the UMAP embedding by condition instead of singlet_ID.

Seurat::DimPlot(sc_data,group.by = "condition")

We can see that the two clusters correspond to the two experimental conditions. So why do only some cell lines have a transcriptional response to Idasanutlin? To understand this, we need to know how Idasanutlin works.

Idasanutlin blocks the interaction of MDM2 with TP53, which can upregulate TP53 and its tumor suppressor functions. But many cancer cell lines have inactivating mutations in TP53, which prevent them from responding to Idasanutlin. Let’s add a metadata column indicating TP53 mutation status and color the UMAP embedding by condition and TP53 status.

sc_data <- Seurat::AddMetaData(sc_data,ifelse(sc_data$singlet_ID %in%
   c('LNCAPCLONEFGC_PROSTATE','DKMG_CENTRAL_NERVOUS_SYSTEM',
   'NCIH226_LUNG','RCC10RGB_KIDNEY','SNU1079_BILIARY_TRACT',
   'CCFSTTG1_CENTRAL_NERVOUS_SYSTEM','COV434_OVARY'),"TP53_WT","TP53_MUT"),"TP53_status")
sc_data <- Seurat::AddMetaData(sc_data,str_c(sc_data$condition,"  ",sc_data$TP53_status)
                               ,"condition_TP53_status")
Seurat::DimPlot(sc_data,group.by = "condition_TP53_status")

We can see that only TP53 wild type cell lines have a significant transcriptional response to Idasanutlin.

Perform cell-cycle analysis

One advantage of single-cell RNA-seq is that we can quantify a drug’s impact on cell-cycle phase. To do this, we will use a list of cell-cycle marker genes from Tirosh et al, 2015.

sc_data <- Seurat::CellCycleScoring(sc_data, s.features = Seurat::cc.genes$s.genes,
                                    g2m.features = Seurat::cc.genes$g2m.genes)
sc_data <- Seurat::AddMetaData(sc_data,factor(plyr::revalue(sc_data$Phase,
                               c("G1"="G0/G1", "G2M"="G2/M")),
                               levels = c("G0/G1","S","G2/M")),col.name = "Phase")
Seurat::DimPlot(sc_data,group.by = "Phase")

When we color the UMAP embedding by the cell-cycle phase, we can see that much of the transcriptional heterogeneity in the TP53 mutant cell lines is driven by cell-cycle.

Now, let’s look at the fraction of cells from each cell line in each cell-cycle phase. We’ll split the plot by ‘condition’ and ‘TP53_status’ to see if there are differences.

sc_data@meta.data %>% ggplot(aes(x=singlet_ID,fill=Phase)) +
  geom_bar(position = "fill",size = 3, width = .8) +
  facet_grid(vars(condition),vars(TP53_status),scale = "free_x") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(x = "", y = "Cell fraction")

As expected, given the known mechanism of the drug, Idasanutlin treatment significantly increases the fraction of cells in G0/G1 for the TP53 wild type cell lines. This increase indicates that these cells have stopped cycling and suggests that they are undergoing TP53 triggered apoptosis.

Run differential expression analysis

Next, let’s do a differential expression analysis to see which genes are up- and down-regulated in response to Idasanutlin treatment. To do this, we will create ‘pseudo-bulk’ profiles by summing read counts across cells for each cell line and condition and then run the limma-trend pipeline as recommended by Crowell et al., 2019. This strategy has less power than a true single-cell differential expression analysis, but it works well in our case because we have many cell lines that serve as independent samples.

  1. Create pseudo-bulk counts data
library(limma)
library(edgeR)

group_sum <- function(group) {
  Matrix::rowSums(Seurat::GetAssayData(sc_data,slot = 'counts')[,group$ID,drop = FALSE])}
summed_counts <- sc_data@meta.data %>% tibble::rownames_to_column(var = 'ID') %>%
  dplyr::group_split(singlet_ID,condition) %>% plyr::laply(.fun = group_sum)
summed_metadata <- sc_data@meta.data %>% dplyr::group_by(singlet_ID,condition) %>%
  dplyr::summarise(TP53_status = first(TP53_status)) %>% dplyr::ungroup()
  1. Convert the counts data to log(CPM + 1)
log_summed_counts <- edgeR::DGEList(t(summed_counts)) %>%
  edgeR::calcNormFactors(method = 'TMMwsp') %>%
  edgeR::cpm(log = TRUE, prior.count = 1)
  1. Specify the design matrix and contrasts
design <- model.matrix(~0 + condition + singlet_ID, data = summed_metadata)
cm <- limma::makeContrasts(
  contrasts = list(treat = "conditionIdasanutlin - conditionDMSO"),
  levels = make.names(colnames(design))
)
  1. Run limma
fit <- limma::lmFit(log_summed_counts, design)
fit2 <- limma::contrasts.fit(fit, contrasts = cm)
fit3 <- limma::eBayes(fit2,trend = TRUE)
diff_expr <- limma::topTable(fit3, number = Inf, genelist = rownames(log_summed_counts))

Now we can plot the differential expression results as a volcano.

library(ggrepel)

diff_expr %>% ggplot(aes(logFC,-log10(P.Value))) + geom_point(aes(color = P.Value < .05)) +
  geom_text_repel(data = diff_expr %>% filter(abs(logFC) > 1.1),aes(label = ID)) +
  guides(color = F) + ggtitle("Idasanutlin - DMSO")

Several genes in the TP53 effector pathway (e.g. MDM2, CDKN1A, and GDF15) have the largest logFC, which makes sense given Idasanutlin’s ability to upregulate TP53. But are these genes only differentially expressed in TP53 wild type cell lines? To answer this question, let’s plot the logFC of the top differentially expressed genes in each cell line as a heatmap and redo our differential expression separating the cell lines by TP53 mutation status.

We’ll make the heatmap with the pheatmap package.

library(pheatmap)

cell_line_metadata <- summed_metadata %>% dplyr::group_by(singlet_ID) %>%
  dplyr::summarise(TP53_status = first(TP53_status)) %>% dplyr::ungroup()
n_cls <- nrow(cell_line_metadata)
logfc_summed_counts <- log_summed_counts[,(1:n_cls)*2] - log_summed_counts[,(1:n_cls)*2 -1]
colnames(logfc_summed_counts) <- cell_line_metadata$singlet_ID
top_diff_genes <- diff_expr %>% dplyr::arrange(-abs(logFC)) %>% head(25) %>% .[["ID"]]
pheatmap::pheatmap(logfc_summed_counts[top_diff_genes,],treeheight_row = 0,treeheight_col = 0,
                   annotation_col   = cell_line_metadata %>%
                   tibble::column_to_rownames('singlet_ID') %>% as.data.frame())

And we’ll use the same limma code for the differential expression analysis.

library(cowplot)

run_limma_trend <- function(log_summed_counts,summed_metadata) {
  design <- model.matrix(~0 + condition + singlet_ID, data = summed_metadata)
  cm <- limma::makeContrasts(
    contrasts = list(treat = "conditionIdasanutlin - conditionDMSO"),
    levels = make.names(colnames(design))
  )
  fit <- limma::lmFit(log_summed_counts, design)
  fit2 <- limma::contrasts.fit(fit, contrasts = cm)
  fit3 <- limma::eBayes(fit2,trend = TRUE)
  diff_expr <- limma::topTable(fit3, number = Inf, genelist = rownames(log_summed_counts))
}
diff_expr_tp53_wt <- run_limma_trend(log_summed_counts[,summed_metadata$TP53_status == "TP53_WT"],
                                     filter(summed_metadata,TP53_status == "TP53_WT"))
diff_expr_tp53_mut <- run_limma_trend(log_summed_counts[,summed_metadata$TP53_status == "TP53_MUT"],
                                     filter(summed_metadata,TP53_status == "TP53_MUT"))
tp53_wt_volcano <- diff_expr_tp53_wt %>% ggplot(aes(logFC,-log10(P.Value))) +
  geom_point(aes(color = P.Value < .05)) +
  geom_text_repel(data = diff_expr_tp53_wt %>% filter(abs(logFC) > 3.6),aes(label = ID)) +
  guides(color = F) + xlim(-5,5) + ggtitle("TP53 wild type")
tp53_mut_volcano <- diff_expr_tp53_mut %>% ggplot(aes(logFC,-log10(P.Value))) +
  geom_point(aes(color = P.Value < .05)) +
  geom_text_repel(data = diff_expr_tp53_mut %>% filter(abs(logFC) > .8),aes(label = ID)) +
  guides(color = F) + xlim(-5,5) + ggtitle("TP53 mutatant")

cowplot::plot_grid(tp53_wt_volcano,tp53_mut_volcano)

Looking at the heatmap and volcanoes it’s clear that the TP53 effector pathway genes are strongly (LogFC > 4) upregulated in the TP53 wild type cell lines, but a few TP53 effector genes, such as GDF15, also exhibit weak (LogFC ~1), but statistically significant upregulation in the TP53 mutant cell lines.

Get viability-related and viability-independent components

What differences in short-term transcriptional responses distinguish drug-sensitive cell lines from drug-insensitive cell lines? This is the key question that the analysis above addresses. Idasanutlin sensitive (i.e. TP53 wild type) cell lines strongly upregulate TP53 effector pathway genes and strongly downregulate cell-cycle genes. In contrast, Idasanutlin insensitive (i.e. TP53 mutant) cell lines don’t exhibit a strong transcriptional response. But how can we answer this question for drugs that lack a clear biomarker, like TP53 mutation status, and don’t exhibit an all-or-nothing response?

In the MIX-Seq paper, we propose a linear modeling approach to relate short-term transcriptional responses to long-term drug sensitivity data generated with the PRISM assay. Specifically, we will decompose the transcriptional response of each gene into two components: a viability-independent response characterizing the response of completely insensitive cell lines, and a viability-related response characterizing the difference in response between sensitive and insensitive cell lines.

To demonstrate this method, let’s use the Trametinib data from MIX-Seq experiment 3. This dataset serves as a good example since it includes a larger number (~100) of cell lines, and since cell lines exhibit a continuous range of Trametinib sensitivity.

trametinib_data <- load_mixseq_data("./Trametinib_24hr_expt3","Trametinib")
dmso_data <- load_mixseq_data("./DMSO_24hr_expt3","DMSO")
sc_data <- merge(trametinib_data,dmso_data)
sc_data <- sc_data[,sc_data@meta.data$cell_quality == "normal"]

To start, we need to get the sensitivity data for Trametinib. Fortunately, this and many other fascinating cancer data sets can be downloaded from depmap.org!

Let’s download the PRISM Repurposing 19Q4 ‘secondary-screen-dose-response-curve-parameters.csv’ file, load it into R, and combine it with summed_metadata. We’ll define sensitivity as one minus the area under the dose-response curve (auc).

trametinib_sensitivity <- read_csv("./secondary-screen-dose-response-curve-parameters.csv") %>%
  dplyr::filter(screen_id == "MTS010",name == "trametinib")
summed_metadata <- sc_data@meta.data %>% dplyr::group_by(singlet_ID,condition) %>%
  dplyr::summarise() %>% dplyr::ungroup() %>%
  dplyr::left_join(trametinib_sensitivity,by = c("singlet_ID" = "ccle_name")) %>%
  dplyr::filter(!is.na(auc)) %>% dplyr::mutate(sensitivity = (1-auc))

The code to get the viability-related and viability-independent components is very similar to the differential expression code above.

  1. Create pseudo-bulk counts data
summed_counts <- sc_data@meta.data %>% tibble::rownames_to_column(var = 'ID') %>%
  dplyr::filter(singlet_ID %in% summed_metadata$singlet_ID) %>%
  dplyr::group_split(singlet_ID,condition) %>% plyr::laply(.fun = group_sum)
  1. Convert the counts data to log(CPM + 1)
log_summed_counts <- edgeR::DGEList(t(summed_counts)) %>%
  edgeR::calcNormFactors(method = 'TMMwsp') %>%
  edgeR::cpm(log = TRUE, prior.count = 1)
  1. Specify the design matrix and contrasts

Here we add an additional covariate ‘is_treat*sensitivity’ and specify two contrasts instead of one.

summed_metadata <- dplyr::mutate(summed_metadata,
                                is_treat = ifelse(condition == "Trametinib",1,0))
design <- model.matrix(~0 + condition + singlet_ID + I(is_treat*sensitivity),
                       data = summed_metadata)
colnames(design)[ncol(design)] <- 'is_treat_sens'
cm <- limma::makeContrasts(
  contrasts = list(independent = "conditionTrametinib - conditionDMSO",
                   related = "is_treat_sens"),
  levels = make.names(colnames(design))
)
  1. Run limma

Here we extract the gene logFC values for the two contrasts which are the independent and related components.

fit <- limma::lmFit(log_summed_counts, design)
fit2 <- limma::contrasts.fit(fit, contrasts = cm)
fit3 <- limma::eBayes(fit2,trend = TRUE)
independent <- limma::topTable(fit3, coef = "conditionTrametinib - conditionDMSO",
                               number = Inf, genelist = rownames(log_summed_counts))
related <- limma::topTable(fit3, coef = "is_treat_sens", number = Inf,
                           genelist = rownames(log_summed_counts))

Now lets plot the viability-independent response as a volcano and show the logFC of the top viability-independent genes in each cell line as a heatmap.

independent %>% ggplot(aes(logFC,-log10(P.Value))) + geom_point(aes(color = P.Value < .05)) +
  geom_text_repel(data = independent %>% filter(abs(logFC) > 2),aes(label = ID)) +
  guides(color = F) + ggtitle("Viability-independent")

cell_line_metadata <- summed_metadata %>% dplyr::group_by(singlet_ID) %>%
  dplyr::summarise(sensitivity = first(sensitivity)) %>% dplyr::ungroup()
n_cls <- nrow(cell_line_metadata)
logfc_summed_counts <- log_summed_counts[,(1:n_cls)*2] - log_summed_counts[,(1:n_cls)*2 -1]
colnames(logfc_summed_counts) <- cell_line_metadata$singlet_ID
cell_line_metadata <- cell_line_metadata %>% arrange(sensitivity)
top_idependent_genes <- independent %>% dplyr::arrange(-abs(logFC)) %>% head(25) %>% .[["ID"]]
pheatmap(logfc_summed_counts[top_idependent_genes,cell_line_metadata$singlet_ID],
         treeheight_row = 0,treeheight_col = 0,cluster_cols = F,show_colnames = F,
         annotation_col = cell_line_metadata %>%
         tibble::column_to_rownames('singlet_ID') %>% as.data.frame())

Looking at the heatmap, we can see that the logFC of these genes doesn’t correlate with Trametinib sensitivity.

Now lets plot the viability-related response as a volcano and make a similar heatmap.

related %>% ggplot(aes(logFC,-log10(P.Value))) + geom_point(aes(color = P.Value < .05)) +
  geom_text_repel(data = related %>% filter(abs(logFC) > 2.9),aes(label = ID)) +
  guides(color = F) + ggtitle("Viability-related")

top_related_genes <- related %>% dplyr::arrange(-abs(logFC)) %>% head(25) %>% .[["ID"]]
pheatmap::pheatmap(logfc_summed_counts [top_related_genes,cell_line_metadata$singlet_ID],
                   treeheight_row = 0,treeheight_col = 0,
                   cluster_cols = F,show_colnames = F,
                   annotation_col   = cell_line_metadata %>%
                   tibble::column_to_rownames('singlet_ID') %>% as.data.frame())

As expected, the logFC of these genes correlates with Trametinib sensitivity. For example, the expression of the cell-cycle regulator E2F is negatively correlated with Trametinib sensitivity.

comments powered by Disqus