zl程序教程

您现在的位置是:首页 >  其它

当前栏目

【无标题】Non-parametric test for difference in mean

for in test Non Difference Mean
2023-09-14 09:09:48 时间
Non-parametric test for difference in mean
Christoph Hafemeister
2021-06-14
NOTE: This document was generated with sctransform version 0.3.2.9007

Introduction
With this vignette we introduce the non-parametric differential expression test for sparse non-negative data as seen in single-cell RNA sequencing.

The test is a model-free randomization test where the observed difference in mean between two groups is compared against a null distribution that is obtained by random shuffling of the group labels.

Given the observed difference and the null distribution, empirical p-values are calculated: emp_pval = (b + 1) / (R + 1) where b is the number of times the absolute difference in mean from a random permutation is at least as large as the absolute value of the observed difference in mean, R is the number of random permutations. This is an upper bound of the real empirical p-value that would be obtained by enumerating all possible group label permutations.

Additionally, we approximate the empirical null distribution with a normal distribution and turn the observed difference in mean into a z-score and then into a p-value. Finally, all p-values (for the tested genes) are adjusted using the Benjamini & Hochberg method (fdr).

The log2FC values in the output are log2(mean1 / mean2).

The current implementation only supports sparse matrices of class dgCMatrix from the Matrix package and has been optimized for speed (see benchmarking results below).

Load some data
We use the publicly available “10k PBMCs from a Healthy Donor (v3 chemistry)” data (11,769 cells) from 10x Genomics available at https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/pbmc_10k_v3

The data is the same as the demo dataset used in Azimuth. We load the results obtained from Azimuth (Azimuth version: 0.3.2; Seurat version: 4.0.0; Reference version: 1.0.0; default parameters in web-app).

We then keep only those cell types that have at least 5 cells with a prediction and mapping score > 0.66 and further remove all genes that have not been detected in at least 5 cells.

counts <- Seurat::Read10X_h5("~/Projects/data_warehouse/raw_public_10x/pbmc_10k_v3_filtered_feature_bc_matrix.h5")
predictions <- read.delim("~/Projects/data_warehouse/raw_public_10x/pbmc_10k_v3_azimuth_pred.tsv", 
    row.names = 1)

tab <- table(predictions$predicted.celltype.l2, predictions$predicted.celltype.l2.score > 
    0.66 & predictions$mapping.score > 0.66)
keep_types <- rownames(tab)[tab[, 2] >= 5]
keep_cells <- rownames(predictions)[predictions$predicted.celltype.l2 %in% keep_types]

counts <- counts[, keep_cells]
counts <- counts[rowSums(counts > 0) >= 5, ]
predictions <- predictions[keep_cells, ]

cell_types <- factor(predictions$predicted.celltype.l2, levels = names(sort(table(predictions$predicted.celltype.l2), 
    decreasing = TRUE)))
We now have a count matrix of 19088 genes and 11729 cells with the following cell type labels:

data.frame(table(cell_types))
cell_types	Freq
CD14 Mono	3469
CD4 TCM	2335
B naive	1062
MAIT	587
CD4 Naive	565
NK	551
CD8 TEM	397
CD16 Mono	385
CD8 Naive	365
Eryth	319
B intermediate	269
B memory	264
gdT	236
Platelet	179
CD4 TEM	143
CD8 TCM	140
cDC2	135
Treg	103
pDC	88
HSPC	28
NK_CD56bright	27
NK Proliferating	24
ILC	23
Plasmablast	19
cDC1	16
Motivation
Here we illustrate the concept of the test using CD14 Monocytes as group 1 and all remaining cells as group 2. We will show two example genes: HINT1 (not differentially expressed), and CD14.

goi <- c("HINT1", "CD14")
df <- melt(t(as.matrix(counts[goi, , drop = FALSE])), varnames = c("cell", "gene"), 
    value.name = "counts")
df$cell_type <- factor(c("rest", "CD14 Mono")[(cell_types == "CD14 Mono") + 1])
# calculate the (geometric) mean per group
df_sum <- group_by(df, gene, cell_type) %>% summarise(mean = expm1(mean(log1p(counts))), 
    mid = median(range(counts)), .groups = "drop")
# and the difference of means
df_diff <- group_by(df_sum, gene) %>% summarise(diff_mean = mean[1] - mean[2], label = sprintf("Difference in mean: %1.2g\nlog2 fold-change: %1.2g", 
    diff_mean, log2(mean[1]/mean[2])), x = max(mid), y = Inf, .groups = "drop")
p1 <- ggplot(df, aes(counts, y = ..density.., fill = cell_type)) + geom_histogram(binwidth = 1, 
    position = "identity", alpha = 0.4) + geom_vline(data = df_sum, aes(xintercept = mean, 
    color = cell_type)) + geom_label(data = df_diff, aes(x, y, label = label), vjust = 1, 
    inherit.aes = FALSE, size = 3) + facet_wrap(~gene, scales = "free") + xlab("Gene counts") + 
    ylab("Proportion of cells") + ggtitle("Observed data and differences in geometric mean")
plot(p1)


The plot above shows the UMI counts per gene per group. Also shown is the difference in mean (mean1 - mean2) and the log2 fold-change (log2(mean1 / mean2)). To find out whether the observed difference in mean is significant we look at the null distribution of difference in mean, i.e. we shuffle the labels (here we use 99 repetitions) and calculate the difference in mean.

# calculate null distribution of difference in mean for each gene
grp <- factor(c("rest", "CD14 Mono")[(cell_types == "CD14 Mono") + 1])
tmp_counts <- counts[goi, , drop = FALSE]
R <- 99
diff_mean_null <- sapply(1:R, function(i) {
    mean_r <- sctransform:::row_gmean_grouped_dgcmatrix(matrix = tmp_counts, group = grp, 
        eps = 1, shuffle = TRUE)
    mean_r[, 1] - mean_r[, 2]
})
df_null <- melt(diff_mean_null, varnames = c("gene", "iteration"), value.name = "diff_mean")

p2 <- ggplot(df_null, aes(diff_mean)) + geom_histogram(bins = 33) + facet_wrap(~gene, 
    scales = "free") + xlab("Difference in geometric mean") + ylab("Count") + ggtitle("Null distribution of differences in geometric mean")
plot(p2)


The null distribution of ‘difference in mean’ shown above indicates what values to expect if the null is true (no difference in mean between the two groups). We can use the distribution to obtain an empirical p-value by asking how often the absolute value of the null distribution is larger or equal to the observed difference in mean. We use the absolute value since this is a two-tailed test, and use a pseudo-count in nominator and denominator when turning the observed frequencies into p-values (see Phipson and Smyth (2010) and Hartwig (2013) for discussions).


# given the null distribution, get empirical p-value, fit a gaussian and get
# approximated p-value
df_res <- left_join(df_null, df_diff, by = "gene") %>% group_by(gene) %>% summarise(emp_pval = (sum((abs(diff_mean.x) - 
    abs(diff_mean.y)) >= 0) + 1)/(R + 1), sds = sqrt(sum(diff_mean.x^2)/(R - 1)), 
    zscore = (diff_mean.y[1] - mean(diff_mean.x))/sds, pval = 2 * pnorm(-abs(zscore)), 
    min_r = min(diff_mean.x), max_r = max(diff_mean.x), mean_r = mean(diff_mean.x), 
    observed = diff_mean.y[1], .groups = "drop")
df_fit <- group_by(df_res, gene) %>% summarise(x = seq(from = min(min_r, observed), 
    to = max(max_r, observed), length.out = 333), y = dnorm(x = x, mean = mean_r, 
    sd = sds), .groups = "drop")
df_anno <- group_by(df_res, gene) %>% summarise(x = max(max_r, observed), y = Inf, 
    label = sprintf("Empirical p-value: %1.2g\nApprox. p-value: %1.2g", emp_pval, 
        pval))

p3 <- ggplot(df_null, aes(diff_mean, y = ..density..)) + geom_histogram(bins = 33, 
    aes(fill = "gray70")) + geom_line(data = df_fit, aes(x = x, y = y, linetype = "1")) + 
    geom_vline(data = df_res, aes(xintercept = observed, linetype = "2"), show_guide = FALSE) + 
    geom_label(data = df_anno, aes(x, y, label = label), hjust = 1, vjust = 1, size = 3) + 
    facet_wrap(~gene, scales = "free") + xlab("Difference in geometric mean") + ylab("Distribution density") + 
    ggtitle("Using the null distribution to obtain p-values") + scale_fill_manual(name = "", 
    values = "gray70", labels = sprintf("null distribution", R)) + scale_linetype_manual(name = "", 
    values = c(1, 2), labels = c("Approximated null", "Observed difference\nin mean"))
plot(p3)


The lowest possible empirical p-value is 1/(R+1) whith R being the number of random permutation used. However, the gaussian approximation of the null distribution allows us to calculate z-scores and consequently p-values that are lower than that. While the approximation using a gaussian might not be exact, especially for genes with very low detection rate or when cell numbers are very low, it generally agrees well with the empirical data.

Example 1: DE of CD14 Mono vs CD16 Mono (one vs one)
First, we will take the count matrix and fit a model using sctransform::vst, and in a second step obtain corrected counts (with the sequencing depth effect removed). Then compare the two groups.

vst_out <- vst(umi = counts, method = "qpoisson", residual_type = "none", return_cell_attr = TRUE, 
    verbosity = 0)
counts_corrected <- correct_counts(x = vst_out, umi = counts, verbosity = 0)
By default sctransform::diff_mean_test applies some moderate pre-filtering and tests only genes with:

absolute log2-fold-change of at least log2(1.2) (0.2630344) AND
mean value of at least 0.05 in at least one of the tested groups AND
at least 5 non-zero observations in the group with higher mean
Here we disable the first filter, but require a mean of at least 0.1 in at least one of the groups. We show results as a volcano plot and highlight the top DE genes (based on p-value or log-fold-change).

bm <- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types, 
    compare = c("CD14 Mono", "CD16 Mono"), log2FC_th = 0, mean_th = 0.1), max_iterations = 1, 
    filter_gc = FALSE)
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing CD14 Mono (group1, N = 3469) to CD16 Mono (group2, N = 385)
#> Keeping 6650 genes after initial filtering
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing CD14 Mono (group1, N = 3469) to CD16 Mono (group2, N = 385)
#> Keeping 6650 genes after initial filtering
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc	gc.sec	n_itr	n_gc	total_time
359MB	0.4265963	1	1	2.34s
The runtime for the tests was 2.34s. In total, 6650 genes were tested. 949 genes showed a mean that was significantly different between groups (FDR <= 0.05 based on empirical p-values) AND had an absolute log2 fold-change of at least 1 (i.e. abs(log2(mean1/mean2)) >= 1).

top_markers <- arrange(de_res, sign(log2FC), -abs(log2FC)) %>% group_by(sign(log2FC)) %>% 
    filter(rank(-abs(zscore), ties.method = "first") <= 4 | rank(-abs(log2FC), ties.method = "first") <= 
        4) %>% ungroup() %>% select(gene, mean1, mean2, log2FC, emp_pval_adj, pval_adj, 
    zscore)

p1 <- ggplot(de_res, aes(log2FC, pmax(-0.5, log10(abs(zscore))))) + geom_point(aes(color = emp_pval_adj < 
    0.05 & pval_adj < 0.05)) + geom_point(data = top_markers, color = "deeppink") + 
    geom_text_repel(data = top_markers, mapping = aes(label = gene)) + theme(legend.position = "bottom") + 
    ylab("Zscore [log10 of absolute value, clipped at -0.5]") + xlab("log2 fold-change (log2(mean1 / mean2))")

show(p1)


Top markers per cell type

filter(top_markers, log2FC < 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE, 
    searching = FALSE)) %>% DT::formatRound(2:7, digits = 2)
gene	mean1	mean2	log2FC	emp_pval_adj	pval_adj	zscore
LYPD2	0.00	0.54	-9.57	0.02	0.00	-33.61
FCGR3A	0.10	13.72	-7.17	0.02	0.00	-205.90
MEG3	0.00	0.19	-7.16	0.02	0.00	-15.02
LINC02345	0.00	0.33	-7.11	0.02	0.00	-34.55
CDKN1C	0.04	4.83	-7.07	0.02	0.00	-125.77
RHOC	0.12	3.43	-4.86	0.02	0.00	-91.97
LST1	4.86	20.53	-2.08	0.02	0.00	-79.96
Showing 1 to 7 of 7 entries
filter(top_markers, log2FC > 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE, 
    searching = FALSE)) %>% DT::formatRound(2:7, digits = 2)
gene	mean1	mean2	log2FC	emp_pval_adj	pval_adj	zscore
PADI2	0.12	0.00		0.02	0.00	7.93
ASGR2	0.22	0.00	6.94	0.02	0.00	8.79
S100A12	11.73	0.12	6.57	0.02	0.00	20.15
CRISPLD2	0.31	0.00	6.45	0.02	0.00	12.62
MS4A6A	4.41	0.29	3.93	0.02	0.00	23.90
LYZ	89.96	6.75	3.74	0.02	0.00	26.14
MNDA	8.90	1.66	2.43	0.02	0.00	25.88
FOS	47.86	14.20	1.75	0.02	0.00	25.92
Showing 1 to 8 of 8 entries
Example 2: Top markers for all cell types (each vs rest)
Here we use the test to find genes that are high for each cell type compared to the rest. This is the default behavior of the test function. To speed things up, we use fewer random permutations (49) and test only the 222 genes with highest log2 fold-change.

bm <- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types, 
    R = 49, only_pos = TRUE, only_top_n = 222, verbosity = 0))
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc	gc.sec	n_itr	n_gc	total_time
974MB	0	1	0	5.24s
Show one plot per cell type and highlight the top 4 markers with respect to p-value and the top 4 markers with respect to log2FC.

top_markers <- group_by(de_res, group1) %>% filter(rank(-zscore, ties.method = "first") <= 
    4 | rank(-log2FC, ties.method = "first") <= 4) %>% select(group1, gene, mean1, 
    mean2, log2FC, zscore, emp_pval_adj)

p1 <- ggplot(de_res, aes(pmin(log2FC, 10), pmin(log10(zscore), 4))) + geom_point(aes(color = emp_pval_adj < 
    0.05)) + geom_point(data = top_markers, color = "deeppink") + geom_text_repel(data = top_markers, 
    mapping = aes(label = gene)) + facet_wrap(~group1, ncol = 3) + theme(legend.position = "bottom")
show(p1)


Table of top markers per cell type

DT::datatable(top_markers, rownames = FALSE) %>% DT::formatRound(3:7, digits = 2)
Show 
10
 entriesSearch:
group1	gene	mean1	mean2	log2FC	zscore	emp_pval_adj
CD14 Mono	S100A9	125.05	1.21	6.69	411.42	0.02
CD14 Mono	S100A8	78.52	0.76	6.69	361.87	0.02
CD14 Mono	VCAN	9.82	0.17	5.85	196.41	0.02
CD14 Mono	CLEC4D	0.25	0.00	7.47	52.10	0.02
CD14 Mono	CLEC4E	0.67	0.00	7.32	84.16	0.02
CD14 Mono	LYZ	89.96	1.47	5.94	309.09	0.02
CD14 Mono	SERPINB2	0.06	0.00	7.82	23.36	0.02
CD14 Mono	SERPINB10	0.05	0.00	7.71	27.57	0.02
CD4 TCM	TNFRSF4	0.19	0.02	3.40	32.68	0.02
CD4 TCM	IL7R	5.16	0.70	2.89	84.56	0.02
Showing 1 to 10 of 164 entriesPrevious12345…17Next
Example 3: Unique markers for all cell types (all vs all)
Since the test is pretty fast we can run all pairwise comparisons to try to find genes that are cell type specific across many individual comparisons.

bm <- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types, 
    compare = "all_vs_all", R = 49, mean_th = 0.1, log2FC_th = 0, verbosity = 0))
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc	gc.sec	n_itr	n_gc	total_time
71.9GB	0.789407	1	118	2.49m
There are 300 unique cell type pairs in the output. On average 5334 genes were tested in each comparison. Below we list the top genes per cell type ranked by fold change and the number of times it was a cell type marker (absolute log2FC >= 0 and p-value <= 0.05). We also show the minimum log2 fold-change and the maximum p-value across all comparisons.

log2fc_th <- 0  #log2(1.2)
p_th <- 0.05
fn <- function(fc, pval, fc_th = 0, p_th = Inf) {
    worst <- which.min(fc)
    wp <- pval[worst]
    if (fc[worst] < 0) {
        wp <- 1 - wp
    }
    data.frame(n = sum(fc >= fc_th & pval <= p_th), log2FC = fc[worst], pval = wp)
}
tmp1 <- group_by(de_res, gene, group1) %>% summarise(fn(log2FC, pval))

tmp1 <- group_by(de_res, gene, group1) %>% summarise(n = sum(log2FC >= log2fc_th), 
    n_sig = sum(log2FC >= log2fc_th & pval <= p_th), log2FC = min(log2FC), pval = max(pval, 
        log2FC < 0))

tmp2 <- group_by(de_res, gene, group2) %>% summarise(n = sum(log2FC <= -log2fc_th), 
    n_sig = sum(log2FC <= -log2fc_th & pval <= p_th), log2FC = min(-log2FC), pval = max(pval, 
        log2FC < 0))

top_markers <- full_join(tmp1, tmp2, by = c(group1 = "group2", gene = "gene")) %>% 
    rename(cell_type = group1) %>% mutate(comparisons = mapply(sum, n.x, n.y, na.rm = TRUE), 
    n_sig = mapply(sum, n_sig.x, n_sig.y, na.rm = TRUE), log2FC = pmin(log2FC.x, 
        log2FC.y, na.rm = TRUE), pval = pmax(pval.x, pval.y, na.rm = TRUE)) %>% arrange(cell_type, 
    -comparisons, -log2FC * n_sig, pval) %>% group_by(cell_type) %>% slice_head(n = 4) %>% 
    select(cell_type, gene, comparisons, log2FC, pval)

DT::datatable(top_markers, rownames = FALSE, caption = "Cell type specific markers") %>% 
    DT::formatRound(4:5, digits = 2)
Show 
10
 entriesSearch:
Cell type specific markers
cell_type	gene	comparisons	log2FC	pval
CD14 Mono	CLEC4D	24	4.82	0.01
CD14 Mono	MARC1	24	4.50	0.04
CD14 Mono	AC245128.3	24	4.34	0.03
CD14 Mono	S100A9	24	4.32	0.00
CD4 TCM	CD5	24	0.56	0.03
CD4 TCM	SLC2A3	24	0.26	0.09
CD4 TCM	MORC2	24	0.26	0.67
CD4 TCM	TAGAP	24	0.18	0.60
B naive	AL139020.1	24	3.35	0.12
B naive	IL4R	24	2.75	0.00
Showing 1 to 10 of 100 entriesPrevious12345…10Next
Heatmap of top markers

mat <- sctransform:::row_gmean_grouped_dgcmatrix(matrix = counts_corrected[unique(top_markers$gene), 
    ], group = cell_types, eps = 1, shuffle = FALSE)
mat <- t(scale(t(mat)))

p1 <- melt(mat, varnames = c("gene", "celltype")) %>% mutate(celltype = factor(celltype, 
    levels = rev(colnames(mat)))) %>% ggplot(aes(gene, celltype, fill = value)) + 
    geom_tile(colour = "gray66") + scale_fill_gradient2(low = "white", mid = "white", 
    high = "black", name = "Geometric mean of each gene per cell type, scaled per gene") + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = NULL, 
    y = NULL) + scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 
    0)) + theme(legend.position = "top", axis.ticks = element_blank())
show(p1)


Example 4: CD4 vs CD8 T cells (some vs some)
We can also test one or more groups against one or more other groups. As an example we test CD4 vs CD8 cells (in each group we combine Naive, TCM, TEM subgroups). The cell type identities for the comparison are passed as a list of length two to the compare argument of the diff_mean_test function.

bm <- bench::mark(de_res <- diff_mean_test(y = counts_corrected, group_labels = cell_types, 
    compare = list(c("CD4 Naive", "CD4 TCM", "CD4 TEM"), c("CD8 Naive", "CD8 TCM", 
        "CD8 TEM")), log2FC_th = 0, mean_th = 0.1), max_iterations = 1, filter_gc = FALSE)
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing group1 (group1, N = 3043) to group2 (group2, N = 902)
#> Keeping 4467 genes after initial filtering
#> Non-parametric DE test for count data
#> Using geometric mean and 99 random permutations
#> Input: 19088 genes, 11729 cells; 25 groups
#> Comparing group1 (group1, N = 3043) to group2 (group2, N = 902)
#> Keeping 4467 genes after initial filtering
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc	gc.sec	n_itr	n_gc	total_time
304MB	0	1	0	1.56s
top_markers <- arrange(de_res, sign(log2FC), -abs(log2FC)) %>% group_by(sign(log2FC)) %>% 
    filter(rank(-abs(zscore), ties.method = "first") <= 4 | rank(-abs(log2FC), ties.method = "first") <= 
        4) %>% ungroup() %>% select(gene, mean1, mean2, log2FC, emp_pval_adj, pval_adj, 
    zscore)

p1 <- ggplot(de_res, aes(log2FC, pmax(-0.5, log10(abs(zscore))))) + geom_point(aes(color = emp_pval_adj < 
    0.05 & pval_adj < 0.05)) + geom_point(data = top_markers, color = "deeppink") + 
    geom_text_repel(data = top_markers, mapping = aes(label = gene)) + theme(legend.position = "bottom") + 
    ylab("Zscore [log10 of absolute value, clipped at -0.5]") + xlab("log2 fold-change (log2(mean1 / mean2))")

show(p1)


Top markers per group

filter(top_markers, log2FC < 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE, 
    searching = FALSE), caption = "Higher in CD8") %>% DT::formatRound(2:7, digits = 2)
Higher in CD8
gene	mean1	mean2	log2FC	emp_pval_adj	pval_adj	zscore
LINC02446	0.00	0.29	-6.81	0.04	0.00	-28.71
FCRL6	0.00	0.11	-6.74	0.04	0.00	-16.75
GZMH	0.01	0.47	-6.10	0.04	0.00	-26.46
CD8B	0.03	1.92	-5.90	0.04	0.00	-76.13
CD8A	0.03	1.76	-5.77	0.04	0.00	-65.05
NKG7	0.09	1.86	-4.34	0.04	0.00	-47.47
CCL5	0.28	3.33	-3.57	0.04	0.00	-52.21
Showing 1 to 7 of 7 entries
filter(top_markers, log2FC > 0) %>% DT::datatable(rownames = FALSE, options = list(paging = FALSE, 
    searching = FALSE), caption = "Higher in CD4") %>% DT::formatRound(2:7, digits = 2)
Higher in CD4
gene	mean1	mean2	log2FC	emp_pval_adj	pval_adj	zscore
CD4	0.35	0.02	3.95	0.04	0.00	18.33
TSHZ2	0.29	0.02	3.94	0.04	0.00	14.87
CD40LG	0.45	0.03	3.82	0.04	0.00	18.83
TNFRSF4	0.15	0.01	3.68	0.04	0.00	10.60
LTB	7.32	2.84	1.36	0.04	0.00	24.27
Showing 1 to 5 of 5 entries
Example 5: Using the test with Seurat
The functionality outlined in the examples above can also be applied to Seurat objects. We demonstrate this with an example below where we perform unsupervised clustering followed by cluster marker identification (as in example 2).

library("Seurat")
#> Attaching SeuratObject
s <- CreateSeuratObject(counts = counts)
# run sctransform
s <- SCTransform(s, verbose = FALSE, method = "qpoisson")
# 'standard' workflow
s <- RunPCA(s, verbose = FALSE)
s <- RunUMAP(s, dims = 1:30, verbose = FALSE)
s <- FindNeighbors(s, dims = 1:30, verbose = FALSE)
s <- FindClusters(s, verbose = FALSE)
DimPlot(s, label = TRUE) + NoLegend() + ggtitle("Unsupervised clustering")


The cells have been grouped into 22 clusters. We are now going to use sctransform::diff_mean_test to find the top cluster markers. For this we access the normalized counts that are in the SCT assay and the cluster labels.

bm <- bench::mark(de_res <- diff_mean_test(y = GetAssayData(s, assay = "SCT", slot = "counts"), 
    group_labels = s$seurat_clusters, R = 49, only_pos = TRUE, only_top_n = 222, 
    verbosity = 0))
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc	gc.sec	n_itr	n_gc	total_time
856MB	0	1	0	4.75s
top_markers <- group_by(de_res, group1) %>% filter(rank(-zscore, ties.method = "first") <= 
    4 | rank(-log2FC, ties.method = "first") <= 4) %>% select(group1, gene, mean1, 
    mean2, log2FC, zscore, emp_pval_adj)

p1 <- ggplot(de_res, aes(pmin(log2FC, 10), pmin(log10(zscore), 4))) + geom_point(aes(color = emp_pval_adj < 
    0.05)) + geom_point(data = top_markers, color = "deeppink") + geom_text_repel(data = top_markers, 
    mapping = aes(label = gene)) + facet_wrap(~group1, ncol = 4) + theme(legend.position = "bottom")
show(p1)


Table of top markers per cluster

DT::datatable(top_markers, rownames = FALSE) %>% DT::formatRound(3:7, digits = 2)
Show 
10
 entriesSearch:
group1	gene	mean1	mean2	log2FC	zscore	emp_pval_adj
0	S100A9	185.57	2.74	6.08	463.24	0.02
0	S100A12	19.55	0.51	5.25	264.37	0.02
0	S100A8	135.89	1.79	6.24	518.87	0.02
0	LYZ	103.77	3.13	5.05	278.81	0.02
1	TNFRSF4	0.26	0.02	3.76	39.55	0.02
1	AC133644.2	0.08	0.01	3.45	21.64	0.02
1	IL7R	7.32	0.77	3.25	118.92	0.02
1	LTB	9.57	1.93	2.31	77.39	0.02
1	NEFL	0.07	0.00	4.67	25.44	0.02
1	LDHB	5.27	1.27	2.06	77.39	0.02
Showing 1 to 10 of 141 entriesPrevious12345…15Next
Seurat heatmap
marker_genes <- group_by(de_res, group1) %>% filter(rank(-log2FC, ties.method = "first") <= 
    4) %>% pull(gene)
# make sure all markers are in the scale.data slot (by default only the highly
# variable genes are there)
s <- GetResidual(s, features = marker_genes, verbose = FALSE)
DoHeatmap(s, features = marker_genes, slot = "scale.data")


Example 6: Conserved markers after integration
Here we show how to use the test to identify the conserved (or consistent) cluster markers after multi-sample integration. This is different from the examples above because additionally to a cluster label the cells also have a sample label and we run the tests per sample and combine the results.

First, follow the Seurat integration vignette to obtain an integrated object.

library("Seurat")
library("SeuratData")
#> Registered S3 method overwritten by 'cli':
#>   method     from         
#>   print.boxx spatstat.geom
#> ── Installed datasets ───────────────────────────────────── SeuratData v0.2.1 ──
#> ✓ bmcite   0.3.0                        ✓ pbmc3k   3.1.4
#> ✓ ifnb     3.1.0                        ✓ ssHippo  3.1.4
#> ✓ panc8    3.0.2                        ✓ stxBrain 0.1.1
#> ────────────────────────────────────── Key ─────────────────────────────────────
#> ✓ Dataset loaded successfully
#> > Dataset built with a newer version of Seurat than installed
#> ❓ Unknown version of Seurat installed
LoadData("ifnb")
#> An object of class Seurat 
#> 14053 features across 13999 samples within 1 assay 
#> Active assay: RNA (14053 features, 0 variable features)
ifnb.list <- SplitObject(ifnb, split.by = "stim")
ifnb.list <- lapply(X = ifnb.list, FUN = SCTransform, method = "qpoisson", verbose = FALSE)
features <- SelectIntegrationFeatures(object.list = ifnb.list, nfeatures = 3000, 
    verbose = FALSE)
ifnb.list <- PrepSCTIntegration(object.list = ifnb.list, anchor.features = features, 
    verbose = FALSE)
immune.anchors <- FindIntegrationAnchors(object.list = ifnb.list, normalization.method = "SCT", 
    anchor.features = features, verbose = FALSE)
immune.combined.sct <- IntegrateData(anchorset = immune.anchors, normalization.method = "SCT", 
    verbose = FALSE)
Note that in this example both samples have been sequenced to similar depth. When working with data where there is a stark discrepancy between sequencing depth, one should make sure that the corrected counts used for differential expression testing were generated using the same ‘target’ depth. Since this is not supported by Seurat::SCTransform it requires running sctransform::correct() with identical cell meta data for all samples and setting the as_is parameter to TRUE. However, in this example we can proceed without any extra steps.

In an unsupervised analysis one would now perform dimensionality reduction and clustering. To keep it simple, we skip these steps and directly use the cell type labels that are provided with the original seurat object.

Now we identify cluster marker genes (each vs rest) that are conserved, i.e. consistent across conditions.

bm <- bench::mark(de_res <- diff_mean_test_conserved(y = GetAssayData(immune.combined.sct, 
    assay = "SCT", slot = "counts"), group_labels = immune.combined.sct$seurat_annotations, 
    sample_labels = immune.combined.sct$stim, only_pos = TRUE, verbosity = 0))
knitr::kable(data.frame(bm[, 5:9]), caption = "Benchmarking details")
Benchmarking details
mem_alloc	gc.sec	n_itr	n_gc	total_time
1.33GB	0.7549027	1	10	13.2s
The function we used calls diff_mean_test repeatedly (once per sample given this balanced design) and aggregates the results per group and gene. The output table has the following columns

group1 Cell type label of the frist group of cells
group2 Label of the second group of cells; currently fixed to ‘rest’
gene Gene name (from rownames of input matrix)
n_tests Number of tests this gene participated in for this group (max 2 in this example)
log2FC_min,median,max Summary statistics for log2FC across the tests
mean1,2_median Median of group mean across the tests
pval_max Maximum of p-values across tests
de_tests Number of tests that showed this gene having a log2FC going in the same direction as log2FC_median and having a p-value <= pval_th
The output is ordered by group1, -de_tests, -abs(log2FC_median), pval_max

Show the top marker genes

top_markers <- filter(de_res, de_tests == 2, p.adjust(pval_max) <= 0.001) %>% group_by(group1) %>% 
    filter(rank(-log2FC_median, ties.method = "first") <= 6)
DT::datatable(top_markers, rownames = FALSE, caption = "Conserved cell type specific markers") %>% 
    DT::formatRound(5:9, digits = 2) %>% DT::formatSignif(10)
Show 
10
 entriesSearch:
Conserved cell type specific markers
group1	group2	gene	n_tests	log2FC_min	log2FC_median	log2FC_max	mean1_median	mean2_median	pval_max	de_tests
B	rest	AL928768.3	2	6.16	6.56	6.96	0.11	0.00	2.6e-77	2
B	rest	TNFRSF13B	2	5.98	6.13	6.27	0.20	0.00	1.2e-223	2
B	rest	CD79A	2	5.65	5.81	5.98	1.55	0.03	0.0	2
B	rest	IGLL5	2	5.40	5.62	5.84	0.24	0.00	6.8e-116	2
B	rest	FCRLA	2	4.97	5.50	6.03	0.11	0.00	3.6e-120	2
B	rest	MS4A1	2	5.29	5.30	5.31	0.67	0.02	0.0	2
B Activated	rest	AC006129.4	2	6.61	7.32	8.03	0.18	0.00	1.7e-263	2
B Activated	rest	TVP23A	2	5.98	6.27	6.57	0.34	0.00	4.9e-153	2
B Activated	rest	PYCR1	2	5.34	6.14	6.94	0.26	0.00	2.9e-243	2
B Activated	rest	CCND1	2	5.77	5.87	5.97	0.05	0.00	1.4e-49	2
Showing 1 to 10 of 78 entriesPrevious12345…8Next
Show a heatmap of the marker genes. The mean expression per gene is scaled separately for each sample (CTRL and STIM) but they are shown interleaved to highlight that the differential expression is conserved across samples.

celltypes <- immune.combined.sct$seurat_annotations
samples <- immune.combined.sct$stim
immune.combined.sct$type_by_stim <- factor(x = paste(celltypes, samples, sep = " "), 
    levels = paste(rep(levels(top_markers$group1), each = length(unique(samples))), 
        unique(samples)))
mat <- sctransform:::row_gmean_grouped_dgcmatrix(matrix = GetAssayData(immune.combined.sct, 
    assay = "SCT", slot = "counts")[top_markers$gene, ], group = factor(immune.combined.sct$type_by_stim), 
    eps = 1, shuffle = FALSE)
# scale samples separately
sel <- grepl(" CTRL$", colnames(mat))
mat[, sel] <- t(scale(t(mat[, sel])))
sel <- grepl(" STIM$", colnames(mat))
mat[, sel] <- t(scale(t(mat[, sel])))
mat <- mat[order(apply(mat, 1, which.max)), ]

p1 <- melt(mat, varnames = c("gene", "celltype")) %>% mutate(celltype = factor(celltype, 
    levels = rev(colnames(mat)))) %>% ggplot(aes(gene, celltype, fill = value)) + 
    geom_tile(colour = "gray66") + scale_fill_gradient2(low = "white", mid = "white", 
    high = "black", name = "Geometric mean of each gene per cell type and sample, scaled per gene and sample") + 
    theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) + labs(x = NULL, 
    y = NULL) + scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 
    0)) + theme(legend.position = "top", axis.ticks = element_blank())
show(p1)


To identify differences between stimulated and control conditions within a celltype, use the test as shown in the examples above. For example, below we look for genes differentially expressed between conditions within B cells.

de_res <- diff_mean_test(y = GetAssayData(immune.combined.sct, assay = "SCT", slot = "counts"), 
    group_labels = immune.combined.sct$type_by_stim, compare = c("B CTRL", "B STIM"), 
    verbosity = 0)
top_markers1 <- arrange(de_res, zscore) %>% head(10)
top_markers2 <- arrange(de_res, -zscore) %>% head(10)
rbind(top_markers1, top_markers2) %>% select(gene, mean1, mean2, log2FC, pval_adj) %>% 
    DT::datatable(rownames = FALSE, caption = "Top 20 B cell DE genes between control (group 1) and stimulated (group 2)") %>% 
    DT::formatRound(2:4, digits = 2) %>% DT::formatSignif(5)
Show 
10
 entriesSearch:
Top 20 B cell DE genes between control (group 1) and stimulated (group 2)
gene	mean1	mean2	log2FC	pval_adj
ISG15	0.24	13.24	-5.76	2.4e-179
ISG20	1.09	11.87	-3.44	5.4e-152
IFIT3	0.04	4.17	-6.63	1.1e-140
IFI6	0.06	4.05	-6.04	3.9e-128
TNFSF10	0.01	1.77	-7.02	2.2e-123
IFIT2	0.03	1.79	-5.85	2.5e-110
IFIT1	0.03	2.68	-6.66	8.5e-105
LY6E	0.13	2.60	-4.27	1.4e-101
MX1	0.11	2.77	-4.61	7.4e-96
PLSCR1	0.09	1.57	-4.16	2.6e-80
Showing 1 to 10 of 20 entriesPrevious12Next
Session info and runtime
Session info