zl程序教程

您现在的位置是:首页 >  其他

当前栏目

单细胞各个亚群特异性高表达基因的数据库注释(包括GO,KEGG,ReactomePA)

2023-02-18 16:30:34 时间

现在单细胞技术的大行其道,让我们的表达量矩阵里面的列也变成了成百上千甚至过万了,首先它们属于不同的样品,如果是10x技术的单细胞转录组,每个样品可以有5到8千的单细胞列,其次呢我们通常是做多个单细胞样品,详见:2个分组的单细胞项目标准分析,这样的话不同样品也可以有表型分组,比如处理组和对照组。

拿到了一个单细胞表达量矩阵,默认需要进行: 单细胞聚类分群注释 ,如果你对单细胞数据分析还没有基础认知,可以看基础10讲:

找各个单细胞亚群特异性高表达量基因

主流的方法是FindAllMarkers函数,但是它比较耗费计算机资源和时间,也可以采用cosg方法,省时省力。

我们以大家熟知的pbmc3k数据集为例,大家先安装这个数据集对应的包,并且对它进行降维聚类分群,参考前面的例子:人人都能学会的单细胞聚类分群注释 ,而且每个亚群找高表达量基因,都存储为Rdata文件。标准代码是:

library(SeuratData) #加载seurat数据集  
getOption('timeout')
options(timeout=10000)
#InstallData("pbmc3k")  
data("pbmc3k")  
sce <- pbmc3k.final  
library(Seurat)
table(Idents(sce))
DimPlot(sce,label = T)

sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, 
                              min.pct = 0.25, 
                              thresh.use = 0.25)
library(dplyr) 
top3 <- sce.markers %>% group_by(cluster) %>% top_n(3, avg_log2FC)
DoHeatmap(sce ,top3$gene,size=3)

会得到热图,展现每个单细胞亚群特异性高表达量基因的前3个,它其实就是一种差异分析了,这个时候它对比的是每个单细胞亚群和所有的其它细胞。

如果是使用COSG包的cosg函数,是另外的格式的输出各个单细胞亚群的top基因:

  input_sce = sce
  table(Idents(input_sce))
  pro = 'cosg_seurat_clusters'
  
  if(T){
    
    library(COSG)
    marker_cosg <- cosg(
      input_sce,
      groups='all',
      assay='RNA',
      slot='data',
      mu=1,
      n_genes_user=100)
    
    save(marker_cosg,file = paste0(pro,'_marker_cosg.Rdata'))
    head(marker_cosg)
    
    ## Top10 genes
    library(dplyr)  
    top_10 <- unique(as.character(apply(marker_cosg$names,2,head,10)))
 
    sce.Scale <- ScaleData(input_sce ,features =  top_10  )  
    
    DoHeatmap(  sce.Scale , top_10 , 
   
    ## Top3 genes
    top_3 <- unique(as.character(apply(marker_cosg$names,2,head,3)))
     
  }
  

然后数据库注释(包括GO,KEGG,ReactomePA)

我这里自己写了一个简单的函数, com_go_kegg_ReactomePA_human ,可以根据输入的基因名字列表去批量做数据库注释(包括GO,KEGG,ReactomePA),这个函数 com_go_kegg_ReactomePA_human 的代码如下所示:



com_go_kegg_ReactomePA_human <- function(symbols_list ,pro){ 
  library(clusterProfiler)
  library(org.Hs.eg.db)
  library(ReactomePA)
  library(ggplot2)
  library(stringr)
  
  # 首先全部的symbol 需要转为 entrezID
  gcSample = lapply(symbols_list, function(y){ 
    y=as.character(na.omit(select(org.Hs.eg.db,
                                  keys = y,
                                  columns = 'ENTREZID',
                                  keytype = 'SYMBOL')[,2])
    )
    y
  })
  gcSample
  
  # 第1个注释是 KEGG 
  xx <- compareCluster(gcSample, fun="enrichKEGG",
                       organism="hsa", pvalueCutoff=0.05)
  dotplot(xx)  + 
    scale_y_discrete(labels=function(x) str_wrap(x, width=50)) 
  ggsave(paste0(pro,'_kegg.pdf'),width = 10,height = 8)
  
  # 第2个注释是 ReactomePA 
  xx <- compareCluster(gcSample, fun="enrichPathway",
                       organism = "human",
                  pvalueCutoff=0.05)
  dotplot(xx)  + 
    scale_y_discrete(labels=function(x) str_wrap(x, width=50)) 
  ggsave(paste0(pro,'_ReactomePA.pdf'),width = 10,height = 8)
  
  # 然后是GO数据库的BP,CC,MF的独立注释
  # Run full GO enrichment test for BP 
  formula_res <- compareCluster(
    gcSample, 
    fun="enrichGO", 
    OrgDb="org.Hs.eg.db",
    ont     = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    qvalueCutoff  = 0.05
  )
  
  # Run GO enrichment test and merge terms 
  # that are close to each other to remove result redundancy
  lineage1_ego <- simplify(
    formula_res, 
    cutoff=0.5, 
    by="p.adjust", 
    select_fun=min
  ) 
  pdf(paste0(pro,'_GO_BP_cluster_simplified.pdf') ,width = 15,height = 8)
  print(dotplot(lineage1_ego, showCategory=5) + 
          scale_y_discrete(labels=function(x) str_wrap(x, width=50)) )
  dev.off() 
  write.csv(lineage1_ego@compareClusterResult, 
            file=paste0(pro,'_GO_BP_cluster_simplified.csv'))
  # Run full GO enrichment test for CC 
  formula_res <- compareCluster(
    gcSample, 
    fun="enrichGO", 
    OrgDb="org.Hs.eg.db",
    ont     = "CC",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    qvalueCutoff  = 0.05
  )
  
  # Run GO enrichment test and merge terms 
  # that are close to each other to remove result redundancy
  lineage1_ego <- simplify(
    formula_res, 
    cutoff=0.5, 
    by="p.adjust", 
    select_fun=min
  ) 
  pdf(paste0(pro,'_GO_CC_cluster_simplified.pdf') ,width = 15,height = 8)
  print(dotplot(lineage1_ego, showCategory=5) + 
          scale_y_discrete(labels=function(x) str_wrap(x, width=50)) )
  dev.off() 
  write.csv(lineage1_ego@compareClusterResult, 
            file=paste0(pro,'_GO_CC_cluster_simplified.csv'))
  
  # Run full GO enrichment test for MF 
  formula_res <- compareCluster(
    gcSample, 
    fun="enrichGO", 
    OrgDb="org.Hs.eg.db",
    ont     = "MF",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    qvalueCutoff  = 0.05
  )
  
  # Run GO enrichment test and merge terms 
  # that are close to each other to remove result redundancy
  lineage1_ego <- simplify(
    formula_res, 
    cutoff=0.5, 
    by="p.adjust", 
    select_fun=min
  ) 
  pdf(paste0(pro,'_GO_MF_cluster_simplified.pdf') ,width = 15,height = 8)
  print(dotplot(lineage1_ego, showCategory=5) + 
          scale_y_discrete(labels=function(x) str_wrap(x, width=50)) )
  dev.off() 
  write.csv(lineage1_ego@compareClusterResult, 
            file=paste0(pro,'_GO_MF_cluster_simplified.csv'))
  
}

这样的话,无论是什么样的项目,使用起来都非常简洁:

symbols_list <-  as.list(as.data.frame(apply(marker_cosg$names,2,head,100)))
source('com_go_kegg_ReactomePA_human.R')
com_go_kegg_ReactomePA_human(symbols_list, pro='pbmc' )

可以简单的看看结果,首先是kegg数据库的注释,可以得到结果的可解释性还是蛮强的:

kegg数据库的注释

然后是ReactomePA数据库的注释结果,也是很合理啦:

ReactomePA数据库

后面还有GO数据库的CC,BP,MF的图,我就不一一展示了,因为这个PBMC非常出名,但凡是有 这方面生物学背景的,都很容易理解这些单细胞各个亚群特异性高表达基因的数据库注释(包括GO,KEGG,ReactomePA),可以作为你单细胞分群准确性的一个辅助证明。