您现在的位置是：首页 > 其他

当前栏目

单细胞各个亚群特异性高表达基因的数据库注释（包括GO,KEGG,ReactomePA）

2023-02-18 16:30:34 时间

现在单细胞技术的大行其道，让我们的表达量矩阵里面的列也变成了成百上千甚至过万了，首先它们属于不同的样品，如果是10x技术的单细胞转录组，每个样品可以有5到8千的单细胞列，其次呢我们通常是做多个单细胞样品，详见：2个分组的单细胞项目标准分析，这样的话不同样品也可以有表型分组，比如处理组和对照组。

拿到了一个单细胞表达量矩阵，默认需要进行： 单细胞聚类分群注释 ，如果你对单细胞数据分析还没有基础认知，可以看基础10讲：

找各个单细胞亚群特异性高表达量基因

主流的方法是FindAllMarkers函数，但是它比较耗费计算机资源和时间，也可以采用cosg方法，省时省力。

我们以大家熟知的pbmc3k数据集为例，大家先安装这个数据集对应的包，并且对它进行降维聚类分群，参考前面的例子：人人都能学会的单细胞聚类分群注释 ，而且每个亚群找高表达量基因，都存储为Rdata文件。标准代码是：

library(SeuratData) #加载seurat数据集  
getOption('timeout')
options(timeout=10000)
#InstallData("pbmc3k")  
data("pbmc3k")  
sce <- pbmc3k.final  
library(Seurat)
table(Idents(sce))
DimPlot(sce,label = T)

sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, 
                              min.pct = 0.25, 
                              thresh.use = 0.25)
library(dplyr) 
top3 <- sce.markers %>% group_by(cluster) %>% top_n(3, avg_log2FC)
DoHeatmap(sce ,top3$gene,size=3)

会得到热图，展现每个单细胞亚群特异性高表达量基因的前3个，它其实就是一种差异分析了，这个时候它对比的是每个单细胞亚群和所有的其它细胞。

如果是使用COSG包的cosg函数，是另外的格式的输出各个单细胞亚群的top基因：

  input_sce = sce
  table(Idents(input_sce))
  pro = 'cosg_seurat_clusters'
  
  if(T){
    
    library(COSG)
    marker_cosg <- cosg(
      input_sce,
      groups='all',
      assay='RNA',
      slot='data',
      mu=1,
      n_genes_user=100)
    
    save(marker_cosg,file = paste0(pro,'_marker_cosg.Rdata'))
    head(marker_cosg)
    
    ## Top10 genes
    library(dplyr)  
    top_10 <- unique(as.character(apply(marker_cosg$names,2,head,10)))
 
    sce.Scale <- ScaleData(input_sce ,features =  top_10  )  
    
    DoHeatmap(  sce.Scale , top_10 , 
   
    ## Top3 genes
    top_3 <- unique(as.character(apply(marker_cosg$names,2,head,3)))
     
  }

然后数据库注释（包括GO,KEGG,ReactomePA）

我这里自己写了一个简单的函数， com_go_kegg_ReactomePA_human ，可以根据输入的基因名字列表去批量做数据库注释（包括GO,KEGG,ReactomePA），这个函数 com_go_kegg_ReactomePA_human 的代码如下所示：



com_go_kegg_ReactomePA_human <- function(symbols_list ,pro){ 
  library(clusterProfiler)
  library(org.Hs.eg.db)
  library(ReactomePA)
  library(ggplot2)
  library(stringr)
  
  # 首先全部的symbol 需要转为 entrezID
  gcSample = lapply(symbols_list, function(y){ 
    y=as.character(na.omit(select(org.Hs.eg.db,
                                  keys = y,
                                  columns = 'ENTREZID',
                                  keytype = 'SYMBOL')[,2])
    )
    y
  })
  gcSample
  
  # 第1个注释是 KEGG 
  xx <- compareCluster(gcSample, fun="enrichKEGG",
                       organism="hsa", pvalueCutoff=0.05)
  dotplot(xx)  + 
    scale_y_discrete(labels=function(x) str_wrap(x, width=50)) 
  ggsave(paste0(pro,'_kegg.pdf'),width = 10,height = 8)
  
  # 第2个注释是 ReactomePA 
  xx <- compareCluster(gcSample, fun="enrichPathway",
                       organism = "human",
                  pvalueCutoff=0.05)
  dotplot(xx)  + 
    scale_y_discrete(labels=function(x) str_wrap(x, width=50)) 
  ggsave(paste0(pro,'_ReactomePA.pdf'),width = 10,height = 8)
  
  # 然后是GO数据库的BP,CC,MF的独立注释
  # Run full GO enrichment test for BP 
  formula_res <- compareCluster(
    gcSample, 
    fun="enrichGO", 
    OrgDb="org.Hs.eg.db",
    ont     = "BP",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    qvalueCutoff  = 0.05
  )
  
  # Run GO enrichment test and merge terms 
  # that are close to each other to remove result redundancy
  lineage1_ego <- simplify(
    formula_res, 
    cutoff=0.5, 
    by="p.adjust", 
    select_fun=min
  ) 
  pdf(paste0(pro,'_GO_BP_cluster_simplified.pdf') ,width = 15,height = 8)
  print(dotplot(lineage1_ego, showCategory=5) + 
          scale_y_discrete(labels=function(x) str_wrap(x, width=50)) )
  dev.off() 
  write.csv(lineage1_ego@compareClusterResult, 
            file=paste0(pro,'_GO_BP_cluster_simplified.csv'))
  # Run full GO enrichment test for CC 
  formula_res <- compareCluster(
    gcSample, 
    fun="enrichGO", 
    OrgDb="org.Hs.eg.db",
    ont     = "CC",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    qvalueCutoff  = 0.05
  )
  
  # Run GO enrichment test and merge terms 
  # that are close to each other to remove result redundancy
  lineage1_ego <- simplify(
    formula_res, 
    cutoff=0.5, 
    by="p.adjust", 
    select_fun=min
  ) 
  pdf(paste0(pro,'_GO_CC_cluster_simplified.pdf') ,width = 15,height = 8)
  print(dotplot(lineage1_ego, showCategory=5) + 
          scale_y_discrete(labels=function(x) str_wrap(x, width=50)) )
  dev.off() 
  write.csv(lineage1_ego@compareClusterResult, 
            file=paste0(pro,'_GO_CC_cluster_simplified.csv'))
  
  # Run full GO enrichment test for MF 
  formula_res <- compareCluster(
    gcSample, 
    fun="enrichGO", 
    OrgDb="org.Hs.eg.db",
    ont     = "MF",
    pAdjustMethod = "BH",
    pvalueCutoff  = 0.05,
    qvalueCutoff  = 0.05
  )
  
  # Run GO enrichment test and merge terms 
  # that are close to each other to remove result redundancy
  lineage1_ego <- simplify(
    formula_res, 
    cutoff=0.5, 
    by="p.adjust", 
    select_fun=min
  ) 
  pdf(paste0(pro,'_GO_MF_cluster_simplified.pdf') ,width = 15,height = 8)
  print(dotplot(lineage1_ego, showCategory=5) + 
          scale_y_discrete(labels=function(x) str_wrap(x, width=50)) )
  dev.off() 
  write.csv(lineage1_ego@compareClusterResult, 
            file=paste0(pro,'_GO_MF_cluster_simplified.csv'))
  
}

这样的话，无论是什么样的项目，使用起来都非常简洁：

symbols_list <-  as.list(as.data.frame(apply(marker_cosg$names,2,head,100)))
source('com_go_kegg_ReactomePA_human.R')
com_go_kegg_ReactomePA_human(symbols_list, pro='pbmc' )

可以简单的看看结果，首先是kegg数据库的注释，可以得到结果的可解释性还是蛮强的：

kegg数据库的注释

然后是ReactomePA数据库的注释结果，也是很合理啦：

ReactomePA数据库

后面还有GO数据库的CC,BP,MF的图，我就不一一展示了，因为这个PBMC非常出名，但凡是有这方面生物学背景的，都很容易理解这些单细胞各个亚群特异性高表达基因的数据库注释（包括GO,KEGG,ReactomePA），可以作为你单细胞分群准确性的一个辅助证明。

猜你喜欢

将博客搬至CSDN
痞子衡嵌入式：知名半导体MCU大厂软件开发C代码规范
痞子衡嵌入式：语音处理工具pzh-speech诞生记（4）- 音频录播实现(PyAudio)
痞子衡嵌入式：语音处理工具pzh-speech诞生记（2）- 界面构建(wxFormBuilder3.8.0)
痞子衡嵌入式：ARM Cortex-M内核那些事（3.2）- 为Security而生(M23/33/35P)
痞子衡嵌入式：了解i.MXRT1xxx系列里负责动态分配ITCM/DTCM/OCRAM大小的FlexRAM模块
痞子衡嵌入式：恩智浦i.MX RTxxx系列MCU启动那些事（8）- 从Serial(1-bit SPI) NOR恢复启动
痞子衡嵌入式：恩智浦i.MX RTxxx系列MCU启动那些事（4）- OTP及其烧写方法
痞子衡嵌入式：恩智浦i.MX RTxxx系列MCU启动那些事（3）- Serial ISP模式(blhost)
痞子衡嵌入式：恩智浦i.MX RTxxx系列MCU启动那些事（2）- Boot配置(ISP_Pin/OTP)
痞子衡嵌入式：恩智浦i.MX RTxxx系列MCU启动那些事（1）- Boot简介
痞子衡嵌入式：恩智浦机器视觉模块OpenMV-RT那些事（1）- 初体验
痞子衡嵌入式：嵌入式从业者应知应会知识点 - 索引
痞子衡嵌入式：飞思卡尔Kinetis系列MCU启动那些事（10）- KBOOT特性(可靠升级)
痞子衡嵌入式：串行EEPROM接口事实标准及SPI EEPROM简介
痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU硬件那些事（2.2）- 在串行NOR Flash XIP调试原理
痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU硬件那些事（2.1）- 玩转板载OpenSDA,Freelink调试器
痞子衡嵌入式：史上最强i.MX RT学习资源汇总(持续更新中...)
痞子衡嵌入式：终于可以放开聊一聊i.MXRT1170这颗划时代MCU了
痞子衡嵌入式：MCUBootUtility v2.0来袭，i.MXRT1010哪里跑

zl程序教程

当前栏目

单细胞各个亚群特异性高表达基因的数据库注释（包括GO,KEGG,ReactomePA）

找各个单细胞亚群特异性高表达量基因

然后数据库注释（包括GO,KEGG,ReactomePA）

相关文章