单细胞ATAC实战04: 联合scRNA-seq数据给细胞注释
数据 实战 联合 注释 单细胞 04 细胞 Seq
2023-06-13 09:18:25 时间
from pathlib import Path
import warnings
import numpy as np
import pandas as pd
import scanpy as sc
import snapatac2 as snap
import scvi
import bioquest as bq
import sckit as sk
基因组注释文件
gff_file="~/DataHub/Genomics/GENCODE/hg38.v43.chr_patch_hapl_scaff.basic.annotation.gff3.gz"
简单合并scRNA-seq和scATAC-seq
adata = snap.read_dataset("PBMC.h5ads")
- gene activity matrix
query = snap.pp.make_gene_matrix(adata, gff_file=gff_file)
query.obs_names = adata.obs_names
- scRNA-seq数据
从网站下载的已经注释好的h5ad文件作为reference
reference = sc.read("10x-Multiome-Pbmc10k-RNA.h5ad")
- 简单合并scATAC-seq和scRNA-seq数据
data = reference.concatenate(query, batch_categories=["reference", "query"])
- 标准流程处理
data.layers["counts"] = data.X.copy()
#过滤
sc.pp.filter_genes(data, min_cells=5)
#文库大小
sc.pp.normalize_total(data, target_sum=1e4)
#log转化
sc.pp.log1p(data)
# 前5000高变基因
sc.pp.highly_variable_genes(
data,
n_top_genes = 5000,
flavor="seurat_v3",
layer="counts",
batch_key="batch",
subset=True
)
Data integration
- First we setup the scvi-tools to pretrain the model.
scvi.model.SCVI.setup_anndata(data, layer="counts", batch_key="batch")
vae = scvi.model.SCVI(
data,
n_layers=2,
n_latent=30,
gene_likelihood="nb",
dispersion="gene-batch",
)
- 训练vae模型
vae.train(max_epochs=1000, early_stopping=True,use_gpu=True)
- Let’s plot the training history and make sure the model has converged.
converged 即1000个epochs前,模型的损失不在大幅度减小
ax = vae.history['elbo_train'][1:].plot()
vae.history['elbo_validation'].plot(ax=ax)
data.obs["celltype_scanvi"] = 'Unknown'
ref_idx = data.obs['batch'] == "reference"
data.obs["celltype_scanvi"][ref_idx] = data.obs['cell_type'][ref_idx]
- 训练lvae模型
lvae = scvi.model.SCANVI.from_scvi_model(
vae,
adata=data,
labels_key="celltype_scanvi",
unlabeled_category="Unknown",
)
lvae.train(max_epochs=1000, n_samples_per_label=100,use_gpu=True)
lvae.history['elbo_train'][1:].plot()
细胞注释
- We now can perform the label transfer/prediction and obtain the joint embedding of reference and query data.
data.obs["C_scANVI"] = lvae.predict(data)
data.obsm["X_scANVI"] = lvae.get_latent_representation(data)
sc.pp.neighbors(data, use_rep="X_scANVI")
sc.tl.umap(data)
sc.pl.umap(data, color=['C_scANVI', "batch"], wspace=0.45)
obs = data.obs
obs = obs[obs['batch'] == 'query']
obs.index = list(map(lambda x: x.split("-query")[0], obs.index))
- Save the predicted cell type labels back to the original cell by bin matrix.
atac=adata.to_adata()
adata.close()
atac.obs['cell_type'] = obs.loc[atac.obs.index]['C_scANVI']
- refine the cell type labels at the leiden cluster level, and save the result.
根据leiden cluster手动注释细胞类型
from collections import Counter
cell_type_labels = np.unique(atac.obs['cell_type'])
count_table = {}
for cl, ct in zip(atac.obs['leiden'], atac.obs['cell_type']):
if cl in count_table:
count_table[cl].append(ct)
else:
count_table[cl] = [ct]
mat = []
for cl, counts in count_table.items():
c = Counter(counts)
c = np.array([c[ct] for ct in cell_type_labels])
c = c / c.sum()
mat.append(c)
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt
df_cm = pd.DataFrame(
mat,
index = count_table.keys(),
columns = cell_type_labels,
)
sn.clustermap(df_cm, annot=True)
sk.label_helper(len(atac.obs.leiden.unique())-1)
new_cluster_names ='''
0,CD14 Mono
1,MAIT
2,CD4 Naive
3,NK
4,Memory B
5,CD14 Mono
6,CD14 Mono
7,Intermediate B
8,cDC
9,Naive B
10,Naive B
11,Treg
12,cDC
13,Treg
14,CD14 Mono
15,pDC
'''
sk.labeled(atac,cluster_names=new_cluster_names,reference_key='leiden')
sc.pl.umap(atac, color=['leiden', "CellType"], wspace=0.45,legend_loc="on data",legend_fontoutline=3)
atac.write_h5ad("atac_annot.h5ad",compression='lzf')
adata.close()
相关文章
- 历时一年,《深度学习与交通大数据实战》正式出版!
- 关于大数据平台,这有一套完整的方法论,你确定不收藏?[通俗易懂]
- 05 tp6 的数据添加 助手函数、 save、insert、strict、replace、insertGetId、insertAll《ThinkPHP6 入门到电商实战》
- 程序员快来学习缓存层场景实战数据收集—技术选型思路及整体方案
- 画图搞懂Kafka的高可用方案-ISR机制如何保证写入数据时主从的数据同步
- SPL 实现电力高频时序数据实时存储统计
- Spark入门实战系列–4.Spark运行架构详解大数据
- MongoDB for Java详解大数据
- 基于leveldb的Nosql数据库SSDB详解大数据
- 实战Kafka ACL机制详解大数据
- Oracle数据安全:借助包体加密保护数据(oracle包体加密)
- 实战教程:Oracle修改列数据的方法(oracle修改列的值)
- 利用Oracle统计表数据的方法(oracle统计表数据)
- MySQL管理之二进制数据操作实战(mysql二进制数据)
- 使用 Python 处理 JSON 格式的数据
- Oracle数据库中数据默认值设置与应用详解(oracle数据默认值)
- MySQL分区:实现高效数据存储(mysql分区的作用)
- Oracle数据导出实战:从零开始入门(oracle导出实例)
- 数据今日mssql数据查询实战(mssql 查询当日)
- Oracle数据库全库导出一步搞定(oracle全库数据导出)
- 数据深入Redis增删改查秒杀实战(用Redis实现添加删除)
- 清华大学Redis实战让数据变得更加安全(清华大学redis实战)
- MySQL轻松将一行数据转换成多行(mysql一行转换多行)
- 的数据Redis存储多种数据类型的缓存方案(redis能缓存哪些类型)
- 用存储过程、GetRows()、抽取10万条数据的速度测试
- php判断文件上传类型及过滤不安全数据的方法