转录组实战02: 计算非冗余外显子长度之和
计算非冗余外显子长度
安装gtftools(http://www.genemine.org/gtftools.php)
micromamba activate RNA
micromamba install -c bioconda gtftools
gtftools -l gene_length.txt ~/DataHub/Genomics/GENCODE/release_42/HS.gencode.v42.annotation.gtf
可以看到gtftools给出了4种基因长度,也给出了计算的方法,第四种方法也叫非冗余外显子长度,我们接下来使用这个长度计算TPM
Calculate gene lengths. Since a gene may have multiple isoforms, there are multiple ways to calculate gene length based on literature. Three simple ways are considering the mean, median and maximum of the lengths of isoforms as the length of the gene. A fourth way is to calculate the length of merged exons of all isoforms (i.e. non-overlapping exonic length). So, in total, four different types of gene lengths(the mean, median and max of lengths of isoforms of a gene, and the length of merged exons of isoforms of a gene) are provided.
gene mean median longest_isoform merged
ENSG00000287252.3 1546 1323 2751 3266
ENSG00000242268.3 1709 1709 2708 2750
ENSG00000270112.5 1793 1818 4685 10685
ENSG00000280143.1 5326 5326 5326 5326
ENSG00000146083.12 1239 816 4122 5627
ENSG00000158486.15 5952 3404 12567 15023
获取基因类型、基因名等信息
gtftools -g gene_info.txt ~/DataHub/Genomics/GENCODE/release_42/HS.gencode.v42.annotation.gtf
压缩保存
压缩保存以上两个文件,之后就不用在计算了,以逗号分隔方便之后python读取。
import pandas as pd
g1 = pd.read_csv("gene_info.txt",sep='\t',index_col=4,header=None)
g1.columns = ["Chr","Start","Stop","Strand","Symbol","GeneType"]
g2 = pd.read_csv("gene_length.txt",sep='\t',index_col=0,usecols=["gene","merged"])
g = pd.merge(g1,g2,left_index=True,right_index=True)
g.insert(0,column="Ensembl",value=g.index)
g.rename(columns={"merged":"Length"},inplace=True)
g.sort_values("Chr",inplace=True)
g.to_csv("hg38_gene_info_v42.csv.gz",index=False)
给count添加基因信息
修改下“转录组实战01: 从数据下载到定量fastp+STAR“中的代码,合并count数据饿时候给gene添加信息,这样比较像测序公司给的表达量数据。
# 现在的目录是~/Project/Human_16_Asthma_Bulk
from pathlib import Path
import pandas as pd
import datatable as dt
dir="alignment/STAR"
count_list = []
tpm_list = []
paths = Path(dir).glob("*ReadsPerGene.out.tab")
for x,y in enumerate(list(paths)):
if x < 1:
Ensembl = pd.read_csv(y,usecols=[0],sep='\t',skiprows=4,header=None)
Ensembl.rename(columns={0:'Ensembl'},inplace=True)
_count_df = pd.read_csv(y,usecols=[1],sep='\t',skiprows=4,header=None)
count_list.append(_count_df.rename(columns={1:y.name.split('Reads')[0]}))
count_df = pd.concat(count_list,axis=1)
count_df.insert(0,column='Ensembl',value=Ensembl)
count = pd.merge(gene_info,count_df,on="Ensembl")
dt.Frame(count).to_csv('star_count.csv.gz')
新的表达矩阵如下所示
Ensembl Chr Start Stop Strand Symbol GeneType Length SRR1039516 SRR1039522 SRR1039508 SRR1039523 SRR1039511 SRR1039510 SRR1039514 SRR1039521 SRR1039512 SRR1039513 SRR1039519 SRR1039515 SRR1039520 SRR1039509 SRR1039518 SRR1039>
ENSG00000130762.15 1 3454664 3481113 + ARHGEF16 protein_coding 5324 2 0 1 0 1 1 1 0 2 0 0 0 1 0 1 1
ENSG00000117472.10 1 46175072 46185962 + TSPAN1 protein_coding 2370 16 2 1 2 2 4 11 1 11 1 17 11 1 0 8 9
ENSG00000227857.2 1 46134530 46139081 + ENSG00000227857 lncRNA 339 0 1 2 0 0 2 0 0 2 0 0 1 1 0 0 0
ENSG00000233114.2 1 46104949 46105175 + ENSG00000233114 processed_pseudogene 226 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
相关文章
- TwoSampleMR实战教程之计算并解读MR结果
- 密集计算场景下的 JNI 实战
- 联邦学习助力人工智能新模型进化(附:金融隐私计算实战项目)
- 计算时间差工具类(TypeScript/JavaScript)
- GBDT的原理_gbdt怎么计算特征重要性
- PQ实战案例拆解 | 汇总多股票交易数据,计算5日均线的操作与算法优化 - 2
- 《前端图形学实战》几何学在前端边界计算中的应用和原理分析
- MySQL计算时间差的实现(mysql两个时间相减)
- 《代码英雄》第一季(6):揭秘云计算
- Live预告| 清微智能CTO《可重构计算芯片的技术原理及实现难点》
- C语言与MySQL,让计算世界更精彩(c语言mysql)
- Oracle技术实现精准年龄计算,智能应用时代来临(oracle年龄计算)
- MySQL中如何计算中位数(mysql中位数如何计算)
- 数计算Oracle中两时间差秒数的实现方法(oracle两时间相差秒)
- 用Redis实现经纬度距离计算(redis计算经纬度距离)
- Redis统计网站UV新技术的使用(redis 计算uv)
- JavaScript特有方法计算二进制中1的个数split方法