您现在的位置是：首页 > 后端

当前栏目

学python：使用python的pysam模块统计bam文件中spliced alignment的reads的数量

Python 文件统计模块使用数量 BAM reads

2023-06-13 09:16:32 时间

使用igv查看bam文件里有cigar字段，这个是啥意思？

找到了一个解释

https://sites.google.com/site/bioinformaticsremarks/bioinfo/sam-bam-format/what-is-a-cigar

image.png

所以如果是spliced alignment 的reads cigar关键词中间会有N，只要统计cigar关键词就可以了

python的pysam模块能够统计一个给定区间内所有reads的数量，也可以统计每个reads的一些性质

import pysam
bamfile = pysam.AlignmentFile("../barkeRTD/output.split.bam/B1/chr1H_part_1.bam",'rb')
reads = bamfile.fetch("chr1H_part_1",102778300,102779978)

reads是一个可以迭代的对象，可以依次访问每个read的情况，read的性质有

image.png

可以探索的内容很多

结合gtf文件统计每个基因区间内的spliced alignment 的reads的数量

import argparse
import pysam


import pandas as pd
#from multiprocessing import Pool

parser = argparse.ArgumentParser(description="Stat read orientation")

parser.add_argument('-g','--gtf',help="input gtf path")
parser.add_argument('-b','--bam',help="input bam path")
parser.add_argument('-o','--csv',help="output csv file")

args = parser.parse_args()

print("Let's go!")

selected_col = [0,1,2,3,4,6,8]
col_names = ['chromosome','source','feature','start','end','strand','gene_name']

df = pd.read_table(args.gtf,sep="\t",comment="#",header=None,usecols=selected_col,names=col_names)
df['gene_name'] = df["gene_name"].str.replace("ID=","")
#chromo = set(df['chromosome'].tolist())


chromo_name = args.bam.split("/")[-1].split(".")[0]
Sam = args.bam.split("/")[-2]

new_df = df.loc[df['chromosome'] == chromo_name]

bamfile = pysam.AlignmentFile(args.bam,'rb')

output_df = {'Sample':[],
             'chromosome':[],
             'gene_name':[],
             'start':[],
             'end':[],
             'strand':[],
             'positive':[],
             'negative':[],
             'positive_spliced':[],
             'negative_spliced':[]}

for i,j in new_df.iterrows():
    positive = 0
    negative = 0
    negative_spliced = 0
    positive_spliced = 0
    for read in bamfile.fetch(chromo_name,j.start,j.end):
        if read.is_read1 and read.is_forward :
            positive += 1
            if 'N' in read.cigarstring:
                positive_spliced += 1
        if read.is_read1 and read.is_reverse:
            negative += 1
            if 'N' in read.cigarstring:
                negative_spliced += 1
    output_df['chromosome'].append(j.chromosome)
    output_df['gene_name'].append(j.gene_name)
    output_df['start'].append(j.start)
    output_df['end'].append(j.end)
    output_df['strand'].append(j.strand)
    output_df['positive'].append(positive)
    output_df['negative'].append(negative)
    output_df['negative_spliced'].append(negative_spliced)
    output_df['positive_spliced'].append(positive_spliced)
    output_df['Sample'].append(Sam)

pd.DataFrame(output_df).to_csv(args.csv,index=False)

print("Congratulations!")

这里只统计reads1中的spliced alignment

如果是双端测序的数据，pysam统计reads数量的时候会计算为2个分为reads1和reads2

脚本的使用方式

 python stat_spliced_junction_read_orientation.py -g input.gtf -b input.bam -o output.csv

最终结果

猜你喜欢

Hadoop2 上HDFS HA 搭建过程
利用Oracle事务回滚命令实现数据安全（oracle事物回滚命令）
深入探讨使用OR查询在MySQL中可能影响查询效率（mysql中or影响效率）
配置Redis端口配置:完美调整（redis端口）
对付雷达制导导弹追击战机是否只能打扰弹？
教程学习鸟哥Linux基础，快速掌握系统技能（鸟哥linux基础）
系统发挥Linux优势：探索系统分支（linux的分支）
精简微服务架构内嵌Redis实现快速数据存储（内嵌redis）
【错误记录】Android 分区存储下的 SD 卡应用专属外部存储空间目录访问 ( 需手动创建应用专属外部存储空间目录 )
Oracle数据库密码恢复：必备技术（oracle密码恢复）
格式化Oracle中的varchar类型的时间为时间类型详解数据库
安装Oracle客户端32位，获取更多功能!（oracle客户端32位）
Apollo 分布式配置中心详解
Linux系统中复制文件夹的技巧（复制文件夹linux）

zl程序教程

当前栏目

学python：使用python的pysam模块统计bam文件中spliced alignment的reads的数量

相关文章