zl程序教程

您现在的位置是:首页 >  后端

当前栏目

学python:使用python的pyRanges模块中的read_gtf函数读取gtf文件报错的解决办法

Python文件模块 使用 报错 函数 读取 解决办法
2023-06-13 09:16:32 时间

pyRanges的帮助文档

https://biocore-ntnu.github.io/pyranges/loadingcreating-pyranges.html

image.png

我自己的gtf文件是这样的 ID和后面字符串是用等号链接的,通常

image.png

是用空格,所以他定义函数用来查拆分字符串的时候是用空格来分隔的,所以这个地方我们把读取代码稍微改动一下,就是增加一个等号作为分隔符

首先定义拆分最后一列的函数

def to_rows(anno):
    rowdicts = []
    try:
        l = anno.head(1)
        for l in l:
            l.replace('"', '').replace(";", "").split()
    except AttributeError:
        raise Exception("Invalid attribute string: {l}. If the file is in GFF3 format, use pr.read_gff3 instead.".format(l=l))

    for l in anno:
        rowdicts.append({kk[0]: kk[-1]
                         for kk in [re.split(' |=',kv.replace('""', '"NA"').replace('"', ''), 1) 
                                    for kv in re.split('; |;',l)]})

    return pd.DataFrame.from_dict(rowdicts).set_index(anno.index)

读取gtf的函数

def read_gtf_full(f, as_df=False, nrows=None, skiprows=0):

    dtypes = {
        "Chromosome": "category",
        "Feature": "category",
        "Strand": "category"
    }

    names = "Chromosome Source Feature Start End Score Strand Frame Attribute".split(
    )

    df_iter = pd.read_csv(
        f,
        sep="\t",
        header=None,
        names=names,
        dtype=dtypes,
        chunksize=int(1e5),
        skiprows=skiprows,
        nrows=nrows,comment="#")

    _to_rows =  to_rows

    dfs = []
    for df in df_iter:
        extra = _to_rows(df.Attribute)
        df = df.drop("Attribute", axis=1)
        ndf = pd.concat([df, extra], axis=1, sort=False)
        dfs.append(ndf)

    df = pd.concat(dfs, sort=False)
    df.loc[:, "Start"] = df.Start - 1

    if not as_df:
        return PyRanges(df)
    else:
        return df

读取gtf文件

import pyranges as pr
from pyranges import PyRanges
read_gtf_full("example02.gtf")

example02.gtf文件的内容

##gff-version 3
# gffread v0.12.7
# gffread -E --keep-genes /mnt/shared/scratch/wguo/barkeRTD/stringtie/B1/Stringtie_B1.gtf -o 00.newgtf/B1/Stringtie_B1_new.gtf
chr1H_part_1 StringTie gene 72141 73256 . + . ID=STRG.1
chr1H_part_1 StringTie transcript 72141 73256 1000 + . ID=STRG.1.1;Parent=STRG.1
chr1H_part_1 StringTie exon 72141 72399 1000 + . Parent=STRG.1.1
chr1H_part_1 StringTie exon 72822 73256 1000 + . Parent=STRG.1.1
chr1H_part_1 StringTie gene 102332 103882 . + . ID=STRG.2
chr1H_part_1 StringTie transcript 102332 103882 1000 + . ID=STRG.2.1;Parent=STRG.2
chr1H_part_1 StringTie exon 102332 103882 1000 + . Parent=STRG.2.1
chr1H_part_1 StringTie transcript 102332 103750 1000 + . ID=STRG.2.2;Parent=STRG.2
chr1H_part_1 StringTie exon 102332 103533 1000 + . Parent=STRG.2.2
chr1H_part_1 StringTie exon 103640 103750 1000 + . Parent=STRG.2.2
chr1H_part_1 StringTie gene 104391 108013 . - . ID=STRG.3
chr1H_part_1 StringTie transcript 104391 108013 1000 - . ID=STRG.3.4;Parent=STRG.3