bioinformatics - 基于变异和人类参考构建DNA序列

Question

1000 基因组计划为我们提供了有关数千人 DNA 序列相对于人类参考 DNA 序列“变异”的信息。变体以VCF文件
格式存储。基本上，对于该项目中的每个人，我们都可以从 VCF 文件中获取他/她的 DNA 变异信息，例如，变异的类型（例如插入/删除和 SNP）以及变异相对于参考的位置。参考采用 FASTA 格式。通过结合 VCF 文件中一个人的变异信息和 FASTA 文件中的人类参考，我想为那个人构建 DNA 序列。

我的问题是：是否已经存在一些工具可以很好地执行任务，或者我必须自己编写脚本。

score 3 · Accepted Answer

来自 VCFtools的 perl 脚本vcf-consensus似乎与您正在寻找的内容接近：

vcf-consensus  
Apply VCF variants to a fasta file to create consensus sequence.

Usage: cat ref.fa | vcf-consensus [OPTIONS] in.vcf.gz > out.fa
Options:
   -h, -?, --help         This help message.
   -H, --haplotype <int>  Apply only variants for the given haplotype (1,2)
   -s, --sample <name>    If not given, all variants are applied
Examples:
   samtools faidx ref.fa 8:11870-11890 | vcf-consensus in.vcf.gz > out.fa

来自参考 fasta 和变体调用文件的新 fasta 序列问题的答案？在 Biostar 上发布也可能有所帮助。

score 2 · Accepted Answer

您可以使用 bcftools ( https://github.com/samtools/bcftools ) 来执行此任务：

bcftools consensus <file.vcf> \
  --fasta-ref <file> \
  --iupac-codes \
  --output <file> \
  --sample <name>

要安装 bcftools：

git clone --branch=develop git://github.com/samtools/bcftools.git
git clone --branch=develop git://github.com/samtools/htslib.git
cd htslib && make && cd ..
cd bcftools && make && cd ..
sudo cp bcftools/bcftools /usr/local/bin/

您还可以将 bcftools 共识与 samtools faidx ( http://www.htslib.org/ ) 结合起来，从 fasta 文件中提取特定的时间间隔。有关更多信息，请参见 bcftools 共识：

About:   Create consensus sequence by applying VCF variants to a reference
         fasta file.
Usage:   bcftools consensus [OPTIONS] <file.vcf>
Options:
    -f, --fasta-ref <file>     reference sequence in fasta format
    -H, --haplotype <1|2>      apply variants for the given haplotype
    -i, --iupac-codes          output variants in the form of IUPAC ambiguity codes
    -m, --mask <file>          replace regions with N
    -o, --output <file>        write output to a file [standard output]
    -c, --chain <file>         write a chain file for liftover
    -s, --sample <name>        apply variants of the given sample
Examples:
   # Get the consensus for one region. The fasta header lines are then expected
   # in the form ">chr:from-to".
   samtools faidx ref.fa 8:11870-11890 | bcftools consensus in.vcf.gz > out.fa

score 1 · Accepted Answer

任何人仍然来到这个页面，如果你有一个 fasta 参考基因组和一个 bam 文件，你想通过更改 SNP 和 N 将其转换为参考文件，你可以使用 samtools、bcftools 和 vcfutils.pl （ps对于初学者：samtools 和 bcftools 都可以在计算集群或 Linux 中编译，如果是这样，只需在软件名称前添加每个位置；vcfutils 已经是 bcftools 的 perl 脚本）

samtools mpileup -d8000 -q 20 -Q 10 -uf  REFERENCE.fasta Your_File.bam | bcftools call -c | vcfutils.pl vcf2fq  > OUTPUT.fastq

d, --max-depth == -q, -min-MQ 要使用的比对的最小映射质量 == -Q, --min-BQ 要考虑的碱基的最小碱基质量 ==（您可以使用当然不同的值，请参阅http://www.htslib.org/doc/samtools.html）

这会生成一种看起来像 fastq 但不是的奇怪格式，因此您无法使用转换器对其进行转换，但您可以使用以下 sed 命令，我为此输出专门编写了该命令：

sed -i -e '/^+/,/^\@/{/^+/!{/^\@/!d}}; /^+/ d; s/@/>/g' OUTPUT.fastq

最后，确保将您的新 fasta 文件与您的参考文件进行比较，以确保一切正常。

编辑小心使用 SED 命令，它可能会在质量评分不同的情况下删除您的一些阅读

bioinformatics - 基于变异和人类参考构建DNA序列

3 回答 3

Related

Reference