“vcftools”的相关标签问题_Stack Overflow中文网

0 投票

2 回答

1067 浏览

python - 解析VCF文件并插入数据库的慢python代码

我有以下代码用于解析 VCF（变体调用格式）文件：

Python代码：

这是我通过脚本传递的示例文件：

示例 VCF 文件：

我的 Postgres 表中的输出 -sampletable

我的 Python 代码运行缓慢。它在 5 分钟内插入大约 1000 条记录。我有超过 500 万条记录。

我正在寻找一些帮助来优化 Python 代码以更快地插入它。请建议。

2019-12-05T17:32:41.473

0 投票

0 回答

29 浏览

apache-kafka - 用于处理 VCF 文件的 Apache Kafka 和 Nifi

我正在尝试使用 Apache Kafka 和 Nifi 工具解析 VCF 文件，基本上 VCF 在文件系统中可用，我必须获取并解析以从中提取不同的属性。是否建议使用 Apache Kafka 和/或 Nifi 进行实施？

apache-kafka apache-nifi vcftools

2020-03-24T06:42:56.567

0 投票

0 回答

283 浏览

vcf-variant-call-format - 如何使用列表 CHR 或 contig ID 过滤 VCF 文件？

我需要通过出现在 CHR 列中的一长串非连续重叠群 ID 对 SNP vcf 文件进行子集化/过滤。我的 VCF 文件目前包含 13,971 个重叠群，我想保留一组特定的 7,748 个重叠群以及与这些重叠群相关的所有内容（所有变体和基因型信息等）。

我的 contig 列表如下所示：

dDocent_Contig_1

dDocent_Contig_100

dDocent_Contig_10000 等

我正在考虑以下脚本： vcftools --vcf TotalRawSNPs.vcf --chr dDocent_Contig_1 --chr dDocent_Contig_100 (etc...) --recode --recode-INFO-all --out FinalRawSNPs

我之前用 --chr 标志单独列出了每个 contig ID。对于这个 --chr 标志，我无法为其提供要保留的 contig ID 的文本文件，这将是理想的。如果我单独列出所有重叠群，它将在命令行中创建一个庞大的脚本。

我已经看到了按个人列表过滤的选项，但没有任何仅按 CHR/contig ID 过滤的明确选项。有没有更有效的方法来按 CHR/contig 过滤我的 vcf 文件？

vcf-variant-call-format vcftools

2020-06-14T01:00:38.700

0 投票

1 回答

45 浏览

compiler-errors - bcftools make 在 win10 中失败并出现大量 vcfmerge 错误

我想编译最新的 github bcftools 但我在下面得到这些错误。

[安装说明] https://raw.githubusercontent.com/samtools/bcftools/develop/INSTALL提到：

另外，我做了以下事情：

BCFTOOLS_PLUGINS=/path/to/bcftools/plugins（添加到 Makefile）和这些 3rd 方库（下载并添加到路径）
zlib、gsl 和 libperl

make 命令运行时的错误

compiler-errors vcftools bcftools

2020-08-31T15:02:46.547

0 投票

1 回答

1281 浏览

bioinformatics - 将多个 VCF 文件合并为一个大 VCF 文件

我有来自特定种族的 VCF 文件列表，例如美洲印第安人、中国人、欧洲人等

在每个种族下，我有大约 100 多个文件。

目前，我计算了一个文件的VARIANT QC 指标，例如 call_rate， n_het 等，如冰雹教程中所示（参考下图）

图片在这里

但是，现在我想为每个种族创建一个文件，然后计算VARIANT_QC指标。

我已经提到了这篇文章和这篇文章，但不认为这能解决我的问题

如何在特定种族下的所有文件中执行此操作？

可以帮我解决这个问题吗？

有没有hail/python/R/other tools办法做到这一点？

bioinformatics vcftools bcftools hail vcf-variant-call-format

2020-09-08T13:53:15.170

0 投票

1 回答

51 浏览

linux - 如何知道一个unix命令执行是否成功

我最近bcftools在命令行中尝试了一个命令（处理变体文件）。

我尝试过的命令如下

但是，该命令正在运行，但需要很长时间才能合并。所以，我让系统通宵运行。

但是，当我醒来时，我看到系统因电量耗尽而关闭

如何知道我在 Unix/Linux 终端中的最后一个命令是否成功？

上面显示的命令只是一个示例。您甚至可以向我解释如何使用simple gzip operation.

虽然我确实看到了，但final.vcf.gz我不能说命令是成功的，因为一旦命令开始运行，我就会看到这个文件正在生成。所以，我不能依赖

请问有什么帮助吗？

linux shell unix vcftools bcftools

2020-09-10T00:10:16.343

0 投票

1 回答

112 浏览

bash - 循环内循环 vcftools bash

我正在尝试利用 vcftools 包来计算堰和科克勒姆的 fst。我想在第一个实例中循环两对种群，然后在来自 1000 Genomes 项目的所有变体中循环这些种群：每个染色体包含一个单独的 vcf 文件。例如，对于 pop1 与 pop2，对于 pop3 与 pop4，计算 1-10 号染色体的 fst。每个种群文件，例如 LWKfile 包含属于该种群的个体列表。

我尝试过：

但是，这不会遍历所有文件，并且似乎卡在 10 号染色体上。是否有更有效的方法在 bash 中执行此操作，因为我担心循环内的循环会太慢。

bash vcf-variant-call-format vcftools

2020-12-31T12:56:46.980

0 投票

1 回答

121 浏览

vcftools - 如何使用 vcftools 按读取深度进行过滤？

我正在尝试构建一个工作流来分析我的 scRNA-seq 数据。我正在使用 GATK 和 samtools、vcftools、bcftools 的组合。我想过滤我的 .vcf 文件，以便它删除所有读取次数少于 10 的条目。看起来 vcftools 可以用于此。我的代码是：

vcftools --vcf "$fn" --out "$fn"_dp10 --minDP 10 --recode --recode-INFO-all

但是，它不会过滤掉任何东西。对此有什么想法吗？

亲切的问候科拉

vcftools

2021-01-25T08:31:33.207

0 投票

1 回答

260 浏览

c++ - How to extract genotype information for each sample as a string from a VCF file using htslib?

I am using htslib for extracting all the information contained in a VCF file in C++.

Currently, thanks to the VCF specification and the documentation in the file vcf.h, I have successfully extracted all the metadata information in the header (Meta-Information Lines), and most of the information contained in each row of the body of the file (Data Lines).

However, I don't know how to extract the genotype information (sample columns).

I am using example files from the 1000G project. This is an example of two rows of the file, it shows the Format field and two samples (The file has more than 1000 samples per each row, I would like to extract the data for all of them):

I know that this is a heavy task that would take some computation time. I have extracted the names of each column (HG00096, HG00077...), but I don't know how to extract the information of each sample either as a full string (e.g., "0|0:0.050:-0.48,-0.48,-0.48"), as a set (array, map, vector...) of key-value pairs (e.g., [("GT", "0|0"), ("DS", "0.050"), ("GL", "-0.48,-0.48,-0.48")), or simply as an array of values (e.g., ["0|0", "0.050", "-0.48,-0.48,-0.48"]. I would like to do this for each sample.

I have been reading the documentation in the vcf.h file and I think that the function bcf_get_genotypes(hdr,line,dst,ndst) may be suitable for this, but I don't know for sure how to use it for extracting the values as strings. Also, I think that this information may be stored inside the 'p' pointer of 'bcf_fmt_t', but I don't know for sure, it just contains an array set of uint8_t values and I don't know if a string (or char array) can be extracted in the way I want.

Is there a way of doing that I am trying to do?

c++vcf-vcard vcf-variant-call-format vcftools htslib

2021-02-06T23:36:26.483

0 投票

1 回答

43 浏览

r - R中的遗传分化，功能genetic_diff（）

我正在尝试运行函数genetic_diff()：myDiff <-genetic_diff(vcf, pops = pop, method = 'nei')

但我得到以下信息“ if(class(x) != "vcfR"){ stop(paste("期望类 vcfR 的对象，而不是收到", class(vcf)))"

我能做些什么？

r vcftools

2021-03-11T16:13:19.233

问题标签 [vcftools]

Reference