披露:我在 Hail 项目上全职工作。
你好呀!欢迎来到编程和生物信息学!
amirouche 正确识别出您需要某种“流式传输”或“逐行”算法来处理太大而无法放入计算机 RAM 的数据。不幸的是,如果您仅限于没有库的 python,您必须手动分块文件并处理 VCF 的解析。
Hail 项目是一个免费的开源工具,适用于基因数据太大而无法放入 RAM 一直到太大而无法放入一台机器(即数十 TB 的压缩 VCF 数据)的科学家。Hail 可以利用一台机器上的所有内核或机器云上的所有内核。Hail 可以在 Mac OS X 和大多数版本的 GNU/Linux 上运行。Hail 揭示了一种统计遗传学领域特定的语言,这使您的问题表达起来要短得多。
最简单的答案
你的 Python 代码最忠实地翻译成 Hail 是这样的:
/path/to/hail importvcf -f YOUR_FILE.vcf.gz \
annotatesamples expr -c \
'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()'
annotatesamples expr -c \
'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled' \
exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
我在我的双核笔记本电脑上运行了上面的命令,文件大小为 2.0GB:
# ls -alh profile225.vcf.bgz
-rw-r--r-- 1 dking 1594166068 2.0G Aug 25 15:43 profile225.vcf.bgz
# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz \
annotatesamples expr -c \
'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()' \
annotatesamples expr -c \
'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled' \
exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:=======================================================> (63 + 2) / 65]hail: info: Coerced sorted dataset
hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 1:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: while importing:
file:/Users/dking/projects/hail-data/profile225.vcf.bgz import clean
hail: info: timing:
importvcf: 34.211s
annotatesamples expr: 6m52.4s
annotatesamples expr: 21.399ms
exportsamples: 121.786ms
total: 7m26.8s
# head sampleInfo.tsv
sample pHomRef pHet nHom nHet nCalled
HG00096 9.49219e-01 5.07815e-02 212325 11359 223684
HG00097 9.28419e-01 7.15807e-02 214035 16502 230537
HG00099 9.27182e-01 7.28184e-02 211619 16620 228239
HG00100 9.19605e-01 8.03948e-02 214554 18757 233311
HG00101 9.28714e-01 7.12865e-02 214283 16448 230731
HG00102 9.24274e-01 7.57260e-02 212095 17377 229472
HG00103 9.36543e-01 6.34566e-02 209944 14225 224169
HG00105 9.29944e-01 7.00564e-02 214153 16133 230286
HG00106 9.25831e-01 7.41687e-02 213805 17128 230933
哇!2GB 7 分钟,太慢了!不幸的是,这是因为 VCF 不是一种很好的数据分析格式!
优化存储格式
让我们转换为 Hail 优化的存储格式,一个 VDS,然后重新运行命令:
# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz write -o profile225.vds
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:========================================================>(64 + 1) / 65]hail: info: Coerced sorted dataset
hail: info: running: write -o profile225.vds
[Stage 1:> (0 + 4) / 65]
[Stage 1:========================================================>(64 + 1) / 65]
# ../hail/build/install/hail/bin/hail read -i profile225.vds \
annotatesamples expr -c \
'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()' \
annotatesamples expr -c \
'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled' \
exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 1:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================> (3 + 1) / 4]hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: timing:
read: 2.969s
annotatesamples expr: 1m20.4s
annotatesamples expr: 21.868ms
exportsamples: 151.829ms
total: 1m23.5s
大约快五倍!对于更大的规模,在代表完整 VCF 的 VDS 上的 Google 云上运行相同的命令,1000 Genomes Project(2535 个全基因组,大约 315GB 压缩)使用 328 个工作核心耗时 3 分 42 秒。
使用内置的冰雹
Hail 还有一个sampleqc
命令可以计算你想要的大部分内容(以及更多!):
../hail/build/install/hail/bin/hail read -i profile225.vds \
sampleqc \
annotatesamples expr -c \
'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled' \
exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 0:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================> (3 + 1) / 4]hail: info: running: sampleqc
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: timing:
read: 2.928s
sampleqc: 1m27.0s
annotatesamples expr: 229.653ms
exportsamples: 353.942ms
total: 1m30.5s
安装冰雹
安装 Hail 非常简单,我们有文档可以帮助您。需要更多帮助?您可以在 Hail 用户聊天室中获得实时支持,或者,如果您更喜欢论坛,Hail discourse(两者都从主页链接到,不幸的是我没有足够的声誉来创建真正的链接)。
不久的将来
在不久的将来(从今天起不到一个月),Hail 团队将完成一个 Python API,它允许您将第一个片段表示为:
result = importvcf("YOUR_FILE.vcf.gz")
.annotatesamples('sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()')
.annotatesamples('sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled')
for (x in result.sampleannotations):
print("Sample " + x +
" nCalled " + x.nCalled +
" nHom " + x.nHom +
" nHet " + x.nHet +
" percent Hom " + x.pHom * 100 +
" percent Het " + x.pHet * 100)
result.sampleannotations.write("sampleInfo.tsv")
编辑:添加了head
tsv 文件的输出。
EDIT2:最新的 Hail 不需要用于 sampleqc 的双等位基因
EDIT3:关于扩展到具有数百个核心的云的注意事项