python - 并发 DNS 查询（Python3 concurrent.futures）消耗过多的 RAM（40GB+）

Question

我有一个包含 3000 万个字符串的列表，我想使用 python 对所有字符串进行 dns 查询。我不明白这个操作如何会占用大量内存。我假设线程将在作业完成后退出，并且还有 1 分钟的超时（{'dns_request_timeout': 1}）。

下面是运行脚本时机器资源的预览：

我的代码如下：

# -*- coding: utf-8 -*-
import dns.resolver
import concurrent.futures
from pprint import pprint
from json import json


bucket = json.load(open('30_million_strings.json','r'))


def _dns_query(target, **kwargs):
    global bucket
    resolv = dns.resolver.Resolver()
    resolv.timeout = kwargs['function']['dns_request_timeout']
    try:
        resolv.query(target + '.com', kwargs['function']['query_type'])
        with open('out.txt', 'a') as f:
            f.write(target + '\n')
    except Exception:
        pass


def run(**kwargs):
    global bucket
    temp_locals = locals()
    pprint({k: v for k, v in temp_locals.items()})

    with concurrent.futures.ThreadPoolExecutor(max_workers=kwargs['concurrency']['threads']) as executor:
        future_to_element = dict()

        for element in bucket:
            future = executor.submit(kwargs['function']['name'], element, **kwargs)
            future_to_element[future] = element

        for future in concurrent.futures.as_completed(future_to_element):
            result = future_to_element[future]


run(function={'name': _dns_query, 'dns_request_timeout': 1, 'query_type': 'MX'},
    concurrency={'threads': 15})

score 0 · Accepted Answer

尝试这个：

def sure_ok(future):
    try:
        with open('out.txt', 'a') as f:
            f.write(str(future.result()[0]) + '\n')
    except:
        pass

with concurrent.futures.ThreadPoolExecutor(max_workers=2500):
    for element in json.load(open('30_million_strings.json','r')):
        resolv = dns.resolver.Resolver()
        resolv.timeout = 1
        future = executor.submit(resolv.query, target + '.com', 'MX')
        future.add_done_callback(sure_ok)

删除global bucket，因为它是多余的，不需要。

删除字典中 30+ 百万期货的引用，也是多余的。

你也可能没有使用足够新的版本concurrent.futures：

https://github.com/python/cpython/commit/5cbca0235b8da07c9454bcaa94f12d59c2df0ad2

python - 并发 DNS 查询（Python3 concurrent.futures）消耗过多的 RAM（40GB+）

1 回答 1

Related

Reference