python - 在 Python 中迭代大量数据的有效方法

Question

我正在尝试对 sha512 哈希进行字典攻击。我知道哈希由两个单词组成，全小写，用空格分隔。这些单词来自一个已知的字典 (02-dictionary.txt)，其中包含 172,820 个单词。目前，我的代码如下：

import hashlib
import sys
import time

def crack_hash(word, target):
    dict_hash = hashlib.sha512(word.encode())
    if dict_hash.hexdigest() == target:
        return (True, word)
    else:
        return (False, None)

if __name__ == "__main__":
    target_hash = sys.argv[1].strip()
    
    fp = open("02-dictionary.txt", "r")

    words = []
    start_time = time.time()
    for word in fp:
        words.append(word)
    fp.close()

    for word1 in words:
        for word2 in words:
            big_word = word1.strip() + " " + word2.strip()
            print(big_word)
            soln_found, soln_word = crack_hash(big_word.strip(), target_hash)
            if soln_found:
                print('Solution found')
                print("The word was:", soln_word)
                break

    end_time = time.time()
    total_time = end_time - start_time
    print("Time taken:", round(total_time, 5), "seconds")

但是，当我运行此代码时，程序运行速度非常慢。我知道 Python 不是最有效的语言，但我猜这个问题更多源于数据结构的选择。有没有更高效的数据结构？我尝试使用该array模块，但文档使它看起来好像被设计用于更原始的类型（整数、浮点数、短裤、布尔值、字符等），而不是更复杂的类型，如字符串（或列表个字符）。改进此代码的最佳方法是什么？在大约一个小时的运行时间中，我只完成了所有可能的单词组合中的大约 1%。

score 1 · Accepted Answer

问题是您正在计算178000 ² = 31684000000 （大约 2 ³⁵）个哈希值。这是很多工作。我进行了一些更改以在纯 python 中实现一些优化，但我怀疑 hashlib 调用的开销非常大。我认为在本机代码中完成这一切会导致更显着的加速。

优化包括以下内容：

将字典中的单词预计算为字节对象
预计算散列第一部分的部分散列结果

import hashlib
import sys
import time


def try_all(words, target_hash):
    for word1 in words:
        hash_prefix = hashlib.sha512(word1 + b' ')
        for word2 in words:
            prefix_copy = hash_prefix.copy()
            prefix_copy.update(word2)
            # print(big_word)
            if prefix_copy.digest() == target_hash:
                print('Solution found')
                big_word = (word1 + b' ' + word2).decode('utf8')
                print(f'The word was: {big_word}')
                return


def read_all_words(filename):
    with open(filename, "rt") as f:
        return [line.strip().encode('utf-8') for line in f]


def get_test_hash(words):
    phrase = words[-2] + b' ' + words[-1]  # pick target towards end
    return hashlib.sha512(phrase).digest()


if __name__ == "__main__":
    words = read_all_words("02-dictionary.txt")
    TESTING = True
    if TESTING:
        words = words[:5000]  # reduce the size of the word list for testing only
        target_hash = get_test_hash(words)
    else:
        target_hash = bytes.fromhex(sys.argv[1].strip())
    start_time = time.time()
    try_all(words, target_hash)
    end_time = time.time()
    total_time = end_time - start_time
    print(f"Time taken: {round(total_time, 5)} seconds")
    print(f'{total_time / pow(len(words), 2)} seconds per hash')

在我的笔记本电脑上，每个哈希运行大约 1.1 * 10 ^-6秒，因此尝试字典中的所有单词将花费不到 10 小时的 CPU 时间。

python - 在 Python 中迭代大量数据的有效方法

1 回答 1

Related

Reference