要在 Python 中查找包含任何给定关键字的行,您可以使用正则表达式:
import re
from itertools import ifilter
def fgrep(words, lines):
# note: allow a partial match e.g., 'b c' matches 'ab cd'
return ifilter(re.compile("|".join(map(re.escape, words))).search, lines)
要将其转换为命令行脚本:
import sys
def main():
with open(sys.argv[1]) as kwfile: # read keywords from given file
# one keyword per line
keywords = [line.strip() for line in kwfile if line.strip()]
if not keywords:
sys.exit("no keywords are given")
if len(sys.argv) > 2: # read lines to match from given file
with open(sys.argv[2]) as file:
sys.stdout.writelines(fgrep(keywords, file))
else: # read lines from stdin
sys.stdout.writelines(fgrep(keywords, sys.stdin))
main()
例子:
$ python fgrep.py a b > fruitfound.txt
有更有效的算法,例如Ago-Corasick 算法,但在我的机器上过滤数百万行只需要不到一秒钟的时间,它可能已经足够好了(grep
快几倍)。令人惊讶acora
的是,对于我尝试过的数据,基于 Ago-Corasick 算法的速度较慢。