在这种情况下,正则表达式是合适的工具。
我希望它找到“cat”、“cat”、“.cat”而不是“catalogue”。
图案:r'\bcat\b'
\b
在单词边界匹配。
如何让用户同时在所有文本中搜索两个词(“猫”或“狗”)
图案:r'\bcat\b|\bdog\b'
打印"filename: <words that are found in it>"
:
#!/usr/bin/env python
import os
import re
import sys
def fgrep(words, filenames, encoding='utf-8', case_insensitive=False):
findwords = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
flags=re.I if case_insensitive else 0).findall
for name in filenames:
with open(name, 'rb') as file:
text = file.read().decode(encoding)
found_words = set(findwords(text))
yield name, found_words
def main():
words = [w.decode(sys.stdin.encoding) for w in sys.argv[1].split(",")]
filenames = sys.argv[2:] # the rest is filenames
for filename, found_words in fgrep(words, filenames):
print "%s: %s" % (os.path.basename(filename), ",".join(found_words))
main()
例子:
$ python findwords.py 'cat,dog' /path/to/*.txt
替代解决方案
为了避免读取内存中的整个文件:
import codecs
...
with codecs.open(name, encoding=encoding) as file:
found_words = set(w for line in file for w in findwords(line))
您还可以在找到的上下文中打印找到的单词,例如,打印带有突出显示的单词的行:
from colorama import init # pip install colorama
init(strip=not sys.stdout.isatty()) # strip colors if stdout is redirected
from termcolor import colored # pip install termcolor
highlight = lambda s: colored(s, on_color='on_red', attrs=['bold', 'reverse'])
...
regex = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
flags=re.I if case_insensitive else 0)
for line in file:
if regex.search(line): # line contains words
line = regex.sub(lambda m: highlight(m.group()), line)
yield line