python - 从 Python 中的文本文件中获取某些单词和短语

Question

我有这段代码，它通过一个文本文件，逐行抓取它并将其拆分为单独的单词。这一切都很好，但是在我的文本文件中，我有某些以“-”开头和结尾的单词和短语，例如，“-foo-”或“-foo bar-”。现在，它们被拆分为“-foo”和“bar-”的代码。我理解为什么会发生这种情况。

我的计划是抓取那些以 '-' 开头和结尾的实例，将它们存储到一个单独的列表中，然后用户将这些短语中的每一个更改为新的内容，将它们放回列表中。如果它是两个单独的单词，我如何告诉它抓取某个短语？

def madLibIt(text_file):
    listOfWords = [] #creates a word list
    for eachLine in text_file: #go through eachLine, and split it into 
        #seperate words
        listOfWords.extend(eachLine.split())
 print listOfWords

score 2 · Accepted Answer

不带分隔符的调用str.split()会按空格分隔文本，因此您不使用-分隔符。

您可以re.findall()与模式一起使用(-.+?-)：

matches = re.findall(r'(-.+?-)', 'This is a -string- with a -foo bar-')
print(matches) # ['-string-', '-foo bar-']

score 1 · Accepted Answer

这个正则表达式准确地抓住了你想要的东西。

import re

s = 'This is a string with -parts like this- and -normal- parts -as well-'

print re.findall(r'((?:-\w[\w\s]*\w-)|(?:\b\w+\b))', s)

>>> 
['This', 'is', 'a', 'string', 'with', '-parts like this-', 'and', '-normal-', 'parts', '-as well-']

python - 从 Python 中的文本文件中获取某些单词和短语

2 回答 2

Related

Reference