python - 如何在python中将文本文件拆分为其单词？

Question

我对 python 很陌生，以前也没有处理过文本……我有 100 个文本文件，每个文件都有大约 100 到 150 行描述患者状况的非结构化文本。我使用以下命令在 python 中读取了一个文件：

with open("C:\\...\\...\\...\\record-13.txt") as f:
    content = f.readlines()
    print (content)

现在我可以将此文件的每一行拆分为其单词，例如：

a = content[0].split()
print (a)

但我不知道如何将整个文件拆分为单词？循环（while 或 for）对此有帮助吗？

谢谢你们的帮助。你的回答帮助我写这个（在我的文件中，单词被空格分隔，所以我认为这是分隔符！）：

with open ("C:\\...\\...\\...\\record-13.txt") as f:
  lines = f.readlines()
  for line in lines:
      words = line.split()
      for word in words:
          print (word)

这只是逐行拆分单词（一行中的一个单词）。

score 9 · Accepted Answer

这取决于你如何定义words，或者你认为是什么delimiters。
注意string.split在 Python 中接收一个可选参数delimiter，所以你可以这样传递它：

for lines in content[0].split():
    for word in lines.split(','):
        print(word)

不幸的是，string.split只接收一个分隔符，因此您可能需要像这样的多级拆分：

for lines in content[0].split():
    for split0 in lines.split(' '):
        for split1 in split0.split(','):
            for split2 in split1.split('.'):
                for split3 in split2.split('?'):
                    for split4 in split3.split('!'):
                        for word in split4.split(':'): 
                            if word != "":
                                print(word)

看起来很丑，对吧？幸运的是，我们可以使用迭代来代替：

delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = content
for delimiter in delimiters:
    new_words = []
    for word in words:
        new_words += word.split(delimiter)
    words = new_words

编辑： 或者只是我们可以使用正则表达式包：

import re
delimiters = ['\n', ' ', ',', '.', '?', '!', ':', 'and_what_else_you_need']
words = re.split('|'.join(delimiters), content)

score 8 · Accepted Answer

with open("C:\...\...\...\record-13.txt") as f:
    for line in f:
        for word in line.split():
            print word

或者，这会给你一个单词列表

with open("C:\...\...\...\record-13.txt") as f:
    words = [word for line in f for word in line.split()]

或者，这会给你一个行列表，但每行都是一个单词列表。

with open("C:\...\...\...\record-13.txt") as f:
    words = [line.split() for line in f]

score 5 · Accepted Answer

我会使用自然语言工具包，因为这种split()方式不能很好地处理标点符号。

import nltk

for line in file:
    words = nltk.word_tokenize(line)

score 4 · Accepted Answer

没有人建议使用发电机，我很惊讶。这是我的做法：

def words(stringIterable):
    #upcast the argument to an iterator, if it's an iterator already, it stays the same
    lineStream = iter(stringIterable)
    for line in lineStream: #enumerate the lines
        for word in line.split(): #further break them down
            yield word

现在，这可以用于您可能已经在记忆中的简单句子列表：

listOfLines = ['hi there', 'how are you']
for word in words(listOfLines):
    print(word)

但它同样适用于文件，无需读取内存中的整个文件：

with open('words.py', 'r') as myself:
    for word in words(myself):
        print(word)

score 2 · Accepted Answer

最灵活的方法是使用列表推导来生成单词列表：

with open("C:\...\...\...\record-13.txt") as f:
    words = [word
             for line in f
             for word in line.split()]

# Do what you want with the words list

然后您可以对其进行迭代，添加到 acollections.Counter或您喜欢的任何其他内容。

python - 如何在python中将文本文件拆分为其单词？

5 回答 5

Related

Reference