python - 使用 Python 进行文本挖掘

Question

我总共有 900 个“.txt”和“.htm”文件。每个文件有 4 个段落。每份文件中都有关于该公司为何被退市的原因。我只需要从所有文件中获取原因。公司暂停的原因通常是在“因为”和“作为”之类的词之后。我如何使用python从所有文档中挖掘原因？我是python新手。任何帮助将不胜感激。

score 1 · Accepted Answer

如果您知道暂停遵循特定的单词，那么可以使用正则表达式来完成。我在几分钟内为你做了一些示例代码。对于初学者，从下面的代码开始学习你不知道的东西。

from os import listdir
import re

for filename in listdir(directory): # directory = filepath to directory
    with open(filename, "r") as file:   # where your documents are located at
        contents = file.read()
    possibleSuspension = re.findall(r'(because of)[\w, ]*', contents)

score 0 · Accepted Answer

如果文档是没有 HTML 标签的纯文本文件，那么基本的正则表达式就可以了。

如果您想解析 HTML 内容，这可能是围绕提取原因更加结构化，请查看 BeautifulSoup：https ://www.crummy.com/software/BeautifulSoup/bs4/doc/

正则表达式示例(?<=This is)(.*)(?=sentence)

在此处为 python 在线尝试您的正则表达式：https ://regex101.com/

python - 使用 Python 进行文本挖掘

2 回答 2

Related

Reference