python - 使用python从txt文件中提取单词

Question

我想从文本文件中提取单引号之间的所有单词。文本文件如下所示：

u'MMA': 10,
=u'acrylic'= : 19,
== u'acting lessons': 2,
=u'aerobic': 141,
=u'alto': 2= 4,
=u&#= 39;art therapy': 4,
=u'ballet': 939,
=u'ballroom'= ;: 234,
= =u'banjo': 38,

理想情况下，我的输出看起来是这样的：

MMA,
acrylic,
acting lessons,
...

从浏览帖子来看，我似乎应该使用 NLTK / regex for python 的某种组合来实现这一点。我尝试了以下方法：

import re

file = open('artsplus_categories.txt', 'r').readlines()

for line in file:
    list = re.search('^''$', file)

file.close()

并得到以下错误：

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

我认为该错误可能是由我寻找模式的方式引起的。我的逻辑是我搜索“....”中的所有内容。

re.py 出了什么问题？

谢谢！

--------------------------------

按照阿什维尼的评论：

import re

file = open('artsplus_categories.txt', 'r').readlines()

for line in file:
    list = re.search('^''$', line)

print list

#file.close()

但输出不包含任何内容：

Samuel-Finegolds-MacBook-Pro:~ samuelfinegold$ /var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup\ At\ Startup/artsplus_categories_clean-393952531.278.py.command ; exit;
None
logout

@Rasco：这是我得到的错误：

File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 177, in findall
    return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer
logout

我正在使用这段代码：

file2 = open('artsplus_categories.txt', 'r').readlines()
list = re.findall("'[^']*'", file2)
for x in list:
    print (x)

score 2 · Accepted Answer

试试这个代码示例：

import re

file =  """u'MMA': 10,
        =u'acrylic'= : 19,
        == u'acting lessons': 2,
        =u'aerobic': 141,
        =u'alto': 2= 4,
        =u&#= 39;art therapy': 4,
        =u'ballet': 939,
        =u'ballroom'= ;: 234,
        = =u'banjo': 38,"""

list = re.findall("'[^']*'", file)
for x in list:
    print (x)

它显示正确的值。请记住，您的示例中的值之一没有正确打开引号，因此匹配项在那里被破坏。

score 2 · Accepted Answer

而不是传递line给正则表达式，您实际上将整个列表（文件）传递给它。你应该传递line给re.searchnot file。

for line in file:
    lis = re.search('^''$', line) # line not file

不要使用list,file作为变量名。它们是内置函数。

更新：

with open('artsplus_categories.txt') as f:
    for line in f:
        print re.search(r"'(.*)'", line).group(1)
...         
MMA
acrylic
acting lessons
aerobic
alto
art therapy
ballet
ballroom
banjo

python - 使用python从txt文件中提取单词

--------------------------------

2 回答 2

Related

Reference