python - PYTHON：从 txt 文件中删除 POS 标签

Question

我有以下 txt 文件，其中包含每个单词的 POS（词性）标签。

不用/jj 到/to 说/vb ,/, 我/ppss 被/bedz furious/jj at/in this/dt 无双/jj 侵/nn 上/in free/jj 企业/nn./。如何/wrb 敢/vbn 他们/ppss

有没有办法在没有 POS 标签的情况下读取文件，所以结果将是：

不用说，我对这种对自由企业的空前侵犯感到愤怒。他们怎么敢

所以，基本上我想删除/.

words = re.findall('\w+',open(input_file).read())

上面的代码将删除 / 但仍然出现 jj ， ppss 等缩写。那么，如何删除 / 后跟任何字符。

score 4 · Accepted Answer

这够好吗？

>>> import re
>>> s = 'Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./.'
>>> re.sub(r'/[^\s]+','',s)
'Needless to say , I was furious at this unparalleled intrusion upon free enterprise .'

这只是删除任何以开头的文本，/直到找到空格为止。

score 1 · Accepted Answer

正如 Wooble 所建议的，您可以通过嵌套在列表理解中的两个拆分来执行此操作：

s = 'Needless/jj to/to say/vb ,/, I/ppss was/bedz furious/jj at/in this/dt unparalleled/jj intrusion/nn upon/in free/jj enterprise/nn ./.'
print " ".join(word.split("/")[0] for word in s.split())

输出：

Needless to say , I was furious at this unparalleled intrusion upon free enterprise .

s.split()将句子拆分成单独的单词。word.split("/")将英语单词（或标点符号）与其词性分开。word.split("/")[0]只选择英文单词并丢弃 POS。" ".join()将生成的英语单词列表组合成一个字符串。

score 0 · Accepted Answer

此代码考虑了 Wooble 的评论以及您需要处理字符串列表 afaiu：

li = [ ('//Needless/jj to/to say/vb ,/, '
        'I/ppss was/bedz fur/ious/jj at/in this/dt '
        'unparalleled/jj intrusion/nn upon/in '
        'free/jj enterprise/nn ./. '
        'How/wrb dared/vbn they/ppss'),
       '/Before/jj to/to say/vb ,/, /I/ppss am/bedz h/a/p/p/y/jj']

import re

def clean(s,r=re.compile('(?<![\s/])/[^\s/]+(?![\S/])')):
    return r.sub('',s)

x = map(clean, li)

print '\n\n'.join(x)

结果

//Needless to say , I was fur/ious at this unparalleled intrusion upon free enterprise . How dared they

/Before to say , /I am h/a/p/p/y

python - PYTHON：从 txt 文件中删除 POS 标签

3 回答 3

Related

Reference