0

我正在从事一个 NLP 项目,在该项目中,我得到了一个 POS 标记的句子数据集。数据集的格式(也应提供例句)是

('单词','pos_tag')

除非这个词有一个单引号(像 're,'s,n't 和结尾引号的 '' 这样的词缀)在这种情况下格式是

(“单词”,“pos_tag”)

我用来处理这个数据集的代码段如下

def corpus_reader(filepath):
 pattern = '\(\'(\w+)\', |(?<=\").*?\", ' 
 sentences = []
 with open( filepath ) as f:
     corpus = f.readlines()

 for line in corpus:
    temp = re.findall( pattern, line )
    sentences.append( temp )

return sentences

该模式由两个要检测的模式 cond1|cond2 组成。

cond1 匹配并提取语料库中的所有单词。

cond2 旨在匹配 '', n't, 's 和 're,它们包含在双引号中,就像我之前提到的那样,但第二个条件不起作用。

期望的结果是所有 post 标记令牌的列表

有人可以提供正确的正则表达式模式来检测我提到的案例吗?

以下是要解析的包含 're, n't, 's 和 '' 的例句

[('We', 'PRP'), ("'re", 'VBP'), ('talking', 'VBG'), ('about', 'IN'), ('years', 'NNS' ), ('ago', 'IN'), ('before', 'IN'), ('anyone', 'NN'), ('heard', 'VBD'), ('of', 'IN' ), ('asbestos', 'NN'), ('have', 'VBG'), ('any', 'DT'), ('questionable', 'JJ'), ('properties', 'NNS' ), ('.', '.')]

[(' ', ''), ('我们', 'PRP'), ('有', 'VBP'), ('没有', 'DT'), ('有用的', 'JJ'), ('信息' , 'NN'), ('on', 'IN'), ('whether', 'IN'), ('users', 'NNS'), ('are', 'VBP'), ('at' , 'IN'), ('risk', 'NN'), (',', ','), ("''", "''"), ('said', 'VBD'), (' T -1', '-NONE-'), ('James', 'NNP'), ('A.', 'NNP'), ('Talcott', 'NNP'), ('of', 'IN '), ('Boston', 'NNP'), ("'s", 'POS'), ('Dana-Farber', 'NNP'), ('Cancer', 'NNP'), ('Institute' ,'NNP'), ('.', '.')]

[('The', 'DT'), ('US', 'NNP'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN') , ('the', 'DT'), ('few', 'JJ'), ('industrialized', 'VBN'), ('nations', 'NNS'), ('that', 'WDT') , (' T -7', '-NONE-'), ('does', 'VBZ'), ("n't", 'RB'), ('have', 'VB'), ('a ', 'DT'), ('higher', 'JJR'), ('standard', 'NN'), ('of', 'IN'), ('regulation', 'NN'), ('for ', 'IN'), ('the', 'DT'), ('smooth', 'JJ'), (',', ','), ('needle-like', 'JJ'), ( '纤维','NNS'), ('such', 'JJ'), ('as', 'IN'), ('crocidolite', 'NN'), ('that', 'WDT'), (' T-1', '-NONE-'), ('are', 'VBP'), ('classified', 'VBN'), ('*-5', '-NONE-'), ('as', 'IN'), ('两栖动物', 'NNS'), (',', ','), ('according', 'VBG'), ('to', 'TO'), ('Brooke', 'NNP'), ('T.', 'NNP'), ('Mossman', 'NNP'), (',', ','), ('a', 'DT'), ('professor' , 'NN'), ('of', 'IN'), ('pathology', 'NN'), ('at', 'IN'), ('the', 'DT'), ('University' , 'NNP'), ('of', 'IN'), ('Vermont', 'NNP'), ('College', 'NNP'), ('of', 'IN'), ('Medicine' , 'NNP'), ('.', '.')]

[(' ', ''), ('What', 'WP'), (' T -14', '-NONE-'), ('matters', 'VBZ'), ('is', 'VBZ'), ('what', 'WP'), ('advertisers', 'NNS'), ('are', 'VBP'), ('paying', 'VBG'), (' T -15', '-NONE -'), ('per', 'IN'), ('page', 'NN'), (',', ','), ('and', 'CC'), ('in', ' IN'), ('that', 'DT'), ('department', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('doing', ' VBG'), ('fine', 'RB'), ('this', 'DT'), ('fall', 'NN'), (',', ','), ("''","''"), ('said', 'VBD'), (' T -1', '-NONE-'), ('先生', 'NNP'), ('Spoon', 'NNP') , ('.', '.')]

感谢并感谢所有回答和帮助的尝试

4

1 回答 1

0

我会使用:

(               # start of capture group 1
  (?<=\(')      # first alternative: positive lookbehind: ('
  [^']*         # zero or more characters other than '
  (?=',)        # positive lookahead: ',
|               # start of second alternative:
  (?<=\(")      # positive lookbehind: ("
  [^"]*         # zero or more characters other than "
  (?=",)        # positive lookahead: ",
)

见正则表达式演示

于 2020-04-11T13:01:35.890 回答