0

我想通过用 wordnet 中的同义词替换单词来为情感分析任务进行数据增强,但替换是随机的

sentences=[]
for index , r in pos_df.iterrows():
  text=normalize(r['text'])
  words=tokenize(text)
  output = ""
  # Identify the parts of speech
  tagged = nltk.pos_tag(words)

  for i in range(0,len(words)):
      replacements = []

      # Only replace nouns with nouns, vowels with vowels etc.
      for syn in wordnet.synsets(words[i]):    
           # Do not attempt to replace proper nouns or determiners
          if tagged[i][1] == 'NNP' or tagged[i][1] == 'DT':
              break

          # The tokenizer returns strings like NNP, VBP etc
          # but the wordnet synonyms has tags like .n.
          # So we extract the first character from NNP ie n
          # then we check if the dictionary word has a .n. or not 
          word_type = tagged[i][1][0]

          if syn.name().find("."+word_type+"."):
              # extract the word only
              r = syn.name()[0:syn.name().find(".")]
              replacements.append(r)

      if len(replacements) > 0:
          # Choose a random replacement
          replacement = replacements[randint(0,len(replacements)-1)]
          print(replacement)
          output = output + " " + replacement
      else:
          # If no replacement could be found, then just use the
          # original word
          output = output + " " + words[i]

  sentences.append([output,'positive'])
4

1 回答 1

0

即使我正在使用类似的项目,从给定的输入生成新句子,但不改变输入文本的上下文。在遇到这个问题时,我发现了一种数据增强技术。这似乎在增强部分效果很好。EDA(Easy Data Augmentation)是一篇论文[ https://github.com/jasonwei20/eda_nlp]

希望这对您有所帮助。

于 2019-05-09T15:28:27.713 回答