我有几条需要处理的推文。我试图找出对人有一定伤害的消息。我如何通过 NLP 实现这一目标
I bought my son a toy gun
I shot my neighbor with a gun
I don't like this gun
I would love to own this gun
This gun is a very good buy
Feel like shooting myself with a gun
在上面的句子中,第 2、6 个是我想要找到的。
如果问题仅限于枪支和射击,那么您可以使用依赖解析器(如 Stanford Parser)来查找动词及其(介词)对象,从动词开始并在解析树中跟踪其依赖项。例如,在 2 和 6 中,这些都是“用枪射击”。
然后,您可以使用“shoot”(“kill”、“murder”、“wound”等)和“gun”(“weapon”、“rifle”等)的(近)同义词列表来检查它们是否发生以这种模式(动词 - 介词 - 名词)在每个句子中。
将有其他方式来表达相同的想法,例如“我买了一把枪来射击我的邻居”,其中依赖关系不同,您也需要检测这些类型的依赖关系。
vpekar 的所有建议都很好。这是一些 python 代码,它至少会解析句子并查看它们是否包含用户定义的一组有害词中的动词。注意:大多数“伤害词”可能有多种含义,其中许多可能与伤害无关。这种方法并不试图消除词义的歧义。
(此代码假设您拥有 NLTK 和 Stanford CoreNLP)
import os
import subprocess
from xml.dom import minidom
from nltk.corpus import wordnet as wn
def StanfordCoreNLP_Plain(inFile):
#Create the startup info so the java program runs in the background (for windows computers)
startupinfo = None
if os.name == 'nt':
startupinfo = subprocess.STARTUPINFO()
startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
#Execute the stanford parser from the command line
cmd = ['java', '-Xmx1g','-cp', 'stanford-corenlp-1.3.5.jar;stanford-corenlp-1.3.5-models.jar;xom.jar;joda-time.jar', 'edu.stanford.nlp.pipeline.StanfordCoreNLP', '-annotators', 'tokenize,ssplit,pos', '-file', inFile]
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, startupinfo=startupinfo).communicate()
outFile = file(inFile[(str(inFile).rfind('\\'))+1:] + '.xml')
xmldoc = minidom.parse(outFile)
itemlist = xmldoc.getElementsByTagName('sentence')
Document = []
#Get the data out of the xml document and into python lists
for item in itemlist:
SentNum = item.getAttribute('id')
sentList = []
tokens = item.getElementsByTagName('token')
for d in tokens:
word = d.getElementsByTagName('word')[0].firstChild.data
pos = d.getElementsByTagName('POS')[0].firstChild.data
sentList.append([str(pos.strip()), str(word.strip())])
Document.append(sentList)
return Document
def FindHarmSentence(Document):
#Loop through sentences in the document. Look for verbs in the Harm Words Set.
VerbTags = ['VBN', 'VB', 'VBZ', 'VBD', 'VBG', 'VBP', 'V']
HarmWords = ("shoot", "kill")
ReturnSentences = []
for Sentence in Document:
for word in Sentence:
if word[0] in VerbTags:
try:
wordRoot = wn.morphy(word[1],wn.VERB)
if wordRoot in HarmWords:
print "This message could indicate harm:" , str(Sentence)
ReturnSentences.append(Sentence)
except: pass
return ReturnSentences
#Assuming your input is a string, we need to put the strings in some file.
Sentences = "I bought my son a toy gun. I shot my neighbor with a gun. I don't like this gun. I would love to own this gun. This gun is a very good buy. Feel like shooting myself with a gun."
ProcessFile = "ProcFile.txt"
OpenProcessFile = open(ProcessFile, 'w')
OpenProcessFile.write(Sentences)
OpenProcessFile.close()
#Sentence split, tokenize, and part of speech tag the data using Stanford Core NLP
Document = StanfordCoreNLP_Plain(ProcessFile)
#Find sentences in the document with harm words
HarmSentences = FindHarmSentence(Document)
这将输出以下内容:
此消息可能表示伤害:[['PRP', 'I'], ['VBD', 'shot'], ['PRP$', 'my'], ['NN', 'neighbor'], [' IN', 'with'], ['DT', 'a'], ['NN', 'gun'], ['.', '.']]
此消息可能表示伤害:[['NNP', 'Feel'], ['IN', 'like'], ['VBG', 'shooting'], ['PRP', 'myself'], ['IN ', 'with'], ['DT', 'a'], ['NN', 'gun'], ['.', '.']]