我的问题与词性标记和解析的自然语言句子的后处理有关。具体来说,我正在编写一个 Lisp 后处理器的组件,该组件将一个句子解析树(例如,由斯坦福解析器生成的)作为输入,从该解析树中提取为生成解析而调用的短语结构规则,然后生成规则和规则计数表。输入和输出的示例如下:
(1) 句子:
John said that he knows who Mary likes
(2) 解析器输出:
(ROOT
(S
(NP (NNP John))
(VP (VBD said)
(SBAR (IN that)
(S
(NP (PRP he))
(VP (VBZ knows)
(SBAR
(WHNP (WP who))
(S
(NP (NNP Mary))
(VP (VBZ likes))))))))))
(3) 此解析树的我的 Lisp 程序后处理器输出:
(S --> NP VP) 3
(NP --> NNP) 2
(VP --> VBZ) 1
(WHNP --> WP) 1
(SBAR --> WHNP S) 1
(VP --> VBZ SBAR) 1
(NP --> PRP) 1
(SBAR --> IN S) 1
(VP --> VBD SBAR) 1
(ROOT --> S) 1
注意句子(1)中没有标点符号。那是故意的。我无法在 Lisp 中解析标点符号——正是因为某些标点符号(例如逗号)是为特殊目的而保留的。但是解析没有标点符号的句子会改变解析规则的分布以及这些规则中包含的符号,如下所示:
(4) 输入句:
I said no and then I did it anyway
(5) 解析器输出:
(ROOT
(S
(NP (PRP I))
(VP (VBD said)
(ADVP (RB no)
(CC and)
(RB then))
(SBAR
(S
(NP (PRP I))
(VP (VBD did)
(NP (PRP it))
(ADVP (RB anyway))))))))
(6) 输入句子(带标点):
I said no, and then I did it anyway.
(7) 解析器输出:
(ROOT
(S
(S
(NP (PRP I))
(VP (VBD said)
(INTJ (UH no))))
(, ,)
(CC and)
(S
(ADVP (RB then))
(NP (PRP I))
(VP (VBD did)
(NP (PRP it))
(ADVP (RB anyway))))
(. .)))
请注意,包含标点符号如何完全重新排列解析树并且还涉及不同的 POS 标签(因此,意味着调用了不同的语法规则来生成它)所以包含标点符号很重要,至少对于我的应用程序而言。
我需要的是找到一种在规则中包含标点符号的方法,这样我就可以生成如下规则,例如,在 (3) 之类的表中,如下所示:
(8) 期望规则:
S --> S , CC S .
对于我正在编写的特定应用程序,实际上需要像 (8) 这样的规则。
但是我发现在 Lisp 中这样做很困难:例如,在 (7) 中,我们观察到 (, ,) 和 (. .) 的出现,这两者在 Lisp 中处理都是有问题的。
我在下面包含了我的相关 Lisp 代码。请注意,我是一个新手 Lisp 黑客,所以我的代码不是特别漂亮或高效。如果有人能建议我如何修改下面的代码,以便我可以解析 (7) 以生成像 (3) 这样的表,其中包含像 (8) 这样的规则,我将不胜感激。
这是我与此任务相关的 Lisp 代码:
(defun WRITE-RULES-AND-COUNTS-SORTED (sent)
(multiple-value-bind (rules-list counts-list)
(COUNT-RULES-OCCURRENCES sent)
(setf comblist (sort (pairlis rules-list counts-list) #'> :key #'cdr))
(format t "~%")
(do ((i 0 (incf i)))
((= i (length comblist)) NIL)
(format t "~A~26T~A~%" (car (nth i comblist)) (cdr (nth i comblist))))
(format t "~%")))
(defun COUNT-RULES-OCCURRENCES (sent)
(let* ((original-rules-list (EXTRACT-GRAMMAR sent))
(de-duplicated-list (remove-duplicates original-rules-list :test #'equalp))
(count-list nil))
(dolist (i de-duplicated-list)
(push (reduce #'+ (mapcar #'(lambda (x) (if (equalp x i) 1 0)) original-rules-list) ) count-list))
(setf count-list (nreverse count-list))
(values de-duplicated-list count-list)))
(defun EXTRACT-GRAMMAR (sent &optional (rules-stack nil))
(cond ((null sent)
NIL)
((and (= (length sent) 1)
(listp (first sent))
(= (length (first sent)) 2)
(symbolp (first (first sent)))
(symbolp (second (first sent))))
NIL)
((and (symbolp (first sent))
(symbolp (second sent))
(= 2 (length sent)))
NIL)
((symbolp (first sent))
(push (EXTRACT-GRAMMAR-RULE sent) rules-stack)
(append rules-stack (EXTRACT-GRAMMAR (rest sent) )))
((listp (first sent))
(cond ((not (and (listp (first sent))
(= (length (first sent)) 2)
(symbolp (first (first sent)))
(symbolp (second (first sent)))))
(push (EXTRACT-GRAMMAR-RULE (first sent)) rules-stack)
(append rules-stack (EXTRACT-GRAMMAR (rest (first sent))) (EXTRACT-GRAMMAR (rest sent) )))
(t (append rules-stack (EXTRACT-GRAMMAR (rest sent) )))))))
(defun EXTRACT-GRAMMAR-RULE (sentence-or-phrase)
(append (list (first sentence-or-phrase))
'(-->)
(mapcar #'first (rest sentence-or-phrase))))
代码调用如下(使用(1)作为输入,产生(3)作为输出):
(WRITE-RULES-AND-COUNTS-SORTED '(ROOT
(S
(NP (NNP John))
(VP (VBD said)
(SBAR (IN that)
(S
(NP (PRP he))
(VP (VBZ knows)
(SBAR
(WHNP (WP who))
(S
(NP (NNP Mary))
(VP (VBZ likes)))))))))))