python - 在 scikit-learn 中实现 K 邻居分类器和线性 SVM 用于词义消歧

Question

我正在尝试使用线性 SVM 和 K 邻居分类器来进行词义消歧（WSD）。这是我用来训练数据的一段数据：

<corpus lang="English">

<lexelt item="activate.v">


<instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
</context>
</instance>


<instance id="activate.v.bnc.00044852" docsrc="BNC">
<answer instance="activate.v.bnc.00044852" senseid="38201"/>
<answer instance="activate.v.bnc.00044852" senseid="38202"/>
<context>
For neurophysiologists and neuropsychologists ,  the way forward in understanding perception has been to correlate these dimensions of experience with ,  firstly ,  the material properties of the experienced object or event  ( usually regarded as the stimulus )  and ,  secondly ,  the patterns of discharges in the sensory system .  Qualitative Aspects of Experience The quality or modality of the experience depends less upon the quality of energy reaching the nervous system than upon which parts of the sensory system are <head>activated</head> : stimulation of the retinal receptors causes an experience of light ; stimulation of the receptors in the inner ear gives rise to the experience of sound ; and so on . Muller 's  nineteenth - century  doctrine of specific energies  formalized the ordinary observation that different sense organs are sensitive to different physical properties of the world and that when they are stimulated ,  sensations specific to those organs are experienced .  It was proposed that there are endings  ( or receptors )  within the nervous system which are attuned to specific types of energy ,  For example ,  retinal receptors in the eye respond to light energy ,  cochlear endings in the ear to vibrations in the air ,  and so on .  
</context>
</instance>
.....

训练数据和测试数据的区别在于测试数据没有“答案”标签。我已经建立了一个字典来存储每个实例的“head”单词的邻居单词，窗口大小为 10。当一个实例有多个时，我只考虑第一个。我还建立了一个集合来记录训练文件中的所有词汇，以便我可以为每个实例计算一个向量。例如，如果总词汇表是 [a,b,c,d,e]，并且一个实例有单词 [a,a,d,d,e]，那么该实例的结果向量将是 [2,0, 0,2,1]。这是我为每个单词构建的字典的一部分：

{
    "activate.v": {
        "activate.v.bnc.00024693": {
            "instanceId": "activate.v.bnc.00024693", 
            "senseId": "38201", 
            "vocab": {
                "although": 1, 
                "back": 1, 
                "bend": 1, 
                "bicycl": 1, 
                "correct": 1, 
                "dig": 1, 
                "general": 1, 
                "handlebar": 1, 
                "hefti": 1, 
                "lever": 1, 
                "nt": 2, 
                "quit": 1, 
                "rear": 1, 
                "spade": 1, 
                "sprung": 1, 
                "step": 1, 
                "type": 1, 
                "use": 1, 
                "wo": 1
            }
        }, 
        "activate.v.bnc.00044852": {
            "instanceId": "activate.v.bnc.00044852", 
            "senseId": "38201", 
            "vocab": {
                "caus": 1, 
                "ear": 1, 
                "energi": 1, 
                "experi": 1, 
                "inner": 1, 
                "light": 1, 
                "nervous": 1, 
                "part": 1, 
                "qualiti": 1, 
                "reach": 1, 
                "receptor": 2, 
                "retin": 1, 
                "sensori": 1, 
                "stimul": 2, 
                "system": 2, 
                "upon": 2
            }
        }, 
        ......

现在，我只需要从 scikit-learn 向 K Neighbors Classifier 和 Linear SVM 提供输入来训练分类器。但我只是不确定如何为每个构建特征向量和标签。我的理解是 label 应该是“答案”中的实例标签和 senseid 标签的元组。但是我不确定特征向量。我应该对“答案”中具有相同实例标签和 senseid 标签的同一个词的所有向量进行分组吗？但是每个单词大约有 100 个单词和数百个实例，我应该如何处理呢？

此外，矢量是一个特征，我需要稍后添加更多特征，例如同义词集、上位词、下位词等。我该怎么做？

提前致谢！

score 2 · Accepted Answer

下一步 - 实现多维线性分类器。

不幸的是我无法访问这个数据库，所以这有点理论上。我可以提出这种方法：

将所有数据强制保存在一个 CSV 文件中，如下所示：

SenseId,Word,Text,IsHyponim,Properties,Attribute1,Attribute2, ...
30821,"BNC","For neurophysiologists and ...","Hyponym sometype",1,1
30822,"BNC","Do you know what it is ...","Antonym type",0,1
...

接下来你可以使用sklearn工具：

import pandas as pd
df.read_csv('file.csv')

from sklearn.feature_extraction import DictVectorizer
enc=DictVectorizer()
X_train_categ = enc.fit_transform(df[['Properties',]].to_dict('records'))

from sklearn.feature_extraction.text import TfidfVectorizer
vec=TfidfVectorizer(min_df=5)  # throw out all terms which present in less than 5 documents - typos and so on
v=vec.fit_transform(df['Text'])

# Join all date together as a sparsed matrix
from scipy.sparse import csr_matrix, hstack
train=hstack( (csr_matrix(df.ix[:, 'Word':'Text']), X_train_categ, v))
y = df['SenseId']

# here you have an matrix with really huge dimensionality - about dozens of thousand columns 
# you may use Ridge regression to deal with it:
from sklearn.linear_model import Ridge
r=Ridge(random_state=241, alpha=1.0)

# prepare test data like training one

更多细节：岭，岭分类器。

处理高维问题的其他技术。

使用稀疏特征矩阵进行文本分类的代码示例。

score 2 · Accepted Answer

机器学习问题是一种优化任务，您没有预定义的万能算法，而是使用不同的方法、参数和数据预处理来摸索最佳结果。所以你从最简单的任务开始是绝对正确的——只用一个词和几种感觉。

但我只是不确定如何为每个构建特征向量和标签。

您可以仅将这些值作为向量分量。枚举向量词并在每个文本中写出此类词的数量。如果单词不存在，则输入空值。我稍微修改了你的例子来澄清这个想法：

vocab_38201= {
            "although": 1, 
            "back": 1, 
            "bend": 1, 
            "bicycl": 1, 
            "correct": 1, 
            "dig": 1, 
            "general": 1, 
            "handlebar": 1, 
            "hefti": 1, 
            "lever": 1, 
            "nt": 2, 
            "quit": 1, 
            "rear": 1, 
            "spade": 1, 
            "sprung": 1, 
            "step": 1, 
            "type": 1, 
            "use": 1, 
            "wo": 1
        }

vocab_38202 = {
            "caus": 1, 
            "ear": 1, 
            "energi": 1, 
            "experi": 1, 
            "inner": 1, 
            "light": 1, 
            "nervous": 1, 
            "part": 1, 
            "qualiti": 1, 
            "reach": 1, 
            "receptor": 2, 
            "retin": 1, 
            "sensori": 1, 
            "stimul": 2, 
            "system": 2, 
            "upon": 2,
            "wo": 1     ### added so they have at least one common word
        }

让我们将其转换为特征向量。枚举所有单词并标记该单词在词汇表中出现的次数。

from collections import defaultdict
words = []

def get_components(vect_dict):
    vect_components = defaultdict(int)
    for word, num in vect_dict.items():
        try:
           ind = words.index(word)
        except ValueError:
           ind = len(words)
           words.append(word)
        vect_components[ind] += num
    return vect_components


#  
vect_comps_38201 = get_components(vocab_38201)
vect_comps_38202 = get_components(vocab_38202)

我们看看吧：

>>> print(vect_comps_38201)
defaultdict(<class 'int'>, {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1})

>>> print(vect_comps_38202)
defaultdict(<class 'int'>, {32: 1, 33: 2, 34: 1, 7: 1, 19: 2, 20: 2, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1, 31: 1})

>>> vect_38201=[vect_comps_38201.get(i,0) for i in range(len(words))]
>>> print(vect_38201)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

>>> vect_38202=[vect_comps_38202.get(i,0) for i in range(len(words))]
>>> print(vect_38202)
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1]

这些 vect_38201 和 vect38202 是您可以在拟合模型中使用的向量：

from sklearn.svm import SVC
X = [vect_38201, vect_38202]
y = [38201, 38202]
clf = SVC()
clf.fit(X, y)
clf.predict([[0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 1]])

输出：

array([38202])

当然这是一个非常简单的例子，只是展示概念。

你能做些什么来改进它？

标准化矢量坐标。
使用优秀的工具Tf-Idf vectorizer从文本中提取数据特征。
添加更多数据。

祝你好运！

python - 在 scikit-learn 中实现 K 邻居分类器和线性 SVM 用于词义消歧

2 回答 2

Related

Reference