我想使用带有 Rapidminer 工具的分类器模型 SVM 对文本数据进行分类。分类将是多类型的。由于我的数据是文本类型的,如何使用 SVM 进行分类。我知道 SVM 仅适用于数字数据。
2 回答
您正在寻找的缺失部分称为“词向量”。基本上,您必须创建一个新的示例集,其中单个属性将代表一个单词。对于给定的示例(即文档),该属性的(数字)值将显示该单词对该文档的“重要性”。
一种天真的方法是使用文档中单词的计数,但通常您应该使用 TD-IDF(词频-逆文档频率),它也会考虑整个文档语料库。
要在 RapidMiner 中执行此操作,您必须安装文本挖掘扩展程序并使用“Process Documents from Data”或“Process Documents from Files”等运算符。请记住,对于文本挖掘,您将需要执行更多预处理步骤,例如创建标记、删除停用词(您可以在几乎所有文档中找到的常用词,因此不是很有帮助)并使用词干(所以“word”和“words”将被同等对待)。
这是一个小例子:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.009">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="75">
<parameter key="text" value="I want to classify text data using classifier model SVM with Rapidminer tool. Classification would be of multilable type. Since my data is of text type, how SVM can be used for this classification. I know that SVM works with numeric data only."/>
</operator>
<operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="165">
<parameter key="text" value="The missing piece you are looking for is called "word vector". Basically you have to create a new example set for which the attributes will represent the words. For a given example (i.e. a document) the (numerical) value for this attribute will show the "importance" of this word for this document. A naive approach would be to use the count of the word within the document, but typically you should use TD-IDF (term frequency–inverse document frequency) which will take the whole document corpus into account as well. To do this in RapidMiner you have to install the text mining extension and use operators like "Process Documents from Data" or "Process Documents from Files". Keep in mind that for text mining you will need to conduct more preprocessing steps like creating tokens, removing stop words (common words which you can find in nearly all documents and which are therefore not very helpful) and use the stem of the words (so "word" and "words" will be treated equally). Here is a small example:"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="112" name="Process Documents" width="90" x="179" y="75">
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/>
<operator activated="true" class="text:stem_porter" compatibility="5.3.000" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="30"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
<connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
顺便说一句:在 youtube 上也有一些非常好的使用 RapidMiner 的文本挖掘教程。
这个问题可能比较老了,但也许有更多像我这样的人,只是在试验 Rapidminer,希望能解决完全相同的问题。
我想关于使用 Rapidminer 的插件“文本挖掘扩展”处理文本的第一部分已经被 maerch 正确解释了一段时间。但考虑到 kailash 的评论,主要问题似乎是二项式 SVM 模型与多项式输入/标签集之间的不兼容。
SVM 模型的实际输入是通过添加元运算符“Polynomial by Binomial Classification”作为 SVM 的包装器来完成的。它多次合并输入类(以您可以使用“分类策略”参数选择的方式),以便始终有两个输入组并将它们提供给 SVM,直到可以得出组合结果。然后,生成的模型能够处理多个类。
下面的过程片段说明了带有 Poly2Bi-Wrapper 的 SVM(默认参数):
<process expanded="true">
<operator activated="true" class="polynomial_by_binomial_classification" compatibility="5.3.015" expanded="true" height="76" name="Polynominal by Binominal Classification" width="90" x="112" y="120">
<parameter key="classification_strategies" value="1 against all"/>
<parameter key="random_code_multiplicator" value="2.0"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<process expanded="true">
<operator activated="true" class="support_vector_machine_linear" compatibility="5.3.015" expanded="true" height="76" name="SVM (Linear)" width="90" x="179" y="210">
<parameter key="kernel_cache" value="200"/>
<parameter key="C" value="0.0"/>
<parameter key="convergence_epsilon" value="0.001"/>
<parameter key="max_iterations" value="100000"/>
<parameter key="scale" value="true"/>
<parameter key="L_pos" value="1.0"/>
<parameter key="L_neg" value="1.0"/>
<parameter key="epsilon" value="0.0"/>
<parameter key="epsilon_plus" value="0.0"/>
<parameter key="epsilon_minus" value="0.0"/>
<parameter key="balance_cost" value="false"/>
<parameter key="quadratic_loss_pos" value="false"/>
<parameter key="quadratic_loss_neg" value="false"/>
</operator>
<connect from_port="training set" to_op="SVM (Linear)" to_port="training set"/>
<connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
</process>
</operator>
<connect from_port="training" to_op="Polynominal by Binominal Classification" to_port="training set"/>
<connect from_op="Polynominal by Binominal Classification" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
请注意,当 Poly2Bi 运算符以这种方式在验证运算符的训练区域内使用并且测试区域中有性能运算符时,RapidMiner 的(至少)版本 5.3.015 会抱怨。Performance 算子会报错:
标签和预测必须属于同一类型,但分别是多项式和名义上的。
但是在 RapidMiner 论坛中,他们指出这似乎是一个您可以忽略的无用警告。就我而言,该过程也运行良好。