在我的新闻页面项目中,我有一个数据库表news,其结构如下:
- id: [integer] unique number identifying the news entry, e.g.: *1983*
- title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
- topic: [string] category which should be chosen by the classificator, e.g: *Sports*
此外,还有一个包含词频信息的表格贝叶斯:
- word: [string] a word which the frequencies are given for, e.g.: *real estate*
- topic: [string] same content as "topic" field above, e.h. *Economics*
- count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*
现在我希望我的 PHP 脚本对所有新闻条目进行分类,并为它们分配几个可能的类别(主题)之一。
这是正确的实现吗?你能改进它吗?
<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
$pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
if (!isset($pWords[$pWords3['topic']])) {
$pWords[$pWords3['topic']] = array();
}
$pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
$pTextInTopics = array();
$tokens = tokenizer($get3['title']);
foreach ($pTopics as $topic=>$documentsInTopic) {
if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
foreach ($tokens as $token) {
echo '....'.$token;
if (isset($pWords[$topic][$token])) {
$pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
}
}
$pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
}
asort($pTextInTopics); // pick topic with lowest value
if ($chosenTopic = each($pTextInTopics)) {
echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
}
}
?>
培训是手动完成的,它不包含在此代码中。如果将文本“你可以通过出售房地产赚钱”分配给类别/主题“经济学”,那么所有单词(you,can,make,...)都将插入到表贝叶斯中,其中“经济学”为主题和 1作为标准计数。如果单词已经与相同的主题组合在一起,则计数会增加。
样本学习数据:
字数主题
卡钦斯基政治 1
索尼技术 1
银行经济学 1
电话技术1
索尼经济学 3
爱立信科技2
样本输出/结果:
文字标题:电话测试索尼爱立信阿斯彭-敏感温贝里
政治
....电话 ....测试 ....索尼 ....爱立信 ....阿斯彭 ....敏感 ....winberry
技术
....发现手机 ....测试 ....索尼发现 ....爱立信发现 ....aspen ....敏感 ....winberry
经济学
....电话 ....测试 ....发现索尼 ....爱立信 ....阿斯彭 ....敏感 ....温莓
结果:文本属于主题技术,可能性为 0.013888888888889
非常感谢您!