我很困惑如何过滤某些文档的单词。我必须一一检查文件。例如来自 tb_tokens :
======================================================================
| tokens_id | tokens_word | tokens_freq| sentence_id | document_id |
======================================================================
| 1 | A | 1 | 0 | 1 |
| 2 | B | 1 | 0 | 1 |
| 3 | C | 1 | 1 | 1 |
| 4 | D | 1 | 0 | 2 |
| ... | | | | |
======================================================================
我必须删除列表中出现的所有单词以及常见单词,例如“and”、“the”等。记录在表 tb_stopword 中的列表,然后删除出现在记录列表中的大多数文档中大量出现的单词在 tb_term 表中。
函数 cekStopWord :
function cekStopWord ($word) {
$query = mysql_query("SELECT stoplist_word FROM tb_stopword where stoplist_word = '$word' ");
$row = mysql_fetch_row($query);
if($row > 0) {
return true;
} else {
return false;
}
}
以及第二个过程的类似功能(删除大多数文档中大量出现的单词)
function cekTerm ($word) {
$query = mysql_query("SELECT term_word FROM tb_term where term_word = '$word' ");
我很困惑如何处理每个文件。我试图通过 doc_id 调用,但它不起作用。这是我的代码:
//$doc_id is a variable that save array of document_id
$query = mysql_query('SELECT tokens_word, sentence_id, document_id FROM tb_tokens WHERE document_id IN (' . implode(",", $doc_id) . ')') or die(mysql_error());
while ($row = mysql_fetch_array($query)) {
$word[$row['document_id']][$row['sentence_id']] = $row['tokens_word'];
}
foreach ($word as $doc_id => $words){
$cekStopWord = cekStopWord($words);
$cekTerm = cekTerm($words);
if((preg_match("/^[A-Z, 0-9]/", $words))&& (!$cekStopWord) && (!$cekTerm) ){
$q = mysql_query("INSERT INTO tb_tagging VALUES ('','$words','','$sentence_id','$doc_id') ");
以及如何在数组中使用 preg_match ?太感谢了 :)