我有 2 张桌子:
tb_sentence
:
================================
|id|doc_id|sentence_id|sentence|
================================
| 1| 1 | 0 | AB |
| 2| 1 | 1 | CD |
| 3| 2 | 0 | EF |
| 4| 2 | 1 | GH |
| 5| 2 | 2 | IJ |
| 6| 2 | 3 | KL |
================================
首先,我计算每个句子的数量document_id
并将它们保存在一个变量$total_sentence
中。所以$total_sentence
变量的值为Array ( [0] => 2 [1] => 4 )
第二张表是tb_stem
:
============================
|id|stem|doc_id|sentence_id|
============================
|1 | B | 1 | 0 |
|2 | A | 1 | 1 |
|3 | C | 2 | 0 |
|4 | A | 2 | 1 |
|5 | E | 2 | 2 |
|6 | C | 2 | 3 |
|7 | D | 2 | 4 |
|8 | G | 2 | 5 |
|9 | A | 2 | 6 |
============================
其次,我需要对stem
每个中的数据进行分组doc_id
,然后计算由( )sentence_id
之前的结果组成的数量。$token
该概念是将文档总数除以包含词干的文档数。编码 :
$query1 = mysql_query("SELECT DISTINCT(stem) AS unique FROM `tb_stem` group by stem,doc_id ");
while ($row = mysql_fetch_array($query1)) {
$token = $row['unique']; //the result $token must be : ABACDEG
}
$query2 = mysql_query("SELECT stem, COUNT( DISTINCT sentence_id ) AS ndw FROM `tb_stem` WHERE stem = '$token' GROUP BY stem, doc_id");
while ($row = mysql_fetch_array($query2)) {
$ndw = $row['ndw']; //the result must be : 1122111
}
$idf = log($total_sentence / $ndw)+1; //$total_sentence for doc_id = 1 must be divide $ndw with the doc_id = 2, etc
但结果在不同文档之间并不分开,如下表所示:
============================
|id|word|doc_id| ndw |idf |
============================
|1 | A | | | |
|2 | B | | | |
|3 | C | | | |
|4 | D | | | |
|5 | E | | | |
|6 | G | | | |
============================
结果必须是:
============================
|id|word|doc_id| ndw |idf |
============================
|1 | A | 1 | | |
|2 | B | 1 | | |
|3 | A | 2 | | |
|4 | C | 2 | | |
|5 | D | 2 | | |
|6 | E | 2 | | |
|7 | G | 2 | | |
============================
请帮助我,谢谢:)
idf 的公式是idf = log(N/df)
其中N
是文档df
的数量,是出现术语 (t) 的文档的数量。每个句子都被视为一个文档。这是 idf 计算的示例:文档:Do you read poetry while flying. Many people find it relaxing to read on long flights
=================================================
| Term | Document1(D1)| D2| df | idf |
=================================================
| find | 0 | 1 | 1 |log(2/1)|
| fly | 1 | 1 | 2 |log(2/2)|
| long | 0 | 1 | 1 |log(2/1)|
| people | 0 | 1 | 1 |log(2/1)|
| poetry | 1 | 0 | 1 |log(2/1)|
| read | 1 | 1 | 2 |log(2/2)|
| relax | 0 | 1 | 1 |log(2/1)|
=================================================