我有 2 张桌子:
tb_sentence:
================================
|id|doc_id|sentence_id|sentence|
================================
| 1|  1   |   0       |    AB  |
| 2|  1   |   1       |    CD  |
| 3|  2   |   0       |    EF  |
| 4|  2   |   1       |    GH  |
| 5|  2   |   2       |    IJ  |
| 6|  2   |   3       |    KL  |
================================
首先,我计算每个句子的数量document_id并将它们保存在一个变量$total_sentence中。所以$total_sentence变量的值为Array ( [0] => 2 [1] => 4 )
第二张表是tb_stem:
============================
|id|stem|doc_id|sentence_id|
============================
|1 | B  |  1   |     0     |
|2 | A  |  1   |     1     |
|3 | C  |  2   |     0     |
|4 | A  |  2   |     1     |
|5 | E  |  2   |     2     |
|6 | C  |  2   |     3     |
|7 | D  |  2   |     4     |
|8 | G  |  2   |     5     |
|9 | A  |  2   |     6     |
============================
其次,我需要对stem每个中的数据进行分组doc_id,然后计算由( )sentence_id之前的结果组成的数量。$token该概念是将文档总数除以包含词干的文档数。编码 :
$query1 = mysql_query("SELECT DISTINCT(stem) AS unique FROM `tb_stem` group by stem,doc_id ");
while ($row = mysql_fetch_array($query1)) {
    $token = $row['unique']; //the result $token must be : ABACDEG
}
$query2 = mysql_query("SELECT stem, COUNT( DISTINCT sentence_id ) AS ndw FROM `tb_stem` WHERE stem = '$token' GROUP BY stem, doc_id");
    while ($row = mysql_fetch_array($query2)) {
        $ndw = $row['ndw']; //the result must be : 1122111
}
$idf = log($total_sentence / $ndw)+1; //$total_sentence for doc_id = 1 must be divide $ndw with the doc_id = 2, etc
但结果在不同文档之间并不分开,如下表所示:
============================
|id|word|doc_id|  ndw |idf |
============================
|1 | A  |      |      |    |
|2 | B  |      |      |    |
|3 | C  |      |      |    |
|4 | D  |      |      |    |
|5 | E  |      |      |    |
|6 | G  |      |      |    |
============================
结果必须是:
 ============================
|id|word|doc_id|  ndw |idf |
============================
|1 | A  |   1  |      |    |
|2 | B  |   1  |      |    |
|3 | A  |   2  |      |    |
|4 | C  |   2  |      |    |
|5 | D  |   2  |      |    |
|6 | E  |   2  |      |    |
|7 | G  |   2  |      |    |
============================
请帮助我,谢谢:)
idf 的公式是idf = log(N/df)其中N是文档df的数量,是出现术语 (t) 的文档的数量。每个句子都被视为一个文档。这是 idf 计算的示例:文档:Do you read poetry while flying. Many people find it relaxing to read on long flights 
=================================================
|     Term     | Document1(D1)| D2| df |   idf  |
=================================================
|     find     |     0        | 1 |  1 |log(2/1)|
|     fly      |     1        | 1 |  2 |log(2/2)|
|     long     |     0        | 1 |  1 |log(2/1)|
|    people    |     0        | 1 |  1 |log(2/1)|
|    poetry    |     1        | 0 |  1 |log(2/1)|
|     read     |     1        | 1 |  2 |log(2/2)|
|    relax     |     0        | 1 |  1 |log(2/1)|
=================================================