php - 将文档总数除以包含词干的文档数

Question

我有 2 张桌子：

tb_sentence：

================================
|id|doc_id|sentence_id|sentence|
================================
| 1|  1   |   0       |    AB  |
| 2|  1   |   1       |    CD  |
| 3|  2   |   0       |    EF  |
| 4|  2   |   1       |    GH  |
| 5|  2   |   2       |    IJ  |
| 6|  2   |   3       |    KL  |
================================

首先，我计算每个句子的数量document_id并将它们保存在一个变量$total_sentence中。所以$total_sentence变量的值为Array ( [0] => 2 [1] => 4 )

第二张表是tb_stem：

============================
|id|stem|doc_id|sentence_id|
============================
|1 | B  |  1   |     0     |
|2 | A  |  1   |     1     |
|3 | C  |  2   |     0     |
|4 | A  |  2   |     1     |
|5 | E  |  2   |     2     |
|6 | C  |  2   |     3     |
|7 | D  |  2   |     4     |
|8 | G  |  2   |     5     |
|9 | A  |  2   |     6     |
============================

其次，我需要对stem每个中的数据进行分组doc_id，然后计算由（）sentence_id之前的结果组成的数量。$token该概念是将文档总数除以包含词干的文档数。编码：

$query1 = mysql_query("SELECT DISTINCT(stem) AS unique FROM `tb_stem` group by stem,doc_id ");
while ($row = mysql_fetch_array($query1)) {
    $token = $row['unique']; //the result $token must be : ABACDEG
}

$query2 = mysql_query("SELECT stem, COUNT( DISTINCT sentence_id ) AS ndw FROM `tb_stem` WHERE stem = '$token' GROUP BY stem, doc_id");
    while ($row = mysql_fetch_array($query2)) {
        $ndw = $row['ndw']; //the result must be : 1122111
}

$idf = log($total_sentence / $ndw)+1; //$total_sentence for doc_id = 1 must be divide $ndw with the doc_id = 2, etc

但结果在不同文档之间并不分开，如下表所示：

============================
|id|word|doc_id|  ndw |idf |
============================
|1 | A  |      |      |    |
|2 | B  |      |      |    |
|3 | C  |      |      |    |
|4 | D  |      |      |    |
|5 | E  |      |      |    |
|6 | G  |      |      |    |
============================

结果必须是：

 ============================
|id|word|doc_id|  ndw |idf |
============================
|1 | A  |   1  |      |    |
|2 | B  |   1  |      |    |
|3 | A  |   2  |      |    |
|4 | C  |   2  |      |    |
|5 | D  |   2  |      |    |
|6 | E  |   2  |      |    |
|7 | G  |   2  |      |    |
============================

请帮助我，谢谢:)

idf 的公式是idf = log(N/df)其中N是文档df的数量，是出现术语 (t) 的文档的数量。每个句子都被视为一个文档。这是 idf 计算的示例：文档：Do you read poetry while flying. Many people find it relaxing to read on long flights

=================================================
|     Term     | Document1(D1)| D2| df |   idf  |
=================================================
|     find     |     0        | 1 |  1 |log(2/1)|
|     fly      |     1        | 1 |  2 |log(2/2)|
|     long     |     0        | 1 |  1 |log(2/1)|
|    people    |     0        | 1 |  1 |log(2/1)|
|    poetry    |     1        | 0 |  1 |log(2/1)|
|     read     |     1        | 1 |  2 |log(2/2)|
|    relax     |     0        | 1 |  1 |log(2/1)|
=================================================

score 2 · Accepted Answer

此查询将为您提供您正在寻找的表：

SELECT t1.doc_id, t2.token as word, t2.token_freq as df, 
log(t1.docs/t2.token_freq) as idf
FROM 
(SELECT doc_id,count(sentence_id) as docs from tb_sentence group by doc_id) as t1,
(SELECT DISTINCT(stem) as token, doc_id, COUNT(sentence_id) as token_freq 
      FROM tb_stem GROUP BY doc_id, token) as t2
WHERE t1.doc_id = t2.doc_id

注意：原始查询中的唯一性是 MySQL 中的保留字，会给您错误。

php - 将文档总数除以包含词干的文档数

1 回答 1

Related

Reference