1

我有一个包含近 100,000 条评论的数据库,我想检测最常用的词(使用停用词来避免常用词)。

我只想这样做一次,然后使用一些最流行的词来标记包含它们的评论。

你能帮我用查询和 PHP 代码来做到这一点吗?谢谢!

4

1 回答 1

3

The easiest approach I guess would be:

  • Create two new tables: keywords (id, word) and keywords_comments (keyword_id, comment_id, count)
    • keywords saves an unique id and the keyword you found in a text
    • keywords_comments stores one row for each connection between each comment that contains that keyword. In count you wil save the number of times this keyword occurred in the comment. The two columns keyword_id + comment_id together form a unique or directly the primary key.
  • Retrieve all comments from the database
  • Parse through all comments and split by non-characters (or other boundaries)
  • Write these entries to your tables

Example

You have the following two comments:

Hello, how are you?!

Wow, hello. My name is Stefan.

Now you would iterate over both of them and split them by non-characters. This would result in the following lowercase words for each text: - First text: hello, how, are, you - Second text: wow, hello, my, name, is, stefan

As soon as you have parsed one of this text, you can already insert it into the database again. I guess you do not want to load 100.000 comments to RAM.

So it would go this:

  • Parse first text an get the keywords above
  • Write each keyword into the tabke keywords if it is not there yet
  • Set a reference from the keyword to the comment (keywords_comments) and set the count correctly (in our example each word occurs only once in each text, you have to count that).
  • Parse second text
  • …</li>

Minor improvement

A very easy improvement you probably have to use for 100.000 comments, is to use a counting variable or add a new field has_been_analyzed to each comment. Then you can read them comment by comment from the database.

I usually use counting variables when I read data chunkwise and know that the data cannot not change from the direction I am starting (i.e. it will stay consistent up to the point I currently am). Then I do something like:

SELECT * FROM table ORDER BY created ASC LIMIT 0, 100
SELECT * FROM table ORDER BY created ASC LIMIT 100, 100
SELECT * FROM table ORDER BY created ASC LIMIT 200, 100
…

Consider that this only works if we know for sure that there are no dates to be added at a place we think we already read. E.g. using DESC would not work, as there could be data inserted. Then the whole offset would break and we would read one article twice and never read the new article.

If you cannot make sure that the outside counting variable stays consistent, you can add a new field analyzed which you set to true as soon as you have read the comment. Then you can always see which comments have already been read and which not. An SQL query would then look like this:

SELECT * FROM table WHERE analyzed = 0 LIMIT 100 /* Reading chunks of 100 */

This works as long as you do not parallelize the workload (with multiple clients or threads). Otherwise you would have to make sure that reading + setting true is atomar (synchronized).

于 2012-06-29T19:02:51.567 回答