我有一个包含近 100,000 条评论的数据库,我想检测最常用的词(使用停用词来避免常用词)。
我只想这样做一次,然后使用一些最流行的词来标记包含它们的评论。
你能帮我用查询和 PHP 代码来做到这一点吗?谢谢!
The easiest approach I guess would be:
keywords
(id, word) and keywords_comments
(keyword_id, comment_id, count)
keywords
saves an unique id and the keyword you found in a textkeywords_comments
stores one row for each connection between each comment that contains that keyword. In count
you wil save the number of times this keyword occurred in the comment. The two columns keyword_id + comment_id together form a unique or directly the primary key.You have the following two comments:
Hello, how are you?!
Wow, hello. My name is Stefan.
Now you would iterate over both of them and split them by non-characters. This would result in the following lowercase words for each text: - First text: hello, how, are, you - Second text: wow, hello, my, name, is, stefan
As soon as you have parsed one of this text, you can already insert it into the database again. I guess you do not want to load 100.000 comments to RAM.
So it would go this:
keywords
if it is not there yetkeywords_comments
) and set the count correctly (in our example each word occurs only once in each text, you have to count that).A very easy improvement you probably have to use for 100.000 comments, is to use a counting variable or add a new field has_been_analyzed to each comment. Then you can read them comment by comment from the database.
I usually use counting variables when I read data chunkwise and know that the data cannot not change from the direction I am starting (i.e. it will stay consistent up to the point I currently am). Then I do something like:
SELECT * FROM table ORDER BY created ASC LIMIT 0, 100
SELECT * FROM table ORDER BY created ASC LIMIT 100, 100
SELECT * FROM table ORDER BY created ASC LIMIT 200, 100
…
Consider that this only works if we know for sure that there are no dates to be added at a place we think we already read. E.g. using DESC
would not work, as there could be data inserted. Then the whole offset would break and we would read one article twice and never read the new article.
If you cannot make sure that the outside counting variable stays consistent, you can add a new field analyzed which you set to true as soon as you have read the comment. Then you can always see which comments have already been read and which not. An SQL query would then look like this:
SELECT * FROM table WHERE analyzed = 0 LIMIT 100 /* Reading chunks of 100 */
This works as long as you do not parallelize the workload (with multiple clients or threads). Otherwise you would have to make sure that reading + setting true is atomar (synchronized).