sql - 如何计算一个表中单词的出现次数到另一表中的评论

Question

我正在尝试在 Google 的 BigQuery 中完成一项可能需要逻辑的任务，我不确定 SQL 是否可以本地处理。

我有 2 张桌子：

第一个表有一个单列，其中每一行都是一个小写单词
第二个表是评论数据库（包含发表评论的人、评论本身、时间戳等数据）

我想按第一个表中单词的出现次数对第二个表中的注释进行排序。

这是我想做的一个基本示例，使用python，使用字母而不是单词......但你明白了：

words = ['a','b','c','d','e']

comments = ['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']

wordcount = {}

for comment in comments:
    for word in words:
        if word in comment:
            if comment in wordcount:
                wordcount[comment] += 1
            else:
                wordcount[comment] = 1

print(sorted(wordcount.items(), key = lambda k: k[1], reverse=True))

输出：

[('look another sentence, which is also a comment', 3), ('this is another comment', 3), ('this is the first sentence', 2), ('nope', 1)]

到目前为止，我所看到的生成 SQL 查询的最佳方法是执行以下操作：

SELECT
    COUNT(*)
FROM
    table
WHERE
    comment_col like '%word1%'
    OR comment_col like '%word2%'
    OR ...

但是有超过2000字……感觉不对。有小费吗？

score 3 · Accepted Answer

以下是 BigQuery 标准 SQL

#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0 
GROUP BY comment
-- ORDER BY cnt DESC

作为一个选项，如果您愿意，您可以使用正则表达式：

#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, word)
GROUP BY comment
-- ORDER BY cnt DESC

您可以使用问题中的虚拟示例测试/玩上面

#standardSQL
WITH words AS (
  SELECT word
  FROM UNNEST(['a','b','c','d','e']) word
),
comments AS (
  SELECT comment 
  FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0 
GROUP BY comment
ORDER BY cnt DESC

更新：

是否有任何快速建议只进行全字符串匹配？

#standardSQL
WITH words AS (
  SELECT word
  FROM UNNEST(['a','no','is','d','e']) word
),
comments AS (
  SELECT comment 
  FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, CONCAT(r'\b', word, r'\b')) 
GROUP BY comment
ORDER BY cnt DESC

score 1 · Accepted Answer

如果我理解得很好，我认为您需要这样的查询：

select comment, count(*) cnt
from comments
join words
  on comment like '% ' + word + ' %'   --this checks for `... word ..`; a word between spaces
  or comment like word + ' %'          --this checks for `word ..`; a word at the start of comment
  or comment like '% ' + word          --this checks for `.. word`; a word at the end of comment
  or comment = word                    --this checks for `word`; whole comment is the word
group by comment
order by count(*) desc

SQL Server Fiddle Demo 作为示例

sql - 如何计算一个表中单词的出现次数到另一表中的评论

2 回答 2

Related

Reference