3

我有一个带有架构的文档表:

CREATE TABLE Frequency (
  docid VARCHAR(255),
  term VARCHAR(255),
  count int,
PRIMARY KEY(docid, term));

要查找我将使用的所有文档的相似性原始分数:

SELECT a.term, b.term, sum(a.count * b.count) 
FROM Frequency a, Frequency b
Where a.term = b.term

我不确定为什么会这样,但它确实在测试数据上做了 D*DT,其中 DT 是 D 的转置。

我现在需要计算诸如“国会枪法”之类的术语的查询/文本字符串相似性

我相信这涉及联合和分组,但我所有的查询尝试都失败了,例如:

SELECT *
FROM Frequency a, Frequency b, Frequency c
Where a.term = b.term 
UNION
SELECT  a.docid, 'congress' as term, 1 as count 
UNION
SELECT  b.docid , 'gun' as term, 1 as count
UNION 
SELECT  c.docid , 'laws' as term, 1 as count 
Group by docid;

我是这种 SQL 的新手,并且在我试图理解我在做什么时会很感激。

请解释为什么第一个查询有效以及如何处理第二个查询。

4

2 回答 2

2

简单来说,我们在这里真正想做的是将新的元组添加到表中,然后使用您上面提到的矩阵转置操作将这个新表与旧表进行比较。您需要“标记”这些新关键字,以便您可以将它们用作查询中的条件。所以这

SELECT b.docid, b.term, SUM(a.count * b.count) 
FROM (SELECT * FROM Frequency
      UNION
      SELECT  'q' as docid, 'congress' as term, 1 as count 
      UNION
      SELECT  'q' as docid, 'gun' as term, 1 as count
      UNION 
      SELECT  'q' as docid, 'laws' as term, 1 as count 
     ) a, Frequency b
WHERE a.term = b.term 
AND a.docid = 'q'
GROUP BY b.docid, b.term
ORDER BY SUM(a.count * b.count);

将为您提供包含该术语及其各自相似度分数的 docid 列表。

于 2013-05-21T09:54:51.927 回答
0

你的问题和评论是不可理解的。

但是以下查询显示了包含所有三个术语的所有文档的三个术语的出现次数:

SELECT a.docid,
       a.count,
       b.count,
       c.count
FROM Frequency AS a
JOIN Frequency AS b ON a.docid = b.docid
JOIN Frequency AS c ON b.docid = c.docid
WHERE a.term = 'congress'
  AND b.term = 'gun'
  AND c.term = 'laws'
于 2013-05-18T19:03:25.920 回答