
我有 3 张桌子。前两个现在关心我们(第三个是为了更好地理解):

author {id, name}
authorship {id, id1, id2}
paper {id, title}

authorship 将作者与论文连接起来,authorship.id1 指的是 author.id,authorship.id2 指的是 paper.id。


w=1 - union_of_common_papers/intersection_of_common_papers

因此,我构建了一个(在 stackoverflow 的帮助下)一个 sql 脚本,该脚本返回所有共同作者的夫妇以及普通论文的联合数量和交集。之后,我将使用 java 中的数据。如下:

SELECT DISTINCT a1.name, a2.name, (
  SELECT  concat(count(a.id2), ',', count(DISTINCT a.id2)) 
  FROM authorship a 
  WHERE a.id1=a1.id or a.id1=a2.id) as weight
FROM authorship au1 
INNER JOIN authorship au2 ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 
INNER JOIN author a1 ON au1.id1 = a1.id 
INNER JOIN author a2 ON au2.id1 = a2.id;


| name            | name                | weight  |
| Kurt            | Michael             | 161,157 |
| Kurt            | Miron               | 138,134 |
| Kurt            | Manish              | 19,18   |
| Roy             | Gregory             | 21,20   |
| Roy             | Richard             | 74,71   |

其中重量我可以看到 2 个数字 a,b 其中 b 是交叉点,ba 是普通论文的并集。


  (SELECT  concat(count(a.id2), ',', count(DISTINCT a.id2)) 
  FROM authorship a 
  WHERE a.id1=a1.id or a.id1=a2.id) as weight

如果没有这条线,所有记录 (1M+) 都会在不到 2 分钟的时间内返回。这条线 50 条记录需要超过 1.5 分钟

我通过命令行在 linux 上使用 mysql


  • 作者拥有约 130,000 条记录
  • 作者身份 ~1,300,000 条记录
  • 查询应返回约 1,200,000 条记录


| id | select_type        | table | type   | possible_keys       | key       | key_len | ref          | rows    | Extra           |
|  1 | PRIMARY            | a1    | ALL    | PRIMARY             | NULL      | NULL    | NULL         |  124768 | Using temporary |
|  1 | PRIMARY            | au1   | ref    | NewIndex1,NewIndex2 | NewIndex1 | 5       | dblp.a1.ID   |       4 | Using where     |
|  1 | PRIMARY            | au2   | ref    | NewIndex1,NewIndex2 | NewIndex2 | 5       | dblp.au1.id2 |       1 | Using where     |
|  1 | PRIMARY            | a2    | eq_ref | PRIMARY             | PRIMARY   | 4       | dblp.au2.id1 |       1 |                 |
|  2 | DEPENDENT SUBQUERY | a     | ALL    | NewIndex1           | NULL      | NULL    | NULL         | 1268557 | Using where     |

1 回答 1





SELECT a1.name, a2.name,
       COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as CommonPapers,
       COUNT(distinct au1.id2) + COUNT(distinct au2.id2) - COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as TotalPapers
FROM authorship au1 INNER JOIN
     authorship au2
     ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 INNER JOIN
     author a1
     ON au1.id1 = a1.id INNER JOIN
     author a2
     ON au2.id1 = a2.id
group by a1.name, a2.name;


由于初始内部连接,上述查询正确计算了交集,但没有计算总数。解决这个问题的一种方法是 a full outer join,但这在 MySQL 中是不允许的。我们可以通过额外的子查询来做到这一点:

SELECT a1.name, a2.name,
       COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end) as CommonPapers,
       (ap1.NumPapers + ap2.NumPapers - COUNT(distinct case when au1.id2 = au2.id2 then au1.id2 end)
       ) as TotalPapers
FROM authorship au1 INNER JOIN
     authorship au2
     ON au1.id2 = au2.id2 AND au1.id1 <> au2.id1 INNER JOIN
     author a1
     ON au1.id1 = a1.id INNER JOIN
     author a2
     ON au2.id1 = a2.id inner join
     (select au.id1, count(*) as numpapers
      from authorship au
     ) ap1
     on ap1.id1 = au1.id1 inner join
     (select au.id1, count(*) as numpapers
      from authorship au
     ) ap2
     on ap2.id1 = au2.id1 inner join
group by a1.name, a2.name;
于 2013-06-03T21:32:47.303 回答