4

我有一个数据库

books          (primary key: bookID)
characterNames (foreign key: books.bookID) 
locations      (foreign key: books.bookID)

字符名称和位置的文本位置保存在相应的表中。
我正在使用 psycopg2 编写 Pythonscript,查找书中给定字符名称和位置的所有出现。我只想要书中的出现,同时找到角色名称和位置。
在这里,我已经有了一个搜索一个位置和一个字符的解决方案:

WITH b AS (  
    SELECT bookid  
    FROM   characternames  
    WHERE  name = 'XXX'  
    GROUP  BY 1  
    INTERSECT  
    SELECT bookid  
    FROM   locations  
    WHERE  l.locname = 'YYY'  
    GROUP  BY 1  
    )  
SELECT bookid, position, 'char' AS what  
FROM   b  
JOIN   characternames USING (bookid)  
WHERE  name = 'XXX'  
UNION  ALL  
SELECT bookid, position, 'loc' AS what  
FROM   b  
JOIN   locations USING (bookid)  
WHERE  locname = 'YYY'  
ORDER  BY bookid, position;  

CTE 'b' 包含所有 bookid,其中出现了字符名称 'XXX' 和位置 'YYY'。

现在我还想知道搜索 2 个地点和一个名称(或分别搜索 2 个名称和一个地点)。如果所有搜索的实体都必须出现在一本书中,这很简单,但是这又如何:
搜索:Tim, Al, Toolshop 结果:书籍包括
(Tim, Al, Toolshop) 或
(Tim, Al) 或
(Tim, Toolshop) 或
(铝,工具店)

该问题可能会在 4、5、6...条件下重复出现。
我想 INTERSECTing 更多子查询,但这行不通。
相反,我会将找到的 bookID 合并,将它们分组并选择 bookid 的出现不止一次:

WITH b AS (  
    SELECT bookid, count(bookid) AS occurrences  
    FROM  
        (SELECT DISTINCT bookid  
        FROM characterNames  
        WHERE name='XXX'  
        UNION  
        SELECT DISTINCT bookid  
        FROM characterNames  
        WHERE name='YYY'  
        UNION  
        SELECT DISTINCT bookid  
        FROM locations  
        WHERE locname='ZZZ'  
        GROUP BY bookid)  
    WHERE occurrences>1)  

我认为这可行,目前无法测试,但这是最好的方法吗?

4

1 回答 1

4

对一般情况使用计数的想法是合理的。不过,对语法进行了一些调整:

WITH b AS (  
   SELECT bookid
   FROM  (
      SELECT DISTINCT bookid  
      FROM   characterNames  
      WHERE  name='XXX'  

      UNION ALL  
      SELECT DISTINCT bookid  
      FROM   characterNames  
      WHERE  name='YYY'  

      UNION ALL
      SELECT DISTINCT bookid  
      FROM   locations  
      WHERE  locname='ZZZ'  
      ) x
   GROUP  BY bookid
   HAVING count(*) > 1
   )
SELECT bookid, position, 'char' AS what
FROM   b
JOIN   characternames USING (bookid)
WHERE  name = 'XXX'

UNION  ALL
SELECT bookid, position, 'loc' AS what
FROM   b
JOIN   locations USING (bookid)
WHERE  locname = 'YYY'
ORDER  BY bookid, position;

笔记

  • 使用UNION ALL(not UNION) 保留子查询之间的重复项。在这种情况下,您希望它们能够计算它们。

  • 子查询应该产生不同的值。它以DISTINCT您拥有的方式工作。您可能想尝试一下GROUP BY 1,看看它是否表现更好(我不希望它这样做。)

  • GROUP BY必须走出子查询。它只会应用于最后一个子查询,并且DISTINCT bookid已经没有任何意义。

  • 检查一本书是否有多个点击必须进入一个HAVING子句:

     HAVING count(*) > 1
    

    您不能在WHERE子句中使用聚合值。


在一张桌子上组合条件

您不能简单地将多个条件组合在一张表上。您将如何计算发现的数量?但是有一种更复杂的方法。可能会或可能不会提高性能,您必须进行测试(使用EXPLAIN ANALYZE)。这两个查询都需要对表至少进行两次索引扫描characterNames。至少它缩短了语法。

考虑我如何计算点击次数characterNames以及我如何sum(hits)在外部更改为SELECT

WITH b AS (  
   SELECT bookid
   FROM  (
      SELECT bookid
           , max((name='XXX')::int)
           + max((name='YYY')::int) AS hits
      FROM   characterNames  
      WHERE  (name='XXX'
           OR name='YYY')
      GROUP  BY bookid

      UNION ALL
      SELECT DISTINCT bookid, 1 AS hits  
      FROM   locations  
      WHERE  locname='ZZZ'  
      ) x
   GROUP  BY bookid
   HAVING sum(hits) > 1
   )
...

将 aboolean转换integer0forFALSE1for TRUE。这有帮助。


EXISTS 更快

骑自行车去公司时,这件事一直在我的后脑勺上踢。我有理由相信这个查询可能会更快。请试一试:

WITH b AS (  
   SELECT bookid

        , (EXISTS (
            SELECT *
            FROM   characterNames c
            WHERE  c.bookid = b.bookid
            AND    c.name = 'XXX'))::int
        + (EXISTS (
            SELECT *
            FROM   characterNames c
            WHERE  c.bookid = b.bookid
            AND    c.name = 'YYY'))::int AS c_hits

        , (EXISTS (
            SELECT *
            FROM   locations l
            WHERE  l.bookid = b.bookid
            AND    l.locname='ZZZ'))::int AS l_hits
   FROM   books b  
   WHERE  (c_hits + l_hits) > 1
   )
SELECT c.bookid, c.position, 'char' AS what
FROM   b
JOIN   characternames c USING (bookid)
WHERE  b.c_hits > 0
AND    c.name IN ('XXX', 'YYY')

UNION  ALL
SELECT l.bookid, l.position, 'loc' AS what
FROM   b
JOIN   locations l USING (bookid)
WHERE  b.l_hits > 0
AND    l.locname = 'YYY'
ORDER  BY 1,2,3;
  • EXISTS半连接可以在第一次匹配时停止执行。由于我们只对 CTE 中全有或全无的答案感兴趣,因此这可能会更快地完成这项工作

  • 这样我们也不需要聚合(没有GROUP BY必要)。

  • 我还记得是否找到了任何字符或位置,并且只重新访问具有实际匹配项的表。

于 2012-04-23T01:18:25.290 回答