sql - 报告查询：加入多个事实表的最佳方式？

Question

我正在开发一个报告系统，该系统允许用户任意查询一组事实表，并限制每个事实表的多个维度表。我编写了一个查询构建器类，它根据约束参数自动组装所有正确的连接和子查询，并且一切都按设计工作。

但是，我有一种感觉，我没有生成最有效的查询。在一组具有几百万条记录的表上，这些查询大约需要 10 秒才能运行，我希望将它们降低到不到一秒的范围内。我有一种感觉，如果我可以摆脱子查询，结果会更有效率。

我不会向您展示我的实际架构（这要复杂得多），而是向您展示一个类似的示例，该示例说明了这一点，而无需解释我的整个应用程序和数据模型。

想象一下，我有一个音乐会信息数据库，其中包含艺术家和场地。用户可以任意标记艺术家和场地。所以架构看起来像这样：

concert
  id
  artist_id
  venue_id
  date

artist
  id
  name

venue
  id
  name

tag
  id
  name

artist_tag
  artist_id
  tag_id

venue_tag
  venue_id
  tag_id

很简单。

现在假设我想查询数据库中今天一个月内发生的所有音乐会，所有带有“techno”和“长号”标签的艺术家，在带有“cheap-beer”和“great-mosh-pits”标签的音乐会上表演.

我能想出的最佳查询如下所示：

SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  artist.name AS artist_name,
  venue.id AS venue_id,
  venue.name AS venue_name,
FROM
  concert
INNER JOIN (
  artist ON artist.id = concert.artist_id
) INNER JOIN (
  venue ON venue.id = concert.venue_id
)
WHERE (
  artist.id IN (
    SELECT artist_id
    FROM artist_tag
    INNER JOIN tag AS a on (
      a.id = artist_tag.tag_id
      AND
      a.name = 'techno'
    ) INNER JOIN tag AS b on (
      b.id = artist_tag.tag_id
      AND
      b.name = 'trombone'
    )
  )
  AND
  venue.id IN (
    SELECT venue_id
    FROM venue_tag
    INNER JOIN tag AS a on (
      a.id = venue_tag.tag_id
      AND
      a.name = 'cheap-beer'
    ) INNER JOIN tag AS b on (
      b.id = venue_tag.tag_id
      AND
      b.name = 'great-mosh-pits'
    )
  )
  AND
  concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
)

该查询有效，但我真的不喜欢拥有这些多个子查询。如果我可以完全使用 JOIN 逻辑来完成相同的逻辑，我感觉性能会大大提高。

在一个完美的世界里，我会使用一个真正的 OLAP 服务器。但是我的客户将部署到 MySQL 或 MSSQL 或 Postgres，我不能保证兼容的 OLAP 引擎将可用。所以我坚持使用带有星型模式的普通 RDBMS。

不要太在意这个例子的细节（我的真实应用程序与音乐无关，但它有多个事实表，与我在这里展示的那些有类似的关系）。在这个模型中，'artist_tag' 和 'venue_tag' 表用作事实表，而其他一切都是维度。

重要的是要注意，在这个例子中，如果我只允许用户限制单个艺术家标签或场地标签值，那么查询编写起来要简单得多。只有当我允许查询包含 AND 逻辑时，它才会变得非常棘手，需要多个不同的标签。

所以，我的问题是：您所知道的针对多个事实表编写有效查询的最佳技术是什么？

score 2 · Accepted Answer

My approach is a bit more generic, putting the filter parameters in tables and then using GROUP BY, HAVING and COUNT to filter the results. I've used this basic approach several times for some very sophisticated 'searching' and it works very well (for me grin).

I also don't join on the Artist and Venue dimension tables initially. I'd get the results as id's (just needing artist_tag and venue_tag) then join the results on the artist and venue tables to get those dimension values. (Basically, search for the entity id's in a sub query, then in an outer query get the dimension values you need. Keeping them separate should improve things...)

DECLARE @artist_filter TABLE (
  tag_id INT
)

DECLARE @venue_filter TABLE (
  tag_id INT
)

INSERT INTO @artist_filter
SELECT id FROM tag
WHERE name IN ('techno','trombone')

INSERT INTO @venue_filter
SELECT id FROM tag
WHERE name IN ('cheap-beer','great-most-pits')


SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  venue.id AS venue_id
FROM
  concert
INNER JOIN
  artist_tag
    ON artist_tag.artist_id = concert.artist_id
INNER JOIN
  @artist_filter AS [artist_filter]
    ON [artist_filter].tag_id = artist_tag.id
INNER JOIN
  venue_tag
    ON venue_tag.venue_id = concert.venue_id
INNER JOIN
  @venue_filter AS [venue_filter]
    ON [venue_filter].tag_id = venue_tag.id
WHERE
  concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)
GROUP BY
  concert.id,
  concert.date,
  artist_tag.artist_id,
  venue_tag.id
HAVING
  COUNT(DISTINCT [artist_filter].id) = (SELECT COUNT(*) FROM @artist_filter)
  AND
  COUNT(DISTINCT [venue_filter].id)  = (SELECT COUNT(*) FROM @venue_filter)

(I'm on a netbook and suffering for it, so I'll leave out the outer query getting the artist and venue names from the artist and venue tables grin)

EDIT
Note:

Another option would be to filter the artist_tag and venue_tag tables in sub-queries/derived-tables. Whether this is worth it depends on how influential the join on the Concert table is. My assumption here is that there are MANY artist and venues, but once filtered on the concert table (itself filtered by the dates) the number of artists/venues decreases dramatically.

Also, there is often a need/desire to deal with the case where NO artist_tags and/or venue_tags are specified. From experience it is better to deal with this programatically. That is, use IF statements and queries specially suited to those cases. A single SQL query CAN be written to handle it, but is much slower than the programatic alternative. Equally, writing similar queries several times may look messy and degrade maintainability, but the increase in complexity need to get this to be a single query is often harder to maintain.

EDIT

Another similar layout could be...
- Filter concert by artist as sub_query/derived_table
- Filter results by venue as sub_query/derived_table
- Join results on dimension tables to get names, etc

(Cascaded filtering)

SELECT
   <blah>
FROM
  (
    SELECT
      <blah>
    FROM
      (
        SELECT
          <blah>
        FROM
          concert
        INNER JOIN
          artist_tag
        INNER JOIN
          artist_filter
        WHERE
        GROUP BY
        HAVING
      )
    INNER JOIN
      venue_tag
    INNER JOIN
      venue_filter
    GROUP BY
    HAVING
  )
INNER JOIN
  artist
INNER JOIN
  venue

By cascading the filtering, each subsequent filtering has a reduce set it has to work on. This MAY reduce the work done by the GROUP BY - HAVING section of the query. For two levels of filtering I would guess this to be unlikely to be dramatic.

The original may still be more performant as it benefits additional filtering in a different manner. In your example:
- There may be many artist in your date range, but few which meet at least one criteria
- There may be many venues in your date range, but few which meet at least one criteria
- Before the GROUP BY, however, all concerts are eliminated where...
---> the artist(s) meets NONE of the criteria
---> AND/OR the venue meets NONE of the criteria

Where you are searching by many criteria this filtering degrades. Also where venues and/or artists share a lot of tags, the filtering also degrades.

So when would I use the original, or when would I use the Cascaded version?
- Original : Few search criteria and venues/artists are dis-similar from each other
- Cascaded : Lots of search criteria or venues/artists tend to be similar

score 1 · Accepted Answer

对模型进行非规范化。在场地和艺术家表中包含标签名称。这样，您可以避免多对多关系，并且您拥有一个简单的星型模式。

通过应用这种非规范化，where 子句只能检查两个表（艺术家和场所）中的这个额外的 tag_name 字段。

score 0 · Accepted Answer

这种情况在技术上不是多个事实表。您在场所和标签以及艺术家和标签之间有多对多的关系。

我认为 MatBailie 在上面提供了一些有趣的示例，但我觉得如果您以一种有用的方式处理应用程序中的参数，这会简单得多。

除了用户在事实表上生成的查询之外，您首先需要两个静态查询来为用户提供参数选项。其中一个是适合场地的标签列表，另一个是适合艺术家的标签。

场地适当的标签：

SELECT DISTINCT tag_id, tag.name as VenueTagName
FROM venue_tag 
INNER JOIN tag 
ON venue_tag.tag_id = tag.id

艺术家适当的标签：

SELECT DISTINCT tag_id, tag.name as ArtistTagName
FROM artist_tag 
INNER JOIN tag 
ON artist_tag.tag_id = tag.id

这两个查询驱动一些下拉或其他参数选择控件。在报告系统中，您应该尽量避免传递字符串变量。在您的应用程序中，您将变量的字符串名称提供给用户，但将整数 ID 传递回数据库。

例如，当用户选择标签时，您获取 tag.id 值并将它们提供给您的查询（我有下面的(1,2)和(100,200)位）：

 SELECT
  concert.id AS concert_id,
  concert.date AS concert_date,
  artist.id AS artist_id,
  artist.name AS artist_name,
  venue.id AS venue_id,
  venue.name AS venue_name,
FROM 
concert
INNER JOIN artist 
    ON artist.id = concert.artist_id
INNER JOIN artist_tag
    ON artist.id = artist_tag.artist_id
INNER JOIN venue 
    ON venue.id = concert.venue_id
INNER JOIN venue_tag
    ON venue.id = venue_tag.venue_id
WHERE venue_tag.tag_id in ( 1,2 ) -- Assumes that the IDs 1 and 2 map to "cheap-beer" and "great-mosh-pits)
AND   artist_tag.tag_id in (100,200) -- Assumes that the IDs 100 and 200 map to "techno" and "trombone") Sounds like a wild night of drunken moshing to brass band techno!
AND concert.date BETWEEN NOW() AND (NOW() + INTERVAL 1 MONTH)

sql - 报告查询：加入多个事实表的最佳方式？

3 回答 3

Related

Reference