sql - 直接加入或存储

Question

我有一个表 A，其中包含我经常处理的条目并将结果存储在表 B 中。现在我想为 A 中的每个条目确定其在 B 中的最新处理日期。

我当前的实现是加入两个表并检索最新日期。然而，另一种可能不太灵活的方法是直接将日期存储在表 A 中。

我可以考虑这两种情况的优缺点（性能、可扩展性......），但还没有这样的情况，并且想看看 stackoverflow 上的某个人是否有类似的情况，并且对任何一种情况都有建议一个具体的原因。

下面是一个快速的架构设计。

Table A
id, some-data, [possibly-here-last-process-date]

Table B
fk-for-A, data, date

谢谢

score 2 · Accepted Answer

Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.

I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.

I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.

score 1 · Accepted Answer

This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).

You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.

On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.

So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.

Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.

score 0 · Accepted Answer

我们在项目跟踪系统中也遇到过类似的情况，项目的最新状态存储在projects表(Cols: project_id, description etc.,)中，项目的历史记录存储在project_history表中(Cols: project_id, update_id, description etc.,)。每当项目有新的更新时，我们需要找出最新的更新号并将其加 1 以获得下一次更新的序列号。我们可以通过project_history在列上对表进行分组project_id并获得来做到这一点MAX(update_id)，但是考虑到项目更新的数量（几十万）和更新的频率，成本会很高。因此，我们决定将值存储在projects表本身的max_update_id列中，并在给定项目有新更新时不断更新它。HTH。

score -1 · Accepted Answer

如果我理解正确，您有一个表，其每一行都是一个参数，另一个表在时间序列中历史记录每个参数值。如果这是正确的，我目前在我正在构建的产品之一中遇到了同样的情况。我的参数表包含一个度量列表（29K 记录），历史参数值表每 1 小时具有该参数的值 - 因此该表当前有 4M 行。在任何给定的时间点，对最新值的请求都将比对历史记录的请求多得多，因此除了参数值表中的最后一条记录之外，我确实在参数表中存储了最新值。虽然这看起来像是数据重复，但从性能的角度来看，它是完全合理的，因为

To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table

So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

sql - 直接加入或存储

4 回答 4

Related

Reference