c# - Fetching large loads of data using paging

Question

Let's for instance say I have a Cloud environment and a Client environment and I want to sync a large amount of data from the cloud to the client. Let's say I have a db table in the cloud named Files and i want the exact identical table to exist in the client environment.

Now let assume a few things:

The files table is very big.
The data of each row in files can be updated at any time and has a last-update column.
I want to fetch the delta's and make sure I am identical in both environments.

My solution:

I make a full sync first, returning all the entries to the client.
I keep the LastSync time in the client environment and keep syncing delta's from the LastSync time.
I do the full sync and the delta syncs using paging: the client will fire a first request for getting the Count of results for the delta and as many other requests needed by the Page Size of each request.

For example, the count:

SELECT COUNT(*) FROM files WHERE last_update > @LastSyncTime

The page fetching:

SELECT col1, col2..
FROM files 
WHERE last_update > @LastSyncTime
ORDER BY files.id
LIMIT @LIMIT 
OFFSET @OFFSET

My problem:

What if for example the first fetch(the Count fetch) will take some time(few minutes for example) and in this time more entries have been updated and added to the last-update fetch.

For example:

The Count fetch gave 100 entries for last-update 1000 seconds.
1 entry updated while fetching the Count.
Now the last-update 1000 seconds will give 101 entries.
The page fetch will only get 100 entries from the 101 with order by id
1 entry is missed and not synced to the client

I have tried 2 other options:

Syncing with from-to date limit for last-update.
Ordering by last-update instead of the id column.

I see issues in both options.

score 2 · Accepted Answer

不要使用OFFSETand LIMIT; 它从好到慢到慢。相反，请跟踪“您离开的地方”，last_update以便提高效率。 更多讨论
由于可能存在日期时间的重复，因此请灵活选择一次执行多少行。
不断地运行它。除非作为“保持活动”，否则不要使用 cron。
不需要初始副本；此代码为您完成。
拥有至关重要INDEX(last_update)

这是代码：

-- Initialize.  Note: This subtract is consistent with the later compare. 
SELECT @left_off := MIN(last_update) - INTERVAL 1 DAY
    FROM tbl;

Loop:

    -- Get the ending timestamp:
    SELECT @cutoff := last_update FROM tbl
         WHERE last_update > @left_off
         ORDER BY last_update
         LIMIT 1  OFFSET 100;   -- assuming you decide to do 100 at a time
    -- if no result, sleep for a while, then restart

    -- Get all the rows through that timestamp
    -- This might be more than 100 rows
    SELECT * FROM tbl
        WHERE last_update > @left_off
          AND last_update <= @cutoff
        ORDER BY last_update
    -- and transfer them

    -- prep for next iteration
    SET @left_off := @cutoff;

Goto Loop

SELECT @cutoff会很快——它是对索引中 100 个连续行的简短扫描。

SELECT *做繁重的工作，并且花费的时间与行数成正比——没有额外的开销OFFSET。读取 100 行大约需要 1 秒（假设旋转磁盘、非缓存数据）。

而不是最初得到COUNT(*)，我会先得到，MAX(last_update)因为其余的代码都是基于last_update. 这个查询是“即时的”，因为它只需要探测索引的末尾。但我声称你甚至不需要那个！

一个可能的错误：如果可以删除“源”中的行，您如何识别？

score 0 · Accepted Answer

您的方法涵盖了许多解决方法，您走错了路。

开始考虑数据库复制，它将抽象所有这些变通方法，并为您提供解决此类问题的工具。

最近一篇关于 MySQL 服务器组复制的优秀文章： https ://www.digitalocean.com/community/tutorials/how-to-configure-mysql-group-replication-on-ubuntu-16-04

score 0 · Accepted Answer

根据数据的大小以及是否使用“公共”或可以在多个客户端之间使用，将其拆分可能会有所帮助。例如，创建每日“增量”完整数据集并缓存它们。这样一来，数据库就不需要在第一次加载时一遍又一遍地查询每个客户端需要的数据。

尽量减少对大数据表的访问（如果数据完全没有变化，缓存异地）。
卸载和缓存经常查询的常用数据。所以你减少了 SQL 查询的数量。
创建一个索引last_update并且id应该有助于加快速度以从数据库中实时获取增量线。

可能的解决方案：

每当有一些新项目存在时，数据库每小时创建/完整集每 x 次。
客户端在第一次从缓存中获取时获得“每日增量/每小时增量”。
客户端直接从数据库中获取自上次增量“最新项目”以来的所有项目。

可能会有所帮助：

使用查询通知更新缓存：SQL Server 中的 MSSQL 查询通知
使用时MySQL：在 MySQL 中通知事件侦听器

c# - Fetching large loads of data using paging

3 回答 3

Related

Reference