1

我有以下需要有效解决的情况,

我正在从客户端设备到服务器进行文件同步。有时,由于服务器的某些问题,无法将来自一台设备的文件从服务器获取到另一台设备。我需要确保服务器中的所有文件都使用单独的线程同步到所有客户端设备。我使用 C++ 进行开发,使用 libcurl 进行客户端到服务器的通信。

在客户端设备中,我们有一个 SQLite 数据库中下载文件的条目。同样在服务器中,我们在服务器数据库(MySQL)中也有类似的更新。我需要列出来自客户端设备的所有可用文件并将其发送到服务器,并且必须将其与从服务器数据库中获取的列表进行比较以找出丢失的文件。

我粗略估计,对于 100 万个文件列表(带完整路径的文件名),它的大小约为 85 MB。压缩后,它的大小可达 10 MB。因此,将整个文件列表(即使经过压缩)从客户端传输到服务器并不是一个好主意。我计划为此实施布隆过滤器,如下所示,

  1. 从客户端数据库中获取文件列表并将其转换为布隆过滤器数据结构。
  2. 只需将Bloom数据结构从客户端单独传输到服务器。
  3. 从服务器端数据库中获取文件列表,并将其与从客户端接收到的 Bloom 数据结构进行比较,找出丢失的文件。

请注意,从客户端启动的上述过程应定期在线程中处理,例如每 1 小时左右。

布隆过滤器的问题是误报率,即使它非常低。我不想错过一个文件。还有其他更好的方法吗?

4

2 回答 2

2

As you've noticed, this isn't a problem for which Bloom Filters are appropriate. With a Bloom Filter, when you get a hit you must then check the authoritative source to differentiate between a false positive and a true positive - they're useful in situations where most queries against the filter will be expected to give a negative result, which is the opposite to your case.

What you could do is have each side build a partial Prefix Tree in memory of all the filenames known to that side. It wouldn't be a full prefix tree - once you number of filenames below a node drops below a certain level you'd just include the full list of those filenames in that node. You then synchronise those prefix trees using a recursive algorithm starting at the root of the trees:

  1. Each side creates a hash of all the sorted, concatenated filenames below the current node.
  2. If the hashes are equal then this node and all descendents are synchronised - return.
  3. If there are no child nodes, send the (short) list of filenames at this terminal node from one side to the other to synchronise and return.
  4. Otherwise, recursively synchronise the child nodes and return.

The hash should be at least 128 bits, and make sure that when you concatenate the filenames for the hash you do so in a reversible manner (ie. seperate them with a character that can't appear in filenames like \0, or prefix each one with its length).

于 2012-11-01T07:13:03.657 回答
0

在文件/路径名压缩中,我发现前缀-后缀压缩甚至比通用(bz2)压缩效果更好。合并后,文件名列表可以减少更多。

诀窍是使用转义码(例如<32)来指示前一行的常用字符数,然后使用常规字符作为唯一部分,最后(可选)编码字符串末尾的常用字符数。

于 2012-11-01T07:22:00.693 回答