caching - HBase 扫描性能

Question

我正在执行一个范围扫描，它给了我 50 万条记录。如果我设置scan.setCaching(100000)它花了不到一秒钟，但如果scan.setCaching(100000)没有设置它花了将近 38 秒。

如果我设置scan.setBlockCache(false)，scan.setCaching(100000)会发生什么？行是否会被缓存？

我在第一次扫描后删除了操作系统缓存，但扫描记录的时间没有变化。为什么？

那么如何检查读取性能呢？

score 22 · Accepted Answer

Scan.setCaching is a misnomer. It should really be called something like Scan.setPrefetch. setCaching actually specifies how many rows will be transmitted per RPC to the regionserver. If you use setCaching(1) then every time you call next() you pay the cost of a round-trip to the regionserver. The down side to setting it to a larger number is that you pay for extra memory in the client, and potentially, you are fetching rows that you won't use, for example, if you stop scanning after reaching a certain number of rows or after you've found a specific value.

Scan.setBlockCache means something entirely different like Chandra pointed out. It basically instructs the regionserver to not pull any data from this Scan into the HBase BlockCache which is a separate pool of memory from the MemStore. Note that MemStores are used for writing and BlockCache is used for reading, and these two pieces of memory are completely separate. HBase currently does not use the BlockCache as a write-back cache. You can control the size of the block cache with the hfile.block.cache.size config setting in hbase-site.xml. Similarly you can control the total pool size of the MemStore via the hbase.regionserver.global.memstore.size setting.

You might want to use setBlockCache(false) if you are doing a full table scan, and you don't want to flush your current working set in the block cache. Otherwise, if you are scanning data that is being used frequently, it would probably be better to leave the setBlockCache alone.

score 6 · Accepted Answer

Hbase 有 2 种类型的缓存结构 -memory store和block cache.
内存存储实现为 MemStore ，用于读取的缓存是块缓存。
当从 HDFS 读取一个数据块时，它被缓存在 BlockCache 中。对相邻数据的后续读取仅由 BlockCache 提供。
因此，当您手动设置 scan.set Block Cache(false) then 时，它将停止缓存从 hdfs 读取的行。
scan.set-caching(100000) 是与扫描仪相关的客户端优化。所以它仍然会不受影响

caching - HBase 扫描性能

2 回答 2

Related

Reference