3

我一直在阅读有关使用 hadoop 进行实时处理的内容,并偶然发现了这个http://www.scaleoutsoftware.com/hserver/

从文档的内容来看,看起来他们使用 hadoop 工作/从节点实现了一个内存数据网格。我在这里有几个问题

  1. 据我了解,如果我有一个 100 GB 大小的数据,我将至少需要 100 GB 的 ram 跨集群上的所有节点仅用于数据 + 用于任务跟踪器的额外 ram、数据节点守护进程 + 用于 hServer 服务的额外 ram在所有这些节点上运行。我的理解正确吗?

  2. 该软件声称他们可以通过改善 hadoop 中的延迟问题来进行实时数据处理。是因为,它允许我们将数据写入内存网格而不是 HDFS?

我是大数据技术的新手。如果有些问题是幼稚的,请道歉。

4

1 回答 1

2

[Full disclosure: I work at ScaleOut Software, the company which created ScaleOut hServer.]

  1. In-memory data grids create a replica for every object to ensure high availability in case of failures.The aggregate amount of memory that is required is the memory used to store the objects with the addition of the memory used to store object replicas. In your example, you will need 200 GB of total memory: 100 GB for objects and 100 GB for replicas. For example, in a four-server cluster, each server needs 50 GB of memory available to the ScaleOut hServer service.

  2. With the current release, ScaleOut hServer takes the first step in enabling real-time analytics by speeding up data access. It does this in two ways, which are implemented using different input/output formats. The first mode of operation uses the grid as a cache for HDFS, and the second uses the grid as the primary storage for a data set, providing support for fast-changing, memory-based data. Accessing data using an in-memory data grid reduces latency by eliminating disk I/O and minimizing network overhead. Also, caching HDFS data provides an additional performance boost by storing keys and values generated by the record reader instead of raw HDFS files in the grid.

于 2013-07-11T23:20:58.613 回答