apache-spark - How to use Apache Streaming with DynamoDB Stream

Question

We have a requirement wherein we log events in a DynamoDB table whenever an ad is served to the end user. There are more than 250 writes into this table per sec in the dynamoDB table.

We would want to aggregate and move this data to Redshift for analytics.

The DynamoDB stream will be called for every insert made in the table i suppose. How can I feed the DynamoDB stream into some kind of batches and then process those batches. Are there any best practices around such kind of use cases ?

I was reading about apache spark and seems like with Apache Spark we can do such kind of aggregation. But apache spark stream does not read the DynamoDB stream.

Any help or pointers is appreciated.

Thanks

score 1 · Accepted Answer

DynamoDB 流有两个接口：低级 API 和 Kinesis Adapter。Apache Spark 具有Kinesis 集成，因此您可以一起使用它们。如果您想知道应该使用什么 DynamoDB 流接口，AWS 建议使用 Kinesis Adapter。

以下是如何为 DynamoDB 使用 Kinesis 适配器。

还有一些需要考虑的事情：

与其使用 Apache Spark，不如看看Apache Flink。它是流优先的解决方案（Spark 使用微批处理实现流），具有更低的延迟、更高的吞吐量、更强大的流算子，并且支持循环处理。它还有一个Kinesis 适配器
您可能不需要 DynamoDB 流来将数据导出到 Redshift。您可以使用 Redshift 命令导出数据。

score 0 · Accepted Answer

Amazon EMR 提供此连接器的实施作为 emr-hadoop-ddb.jar 的一部分，其中包含 DynamoDBItemWriteable 类。使用此类，您可以实现自己的 DynamoDBInputFormat，如下所示。

 public class DynamoDbInputFormat implements InputFormat, Serializable {

    @Override
    public InputSplit[] getSplits(@NonNull final JobConf job, final int numSplits) throws IOException {
        final int splits = Integer.parseInt(requireNonNull(job.get(NUMBER_OF_SPLITS), NUMBER_OF_SPLITS
            + " must be non-null"));

        return IntStream.
            range(0, splits).
            mapToObj(segmentNumber -> new DynamoDbSplit(segmentNumber, splits)).
            toArray(InputSplit[]::new);
}

apache-spark - How to use Apache Streaming with DynamoDB Stream

2 回答 2

Related

Reference