json - Distributed processing of JSON in Hadoop

Question

I want to process a ~300 GB JSON file in Hadoop. As far as my understanding goes a JSON consists of a single string with data nested in it. Now if I want to parse the JSON string using Google's GSON, then won't the Hadoop have to put the entire load upon a single node as the JSON is not logically divisible for it.

How do I partition the file (I can make out the partitions logically looking at the data) if I want that it should be processed parallely on different nodes. Do I have to break the file before I load it onto HDFS itself. Is it absolutely necessary that the JSON is parsed by one machine (or node) at least once?

score 1 · Accepted Answer

假设您知道可以在逻辑上将 JSON 解析为逻辑独立的组件，那么您只需编写自己的 InputFormat 即可完成此操作。

从概念上讲，您可以将每个逻辑上可分割的 JSON 组件视为一个“行”数据。每个组件都包含可以独立操作的最少信息。

然后，您需要创建一个类 FileInputFormat，您必须在其中返回每个 JSON 组件。

public class JSONInputFormat extends FileInputFormat<Text,JSONComponent {...}

score 0 · Accepted Answer

您可能会发现这个JSON SerDe很有用。它允许 hive 以 JSON 格式读写。如果它适合您，那么使用 Hive 处理您的 JSON 数据会更加方便，因为您不必担心将读取您的 JSON 数据并为您创建拆分的自定义 InputFormat。

score 0 · Accepted Answer

如果您可以在逻辑上将您的巨型 JSON 划分为多个部分，请执行此操作，并将这些部分保存为文件中的单独行（或序列文件中的记录）。然后，如果您将此新文件提供给 Hadoop MapReduce，映射器将能够并行处理记录。

所以，是的，JSON 应该至少被一台机器解析一次。这个预处理阶段不需要在 Hadoop 中执行，简单的脚本就可以完成这项工作。使用流式 API 避免将大量数据加载到内存中。

json - Distributed processing of JSON in Hadoop

3 回答 3

Related

Reference