2

我是 hadoop 的初学者,我被告知要创建一个自定义 inputformat 类来读取 json 数据,我已经用谷歌搜索并学习了如何创建一个自定义 inputformat 类来从文件中读取数据。但是我一直在解析 json数据。我的 json 数据看起来像这样

[
    {
        "_count": 30,
        "_start": 0,
        "_total": 180,
        "values": [
            {
                "attachment": {
                    "contentDomain": "techcarnival2013.eventbrite.com",
                    "contentUrl": "http://techcarnival2013.eventbrite.com/",
                    "imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png",
                    "summary": "Get to know a few thousand of Silicon Valley's best and brightest while enjoying unparalleled access to Candlestick Park,\u00a0games, food, music and more. We'll have carnival games you haven't played since you were ten, giant inflatable obstacle...",
                    "title": "Tech Carnival @ Candlestick Park"
                },
                "comments": {
                    "_total": 0
                },
                "creationTimestamp": 1373908436000,
                "creator": {
                    "firstName": "Clayton",
                    "headline": "Director of Operations",
             "secondname":{
                "name":"myname"
                },
                    "lastName": "K.",
                    "pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj"
                },
                "likes": {
                    "_total": 0
                },
                "relationToViewer": {
                    "availableActions": {
                        "_total": 7,
                        "values": [
                            {
                                "code": "add-comment"
                            },
                            {
                                "code": "categorize-as-job"
                            },
                            {
                                "code": "categorize-as-promotion"
                            },
                            {
                                "code": "flag-as-inappropriate"
                            },
                            {
                                "code": "follow"
                            },
                            {
                                "code": "like"
                            },
                            {
                                "code": "reply-privately"
                            }
                        ]
                    },
                    "isFollowing": false,
                    "isLiked": false
                },
                "summary": "Network with 4,000+ from the tech community, including folks from DFJ, Google, LinkedIn, Square, Uber, Y Combinator, 500 Startups, etc. $10 ticket gets you all-you-can-ride access to the pop-up Tech Carnival, will be the biggest Wednesday night of the tech summer.",
                "title": "Tech Event @ Candlestick Park on Wednesday, July 17th! Come play carnival games with ~4,000 of the Bay area's best and brightest!"
            },
            {
                "attachment": {
                    "contentDomain": "lifebeyondnumbers.com",
                    "contentUrl": "http://bit.ly/10VTqMu",
                    "imageUrl": "http://lifebeyondnumbers.com/wp-content/uploads/2013/07/lurnq_Online_Courses.jpg",
                    "summary": "LurnQ offers a platform for learning and teaching that is free for everyone. It caters to a diverse online audience and is relevant to everyone in general. The key segment that we address now is of life long learners.",
                    "title": "LurnQ - making lifelong learning clutter free, fun and a social..."
                },
                "comments": {
                    "_total": 0
                },
                "creationTimestamp": 1373883177000,
                "creator": {
                    "firstName": "Syed",
                    "headline": "Founder and CEO at QubiqSquare",
                    "lastName": "Muksit",
                    "pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_Y5gdzlRCbQBTqIa-pXYnz-01b6KinDO-pFWnz-ZCZLk1WWdt-_SLUt2uWmrpzo0OxQxcVv6pRjbE"
                },
                "likes": {
                    "_total": 0
                },
                "relationToViewer": {
                    "availableActions": {
                        "_total": 7,
                        "values": [
                            {
                                "code": "add-comment"
                            },
                            {
                                "code": "categorize-as-job"
                            },
                            {
                                "code": "categorize-as-promotion"
                            },
                            {
                                "code": "flag-as-inappropriate"
                            },
                            {
                                "code": "follow"
                            },
                            {
                                "code": "like"
                            },
                            {
                                "code": "reply-privately"
                            }
                        ]
                    },
                    "isFollowing": false,
                    "isLiked": false
                },
                "summary": "LurnQ offers a platform for learning and teaching that is free for everyone. It caters to a diverse online audience and is relevant to everyone in general. The key segment that we address now is of life long learners.",
                "title": "There is so much to learn and most of the times, we don\u2019t even know that this-and-that good stuff exists.  http://bit.ly/10VTqMu"
            },
            {
                "attachment": {
                    "contentDomain": "techcarnival2013.eventbrite.com",
                    "contentUrl": "http://techcarnival2013.eventbrite.com/",
                    "imageUrl": "http://ebmedia.eventbrite.com/s3-s3/static/images/django/logos/eb_home_tm-trans-fb.png",
                    "summary": "Get to know a few thousand of Silicon Valley's best and brightest while enjoying unparalleled access to Candlestick Park,\u00a0games, food, music and more. We'll have carnival games you haven't played since you were ten, giant inflatable obstacle...",
                    "title": "Tech Carnival @ Candlestick Park"
                },
                "comments": {
                    "_total": 0
                },
                "creationTimestamp": 1373654758000,
                "creator": {
                    "firstName": "Clayton",
                    "headline": "Director of Operations",
                    "lastName": "K.",
                    "pictureUrl": "http://m.c.lnkd.licdn.com/mpr/mprx/0_R7Vm6_RqBDHaHCDzJHRA6hsNcwOfECjzMeaA6heqHeo0v6ovBWoCe8pVJiYrd5pJVu4KdbnQQ3Lj"
                },
                "likes": {
                    "_total": 0
                },
                "relationToViewer": {
                    "availableActions": {
                        "_total": 7,
                        "values": [
                            {
                                "code": "add-comment"
                            },
                            {
                                "code": "categorize-as-job"
                            },
                            {
                                "code": "categorize-as-promotion"
                            },
                            {
                                "code": "flag-as-inappropriate"
                            },
                            {
                                "code": "follow"
                            },
                            {
                                "code": "like"
                            },
                            {
                                "code": "reply-privately"
                            }
                        ]
                    },
                    "isFollowing": false,
                    "isLiked": false
                },
                "summary": "Network with 4,000+ from the tech community, including folks from DFJ, Google, LinkedIn, Square, Uber, Y Combinator, 500 Startups, etc. $10 ticket gets you all-you-can-ride access to the pop-up Tech Carnival, will be the biggest Wednesday night of the tech summer.",
                "title": "Tech Event @ Candlestick Park on Wednesday, July 17th! Come play carnival games with ~4,000 of the Bay area's best and brightest!"
            }
..........
........ so on

]

所以我很困惑如何在我的自定义输入格式类中读取 json 对象。关于如何解析这个的任何想法?我想读取 json 数组中的单个 json 对象,我的意思是读取正确的 json 字符串,然后将字符串提供给映射我将在映射中使用 json 解析器来构造我自己的键值对。对此有任何帮助吗?提前致谢

4

1 回答 1

1

如果您的问题与 Magham Ravi 评论的一致,那么答案很好。

但是,如果您有一个包含上面提到的所有 JSON 数据的文件,您可能希望读取整个文件并将其作为字符串从 map 函数中的值部分(BytesWritable 值)中检索出来,并将其提供给您的 JSON 解析器在同一个 map() 函数中可用。

请看一下WholeFileInputFormat

此外,如果您在单个文件中说明了多个 JSON 对象数据以及将每个 JSON 对象数据作为映射器中的值获取的内容,则可以使用定义了开始和结束标记的XMLInputFormat之类的东西。对于 JSON,您必须有一个唯一的开始和结束标签来准确标记您想要的单个 JSON 数据对象的开始和结束。如果您希望将上面的整个 JSON 对象作为值返回,那么使用 start-tag = "[{" 和 end-tag = "}]" 可能无济于事,因为您已经有许多嵌套的对象会混淆输入格式。

如果您在任何情况下都无法实现上述目标,请尝试构建您的 customTextInputFormat 覆盖TextInputFormat 中定义的LineReader

在 LineReader 类中,您可以使用这两个设置(我可能有点过时,请检查现在是否可以使用配置属性进行配置,我知道 CDH 已使其可配置,如果您不需要覆盖)

private static final byte CR = '\r';
private static final byte LF = '\n';

您可以放开 CR 并将 LF 更改为指向“ ]\n[ ”,因为您的每个独立 JSON 数据都将采用如图所示的形式,或者您会更好地了解它如何?

[

...JSON 1

]

[

...JSON 2

]

[

...JSON N

]

(注意:在 ] 和 [ 之间有一个 \n ,它标记为不同 JSON 对象数据之间的边界。

希望这是有道理的。

于 2013-09-03T19:06:36.357 回答