0

I'm trying to use Apache Drill (for the first time) on a JSON file that looks like this:

{
    "Key1": {
      "htmltags": "<htmltag attr1='bravo' /><htmltag attr2='delta' /><htmltag attr3='charlie' />"
    },
    "Key2": {
      "htmltags": "<htmltag attr1='kilo' /><htmltag attr2='lima' /><htmltag attr3='mike' />"
    },
    "Key3": {
      "htmltags": "<htmltag attr1='november' /><htmltag attr2='foxtrot' /><htmltag attr3='sierra' />"
    }
}

My initial query was the hello world of drill: SELECT * FROM DataFile.json, and returned me the columns Key1, Key2, Key3. They only had one row, and it contained the entry: "<htmltag attr1='bravo' /><htmltag attr2='delta' /><htmltag attr3='charlie' />" [i.e., only the entry Key1.htmltags].

I have two questions:

  1. Why was there only one row returned, when there were three differently valued entries for each key?
  2. After using the KVGEN/FLATTEN functions to get at my strings inside "htmltags" above, is there a way to drill further into (analyse and extract data from) the HTML tags?
4

2 回答 2

0

不幸的是,Drill 似乎不是适合这项工作的工具(在撰写 Homebrew 时为 v1.1.0)。

  1. 系统似乎存在错误,这就是尽管有多列但只有一行的原因。我已经提交了一份报告:https ://issues.apache.org/jira/browse/DRILL-4102
  2. 我再次浏览了文档,没有工具可以本地分析 HTML 或 XML。这取决于字符串操作不是我喜欢的任务

因此,我将使用 XML 解析器、DOM 树爬虫等,并使用 bash 字符串函数来提取目标标记字符串 awk/tee。

于 2015-11-17T12:45:02.977 回答
0

JSON 似乎格式不正确。对象没有通过名称/值对明确标识。也不是一个清晰的数组。

一旦解决了这个问题,htmltags 的值就必须用诸如定位、子字符串、位置等字符串函数来处理(参见https://drill.apache.org/docs/string-manipulation/

最好的可能是将 htmltags 作为数组而不是字符串。

于 2015-11-16T17:31:02.570 回答