2

使用 Spark 将 AsTextFile 保存到 S3 后,然后像 Hadoop 一样格式化。像这样的桶中的文件格式。

在此处输入图像描述

格式为 YEAR/MONTH/DATE/TIMESTAMP,数据部分为 0000 文件为 json 格式。

我配置钻取并指向我的存储桶名称

{
  "type": "file",
  "enabled": true,
  "connection": "s3://com.giaosudau.win-bid",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    "tsv": {
      "type": "text",
      "extensions": [
        "tsv"
      ],
      "delimiter": "\t"
    },
    "parquet": {
      "type": "parquet"
    },
    "json": {
      "type": "json"
    },
    "avro": {
      "type": "avro"
    }
  }
}

我选择存储没问题。但不知道选择和获取日期或时间或文件的数据。

在此处输入图像描述

我也尝试了一些事情,比如选择一个文件但没有成功:

在此处输入图像描述

文件样本在这里

{"auctionId":"xx","bidRequestString":{"id":"xx","timestamp":"2015-07-29T08:31:00.413Z","isTest":false,"url":"http://www.222.3232.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.34343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"332","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=34343\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"23848834:strategy","biddingMainAccount":"1212312","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"123232","xchg":"9849839"},"winPrice":"100USD/1M"}
{"auctionId":"343","bidRequestString":{"id":"344","timestamp":"2015-07-28T08:31:00.413Z","isTest":false,"url":"http://www.xx.xx.com/","userAgent":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","protocolVersion":"Google Protocol Buffer","exchange":"adx","provider":"Google","location":{"dma":-1,"metro":-1,"timezoneOffsetMinutes":-1},"segments":{"AdxDetectedVerticals":["23:0.571243","355:0.409098","380:0.339647","474:0.415936","540:0.079871"]},"userIds":{"prov":"bGZqaWdsa2dmbmRubWxtamdoaGs","xchg":"bGZqaWdsa2dmbmRubWxtamdoaGs"},"imp":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"spots":[{"id":"99","banner":{"w":[220,300],"h":[600,450],"id":"99","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["220x600","300x450"],"position":0},{"id":"199","banner":{"w":120,"h":600,"id":"199","pos":0,"topframe":0},"pmp":{"ext":{"adgroup_id":"17490739393"}},"formats":["120x600"],"position":0}],"site":{"id":"502","publisher":{"id":"502"},"page":"http://www.blog.343.com/"},"device":{"ua":"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7pre) Gecko/20070815 Firefox/2.0.0.6 Navigator/9.0b3","ip":"109.103.101.0","geo":{"zip":" 2600"},"language":"en","ext":{"geo_criteria_id":1000142}},"user":{"id":"bGZqaWdsa2dmbmRubWxtamdoaGs","ext":{"cookie_age_seconds":2080089}}},"bidResponseCreative":{"itemId":"fsknxfwe34235235","campaignId":"342342342","htmlSnippet":"\u003ciframe frameborder=0 scrolling=no width=\"300\" height=\"250\" src=\"//sv.brand-display.com/adedge/api/bd/serving/simple/frame?aukey=xxx\u0026_=%CACHEBUSTER%\u0026winning_price=%WINNING_PRICE%\u0026google_click_url=%CLICK_URL_ESC%\u0026encrypt_value=%ENCRYPT_VALUE%\"\u003e\u003c/iframe\u003e","name":"Expandable Web","formatCode":"billboard","bd":"","status":1,"deleted":false,"destinationUrl":"https://ex.sg","tagging":[1,2,3],"expandingDirection":14,"bdUrl":"","format":{"code":"billboard","name":"Billboard","publisher":"default","type":"standard","width":120,"height":600,"expanded_width":0,"expanded_height":0,"collapsed_width":0,"collapsed_height":0,"aspratio":3.76,"expandable":true,"expand_first":true}},"bidResponseCreativeId":"1","bidResponseCreativeName":"","bidResponseData":{"bids":[{"creative":0,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":0},{"creative":1,"ext":null,"price":"6071USD/1M","priority":1.0,"spotIndex":1}]},"bidResponseMeta":"null","bidWinMeta":"null","biddingAgentName":"iface.http","biddingFullAccount":"829838828928932:strategy","biddingMainAccount":"3434","biddingMaxPrice":"6071USD/1M","biddingRequestFormatType":"datacratic","biddingSubAccount":"strategy","impIndex":"1","impressionId":"199","pricePriority":"1.000000","rawWinPrice":"100USD/1M","timestamp":"2015-Jul-29 08:31:00.53899","userIds":{"prov":"cc","xchg":"cc"},"winPrice":"100USD/1M"}

如何从中获取数据?

4

1 回答 1

2

以下是一些使用您的数据和本地文件系统而不是 s3 的示例。我认为您的文件没有任何扩展名,但如果不是这样,如果您的文件具有 .json 扩展名,您可以像这样查询它们(使用表别名 t 来解决歧义):

use dfs.`root`;
+-------+---------------------------------------+
|  ok   |                summary                |
+-------+---------------------------------------+
| true  | Default schema changed to [dfs.root]  |
+-------+---------------------------------------+
1 row selected (0.079 seconds)
select t.auctionID from dfs.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.093 seconds)

如果您的文件没有扩展名,请修改存储插件配置,将 defaultInputFormat 设置为“json”;否则,Drill 假定文件默认为 Parquet 格式。例如,我在我的根工作区中设置了 defaultInputFormat:

{
  "type": "file",
  "enabled": true,
  "connection": "file:///",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": "json"
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
. . .

当您查询没有扩展名的文件时,您需要指定定义 defaultInputFormat 的工作区,在此示例中为 root:

select t.auctionID from dfs.`root`.`/Users/khahn/drill/apache-drill-1.1.0/part-00000` t;
+------------+
| auctionId  |
+------------+
| xx         |
| 343        |
+------------+
2 rows selected (0.095 seconds)

请参阅有关处理类型差异的文档:https ://drill.apache.org/docs/json-data-model/#handling-type-differences

于 2015-07-30T22:22:39.483 回答