3

Apache Drill 功能列表提到它可以从 Google Cloud Storage 查询数据,但我找不到任何关于如何做到这一点的信息。我已经让它在 S3 上运行良好,但怀疑我在谷歌云存储方面遗漏了一些非常简单的东西。

有人有谷歌云存储的示例存储插件配置吗?

谢谢

4

2 回答 2

3

这是一个相当古老的问题,所以我想您要么找到了解决方案,要么继续您的生活,但对于在不使用 Dataproc 的情况下寻找解决方案的任何人,这里有一个解决方案:

  1. 将 GCP 连接器中的 JAR 文件添加到 jars/3rdparty 目录。
  2. 将以下内容添加到 conf 目录中的 site-core.xml 文件中(将 YOUR_PROJECT_ID 等大写值更改为您自己的详细信息):
<property>
    <name>fs.gs.project.id</name>
    <value>YOUR_PROJECT_ID</value>
    <description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    </description>
  </property>
  <property>
    <name>fs.gs.auth.service.account.private.key.id</name>
    <value>YOUR_PRIVATE_KEY_ID</value>
  </property>
    <property>
        <name>fs.gs.auth.service.account.private.key</name>
        <value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
    </property>
  <property>
    <name>fs.gs.auth.service.account.email</name>
    <value>YOUR_SERVICE_ACCOUNT_EMAIL/value>
    <description>
      The email address is associated with the service account used for GCS
      access when fs.gs.auth.service.account.enable is true. Required
      when authentication key specified in the Configuration file (Method 1)
      or a PKCS12 certificate (Method 3) is being used.
    </description>
  </property>
  <property>
    <name>fs.gs.working.dir</name>
    <value>/</value>
    <description>
      The directory relative gs: uris resolve in inside of the default bucket.
    </description>
  </property>
   <property>
    <name>fs.gs.implicit.dir.repair.enable</name>
    <value>true</value>
    <description>
      Whether or not to create objects for the parent directories of objects
      with / in their path e.g. creating gs://bucket/foo/ upon deleting or
      renaming gs://bucket/foo/bar.
    </description>
  </property>
   <property>
    <name>fs.gs.glob.flatlist.enable</name>
    <value>true</value>
    <description>
      Whether or not to prepopulate potential glob matches in a single list
      request to minimize calls to GCS in nested glob cases.
    </description>
  </property>
   <property>
    <name>fs.gs.copy.with.rewrite.enable</name>
    <value>true</value>
    <description>
      Whether or not to perform copy operation using Rewrite requests. Allows
      to copy files between different locations and storage classes.
    </description>
  </property>

启动 Apache Drill。

将自定义存储添加到 Drill。

你可以走了。

解决方案来自这里,在这里我详细介绍了我们使用 Apache Drill 围绕数据探索所做的工作。

于 2020-03-18T16:51:52.750 回答
1

我设法使用在 Google Dataproc 集群上运行的 Apache Drill (1.6.0) 在 Google Cloud Storage (GCS) 中查询 parquet 数据。为了进行设置,我采取了以下步骤:

  1. 安装 Drill 并使 GCS 连接器可访问(这可以用作 dataproc 的初始化脚本,请注意它没有真正经过测试并且依赖于本地 zookeeper 实例):

    #!/bin/sh
    set -x -e
    BASEDIR="/opt/apache-drill-1.6.0"
    mkdir -p ${BASEDIR}
    cd ${BASEDIR}
    wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
    tar -xzvf apache-drill-1.6.0.tar.gz
    mv apache-drill-1.6.0/* .
    rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz
    
    ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
    mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
    ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml
    
    drillbit.sh start
    
    set +x +e
    
  2. 连接到 Drill 控制台,创建一个新的存储插件(例如,调用它gcs),并使用以下配置(注意我从 s3 配置中复制了大部分内容,做了一些小改动):

    {
      "type": "file",
      "enabled": true,
      "connection": "gs://myBucketName",
      "config": null,
      "workspaces": {
        "root": {
          "location": "/",
          "writable": false,
          "defaultInputFormat": null
        },
        "tmp": {
          "location": "/tmp",
          "writable": true,
          "defaultInputFormat": null
        }
      },
      "formats": {
        "psv": {
          "type": "text",
          "extensions": [
            "tbl"
          ],
          "delimiter": "|"
        },
        "csv": {
          "type": "text",
          "extensions": [
            "csv"
          ],
          "delimiter": ","
        },
        "tsv": {
          "type": "text",
          "extensions": [
            "tsv"
          ],
          "delimiter": "\t"
        },
        "parquet": {
          "type": "parquet"
        },
        "json": {
          "type": "json",
          "extensions": [
            "json"
          ]
        },
        "avro": {
          "type": "avro"
        },
        "sequencefile": {
          "type": "sequencefile",
          "extensions": [
            "seq"
          ]
        },
        "csvh": {
          "type": "text",
          "extensions": [
            "csvh"
          ],
          "extractHeader": true,
          "delimiter": ","
        }
      }
    }
    
  3. 使用以下语法查询(注意反引号):

    select * from gs.`root`.`path/to/data/*` limit 10;
    
于 2016-07-12T17:17:41.633 回答