Apache Drill 功能列表提到它可以从 Google Cloud Storage 查询数据,但我找不到任何关于如何做到这一点的信息。我已经让它在 S3 上运行良好,但怀疑我在谷歌云存储方面遗漏了一些非常简单的东西。




2 回答 2


这是一个相当古老的问题,所以我想您要么找到了解决方案,要么继续您的生活,但对于在不使用 Dataproc 的情况下寻找解决方案的任何人,这里有一个解决方案:

  1. 将 GCP 连接器中的 JAR 文件添加到 jars/3rdparty 目录。
  2. 将以下内容添加到 conf 目录中的 site-core.xml 文件中(将 YOUR_PROJECT_ID 等大写值更改为您自己的详细信息):
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
        <value>-----BEGIN PRIVATE KEY-----\nYOUR_PRIVATE_KEY\n-----END PRIVATE KEY-----\n</value>
      The email address is associated with the service account used for GCS
      access when fs.gs.auth.service.account.enable is true. Required
      when authentication key specified in the Configuration file (Method 1)
      or a PKCS12 certificate (Method 3) is being used.
      The directory relative gs: uris resolve in inside of the default bucket.
      Whether or not to create objects for the parent directories of objects
      with / in their path e.g. creating gs://bucket/foo/ upon deleting or
      renaming gs://bucket/foo/bar.
      Whether or not to prepopulate potential glob matches in a single list
      request to minimize calls to GCS in nested glob cases.
      Whether or not to perform copy operation using Rewrite requests. Allows
      to copy files between different locations and storage classes.

启动 Apache Drill。

将自定义存储添加到 Drill。


解决方案来自这里,在这里我详细介绍了我们使用 Apache Drill 围绕数据探索所做的工作。

于 2020-03-18T16:51:52.750 回答

我设法使用在 Google Dataproc 集群上运行的 Apache Drill (1.6.0) 在 Google Cloud Storage (GCS) 中查询 parquet 数据。为了进行设置,我采取了以下步骤:

  1. 安装 Drill 并使 GCS 连接器可访问(这可以用作 dataproc 的初始化脚本,请注意它没有真正经过测试并且依赖于本地 zookeeper 实例):

    set -x -e
    mkdir -p ${BASEDIR}
    cd ${BASEDIR}
    wget http://apache.mesi.com.ar/drill/drill-1.6.0/apache-drill-1.6.0.tar.gz
    tar -xzvf apache-drill-1.6.0.tar.gz
    mv apache-drill-1.6.0/* .
    rm -rf apache-drill-1.6.0 apache-drill-1.6.0.tar.gz
    ln -s /usr/lib/hadoop/lib/gcs-connector-1.4.5-hadoop2.jar ${BASEDIR}/jars/gcs-connector-1.4.5-hadoop2.jar
    mv ${BASEDIR}/conf/core-site.xml ${BASEDIR}/conf/core-site.xml.old
    ln -s /etc/hadoop/conf/core-site.xml ${BASEDIR}/conf/core-site.xml
    drillbit.sh start
    set +x +e
  2. 连接到 Drill 控制台,创建一个新的存储插件(例如,调用它gcs),并使用以下配置(注意我从 s3 配置中复制了大部分内容,做了一些小改动):

      "type": "file",
      "enabled": true,
      "connection": "gs://myBucketName",
      "config": null,
      "workspaces": {
        "root": {
          "location": "/",
          "writable": false,
          "defaultInputFormat": null
        "tmp": {
          "location": "/tmp",
          "writable": true,
          "defaultInputFormat": null
      "formats": {
        "psv": {
          "type": "text",
          "extensions": [
          "delimiter": "|"
        "csv": {
          "type": "text",
          "extensions": [
          "delimiter": ","
        "tsv": {
          "type": "text",
          "extensions": [
          "delimiter": "\t"
        "parquet": {
          "type": "parquet"
        "json": {
          "type": "json",
          "extensions": [
        "avro": {
          "type": "avro"
        "sequencefile": {
          "type": "sequencefile",
          "extensions": [
        "csvh": {
          "type": "text",
          "extensions": [
          "extractHeader": true,
          "delimiter": ","
  3. 使用以下语法查询(注意反引号):

    select * from gs.`root`.`path/to/data/*` limit 10;
于 2016-07-12T17:17:41.633 回答