distributed - How to configure drill to use all the nodes for a query (by creating multiple fragements)

Question

I am using Drill (1.3) on two nodes. Say:

192.xxx.xxx.xxx
192.yyy.yyy.yyy

I tried querying (from 192.xxx.xxx.xxx) on a csv file (1000 million records):

select count(*) from dfs.`home/impadmin/BiggerBoy.csv`

Also, I tried join query (from 192.xxx.xxx.xxx) on Hive & Oracle :

select * from hive.testDB.`catalog_sales` x inner join oracle.ILABUSER.`customer_address` y on y.CA_ADDRESS_SK = x.CS_BILL_ADDR_SK group by  y.CA_CITY limit 100

Every time I got(from Drill UI):

Query Profile
STATE: COMPLETED

FOREMAN: 192.xxx.xxx.xxx

TOTAL FRAGMENTS: 1

Why the other node is not used. Then whats the benefit of using multiple nodes in this case.

Do Drill take care of this by itself or I need to configure something?

If anybody able to get multiple fragment please share your use case.

score 0 · Accepted Answer

假设您使用的是分布式文件系统，我从这篇文章中了解到本地文件系统插件 (dfs) 不适用于多个钻头。尽管引用的帖子主要解决了有关写入的问题，但它听起来适用于您有关读取的问题。

要将 Drill 配置为使用多个节点，请参阅以分布式模式安装 Drill下的小节。

查询分布取决于查询复杂性。当规划器构建查询计划时，它将计划分成多个主要片段，并且通常在它们之间存在一些分布。在单个节点中，您可以在同一个节点中运行多个次要片段，例如，在 32 列机器上，您最多可以运行 23 个次要片段，约占 75%。在多个节点上，例如在 4 个节点上，每个节点可能会为同一个查询运行 23 个次要片段。

如果你有一个在工头节点上运行的根片段，Drill 不能拆分它。叶片段的分布取决于查询，并受可拆分输入数量的限制。如果您有一个不可拆分的文件，则查询计划使用单个叶子。如果计划中有中间片段，则可以分发。我无法详细了解单个叶子和中间片段的分布如何限制在一个节点上。

在查询配置文件中，当您单击根片段时，您只会看到单个次要片段，并且每个片段的主机名与工头名称相同。如果您单击查询配置文件中的多个主要片段之一，您会看到查询已分发到的不同主机名。

distributed - How to configure drill to use all the nodes for a query (by creating multiple fragements)

1 回答 1

Related

Reference