hive - Hive 性能提升

Question

我正在处理一个数据库（2.5 GB），有些表只有 40 行，有些表有 900 万行数据。当我对大表进行任何查询时，它需要更多时间。我希望在更短的时间内获得结果

对只有 90 行的表的小查询-->

hive> select count(*) from cidade; 
Time taken: 50.172 seconds

hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

<property>
<name>dfs.block.size</name>
<value>131072</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
</configuration>

这些设置会影响配置单元的性能吗？ dfs.replication=3 dfs.block.size=131072

我可以将它从蜂巢提示设置为

hive>set dfs.replication=5

这个值是否只保留在特定的会话中？

还是在 .xml 文件中更改它更好？

score 4 · Accepted Answer

重要的是这select count(*)将导致 hive 启动 map reduce 工作。

您可能认为这非常快，就像 mysql 查询一样。

但即使是hadoop中最简单的map reduce作业，总时间也包括提交给job tracker，将task分配给task tracker等。所以总时间至少有几十秒。

select count(*)在一张大桌子上试一试。时间不会增加太多。

因此，您需要了解 hive 和 hadoop 处理大数据。

score 3 · Accepted Answer

dfs.replication不应影响配置单元查询的运行时间。它是从 hdfs-site.xml 公开的属性，用于确定将数据块复制到多少个 HDFS 节点。dfs.replication3 表示每个数据块位于 3 个节点上（总共）。因此，它不适用于特定的会话。

hive - Hive 性能提升

2 回答 2

Related

Reference