0

我有 5 个节点 Hortonworks 集群(版本 - 2.4.2),我在其中安装了 Hawq 2.0.0。

这5个节点是:edge master(Name节点)node1(Data Node1)node2(Data Node2)node3(Data Node3)

我按照这个链接在 HDP 中安装 Hawq - http://hdb.docs.pivotal.io/hdb/install/install-ambari.html

Hawq 组件安装在这些节点中:

Hawq 主节点 - 节点 1 Hawq 备用主节点 - 节点 2

hawq 段 - node1,node2,node3

在安装时,Hawq master、Hawq standy master、hawq 段已成功安装,但在 Ambari 中由 Hawq 安装程序运行的基本 Hawq 测试失败:

下面在Installer执行的操作中

2016-06-30 00:24:22,513 - --- Check state of HAWQ cluster ---
2016-06-30 00:24:22,513 - Executing hawq status check...
2016-06-30 00:24:22,514 - Command executed: su - gpadmin -c "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null node1.localdomain \"source /usr/local/hawq/greenplum_path.sh && hawq state -d /data/hawq/master \" "
2016-06-30 00:24:23,343 - Output of command:
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--HAWQ instance status summary
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:------------------------------------------------------
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Master instance                                = Active
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Master standby                                 = node2.localdomain
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Standby master state                           = Standby host passive
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Total segment instance count from config file  = 3
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:------------------------------------------------------ 
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Segment Status                                    
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:------------------------------------------------------ 
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Total segments count from catalog      = 1
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Total segment valid (at master)        = 0
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Total segment failures (at master)     = 3
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Total number of postmaster.pid files missing   = 0
20160630:00:24:23:032731 hawq_state:node1:gpadmin-[INFO]:--   Total number of postmaster.pid files found     = 3


2016-06-30 00:24:23,344 - --- Check if HAWQ can write and query from a table ---
2016-06-30 00:24:23,344 - Dropping ambari_hawq_test table if exists
2016-06-30 00:24:23,344 - Command executed: su - gpadmin -c "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null node1.localdomain \"export PGPORT=5432 && source /usr/local/hawq/greenplum_path.sh && psql -d template1 -c \\\"DROP  TABLE IF EXISTS ambari_hawq_test;\\\" \" "
2016-06-30 00:24:23,436 - Output:
DROP TABLE

2016-06-30 00:24:23,436 - Creating table ambari_hawq_test
2016-06-30 00:24:23,436 - Command executed: su - gpadmin -c "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null node1.localdomain \"export PGPORT=5432 && source /usr/local/hawq/greenplum_path.sh && psql -d template1 -c \\\"CREATE  TABLE ambari_hawq_test (col1 int) DISTRIBUTED RANDOMLY;\\\" \" "
2016-06-30 00:24:23,693 - Output:
CREATE TABLE

2016-06-30 00:24:23,693 - Inserting data to table ambari_hawq_test
2016-06-30 00:24:23,693 - Command executed: su - gpadmin -c "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null node1.localdomain \"export PGPORT=5432 && source /usr/local/hawq/greenplum_path.sh && psql -d template1 -c \\\"INSERT INTO  ambari_hawq_test SELECT * FROM generate_series(1,10);\\\" \" 

"

--- 上面我们可以看到,drop 和 Create table 被执行了,但是 insert 操作没有成功。

所以,我在 Hawq 主节点上手动执行了插入命令,即 node1

这些是手动执行的步骤:

[root@node1 ~]# su - gpadmin
[gpadmin@node1 ~]$ psql
psql (8.4.20, server 8.2.15)
WARNING: psql version 8.4, server version 8.2.
         Some psql features might not work.
Type "help" for help.

gpadmin=#
gpadmin=# \c gpadmin
psql (8.4.20, server 8.2.15)
WARNING: psql version 8.4, server version 8.2.
         Some psql features might not work.
You are now connected to database "gpadmin".
gpadmin=# create table test(name varchar);
gpadmin=# insert into test values('vikash');

-- 上面的插入操作时间长了就报错了

错误:无法从资源管理器获取资源,由于没有可用的集群,资源请求超时(pquery.c:804)

此外,节点 1 中的 hawq 段日志将作为

[root@node1 ambari-agent]# tail -f /data/hawq/segment/pg_log/hawq-2016-06-30_045853.csv
2016-06-30 05:10:24.522688 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 192.168.122.1"
,,,,,,,0,,"network_utils.c",210,
2016-06-30 05:10:54.603726 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 127.0.0.1",,,,
,,,0,,"network_utils.c",210,
2016-06-30 05:10:54.603769 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 2.10.1.71",,,,
,,,0,,"network_utils.c",210,
2016-06-30 05:10:54.603778 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 192.168.122.1"
,,,,,,,0,,"network_utils.c",210,
2016-06-30 05:11:24.625919 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 127.0.0.1",,,,
,,,0,,"network_utils.c",210,
2016-06-30 05:11:24.626088 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 2.10.1.71",,,,
,,,0,,"network_utils.c",210,
2016-06-30 05:11:24.626129 EDT,,,p248618,th-1357371264,,,,0,,,seg-10000,,,,,"LOG","00000","Resource manager discovered local host IPv4 address 192.168.122.1"
,,,,,,,0,,"network_utils.c",210,

我也曾尝试检查“gp_segment_configuration”

gpadmin=# select * from gp_segment_configuration
gpadmin-# ;
 registration_order | role | status | port  |     hostname      |  address  |            description
--------------------+------+--------+-------+-------------------+-----------+------------------------------------
                 -1 | s    | u      |  5432 | node2.localdomain | 2.10.1.72 |
                  0 | m    | u      |  5432 | node1             | node1     |
                  1 | p    | d      | 40000 | node1.localdomain | 2.10.1.71 | resource manager process was reset
(3 rows)

注意:在 hawq-site.xml 中,资源管理类型从下拉列表中选择为“ STANDALONE ”而不是“YARN”。

任何人都有任何线索,这里有什么问题???提前致谢 !!!

4

2 回答 2

1

我以前遇到过这样的问题。在这样的环境中,每个网段都有一个共同的 IP 地址。所以请检查segment节点是否有相同的IP地址。对于hawq2.0.0,它会将具有相同IP地址的segment视为一个节点,这就是为什么你有3个segment节点,但在gp_segment_configuration中,只注册了一个segment节点。您可以删除重复的 IP 地址并重试。

此问题已使用最新的 hawq 代码修复。

于 2016-07-01T03:02:05.283 回答
0

感谢大家的回复。

centOS 中的底层操作系统及其在 vCloud 上。正如建议的那样,我已经完成了拥有 3 个段的所有 3 个数据节点的 IP 配置。这些节点没有使用相同的网卡(IP)。但是在进一步调查中,我通过ifconfig发现,除了 "eth1" 和 "lo" 之外,还有另一组配置存在于 " vibr0 " 下。

这个“vibr0”在所有段节点中都是相同的,这是导致问题的原因。我从所有节点中删除它,然后插入查询工作

下面是 ifconfig 的结果,为了解决这个问题,从所有段节点中删除了“vibr0”。

eth1 Link encap:Ethernet HWaddr 00:50:56:01:31:26 inet addr:2.10.1.74 Bcast:2.10.3.255 Mask:255.255.252.0 inet6 addr:fe80::250:56ff:fe01:3126/64 范围: Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX 数据包:426157 错误:0 丢弃:0 超限:0 帧:0 TX 数据包:259592 错误:0 丢弃:0 超限:0 运营商:0 冲突:0 txqueuelen:1000 RX字节:361465764(344.7 MiB) TX 字节:216951933(206.9 MiB)

lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packet:6 errors:0 dropped:0 overruns:0帧:0 TX 数据包:6 错误:0 丢弃:0 超限:0 载波:0 冲突:0 txqueuelen:0 RX 字节:416 (416.0 b) TX 字节:416 (416.0 b)

virbr0 Link encap:Ethernet HWaddr 52:54:00:DC:EE:00 inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packet:0 errors:0丢弃:0 超限:0 帧:0 TX 数据包:0 错误:0 丢弃:0 超限:0 载波:0 冲突:0 txqueuelen:0 RX 字节:0 (0.0 b) TX 字节:0 (0.0 b)

于 2016-07-01T07:27:38.683 回答