postgresql - PostgreSQL 9.1 流式复制 restore_command：退出代码 255 的特殊含义？

Question

我在 Ubuntu 10.04.2 LTS（主要和备用）上有一个 PostgreSQL 9.1.3 流复制设置。使用流式基本备份 ( pg_basebackup) 初始化复制。该restore_command脚本尝试使用 .从远程存档位置获取所需的 WAL 存档rsync。

当 restore_command 脚本失败并出现退出代码 <> 255 时，一切都像文档中描述的那样工作：

启动时，备用数据库首先恢复存档位置中所有可用的 WAL，调用 restore_command。一旦到达那里可用的 WAL 的末尾并且 restore_command 失败，它就会尝试恢复 pg_xlog 目录中可用的任何 WAL。如果失败，并且已经配置了流式复制，则备用服务器会尝试连接到主服务器并从存档或 pg_xlog 中找到的最后一条有效记录开始流式传输 WAL。如果失败或未配置流复制，或者如果连接稍后断开，则备用数据库将返回步骤 1 并再次尝试从存档中恢复文件。这个从归档、pg_xlog 和通过流复制的重试循环继续进行，直到服务器停止或由触发器文件触发故障转移。

但是，当 restore_command 脚本以退出代码 255 失败（因为脚本返回失败的 rsync 调用的退出代码）时，服务器进程因以下错误而死：

2012-05-09 23:21:30 CEST - @  LOG:  database system was interrupted; last known up at     2012-05-09 23:21:25 CEST
2012-05-09 23:21:30 CEST - @  LOG:  entering standby mode
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7]
2012-05-09 23:21:30 CEST - @  FATAL:  could not restore file "00000001000000000000003D" from archive: return code 65280
2012-05-09 23:21:30 CEST - @  LOG:  startup process (PID 8184) exited with exit code 1
2012-05-09 23:21:30 CEST - @  LOG:  aborting startup due to startup process failure

所以我现在的问题是：这是一个错误，还是退出代码 255 的特殊含义在其他优秀的文档中缺失，或者我在这里遗漏了其他东西？

score 2 · Accepted Answer

在主服务器上，您WAL的目录中有文件pg_xlog/。当WAL文件在那里时，PostgreSQL 能够在需要时将它们传送到备用数据库。

通常，您也有本地存档WAL位置，当 PostgreSQL 将文件移动到那里时，它们不再可以在线传递到备用数据库，备用数据库希望它们来自存档WAL位置，通过restore_command.

如果您WAL在主服务器和备用服务器上的存档设置位置不同，那么暂时无法到达备用服务器，并且您有一个间隙。

在您的情况下，这可能意味着：

00000001000000000000003D已被主 PostgreSQL 归档；
备用restore_command服务器从配置的源位置看不到它。

scp您可以考虑使用或手动将丢失的 WAL 文件从主数据库复制到备用数据库rsync。还可能需要检查您的WAL位置并确保两台服务器的方向相同。

编辑： grep -ing for restore_commandin 来源，仅access/transam/xlog.c引用它。在功能RestoreArchivedFile几乎结束时（9.1.3 源的第 3115 行），检查是否restore_command正常退出或是否收到信号。

在第一种情况下，消息被分类为DEBUG2。如果restore_command收到其他信号SIGTERM（并且我猜无法正确处理它），FATAL将报告错误。这适用于所有大于 125 的代码。

不过，我无法告诉你为什么。
我建议在黑客名单上询问。

score 0 · Accepted Answer

这看起来像是我使用 NFS 临时遇到的 rsync 问题（在端口 837 上使用 rpcbind/rstatd）：

$ rsync -avz /var/backup/* backup@storage:/data/backups
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]

这为我修复了它：

service rpcbind stop

score 0 · Accepted Answer

我在创建热备用（postgres 9.5）时遇到了同样的问题。流式传输正在工作（我通过 pg_basebackup 使用与稍后将在备用数据库的 recovery.conf 中使用的凭据相同的凭据为备用数据库播种）。

完成基本备份后，我设置了以下 recovery.conf：

standby_mode = 'on'
primary_conninfo = 'host=ip.of.master port=5432 user=pgstandby password=password'
recovery_target_timeline = 'latest'
restore_command = 'sftp -q user@ip.of.wal.archive.host:data/master_wal_archive/%f "%p"'
trigger_file = '/srv/pgsql/9.5/data/trigger'

启动服务器会产生：

2016-03-08 12:34:58.981 UTC  (/)LOG:  database system was interrupted; last known up at 2016-03-08 12:26:10 UTC
Couldn't read packet: Connection reset by peer
2016-03-08 12:34:59.525 UTC  (/)FATAL:  could not restore file "00000002.history" from archive: child process exited with exit code 255
2016-03-08 12:34:59.526 UTC  (/)LOG:  startup process (PID 26636) exited with exit code 1
2016-03-08 12:34:59.526 UTC  (/)LOG:  aborting startup due to startup process failure

如果我从 recovey.conf 中删除了 restore_command 行，则备用服务器启动正常并开始从主服务器流式传输 WAL。

我最终将问题追溯到没有将备用 postgres 用户的公钥添加到 WAL 归档主机的 authorized_hosts 文件中。我还忘记将 WAL 存档主机的服务器指纹添加到备用 postgres 用户的 known_hosts 文件中。

这两个错误（我假设）导致 sftp restore_command 以代码 255 退出。正如 tscho 所说，Postgres 文档建议如果 restore_command 以任何非零值退出，Postgres 将继续尝试从 master 流式传输而不是拒绝开始。实际上，如果退出代码高于某个数字（可能是 125，正如 vyegorov 的源代码 grepping 所暗示的那样？），情况似乎并非如此。

一旦我修复了这两个 SSH 问题，备用服务器就可以使用 recovery.conf 中的 restore_command 正常启动。

score 0 · Accepted Answer

这是描述为什么选择命令进程的高退出状态的这种行为的注释，以及实现它的当前代码。

    /*
     * Remember, we rollforward UNTIL the restore fails so failure here is
     * just part of the process... that makes it difficult to determine
     * whether the restore failed because there isn't an archive to restore,
     * or because the administrator has specified the restore program
     * incorrectly.  We have to assume the former.
     *
     * However, if the failure was due to any sort of signal, it's best to
     * punt and abort recovery.  (If we "return false" here, upper levels will
     * assume that recovery is complete and start up the database!) It's
     * essential to abort on child SIGINT and SIGQUIT, because per spec
     * system() ignores SIGINT and SIGQUIT while waiting; if we see one of
     * those it's a good bet we should have gotten it too.
     *
     * On SIGTERM, assume we have received a fast shutdown request, and exit
     * cleanly. It's pure chance whether we receive the SIGTERM first, or the
     * child process. If we receive it first, the signal handler will call
     * proc_exit, otherwise we do it here. If we or the child process received
     * SIGTERM for any other reason than a fast shutdown request, postmaster
     * will perform an immediate shutdown when it sees us exiting
     * unexpectedly.
     *
     * Per the Single Unix Spec, shells report exit status > 128 when a called
     * command died on a signal.  Also, 126 and 127 are used to report
     * problems such as an unfindable command; treat those as fatal errors
     * too.
     */
    if (WIFSIGNALED(rc) && WTERMSIG(rc) == SIGTERM)
        proc_exit(1);

    signaled = WIFSIGNALED(rc) || WEXITSTATUS(rc) > 125;

    ereport(signaled ? FATAL : DEBUG2,
            (errmsg("could not restore file \"%s\" from archive: %s",
                    xlogfname, wait_result_to_str(rc))));

postgresql - PostgreSQL 9.1 流式复制 restore_command：退出代码 255 的特殊含义？

4 回答 4

Related

Reference