51

这个问题正在扼杀我的生产服务器的稳定性。

回顾一下,基本思想是我的节点服务器有时会间歇性变慢,有时会导致网关超时。尽我所能从我的日志中看出,有东西阻塞了节点线程(意味着传入的请求不被接受),但我一辈子都无法弄清楚是什么。

问题的严重程度不一。有时应该小于 100 毫秒的请求需要大约 10 秒才能完成;有时它们甚至根本不会被节点服务器接受。简而言之,就好像某个随机任务正在工作并阻塞节点线程一段时间,从而减慢(甚至阻塞)传入请求;我可以肯定地说的一件事是需要修复的症状是“网关超时”

问题来来去去毫无预兆。我无法将它与 CPU 使用率、RAM 使用率、正常运行时间或任何其他相关统计数据相关联。我已经看到服务器可以很好地处理大负载,然后在小负载下出现此错误,因此它甚至看起来与负载无关。在太平洋标准时间凌晨 1 点左右看到错误并不罕见,这是一天中最小的加载时间!重新启动节点应用程序似乎可能会使问题消失一段时间,但这并不能告诉我太多。我确实想知道它是否可能是 node.js 中的一个错误……考虑到它正在杀死我的生产服务器,这不是很令人欣慰。

  • 我做的第一件事是确保我已将 node.js 升级到最新版本(0.8.12),以及我的所有模块(在这里它们是)。当然,我也有很多错误捕捉器。我没有做任何时髦的事情,比如将很多内容打印到控制台或写入很多文件。
  • 一开始我以为是出站的HTTP请求阻塞了入站的socket,因为express中间件连入站请求都没有接,但我放弃了这个理论,因为看起来节点线程本身变得很忙
  • 接下来,我用 JSHint 浏览了我的所有代码,并修复了每一个警告,包括一些意外的全局变量(忘记写“var”),但这并没有帮助
  • 在那之后,我假设我可能内存不足。但是,我通过 nodetime 的堆快照现在看起来相当不错(如下所述)。
  • 仍然认为内存可能是一个问题,我看了一下垃圾收集。我启用了 --nouse-idle-notification 标志,并在不需要时对 NULL 对象进行了更多代码优化。
  • 仍然确信内存是问题所在,我添加了 --expose-gc 标志并执行了 gc(); 指挥每一分钟。这并没有改变任何东西,除了偶尔让请求变慢一点。
  • 在绝望的尝试中,我将“集群”模块设置为使用 2 个工作人员并每 30 分钟自动重新启动一次。尽管如此,还是没有运气。
  • 我将 ulimit 增加到 10,000 以上,并密切关注打开的文件。每个 node.js 应用程序似乎有 < 300 个打开的文件(或套接字),因此增加 ulimit 没有影响。

我一直在用 nodetime 记录我的服务器,这是它的要点:

  • 在亚马逊云上运行的 CentOS 5.2(m1.large 实例)
  • 始终大于 5000 MB 的可用内存
  • 始终小于 150 MB 的堆大小
  • CPU 使用率始终低于 60%

我还检查了我的 MongoDB 服务器,它的 CPU 使用率 <5%,并且没有请求需要 > 100 毫秒才能完成,所以我非常怀疑是否存在瓶颈。

I've wrapped (almost) all my code using Q-promises (see code sample), and of course have avoided Sync() calls like the plague. I've tried to replicate the issue on my testing server (OSX), but have had little luck. Of course, this may be just because the production servers are being used by so many people in so many unpredictable ways that I simply cannot replicate via stress tests...

4

7 回答 7

15

Many months after I first asked this question, I found the answer.

In a nutshell, the problem was that I was not piping a big asset when transferring it from one server to another. In other words, I was downloading an image from one server, before uploading it to a S3 bucket. Instead of streaming the download into the upload, I downloaded the file into memory, and then uploaded it.

I'm not sure why this did not show up as a memory spike, or elsewhere in my statistics.

于 2013-03-28T22:06:08.600 回答
12

My guess is Mongoose. If you are storing large payloads in Mongo, Mongoose can be pretty slow due to how it builds the Mongoose objects. See https://github.com/LearnBoost/mongoose/issues/950 for more details on the problem. If this is the problem you wouldn't see it in Mongo itself since the query returns quickly, but object instantiation could take 75x the query time.

Try setting up timers around (process.hrtime()) before and after you the Mongoose objects are being created to see if that might be the problem. If this is the problem, I would switch to using the node Mongo driver directly instead of going through Mongoose.

于 2012-10-17T20:02:45.323 回答
4

You are heavily leaking memory, try setting every object to null as soon as you don't need it anymore! Read this.

More information about hunting down memory leaks can be found here.

Give special attention to having multiple references to the same object and check if you have circular references, those are a pain to debug but will help you very much.

Try invoking the garbage collector manually every minute or so (I don't know if you can do this in node.js cause I'm more of a c++ and php coder). From my years of experience working with c++ I can tell you the most likely cause of your application slowing down over time is memory leaks, find them and plug them, you'll be ok!

Also assuming you're not caching and/or processing images, audio or video in memory or anything like that 150M heap is a lot! Those could be hundreds of thousands or even millions of small objects.

You don't have to be running out of memory for your application to slow down... just searching for free memory with that many objects already allocated is a huge job for the memory allocator, it takes a lot of time to allocate each new object and as you leak more and more memory that time only increases.

于 2012-10-14T22:48:49.367 回答
1

Is "--nouse-idle-connection" a mistake? do you really mean "--nouse_idle_notification".

I think it's maybe some issues about gc with too many tiny objects. node is single process, so watch the most busy cpu core is much important than the load. when your program is slow, you can execute "gdb node pid" and "bt" to see what node is busy doing.

于 2012-10-18T02:13:50.060 回答
1

What I'd do is set up a parallel node instance on the same server with some kind of echo service and test that one. If it runs fine, you narrow down your problem to your program code (and not a scheduler/OS-level problem). Then, step by step, include the modules and test again. Certainly this is a lot of work, takes long and I dont know if it is doable on your system.

于 2012-10-19T07:30:45.557 回答
1

If you need to get this working now, you can go the NASA redundancy route:

Bring up a second copy of your production servers, and put a proxy in front of them which routes each request to both stacks and returns the first response. I don't recommend this as a perfect long-term solution but it should help significantly reduce issues in production now, and help you gather log data that you could replay to recreate the issues on non-production servers.

Obviously, this is straight-forward for read requests, but more complex for commands which write to the db.

于 2012-10-24T00:43:07.937 回答
0

We have a similar problem with our Node.js server. It didn't scale well for weeks and we had tried almost everything as you had. Our problem was in the implicit backlog value which is set very low for high-concurrent environments.

http://nodejs.org/api/http.html#http_server_listen_port_hostname_backlog_callback

Setting the backlog to a significantly higher value (e.g. 10000) as well as tune networking in our kernel (/etc/sysctl.conf on Linux) as described in manual section helped a lot. From this time forward we don't have any timeouts in our Node.js server.

于 2012-10-27T11:27:35.123 回答