2

We monitor our mongoDB connection count using this:

http://godoc.org/labix.org/v2/mgo#GetStats

However, we have been facing a strange connection leak issue where the connectionCount creeps up consistently by 1 more open connection per 10 seconds. (That's regardless whether there is any requests). I can spin up a server in localhost, leave it there, do nothing, the conectionCount will still creep up. Connection count eventually creeps up to a few thousand and it kills the app/db then and we have to restart the app.

This might not be enough information for you to debug. Does anyone have any ideas, connection leaks that you have dealt with in the past. How did you debug it? What are some of the way that I can debug this.

We have tried a few things, we scanned our code base for any code that could open a connection and put counters/debugging statements there, and so far we have found no leak. It is almost like there is a leak in a library somewhere.

This is a bug in a branch that we have been working on and there have been a few hundred commits into it. We have done a diff between this and master and couldn't find why there is a connection leak in this branch.

As an example, there is the dataset that I am referencing:

Clusters:      1   
MasterConns:   9936      <-- creeps up 1 per second
SlaveConns:    -7359     <-- why is this negative?
SentOps:       42091780   
ReceivedOps:   38684525   
ReceivedDocs:  39466143   
SocketsAlive:  78        <-- what is the difference between the socket count and the master conns count?
SocketsInUse:  1231   
SocketRefs:    1231

MasterConns is the number that creeps up one per 10 second. I am not entirely sure what the other numbers can mean.

4

1 回答 1

14

MasterConns不能告诉你是否有泄漏,因为它并没有减少。该字段表示自上次统计重置以来建立的连接数,而不是当前正在使用的套接字数。后者由SocketsAlive字段指示。

为了让您对这个主题有更多的了解,mgo 套件中的每一个测试都围绕着逻辑,以确保在测试完成后统计数据显示合理的值,这样潜在的泄漏就不会被忽视。这就是引入这种统计收集系统的主要原因。

然后,您看到这个数字每 10 秒左右增加一次的原因是由于内部活动发生了学习集群的状态。也就是说,这种行为最近发生了变化,因此它不会建立新的连接,而是从池中选择现有的套接字,所以我相信你没有使用最新版本。

SlaveConns负面看起来像一个错误。关于建立连接的统计信息收集有一个小的边缘案例,因为在我们与之交谈之前我们无法判断给定服务器是主服务器还是从服务器,因此可能存在未覆盖的路径。如果您在升级后仍然看到该行为,请报告该问题,我们将很乐意查看。

SocketsInUse是一个或多个会话仍在引用的套接字数量,无论它们是否处于活动状态(连接已建立)。SocketsAlive又是实际的 TCP 连接数。两者之间的差异表明许多会话未关闭。这可能没问题,如果它们仍然被应用程序保存在内存中并且最终将被关闭,或者如果session.Close应用程序错过了操作,则可能是泄漏。

于 2013-10-18T18:32:01.487 回答