I have an issue with MongoDB becoming unresponsive and refusing connections in production.
In testing, I'm able to consistently reproduce the "lock out" by sending lot of requests to my app.
Here is an example output from mongostat:
0 0 0 0 0 1 0 1.3g 1.8g 177m 0 0 0 0|0 0|0 62b 1k 18 19:40:11
0 0 0 0 0 1 0 1.3g 1.8g 177m 0 0 0 0|0 0|0 62b 1k 18 19:40:12
--------------LOAD TEST BEGINS-----------------------
1 56 0 0 1 58 0 1.3g 1.81g 177m 0 0 0 0|0 0|0 10k 302k 18 19:40:13
10 116 0 0 2 127 0 1.3g 1.81g 177m 0 0 0 0|0 1|0 32k 263k 19 19:40:14
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
206 431 0 0 2 638 0 1.3g 1.82g 177m 0 1.1 0 0|0 0|0 206k 339k 22 19:40:15
331 649 0 0 2 983 0 1.3g 1.81g 177m 0 1 0 0|0 0|0 317k 521k 21 19:40:16
367 723 0 0 10 1090 0 1.3g 1.81g 177m 0 0.9 0 0|1 1|1 297k 1m 20 19:40:17
500 961 0 0 11 1461 0 1.3g 1.81g 177m 0 1.2 0 0|0 0|0 390k 1m 19 19:40:18
543 1073 0 0 10 1618 0 1.3g 1.82g 177m 0 1.2 0 0|0 2|0 433k 1m 21 19:40:19
461 909 0 0 11 1370 0 1.3g 1.81g 177m 0 1.1 0 0|0 2|0 365k 1m 20 19:40:20
347 685 0 0 6 1033 0 1.3g 1.81g 177m 0 0.8 0 0|0 1|0 411k 962k 21 19:40:21
482 938 0 0 9 1421 0 1.3g 1.81g 177m 0 1.1 0 0|0 0|0 492k 1m 20 19:40:22
444 877 0 0 10 1322 0 1.3g 1.81g 177m 0 1.1 0 0|0 0|1 479k 1m 22 19:40:23
528 1019 0 0 9 1549 0 1.3g 1.81g 177m 0 1.2 0 0|0 0|0 446k 1m 21 19:40:24
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
520 1033 0 0 10 1554 0 1.3g 1.81g 177m 0 1.2 0 0|0 1|0 418k 1m 22 19:40:25
448 878 0 0 10 1326 0 1.3g 1.81g 177m 0 1.2 0 0|0 1|0 349k 1m 22 19:40:26
503 979 0 0 8 1483 0 1.3g 1.82g 177m 0 1.3 0 0|0 1|0 397k 1m 21 19:40:27
379 743 0 0 10 1123 0 1.3g 1.81g 177m 0 1 0 0|0 0|0 315k 1m 22 19:40:28
456 896 0 0 7 1354 0 1.3g 1.81g 177m 0 1.1 0 0|0 0|0 436k 1m 22 19:40:29
371 715 0 0 10 1086 0 1.3g 1.81g 175m 0 0.9 0 0|0 0|0 299k 1m 20 19:40:30
505 993 0 0 7 1499 0 1.3g 1.81g 175m 0 1.2 0 0|0 2|0 394k 1m 21 19:40:31
493 1023 0 0 10 1519 0 1.3g 1.82g 175m 0 1.1 0 0|0 1|0 392k 1m 22 19:40:32
426 868 0 0 11 1295 0 1.3g 1.81g 170m 0 1 0 0|0 1|0 363k 1m 21 19:40:33
352 772 0 0 7 1125 0 1.3g 1.81g 163m 19 4.4 0 0|1 0|1 368k 1m 22 19:40:34
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
330 656 0 0 5 985 0 1.3g 1.81g 159m 4 1.5 0 0|0 0|0 325k 876k 19 19:40:35
398 817 0 0 8 1218 0 1.3g 1.81g 155m 9 1.3 0 0|0 1|0 372k 1m 21 19:40:36
464 1015 0 0 8 1479 0 1.3g 1.81g 155m 2 1.2 0 0|0 1|0 381k 1m 21 19:40:37
457 915 0 0 11 1372 0 1.3g 1.81g 155m 0 1.1 0 0|0 0|0 356k 1m 22 19:40:38
458 960 0 0 7 1419 0 1.3g 1.83g 155m 0 1.1 0 0|0 0|0 376k 1m 22 19:40:39
391 760 0 0 11 1153 0 1.3g 1.81g 155m 0 1 0 0|0 1|0 321k 1m 21 19:40:40
315 639 0 0 5 956 0 1.3g 1.81g 155m 0 0.8 0 0|0 0|0 325k 877k 21 19:40:41
303 696 0 0 7 998 0 1.3g 1.81g 155m 0 0.8 0 0|0 0|0 279k 1m 21 19:40:42
388 772 0 0 8 1163 0 1.3g 1.81g 155m 0 1 0 0|0 0|0 308k 1m 20 19:40:43
273 623 0 0 7 895 0 1.3g 1.81g 155m 0 0.7 0 0|0 2|0 232k 1m 20 19:40:44
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
397 796 0 0 7 1194 0 1.3g 1.81g 155m 0 0.9 0 0|0 0|0 311k 1m 19 19:40:45
347 754 0 0 4 1103 0 1.3g 1.81g 155m 0 0.9 0 0|0 0|0 329k 771k 20 19:40:46
336 700 0 0 8 1042 0 1.3g 1.82g 160m 20 0.8 0 2|2 3|2 370k 1m 23 19:40:47
272 525 0 0 5 805 0 1.3g 1.79g 167m 188 0.6 0 0|0 0|0 306k 818k 18 19:40:48
103 246 0 0 1 350 0 1.3g 1.78g 166m 0 0.4 0 0|0 0|0 136k 234k 15 19:40:49
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:50
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:51
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:52
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:53
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:54
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:55
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:56
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:57
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:58
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:40:59
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:00
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:01
0 3 0 0 0 4 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 775b 4k 15 19:41:02
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:03
0 0 0 0 0 1 1 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:04
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:05
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:06
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:07
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:08
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:09
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:10
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:11
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:12
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:13
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:14
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:15
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:16
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:17
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:18
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:19
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:20
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:21
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:22
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:23
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:24
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:25
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:26
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:27
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:28
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:29
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:30
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:31
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:32
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:33
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:34
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:35
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:36
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:37
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:38
0 0 0 0 0 1 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 62b 1k 15 19:41:39
0 28 0 0 0 29 0 1.3g 1.78g 166m 0 0 0 0|0 0|0 5k 27k 15 19:41:40
86 234 0 0 1 322 0 1.3g 1.79g 166m 0 0.2 0 0|0 1|0 145k 217k 17 19:41:41
216 445 0 0 5 663 0 1.3g 1.79g 167m 0 0.5 0 0|0 0|0 220k 807k 16 19:41:42
302 612 0 0 6 916 0 1.3g 1.79g 167m 0 0.6 0 0|0 0|0 236k 950k 16 19:41:43
113 247 0 0 1 362 0 1.3g 1.79g 167m 0 0.2 0 0|0 0|0 95k 192k 17 19:41:44
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
38 63 0 0 2 103 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 26k 289k 16 19:41:45
--------------LOAD TEST ENDS-----------------------
0 0 0 0 0 2 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 124b 2k 16 19:41:46
0 0 0 0 0 2 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 124b 2k 16 19:41:47
0 0 0 0 0 2 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 124b 2k 16 19:41:48
0 0 0 0 0 2 0 1.3g 1.79g 167m 0 0 0 0|0 0|0 124b 2k 16 19:41:49
As you can see, in the middle of the load test, the reads and writes drop to pretty much nothing. During that time, I'm unable to connect locally to MongoDB via mongo or mongostat (connection refused). As you can also see, the connections is pretty low and there are 16000 connections in the max connections for mongo.
I'm able to reproduce this consistently. However, each time the "lock out" is a different length. I've observed 51 seconds, 20 seconds, 28 seconds. In each instance, I'm still connected to mongo via mongostat (and receiving output), but new connections will be refused.
user@myhostname:~# mongostat
couldn't connect to [127.0.0.1] couldn't connect to server 127.0.0.1
What could be the issue? I'm running MongoDB v1.8.2.
As an aside, I'm able to reproduce a similar issue, where the commands don't drop off entirely, but are low in volume and fluctuate wildly.