1

Because I am running out of arguments discussing with our admins, I hope you can help me with the following issue.

We have a strange behaviour corresponding to our self-implemented windows-services. They freeze randomly. Sometimes they keep on working for weeks and sometimes they freeze multiple times in a week. I am pretty sure, there is no problem with bad code or unhandled exceptions. In my opinion this is some kind of a windows admin/rights management problem in combination with chronological coincidence.

But let's start with some information at first:

  • All windows services are running on one server.
  • All windows services are executed by the same windows user.
  • The server is a virtual machine. (VMWare, Windows Server 2008 R2) (I know...)
  • The services are implemented using VB.Net with .Net 4.0. (I know... Was not my decision ;-))
  • We have 2 different kinds of services (called A, B).
  • Both kinds of services read files from a directory and write some information in a database. It's problably not important, what they are doing exactly, because this is some kind of standard task.
  • Every kind of a service exists in 3 variants, that are copies of each other, but use different SQL servers to store data (called 1, 2, 3).
  • At irregular intervals one ore two of the six services seem to freeze.
  • Inside the windows service manager the frozen services are marked as "running". Via Powershell command the services also are marked as running.
  • There is no pattern you can see corresponding to which services freeze. Somethimes for example service A variant 2 is frozen, whereas variant 1 and 3 are working fine. Important: There is the same code behind these 3 variants.
  • Each service writes one log file per day. Looking into the log of a frozen service you can see, that there is no exception or error logged. The services just have stopped doing their work.
  • There is no relevant information I can find in the windows events.
  • Restarting the frozen services always helps. Sometimes you cannot simply restart them. Instead you have to stop them first and start them after this. In this case you see "error 1061: service cannot accept control messages at this time". This also occurs irregularly.

Because I could not see any logged errors, I installed DebugDiag on the corresponding server, added crash rules for the mentioned services and perhaps found something interesting. Here is an extract of the DebugDiag log:

[12.06.2017 01:04:05]
  Thread created. New thread - System ID: 17372
[12.06.2017 01:04:29]
  Thread exited. Exiting thread - System ID: 7152. Exit code - 0x00000000
[12.06.2017 06:55:25]
  Thread created. New thread - System ID: 13252
  Thread exited. Exiting thread - System ID: 31012. Exit code - 0x00000000
  C:\Windows\System32\wship6.dll Unloaded from 0xfcee0000
  C:\Windows\System32\wshtcpip.dll Unloaded from 0xfc650000
  C:\Windows\System32\fwpuclnt.dll Unloaded from 0xfb1c0000
  C:\Windows\system32\security.dll Unloaded from 0x6f9e0000
  Thread exited. Exiting thread - System ID: 25912. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 17372. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 27412. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 13252. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 31768. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 27540. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 12252. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 29336. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 5620. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 8248. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 4340. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 18056. Exit code - 0x00000000
  Thread exited. Exiting thread - System ID: 34164. Exit code - 0x00000000
  Process exited. Exit code - 0x00000000

The last sign of life of the service (let's say it was service A variant 2), that was frozen again at this time, was at 01:04:29, where one thread has been exited. At 06:55:25 the service has been restarted by one of our admins, because he saw, that the service seemed to be frozen. No dump was written by DebugDiag, so I assume again, that the service did not crash.

For me it was strange, that wship6.dll, wshtcpip.dll, fwpuclnt.dll and security.dll were unloaded while restarting the service, because I have not seen this yet. I tried to restart another variant of service A several times, which was not frozen. I saw the same entries, but they were written only after the first restart. Even after stopping and starting the service again, I could not see, that the libraries were unloaded.

So after a lot of information:

  • Can you tell me roughly the task of these windows libraries?
  • Is there any hint, that the servers could have problems corresponding to user rights managment / group policies? I know, that we had issues with group policies in the past. The local rights of the user, which was executing the services, were overwritten by some invalid global group policies. That's at least what I understood. I am developing and don't do admin tasks.
  • What else could I check to make sure, there really is no problem with the code / help our admins to solve this annoying issue?

Edit 16.06.2017: Last night it was another windows service that stopped working with the same behaviour. Some variants of the windows service are frozen and some are still working. But this time you cannot see that the mentioned DLLs were unloaded while restarting the service. Maybe the first suspicion about the unloaded DLLs does not help for further diagnostics. One interesting fact: This service stopped working at the same time as the first service. Maybe there is a problem with the VM backups or something equivalent? I guess there is a regular task that is causing the problem. Do you have any hints?

Edit 19.06.2017: I guess we have found something interesting. The freezing services all have one .Net component in common: a filesystemwatcher. This has never been a problem in the past because we extended the .Net-filesystemwatcher with a self-reconnecting feature. The fileserver, which contains the path that is relevant for our filesystemwatcher, is backed up every night. Our filesystemwatcher reconnect feature checks every second, if this network path is unavailable. If so, the filesystemwatcher is reconnected after the path is available again. The hosting server, which manages all our virtual servers, has been upgraded a few days ago. So we have the following suspicion: Let's assume our windows service checks the network path at time t_1000 and t_2000. The virtual server backup disconnects the virtual file server, which contains the network path monitored by the filesystemwatcher, at time t_1200 and reconnects the path at t_1500. In this case our reconnect feature cannot work properly, because at t_1000 and t_2000 the network path was available. The filesystemwatcher nevertheless lost his connection and does not react to incoming files in the mentioned network path. This has not been a problem before, because the reconnect triggered by our backup software took some milliseconds longer due to the slower hardware used in this server. So our reconnect feature worked fine.

So what can we do?

  • Option 1: Contact the vendor of our backup software. Maybe this is a bug in his software?
  • Option 2: Never ever use a filesystemwatcher again, because we are always working on network paths.
  • Option 3: Maybe there is a way to optimize the filesystemwatcher even more? Can a filesystemwatcher catch any events like this, so we don't have to use our reconnect feature, which is working with a timer? What do you think?

Many thanks in advance.

4

1 回答 1

0

Here is our solution for everyone, who is interested:

The vendor of the backup software is aware of this problem, but is not willed to fix it. So we decided to create a new virtual machine, which is used as a fileserver for our needs. This new fileserver will not be backed up via snapshot.

I did not find a way to further improve our filesystemwatcher, so I guess this is our only chance to solve the problem.

于 2017-07-10T06:31:40.867 回答