We run a bunch of Python test scripts on a group of test stations. The test scripts interface with hardware units on these test stations, so we're stuck running one test script at a time per station (we can't virtualize everything). We built a tool to assign tests to different stations and report test results - this allows us to queue up thousands of tests and let these run overnight, or for any length of time.
Occasionally, what we've found is that test stations will drop out of the cluster. When I remotely log into them, I get a black screen, then they reboot, then upon logging in I'm notified that windows XP had a "serious error". The Event Log contains a record of this error, which states Category: (102)
and Event ID: 1003
.
Previously, we found that this was caused by the creation of hundreds of temporary Firefox profiles - our tests use selenium webdriver to automate website interactions, and each time we started a new browser, a temporary Firefox profile was created. We added a step in the cleanup between each test that empties these temporary Firefox profiles, but we're still finding that stations drop out sometime, and always with this serious error and record in the Event Log.
I would like to find the root cause of this problem, but I don't know how to go about doing this. I've tried searching for information about how to read event log entries, but I haven't turned up anything that helps. I'm open to any suggestions for ways to go about debugging this issue.