6

We run a bunch of Python test scripts on a group of test stations. The test scripts interface with hardware units on these test stations, so we're stuck running one test script at a time per station (we can't virtualize everything). We built a tool to assign tests to different stations and report test results - this allows us to queue up thousands of tests and let these run overnight, or for any length of time.

Occasionally, what we've found is that test stations will drop out of the cluster. When I remotely log into them, I get a black screen, then they reboot, then upon logging in I'm notified that windows XP had a "serious error". The Event Log contains a record of this error, which states Category: (102) and Event ID: 1003.

Previously, we found that this was caused by the creation of hundreds of temporary Firefox profiles - our tests use selenium webdriver to automate website interactions, and each time we started a new browser, a temporary Firefox profile was created. We added a step in the cleanup between each test that empties these temporary Firefox profiles, but we're still finding that stations drop out sometime, and always with this serious error and record in the Event Log.

I would like to find the root cause of this problem, but I don't know how to go about doing this. I've tried searching for information about how to read event log entries, but I haven't turned up anything that helps. I'm open to any suggestions for ways to go about debugging this issue.

4

1 回答 1

0

我以前在使用 Firefox 时遇到过类似的问题。我们很少能成功地捕获一台机器,它只是没有关闭浏览器会话。因此最终蓝屏死机。显然,这是 webdriver、firefox 或 XP(我们也在使用)中的一个错误。我们通过在每个单独的测试之间积极杀死每个 firefox 进程来解决它。这对我们有用。而且因为您没有并行运行测试,所以它也适用于您。我所说的激进是指用斧头穿过它。相当于killall -9 firefox. 因为这些会话没有响应。

至于根本原因?特定版本的 Firefox 没有出现此问题。但我们从未真正设法正确调试它。调试非常困难,因为它在短期测试运行下无法重现,一旦出现问题,它确实会导致严重崩溃。

于 2014-01-18T00:03:38.387 回答