I've been using Selenium as a scraper/crawler, because I need a page's content after JS is evaluated. I have five EC2 machines that are each running selenium and a couple instances of the scraper I wrote.
However, I'm noticing some really odd behavior. After a couple hours, selenium stops on all the machines at around the same time. Given that I start selenium and the scrapers at the same time on all servers, this leads me to believe that there's some issue with selenium that pops up after long periods of time.
Here's selenium's log:
14:34:58.628 INFO - RemoteWebDriver instances should connect to: http://127.0.0.1:4444/wd/hub
14:34:58.629 INFO - Version Jetty/5.1.x
14:34:58.630 INFO - Started HttpContext[/selenium-server/driver,/selenium-server/driver]
14:34:58.631 INFO - Started HttpContext[/selenium-server,/selenium-server]
14:34:58.631 INFO - Started HttpContext[/,/]
14:34:58.753 INFO - Started org.openqa.jetty.jetty.servlet.ServletHandler@6a669053
14:34:58.753 INFO - Started HttpContext[/wd,/wd]
14:34:58.764 INFO - Started SocketListener on 0.0.0.0:4444
14:34:58.765 INFO - Started org.openqa.jetty.jetty.Server@2ef36617
21:24:41.031 INFO - Shutting down...
Another interesting thing I noticed: on each cluster, I always have at one scraper instance with this error:
File "SiteScraper.py", line 238, in _add_rendered_html
self.browser.get(url)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 168, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 156, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 147, in check_response
raise exception_class(message, screen, stacktrace)
WebDriverException: Message: u'Modal dialog present'
I think this means that selenium or firefox (the browser that I'm using with web driver) is popping up a modal after a certain period of time.
Has anyone had a similar problem/any insight on how to fix this?