2.5 KiB
title, date, author, type, categories, draft
| title | date | author | type | categories | draft | |
|---|---|---|---|---|---|---|
| Selenium Remote Driver Timeout | 2019-01-18T17:30:17+01:00 | James McDonald | post |
|
true |
Investigating an issue with a Selenium grid revealed some interesting shenanigans. We were experiencing a problem where some (working) tests failed and the Selenium grid was stuck with browsers apparently busy and jobs in the queue. Sometimes the grid itself would become unresponsive.
After a bunch of investigation I managed to track down the source: the test
suite was setting the Selenium client's read_timeout to 15 seconds. Doesn't
sound so bad, right? So here's where it all goes bork...
The test job runs 8 tests in parallel, and it's possible for more than one job to be run at the same time, so more multiples of 8.
The interesting stuff starts when the 15 second timer is exceeded. The client
immediately gives up, marks the test as failed because of ReadTimeout and
goes on to the next test. But Selenium doesn't know about that, so the job
stays in the grid's queue. That wouldn't be too bad in itself, but
unfortunately that's not the end of it. When the job gets allocated a browser
instance, it runs normally. Then, as far as I can tell, the browser instance
sits and waits politely. Presumably it expects some client thread to come along
and pick up the result, but the client is long gone. So it sits. And waits.
Until the browserTimeout reaper comes along and stabs it.
Remember the client that went and started on the next test? That one might get stuck in the queue too. And another, and another. And more from all the other impatient threads running their own tests. Quickly, the browser pool is saturated with stuck browsers waiting for clients that have wandered off. Add a couple of hundred of these and you can jam up the whole grid queue to the point where the grid service no longer responds at all.
As an aside, it seems like the browsers get very upset by this sitation. Chrome in particular chews up multiple gigabytes whilst apparently doing nothing until these jobs are finished. I'm not necessarily sure it's related, because browsers do love them some RAMs at the best of times.
There might be several solutions to this, but I went for the simplest one. We increased the timeout to 2 minutes (the default appears to be 1 minute, which would probably also be fine). The nice, patient test clients leave plenty of time for requests to be handled, get the responses they're looking for, and nobody jams up anybody's queues. Lovely.