Add selenium remote timeout

2019-01-18 18:06:03 +01:00
parent fa62dfe28f
commit ed1548e943
1 changed files with 49 additions and 0 deletions
@@ -0,0 +1,49 @@
 ---
 title: "Selenium Remote Driver Timeout"
 date: 2019-01-18T17:30:17+01:00
 author: James McDonald
 type: post
 categories:
  - Tech
 draft: true
 ---
 Investigating an issue with a Selenium grid revealed some interesting
 shenanigans. We were experiencing a problem where some (working) tests failed
 and the Selenium grid was stuck with browsers apparently busy and jobs in the
 queue. Sometimes the grid itself would become unresponsive.
 After a bunch of investigation I managed to track down the source: the test
 suite was setting the Selenium client's `read_timeout` to 15 seconds. Doesn't
 sound so bad, right? So here's where it all goes bork...
 The test job runs 8 tests in parallel, and it's possible for more than one job
 to be run at the same time, so more multiples of 8.
 The interesting stuff starts when the 15 second timer is exceeded. The client
 immediately gives up, marks the test as failed because of `ReadTimeout` and
 goes on to the next test. But Selenium doesn't know about that, so the job
 stays in the grid's queue. That wouldn't be too bad in itself, but
 unfortunately that's not the end of it. When the job gets allocated a browser
 instance, it runs normally. Then, as far as I can tell, the browser instance
 sits and waits politely. Presumably it expects some client thread to come along
 and pick up the result, but the client is long gone. So it sits. And waits.
 Until the `browserTimeout` reaper comes along and stabs it.
 Remember the client that went and started on the next test? That one might get
 stuck in the queue too. And another, and another. And more from all the other
 impatient threads running their own tests. Quickly, the browser pool is
 saturated with stuck browsers waiting for clients that have wandered off. Add a
 couple of hundred of these and you can jam up the whole grid queue to the point
 where the grid service no longer responds at all.
 As an aside, it seems like the browsers get very upset by this sitation. Chrome
 in particular chews up multiple gigabytes whilst apparently doing nothing until
 these jobs are finished. I'm not necessarily sure it's related, because
 browsers do love them some RAMs at the best of times.
 There might be several solutions to this, but I went for the simplest one. We
 increased the timeout to 2 minutes (the default appears to be 1 minute, which
 would probably also be fine). The nice, patient test clients leave plenty of
 time for requests to be handled, get the responses they're looking for, and
 nobody jams up anybody's queues. Lovely.