From ed1548e94389987b1a3d935536c15fd7000fe666 Mon Sep 17 00:00:00 2001 From: James McDonald Date: Fri, 18 Jan 2019 18:06:03 +0100 Subject: [PATCH] Add selenium remote timeout --- content/posts/selenium-remote-timeout.md | 49 ++++++++++++++++++++++++ 1 file changed, 49 insertions(+) create mode 100644 content/posts/selenium-remote-timeout.md diff --git a/content/posts/selenium-remote-timeout.md b/content/posts/selenium-remote-timeout.md new file mode 100644 index 0000000..953268e --- /dev/null +++ b/content/posts/selenium-remote-timeout.md @@ -0,0 +1,49 @@ +--- +title: "Selenium Remote Driver Timeout" +date: 2019-01-18T17:30:17+01:00 +author: James McDonald +type: post +categories: + - Tech +draft: true +--- + +Investigating an issue with a Selenium grid revealed some interesting +shenanigans. We were experiencing a problem where some (working) tests failed +and the Selenium grid was stuck with browsers apparently busy and jobs in the +queue. Sometimes the grid itself would become unresponsive. + +After a bunch of investigation I managed to track down the source: the test +suite was setting the Selenium client's `read_timeout` to 15 seconds. Doesn't +sound so bad, right? So here's where it all goes bork... + +The test job runs 8 tests in parallel, and it's possible for more than one job +to be run at the same time, so more multiples of 8. + +The interesting stuff starts when the 15 second timer is exceeded. The client +immediately gives up, marks the test as failed because of `ReadTimeout` and +goes on to the next test. But Selenium doesn't know about that, so the job +stays in the grid's queue. That wouldn't be too bad in itself, but +unfortunately that's not the end of it. When the job gets allocated a browser +instance, it runs normally. Then, as far as I can tell, the browser instance +sits and waits politely. Presumably it expects some client thread to come along +and pick up the result, but the client is long gone. So it sits. And waits. +Until the `browserTimeout` reaper comes along and stabs it. + +Remember the client that went and started on the next test? That one might get +stuck in the queue too. And another, and another. And more from all the other +impatient threads running their own tests. Quickly, the browser pool is +saturated with stuck browsers waiting for clients that have wandered off. Add a +couple of hundred of these and you can jam up the whole grid queue to the point +where the grid service no longer responds at all. + +As an aside, it seems like the browsers get very upset by this sitation. Chrome +in particular chews up multiple gigabytes whilst apparently doing nothing until +these jobs are finished. I'm not necessarily sure it's related, because +browsers do love them some RAMs at the best of times. + +There might be several solutions to this, but I went for the simplest one. We +increased the timeout to 2 minutes (the default appears to be 1 minute, which +would probably also be fine). The nice, patient test clients leave plenty of +time for requests to be handled, get the responses they're looking for, and +nobody jams up anybody's queues. Lovely.