Files
blog/content/posts/selenium-remote-timeout.md
2019-01-18 18:06:03 +01:00

2.5 KiB

title, date, author, type, categories, draft
title date author type categories draft
Selenium Remote Driver Timeout 2019-01-18T17:30:17+01:00 James McDonald post
Tech
true

Investigating an issue with a Selenium grid revealed some interesting shenanigans. We were experiencing a problem where some (working) tests failed and the Selenium grid was stuck with browsers apparently busy and jobs in the queue. Sometimes the grid itself would become unresponsive.

After a bunch of investigation I managed to track down the source: the test suite was setting the Selenium client's read_timeout to 15 seconds. Doesn't sound so bad, right? So here's where it all goes bork...

The test job runs 8 tests in parallel, and it's possible for more than one job to be run at the same time, so more multiples of 8.

The interesting stuff starts when the 15 second timer is exceeded. The client immediately gives up, marks the test as failed because of ReadTimeout and goes on to the next test. But Selenium doesn't know about that, so the job stays in the grid's queue. That wouldn't be too bad in itself, but unfortunately that's not the end of it. When the job gets allocated a browser instance, it runs normally. Then, as far as I can tell, the browser instance sits and waits politely. Presumably it expects some client thread to come along and pick up the result, but the client is long gone. So it sits. And waits. Until the browserTimeout reaper comes along and stabs it.

Remember the client that went and started on the next test? That one might get stuck in the queue too. And another, and another. And more from all the other impatient threads running their own tests. Quickly, the browser pool is saturated with stuck browsers waiting for clients that have wandered off. Add a couple of hundred of these and you can jam up the whole grid queue to the point where the grid service no longer responds at all.

As an aside, it seems like the browsers get very upset by this sitation. Chrome in particular chews up multiple gigabytes whilst apparently doing nothing until these jobs are finished. I'm not necessarily sure it's related, because browsers do love them some RAMs at the best of times.

There might be several solutions to this, but I went for the simplest one. We increased the timeout to 2 minutes (the default appears to be 1 minute, which would probably also be fine). The nice, patient test clients leave plenty of time for requests to be handled, get the responses they're looking for, and nobody jams up anybody's queues. Lovely.