blog/content/posts/selenium-remote-timeout.md

---
title: "Selenium Remote Driver Timeout"
date: 2019-01-18T17:30:17+01:00
author: James McDonald
type: post
categories:
  - Tech
draft: true
---

Investigating an issue with a Selenium grid revealed some interesting
shenanigans. We were experiencing a problem where some (working) tests failed
and the Selenium grid was stuck with browsers apparently busy and jobs in the
queue. Sometimes the grid itself would become unresponsive.

After a bunch of investigation I managed to track down the source: the test
suite was setting the Selenium client's `read_timeout` to 15 seconds. Doesn't
sound so bad, right? So here's where it all goes bork...

The test job runs 8 tests in parallel, and it's possible for more than one job
to be run at the same time, so more multiples of 8.

The interesting stuff starts when the 15 second timer is exceeded. The client
immediately gives up, marks the test as failed because of `ReadTimeout` and
goes on to the next test. But Selenium doesn't know about that, so the job
stays in the grid's queue. That wouldn't be too bad in itself, but
unfortunately that's not the end of it. When the job gets allocated a browser
instance, it runs normally. Then, as far as I can tell, the browser instance
sits and waits politely. Presumably it expects some client thread to come along
and pick up the result, but the client is long gone. So it sits. And waits.
Until the `browserTimeout` reaper comes along and stabs it.

Remember the client that went and started on the next test? That one might get
stuck in the queue too. And another, and another. And more from all the other
impatient threads running their own tests. Quickly, the browser pool is
saturated with stuck browsers waiting for clients that have wandered off. Add a
couple of hundred of these and you can jam up the whole grid queue to the point
where the grid service no longer responds at all.

As an aside, it seems like the browsers get very upset by this sitation. Chrome
in particular chews up multiple gigabytes whilst apparently doing nothing until
these jobs are finished. I'm not necessarily sure it's related, because
browsers do love them some RAMs at the best of times.

There might be several solutions to this, but I went for the simplest one. We
increased the timeout to 2 minutes (the default appears to be 1 minute, which
would probably also be fine). The nice, patient test clients leave plenty of
time for requests to be handled, get the responses they're looking for, and
nobody jams up anybody's queues. Lovely.