From ed1548e94389987b1a3d935536c15fd7000fe666 Mon Sep 17 00:00:00 2001
From: James McDonald <james@jamesmcdonald.com>
Date: Fri, 18 Jan 2019 18:06:03 +0100
Subject: [PATCH] Add selenium remote timeout

---
 content/posts/selenium-remote-timeout.md | 49 ++++++++++++++++++++++++
 1 file changed, 49 insertions(+)
 create mode 100644 content/posts/selenium-remote-timeout.md

diff --git a/content/posts/selenium-remote-timeout.md b/content/posts/selenium-remote-timeout.md
new file mode 100644
index 0000000..953268e
--- /dev/null
+++ b/content/posts/selenium-remote-timeout.md
@@ -0,0 +1,49 @@
+---
+title: "Selenium Remote Driver Timeout"
+date: 2019-01-18T17:30:17+01:00
+author: James McDonald
+type: post
+categories:
+  - Tech
+draft: true
+---
+
+Investigating an issue with a Selenium grid revealed some interesting
+shenanigans. We were experiencing a problem where some (working) tests failed
+and the Selenium grid was stuck with browsers apparently busy and jobs in the
+queue. Sometimes the grid itself would become unresponsive.
+
+After a bunch of investigation I managed to track down the source: the test
+suite was setting the Selenium client's `read_timeout` to 15 seconds. Doesn't
+sound so bad, right? So here's where it all goes bork...
+
+The test job runs 8 tests in parallel, and it's possible for more than one job
+to be run at the same time, so more multiples of 8.
+
+The interesting stuff starts when the 15 second timer is exceeded. The client
+immediately gives up, marks the test as failed because of `ReadTimeout` and
+goes on to the next test. But Selenium doesn't know about that, so the job
+stays in the grid's queue. That wouldn't be too bad in itself, but
+unfortunately that's not the end of it. When the job gets allocated a browser
+instance, it runs normally. Then, as far as I can tell, the browser instance
+sits and waits politely. Presumably it expects some client thread to come along
+and pick up the result, but the client is long gone. So it sits. And waits.
+Until the `browserTimeout` reaper comes along and stabs it.
+
+Remember the client that went and started on the next test? That one might get
+stuck in the queue too. And another, and another. And more from all the other
+impatient threads running their own tests. Quickly, the browser pool is
+saturated with stuck browsers waiting for clients that have wandered off. Add a
+couple of hundred of these and you can jam up the whole grid queue to the point
+where the grid service no longer responds at all.
+
+As an aside, it seems like the browsers get very upset by this sitation. Chrome
+in particular chews up multiple gigabytes whilst apparently doing nothing until
+these jobs are finished. I'm not necessarily sure it's related, because
+browsers do love them some RAMs at the best of times.
+
+There might be several solutions to this, but I went for the simplest one. We
+increased the timeout to 2 minutes (the default appears to be 1 minute, which
+would probably also be fine). The nice, patient test clients leave plenty of
+time for requests to be handled, get the responses they're looking for, and
+nobody jams up anybody's queues. Lovely.