Saturday 6 June 2015

Auto DB disconnect is a folly ... and what actually occurs??

For many many years I have heard the excuse (and it is an excuse) "PHP closes the connection to the DB when the script ends... we don't need mysql_close()". This is put forward by programmers who have got away with this for a long time... yes.. including me before you all 'pass on by'... I am as guilty as the next one.

So, dear friends, let me explain what the nagging voice in the back of my mind kept telling me, without actually telling me, until we got hit by this particular issue... and hit hard.
Oh.. and if you don't use MySql, fear not, this affects you too.. so please read on.

When a connection is made to a Database, or ANY service across a TCP network, what actually happens is a socket is opened. To all intents and purposes a socket is a file. It is used as a means of communication. Essentially (and sorry if you are a sysadmin and already know this... this is to help you out with your programmers) a socket is a place that stores some data for your network to pass to its paired socket on a different machine.. or even the same machine.

So, with a DB connection, to a different machine, your code sets up a socket (yes.. it really does) on the node the code runs on, and the DB sets up a paired socket on its machine. Thats one file handle gone on each server.
This socket will be set up with an automatic kill after it has had no traffic for the default time, or after a close() call has been made.. on Linux machines the default time is 1 minute.
Most scripting languages (yes thats you PHP) have a default timeout of 1 minute too... which is handy... not.
So when your script 'ends', without calling a close(), your local socket stays open for 1 minute, waiting for any erroneous packets to come back, and the server side socket does likewise, waiting for any erroneous traffic over TCP.
Once the client side times out, it will send (thanks TCP for doing what I should have done) a close to its server side socket, so it will then close off... when it is sure that it has received that last packet... or after 1 minute.

What does this mean?

This means that every time your script does not issue a specific close, you have a file handle on the client and the server taken up for an extra minute to maybe 2 minutes.
Yea.. I hear you say.. so what? Well.. if you are running a simple website, this may make no difference. But what if your website grows, or you suddenly have a lot of people hit it .... then you may find it crashing... for exactly this reason.

In these days of cloud computing, a VM (virtual machine) generally has 24k socket files per instance (yes AWS/DO et all.. now I am looking at you). This enables them to spread the load on the underlying bare metal machine. So, lets say 15k people hit your site in a given minute. Thats 15k sockets taken up.. no problem. Scripts and DB works fine. Then one day you spike.... yoou've been running for years.. and now you spike to 30k visitors in 10 seconds... and down you go. (Yes.. this happened with us... its why I am writing this). What has happened is that the 0.005 ms connect and query you are doing is taking 1m to tear down. Don't care that it took a fraction of a second, its the closing of the socket that killed you, as they are sitting there waiting to make sure no stray packets are there.
Yes.. web nodes help at the client side.. but wont help at the DB level... the more webnodes you open up, the less sockets become available on your single database server. If you are reading only, thats cool... you can put a proxy in, and split the reads. If you are writing in a single master scenario... you are royally st*ffed, because your master is about to die, and you wont have a spare socket to ssh into it, to help out.
Yes.. you can now go to a dedicated instance, and have the normal 65k sockets (or thereabouts).. this has helped until 70k people want to write some details to your master in 1 minute.

We actually got hit because we were running mysql-proxy, on the same machine as the database, to get some logging out, and to be able to expand our DB access. What this did was make the scenario worse, because we had a socket going to proxy, and a socket out of proxy and one into the DB .. all on the same machine. So for every 15k connections we were using 45k sockets on the database machine.

So lesson learned, we now have dedicated instances and separate proxy servers, with 65k file handles on the proxy and the DB. We are also laboriously going through issuing a mysql_close() on the DB access object, and also all other connections to TCP services. This has certainly helped, but will not solve the issue of a few hundred thousand hitting within the same minute... but its a lot better than dying when you only have 20k people connect to your servers.

I hope this helps any of you that have the same issue (and you may not realise this is the issue until you examine life under the hood.. and please remember.. a timeout or exit of a script does not close your connection to a TCP service.. TCP does that for you, if not told otherwise, as part of its timeout process.. which is a lot longer than you would like it to be in a high pressure environment.