Saturday, 6 June 2015

Auto DB disconnect is a folly ... and what actually occurs??

For many many years I have heard the excuse (and it is an excuse) "PHP closes the connection to the DB when the script ends... we don't need mysql_close()". This is put forward by programmers who have got away with this for a long time... yes.. including me before you all 'pass on by'... I am as guilty as the next one.

So, dear friends, let me explain what the nagging voice in the back of my mind kept telling me, without actually telling me, until we got hit by this particular issue... and hit hard.
Oh.. and if you don't use MySql, fear not, this affects you too.. so please read on.

When a connection is made to a Database, or ANY service across a TCP network, what actually happens is a socket is opened. To all intents and purposes a socket is a file. It is used as a means of communication. Essentially (and sorry if you are a sysadmin and already know this... this is to help you out with your programmers) a socket is a place that stores some data for your network to pass to its paired socket on a different machine.. or even the same machine.

So, with a DB connection, to a different machine, your code sets up a socket (yes.. it really does) on the node the code runs on, and the DB sets up a paired socket on its machine. Thats one file handle gone on each server.
This socket will be set up with an automatic kill after it has had no traffic for the default time, or after a close() call has been made.. on Linux machines the default time is 1 minute.
Most scripting languages (yes thats you PHP) have a default timeout of 1 minute too... which is handy... not.
So when your script 'ends', without calling a close(), your local socket stays open for 1 minute, waiting for any erroneous packets to come back, and the server side socket does likewise, waiting for any erroneous traffic over TCP.
Once the client side times out, it will send (thanks TCP for doing what I should have done) a close to its server side socket, so it will then close off... when it is sure that it has received that last packet... or after 1 minute.

What does this mean?

This means that every time your script does not issue a specific close, you have a file handle on the client and the server taken up for an extra minute to maybe 2 minutes.
Yea.. I hear you say.. so what? Well.. if you are running a simple website, this may make no difference. But what if your website grows, or you suddenly have a lot of people hit it .... then you may find it crashing... for exactly this reason.

In these days of cloud computing, a VM (virtual machine) generally has 24k socket files per instance (yes AWS/DO et all.. now I am looking at you). This enables them to spread the load on the underlying bare metal machine. So, lets say 15k people hit your site in a given minute. Thats 15k sockets taken up.. no problem. Scripts and DB works fine. Then one day you spike.... yoou've been running for years.. and now you spike to 30k visitors in 10 seconds... and down you go. (Yes.. this happened with us... its why I am writing this). What has happened is that the 0.005 ms connect and query you are doing is taking 1m to tear down. Don't care that it took a fraction of a second, its the closing of the socket that killed you, as they are sitting there waiting to make sure no stray packets are there.
Yes.. web nodes help at the client side.. but wont help at the DB level... the more webnodes you open up, the less sockets become available on your single database server. If you are reading only, thats cool... you can put a proxy in, and split the reads. If you are writing in a single master scenario... you are royally st*ffed, because your master is about to die, and you wont have a spare socket to ssh into it, to help out.
Yes.. you can now go to a dedicated instance, and have the normal 65k sockets (or thereabouts).. this has helped until 70k people want to write some details to your master in 1 minute.

We actually got hit because we were running mysql-proxy, on the same machine as the database, to get some logging out, and to be able to expand our DB access. What this did was make the scenario worse, because we had a socket going to proxy, and a socket out of proxy and one into the DB .. all on the same machine. So for every 15k connections we were using 45k sockets on the database machine.

So lesson learned, we now have dedicated instances and separate proxy servers, with 65k file handles on the proxy and the DB. We are also laboriously going through issuing a mysql_close() on the DB access object, and also all other connections to TCP services. This has certainly helped, but will not solve the issue of a few hundred thousand hitting within the same minute... but its a lot better than dying when you only have 20k people connect to your servers.

I hope this helps any of you that have the same issue (and you may not realise this is the issue until you examine life under the hood.. and please remember.. a timeout or exit of a script does not close your connection to a TCP service.. TCP does that for you, if not told otherwise, as part of its timeout process.. which is a lot longer than you would like it to be in a high pressure environment.

Saturday, 12 July 2014

The Power of 4 (or the Higgs Boson of the MySql World)

One of the dangers of reading your logs, is that you may trip up over something you don't quite understand that leads you into months of searching and hypothesising on the whys and wherefores and fixes.

Picture the scene. About 18 months to 2 Years ago I was involved in tracking down as many long running queries, for a client, as possible. Trawling the slow logs, trawling the error logs to show what was not working because query xyz was holding things up. It wasn't particularly fun, but it was vital to the business.

One day up popped this error .. or something very similar to it:
         SQLSTATE[HY000] [2003] Can't connect to MySQL server on 'n.n.n.n (4)

Hadn't seen one of those before. Well, thought I, I'll wait till some downtime on Friday and take a look to see what happened. At the time this had already passed about 4-5 hours earlier, so the moment was passed.
Friday came, and I settled down to taking a look. I got hold of our friendly sysadmins to see what the connection rate to the server was, what the load looked like, what the network usage was... all that kind of good stuff. Nothing out of the ordinary.
Next I checked the slow logs on the server in question.. nothing that seemed likely.. in fact it happened on a quiet day. My interest was piqued. This actually happened on a web slave, that also acted as a slave Db to the clients extranet. It represented a 'missed' connection. No one had complained internally about not being able to connect, so, it could easily represent a missed sale (fickle people clients.. they just move on over to the next site and we've lost a sale or 40).

So, doing the normal checks drew a blank. I trawled the web, and Oracles support site. We weren't alone, but it wasn't a major issue either... and no obvious answers from anyone.

Well, that ended there. It was the weekend after all, and we had only one instance in the last year or two, and a few hundred thousand connections every day, so I just thought 'chalk it down to experience' and move on.

Life continued. About 10 days later, (definitely not the following week), up popped 2 in short succession, on the other of the pair of web slaves. Interesting thought I. Without waiting till Friday this time, I jumped straight on it. I gathered the usual culprits of information I recorded the time, I aksed around for people moaning about not getting onto the site. I looked at slow queries, network load, machine load (bear in mind this machine was a nice shiny new one with mega amounts of memory). Nada, Rien, Nothing.
At this stage I was logging slave lag as well as slow queries, for other reasons, so checked against that. No lag. I looked at crons that were running on the master... there was one that finished 2-3 seconds earlier, but nothing at the time. The slaves were up to date. The connections were well below the maximum. There was absolutely nothing that showed anything was wrong... except the (4) in the logs.
After a fruitless couple of hours, and with plenty else to be getting on with, I just logged the details I did have, and moved on.

Over the next couple of months we started getting more and more. As opposed to once in a blue moon, or once every week, we started getting 2 to 3 per day. They were definitely on the increase. I had a larger spreadsheet showing slaves, times, and all kinds of cool stuff. I also, by now, had an IT Director asking what was causing them? Like I actually knew? I mean.. if I knew, I would have suggested a raft of fixes. I hadn't, so I didn't know did I? I did suggest (in teh full knowledge that ths would produce a race condition), that we could simply check for a (4) error on connect, wait a nano second or two and try again... it met with a stony silence... the kind I would have given in his position frankly.

In a fit of pique (another one... I'm prone to those when something bugs me) I downloaded the mysql server and client source. I ran grep over the server source... nothing. I ran grep over the client library... and there it was .. O N E occurrence in the connect code. Phew! I would now know what causes it.
On went the programmers hat, out came Geany ( I know)... and up came the code. It became apparent, quite quickly, that the demon (4) error was an error that came midway of a connect. It was NOT a timeout as such. The connect process goes something similar to:
      Client: "Hi server ... you there?"
                                                                                             Server: "Yep... what do you want?"
      Client: "Can you log this guy in please?"
                                                                                             Server: "..."
      Client: "Hey .. did you hear me?"
                                                                                             Server: "...."
      Client: "(4)"

Gee.. thanks client library... thats a really useful explanation. So , in reality, I now knew it wasn't a timeout, it was simply a lack of response from the server, midway through a connection handshake. So, it was obviously not a programming issue... more a network issue. I knew just what the reaction from the sysadmins would be.. and I wasn't wrong... but as far as I was concerned the ball was definitely in their park now. Phew again.

The following morning I was met by a barage of information showing beyond all reasonable doubt, that it was nothing within the realms of the sysadmins workload. There was nothing they could do to identify what had caused this to happen at this particular time, and nothing they could tweak to make it go away either.
I was also met by 3 more (4) errors. Ok.. this was getting silly. They were on the increase, and each one could represent a missed customer.
I logged the times, and went back to look at my list, now longer than I would have liked, to see what was going on. I saw, to my horror, that apart from one or two of the 70-80 of these little devils, they all happened on the same slaves, and they all happened within 2-3 seconds of a cron job finishing on the master. This is within the time it took for the replication of that crons data to the slaves.

I went to the cron. Now, this particular cron, runs hourly. Its one that Inserts/Updates around 20k rows of data. This actually depends on the sales levels. The company had just started a particularly aggressive expansion.. still going on as I write this. The cron used to write each statement seperately. This had casued locking on a table a few months earlier, during the day. I had suggested a 'fix' whereby we use the 'multirow insert/update', to reduce the lock time (it was MyIsam if anyone is interested). I had completely forgotton that, as the fix had gone in a few weeks prior to the first occurrence of the (4) issue.

Thinking about this, I came up with the slightly crazy idea that al that was happeneing was the network was getting flooded with large chunks of data. not so flooded as to raise cacti alarms, but flooded enough to stop a single connect handshake to happen in the internally set time .
I put this forward to the sysadmins. They did the decent thing, thought about it, and came back with "we believe that to be extremely unlikely". well... for me it was the only sane conclusion. It covered the fact that it wasn't predictable, it covered the fact that it coincided with an increase in trade (that paid my salary), and every single one happened within a second or two of that cron finishing. Why was that then sysadmin? No answer.
It was getting messy. I never like going head to head with sysadmins... we need them. But in this case I had no other option. All other avenues were tried. My suggestion was to make the blocks of data slightly smaller, so that some space could be created for connection handshakes. So, reduce the blocks of rows from 100 to 50.
Nothing happened. Then we started seeing a real increase in the occurrence. I was asked to look again. I stood my ground. I stayed with the  theory that we needed to lower the block size.
I annoyed the IT director (he would never admit it) but it was the only theory anyone was putting forward that held any water. So .. my fix of the fix went in.

It has to be said, that the following morning there was a parting shot by the db, where it produced a (4) error 2 hours after the fix went in. But that was the last anyone has seen of it. That was over a year ago now, so one can only conclude, inconclusively, that I was right, and this was caused primarily by the inability of the connection handshake to make its way through a stream of data.

So my friends, if you are unfortunate enough to find one of these... save yourself a bunch of time, and look at the actual data passing up and down your network when it happens. It probably wont be enough to trigger an alarm, but it could easliy be enough to stop one little tiny relevaant tcp packet being acknowledged by your server, and your client library throwing a hissy fit.... resulting in the MySql swear word : (4) !