Monday 7 September 2020

The Case for MongoDB single sharded cluster
(or why PSA/PSS is suboptimal)

Recently I have bumped into a few clients who have hit issues with MongoDB, due to growth. Without fail these have been down to a couple of issues, but centred on the set up of Primary/Secondary/Arbiter (PSA) or Primary/Secondary/Secondary (PSS), and then trying to squeeze performance out of that setup.

Now, I understand that these were, almost exclusively,  set up by front end devs (nothing wrong with that), but the issue lies with just assuming that MongoDb is a Relational database. We all feel warm and cuddly about a Relational DB, its Master /Slave (Primary/Secondary) approach, and the fact that you write to a master (primary), and read from a slave (secondary). So Mongo gets set up in exactly the same way, and all is wonderful… up until we have squeezed that last ounce of performance out of it, and things still dont work well. 

“Isn’t MongoDb one of the fastest most efficient Dbs on the Market? So why is my site dying?” 

Lets look at why PSA is not a great idea

First off.. if you are using MongoDb on a site that you know for certain will never, in the lifetime of you , your kids and grandkids, require more than one copy of the data, on a single machine.. please ignore this part of the post, and carry on setting up as PSA… or PSS. 

Ok.. so lets look at a typical scenario. It has to be said, that most issues surround the usage of memory/cache by wiredTiger (WT), Mongodbs default storage engine. WT is the main reason why MongoDb is fast…. blindingly fast.. if used properly.

Sample setup of PSS/PSA
WT will use the max of 50% of memory – 1Gb or 256Mb memory whichever is bigger. What it does with this is:
1. Load all indexes that it can into memory
2. Assign memory to connections. You should never have to have more that 5k connections. If you do, something is wrong with your
application.
3. Load as much of recently used data into memory as it can, with what is left over.

The remaining memory, less whatever the OS is taking up, will be used by MongoDb for dealing with query results (aggregation/sorting/etc) , and communicating to your application (this is important).

So lets assume you have gone large on this, and have a 128 Gb memory box. WT will take 63.5 Gb. This means that all your indexes will need to take up 63.5Gb as a maximum, preferably less. Any extra space will be used as data space from the most recently used data. MongoDB will itself use the remaining 63.5Gb.

You now check your usage, and see that indexes are taking no more than 23Gb. Thats great.. plenty of room for expansion, plenty of room for data... all is going swimmingly well. Phew!

Then one day, it all starts slowing up. The Primary keeps swapping out, and needs restarting, queries that were fine are now hitting the slow logs, and all manner of weird stuff is occurring. No real reason for this. You know data has increased, but you have done no physical changes to the setup, COLLSCANS are not happening, IDXSCANS are not happening. So what has changed?

The chances are that your indexes have increased, and the amount of data has increased as well (obviously) , and essentially nothing actually ‘fits’ anymore. This leads to WT having to swap out old data, then also swap out indexes that are least useful, in order to reread those that are used. This can be confirmed by looking at the stats, or, if you are using a valid logging system (Atlas, Datadog
etc etc) .. that shows this.

So what happens now?

In no particular order, I have seen the following, and none of these provide a permanent fix for the situation...

Add to the cache
More memory can be taken from the OS, and added to the actual memory used by the mongod. This enables the mongod (data) unit to use more memory, but reduces the amount available to the Mongo unit for sorts, and pipelines.
It is a valid ’fix’, but will only be temporary until the data increases some more... when we are back to square one.

Increase the machine size
If you are in the cloud, this is relatively simple, if not, it involves provisioning new hardware, either on premise or through your hosting service, in which case, it is not a ‘quick fix’. It also buys you time only, at extra cost.

Neither actually solve the issue of a growing business. Sure, you now have extra memory, and probably extra disk space, but you have not actually solved the issue for the long term.

Split the data into different databases
This comes form the mistaken idea that some client data being accessed more than others will reduce the index usage, and data usage, allowing your more important data to stay in memory, reducing the actual memory usage.
What this actually does is slow the loading of indexes and data, by not loading into cache until the database is used. However, typically, all your databases will be used at some point by the system (if they aren't... why are they there?) .
Next, what this does is increase your filehandle usage at the OS level. Each index is opened and read... and the filehandle is kept open so that the refreshed data can be read in quicker, when changes are made. So for every index there is an open file. 

At the OS level this can be bad also.
Lets say you have, to keep it simple, one database, with 10 collections, and 15 clients. Each collection has 8 indexes. So for this one database you have 80 index files open. You then split that to one database per client. Now we have 15 versions of the above, so 80*15 or 1200 index files open.
This is a small example. However, if you look at this Mongo Jira ticket I found, you will see that some people struggle because they have, for example, 280 Databases, 47,891 Collections and 270,078 Indexes. 
Generally there are 65000 max file handles allowed.. so you can see the issue. With 1 database this user would have, 1 database, 171 collections and 965 indexes, or similar. A whole different story.

Split the collections into multiple collections
This is similar to the effect of splitting into multiple databases. I have seen this on more than one occasion with stats type data. Due to issues with the database, the collections were split into ‘daily’ or ‘weekly’ collections. 
All this does is increase the complexity, multiply the number of indexes used by 7, 14.. whatever, and leave you with a collection management issue, with no real gain.
Further, this leads to having monthly / yearly aggregation collections, as its no longer possible to use a single collection to get this data. So the storage and index usage increases further, which is exactly what we are trying to reduce.

Multiple Clusters
Finally, because all of the above have failed to lead to a stable system, the decision is made to have multiple clusters, with sets of data completely separated from each other.

This solves the issue, as essentially the data is being split completely, similar to each cluster representing all the data on it as though it was the total amount of data.

However, this is also an issue elsewhere. What you now need to do is execute queries against the correct cluster for that dataset, miss out on the ability to have a central source for all your companies data, and a need to query more than one cluster to get a single result across the company. 

Think of a simple question like:
How many clients do we have? 
What was the client with the highest actvity last week?

Add to this, that the pattern this leads to is having multiple clusters, all doing the same thing, but with added maintenance and complexity for your sysadmins to deal with... multiple backups.. all manner of costs for no particular gain.

Ok, so how do we solve this from the get go?
What we should do is set MongoDb up as a single sharded cluster. This involves the following units, and please bare with me.

Mongod :  Data/Shard nodes.
Arbiter   : At least one arbiter.
Mongos  : Query Nodes... these are called by the application, and call the data nodes, and config servers, directly... see below.
Config server(s) : Ideally a cluster of 3 for larger systems, if small you could get away with 1 or 2... but what happens if they die?

MongoDB single sharded cluster
Arbiters
These do what they always do, keep the primaries and secondaries in order, and help Mongodb work out which shard server should be a primary.

Config Servers
These hold metadata, and the distribution of the shards across the shard servers .. your current data nodes.

Shard/Data nodes
These hold your actual data. For a PSA / PSS setup, you will have one set of these, which can also be replicated.

Mongos nodes
These issue the queries. They take information from the config servers, send the ‘data gathering’ part of the query to the relevant shard nodes, and deal with any aggregations, sorting and extra work locally to the Mongos.
Importantly, the Mongos can reside on the same machine as your application/web servers, if they are big enough. No data on these, only memory requirements.

Extra work required.
Each collection will need a ‘shard’ key, which should be as unique as possible, and NOT the primary key. Further, the major use indexes should have the shard key as the first field in the index... it just makes life easier for data retrieval.

What does this achieve?

Separation of Concerns

From the outset, we have split the workload on the data node(s) from performing data retrieval and aggregation, to just performing the data retrieval, with any very basic aggregation and/or sorting for data on the relevant node.

The Mongos nodes perform any aggregations/sorting locally to itself , once it has received all the data from the data node(s).
This means that the data nodes are now limited to just data retrieval, allowing more memory for this task. 
Further it  means that you have a choice. If the application slows up due to more users, or more aggregations, then you can simply add more load balanced mongos units.
If your data is now the cause of any issues, you simply add more shard nodes, splitting the data down again. This is ‘horizontal sharding’ at its best. 

Separation of Concerns and Expand your Cluster
Expand your Cluster where it is needed
As your data grows, and the system starts to become overused, you simply add more shard/data nodes. 

For each new shard node, 
Mongodb will automatically split your data across the new nodes. So, if it is in a single shard (PSA) mode, adding a new data node will halve the data on the current node (effectively doubling the available memory), more than likely giving you the space you need, without increasing the individual node size.

Remember, if you have your shard servers as a replica set, you will need to add a server per member, so if you have a Primary and a Secondary, you will need to add 2+ new servers, so the new shards work as Primary and Secondary also.

If your user base grows, or more time is taken with aggregations, you add more load balanced Mongos units, as mentioned above.

This setup actually solves all the issues mentioned in the first part of this blog. If you have split your databases by client, then you should really put them back into one, with the client field as a part of each collection.

You wanted Speed?
Well, now you can have it. The Primaries can all now be relatively smaller units, and can be set up as ‘In Memory’ databases. 
This depends on the total size of your data and indexes, which should now be a lot smaller.
Having done this with a 15-20 node cluster in the past, this made a huge difference to performance. Ensuring that all your data and indexes.. per node... fit into the nodes memory on the Primaries, increased the speed of throughput on writes. We were also able to add ‘important quick queries’ back to the primaries, to make them even faster.

Saturday 6 June 2015

Auto DB disconnect is a folly ... and what actually occurs??

For many many years I have heard the excuse (and it is an excuse) "PHP closes the connection to the DB when the script ends... we don't need mysql_close()". This is put forward by programmers who have got away with this for a long time... yes.. including me before you all 'pass on by'... I am as guilty as the next one.

So, dear friends, let me explain what the nagging voice in the back of my mind kept telling me, without actually telling me, until we got hit by this particular issue... and hit hard.
Oh.. and if you don't use MySql, fear not, this affects you too.. so please read on.

When a connection is made to a Database, or ANY service across a TCP network, what actually happens is a socket is opened. To all intents and purposes a socket is a file. It is used as a means of communication. Essentially (and sorry if you are a sysadmin and already know this... this is to help you out with your programmers) a socket is a place that stores some data for your network to pass to its paired socket on a different machine.. or even the same machine.

So, with a DB connection, to a different machine, your code sets up a socket (yes.. it really does) on the node the code runs on, and the DB sets up a paired socket on its machine. Thats one file handle gone on each server.
This socket will be set up with an automatic kill after it has had no traffic for the default time, or after a close() call has been made.. on Linux machines the default time is 1 minute.
Most scripting languages (yes thats you PHP) have a default timeout of 1 minute too... which is handy... not.
So when your script 'ends', without calling a close(), your local socket stays open for 1 minute, waiting for any erroneous packets to come back, and the server side socket does likewise, waiting for any erroneous traffic over TCP.
Once the client side times out, it will send (thanks TCP for doing what I should have done) a close to its server side socket, so it will then close off... when it is sure that it has received that last packet... or after 1 minute.

What does this mean?

This means that every time your script does not issue a specific close, you have a file handle on the client and the server taken up for an extra minute to maybe 2 minutes.
Yea.. I hear you say.. so what? Well.. if you are running a simple website, this may make no difference. But what if your website grows, or you suddenly have a lot of people hit it .... then you may find it crashing... for exactly this reason.

In these days of cloud computing, a VM (virtual machine) generally has 24k socket files per instance (yes AWS/DO et all.. now I am looking at you). This enables them to spread the load on the underlying bare metal machine. So, lets say 15k people hit your site in a given minute. Thats 15k sockets taken up.. no problem. Scripts and DB works fine. Then one day you spike.... yoou've been running for years.. and now you spike to 30k visitors in 10 seconds... and down you go. (Yes.. this happened with us... its why I am writing this). What has happened is that the 0.005 ms connect and query you are doing is taking 1m to tear down. Don't care that it took a fraction of a second, its the closing of the socket that killed you, as they are sitting there waiting to make sure no stray packets are there.
Yes.. web nodes help at the client side.. but wont help at the DB level... the more webnodes you open up, the less sockets become available on your single database server. If you are reading only, thats cool... you can put a proxy in, and split the reads. If you are writing in a single master scenario... you are royally st*ffed, because your master is about to die, and you wont have a spare socket to ssh into it, to help out.
Yes.. you can now go to a dedicated instance, and have the normal 65k sockets (or thereabouts).. this has helped until 70k people want to write some details to your master in 1 minute.

We actually got hit because we were running mysql-proxy, on the same machine as the database, to get some logging out, and to be able to expand our DB access. What this did was make the scenario worse, because we had a socket going to proxy, and a socket out of proxy and one into the DB .. all on the same machine. So for every 15k connections we were using 45k sockets on the database machine.

So lesson learned, we now have dedicated instances and separate proxy servers, with 65k file handles on the proxy and the DB. We are also laboriously going through issuing a mysql_close() on the DB access object, and also all other connections to TCP services. This has certainly helped, but will not solve the issue of a few hundred thousand hitting within the same minute... but its a lot better than dying when you only have 20k people connect to your servers.

I hope this helps any of you that have the same issue (and you may not realise this is the issue until you examine life under the hood.. and please remember.. a timeout or exit of a script does not close your connection to a TCP service.. TCP does that for you, if not told otherwise, as part of its timeout process.. which is a lot longer than you would like it to be in a high pressure environment.