Saturday 12 July 2014

The Power of 4 (or the Higgs Boson of the MySql World)

One of the dangers of reading your logs, is that you may trip up over something you don't quite understand that leads you into months of searching and hypothesising on the whys and wherefores and fixes.

Picture the scene. About 18 months to 2 Years ago I was involved in tracking down as many long running queries, for a client, as possible. Trawling the slow logs, trawling the error logs to show what was not working because query xyz was holding things up. It wasn't particularly fun, but it was vital to the business.

One day up popped this error .. or something very similar to it:
       
         SQLSTATE[HY000] [2003] Can't connect to MySQL server on 'n.n.n.n (4)

Hadn't seen one of those before. Well, thought I, I'll wait till some downtime on Friday and take a look to see what happened. At the time this had already passed about 4-5 hours earlier, so the moment was passed.
Friday came, and I settled down to taking a look. I got hold of our friendly sysadmins to see what the connection rate to the server was, what the load looked like, what the network usage was... all that kind of good stuff. Nothing out of the ordinary.
Next I checked the slow logs on the server in question.. nothing that seemed likely.. in fact it happened on a quiet day. My interest was piqued. This actually happened on a web slave, that also acted as a slave Db to the clients extranet. It represented a 'missed' connection. No one had complained internally about not being able to connect, so, it could easily represent a missed sale (fickle people clients.. they just move on over to the next site and we've lost a sale or 40).

So, doing the normal checks drew a blank. I trawled the web, and Oracles support site. We weren't alone, but it wasn't a major issue either... and no obvious answers from anyone.

Well, that ended there. It was the weekend after all, and we had only one instance in the last year or two, and a few hundred thousand connections every day, so I just thought 'chalk it down to experience' and move on.

Life continued. About 10 days later, (definitely not the following week), up popped 2 in short succession, on the other of the pair of web slaves. Interesting thought I. Without waiting till Friday this time, I jumped straight on it. I gathered the usual culprits of information I recorded the time, I aksed around for people moaning about not getting onto the site. I looked at slow queries, network load, machine load (bear in mind this machine was a nice shiny new one with mega amounts of memory). Nada, Rien, Nothing.
At this stage I was logging slave lag as well as slow queries, for other reasons, so checked against that. No lag. I looked at crons that were running on the master... there was one that finished 2-3 seconds earlier, but nothing at the time. The slaves were up to date. The connections were well below the maximum. There was absolutely nothing that showed anything was wrong... except the (4) in the logs.
After a fruitless couple of hours, and with plenty else to be getting on with, I just logged the details I did have, and moved on.

Over the next couple of months we started getting more and more. As opposed to once in a blue moon, or once every week, we started getting 2 to 3 per day. They were definitely on the increase. I had a larger spreadsheet showing slaves, times, and all kinds of cool stuff. I also, by now, had an IT Director asking what was causing them? Like I actually knew? I mean.. if I knew, I would have suggested a raft of fixes. I hadn't, so I didn't know did I? I did suggest (in teh full knowledge that ths would produce a race condition), that we could simply check for a (4) error on connect, wait a nano second or two and try again... it met with a stony silence... the kind I would have given in his position frankly.

In a fit of pique (another one... I'm prone to those when something bugs me) I downloaded the mysql server and client source. I ran grep over the server source... nothing. I ran grep over the client library... and there it was .. O N E occurrence in the connect code. Phew! I would now know what causes it.
On went the programmers hat, out came Geany ( I know)... and up came the code. It became apparent, quite quickly, that the demon (4) error was an error that came midway of a connect. It was NOT a timeout as such. The connect process goes something similar to:
     
      Client: "Hi server ... you there?"
                                                                                             Server: "Yep... what do you want?"
      Client: "Can you log this guy in please?"
                                                                                             Server: "..."
      Client: "Hey .. did you hear me?"
                                                                                             Server: "...."
      Client: "(4)"

Gee.. thanks client library... thats a really useful explanation. So , in reality, I now knew it wasn't a timeout, it was simply a lack of response from the server, midway through a connection handshake. So, it was obviously not a programming issue... more a network issue. I knew just what the reaction from the sysadmins would be.. and I wasn't wrong... but as far as I was concerned the ball was definitely in their park now. Phew again.

The following morning I was met by a barage of information showing beyond all reasonable doubt, that it was nothing within the realms of the sysadmins workload. There was nothing they could do to identify what had caused this to happen at this particular time, and nothing they could tweak to make it go away either.
I was also met by 3 more (4) errors. Ok.. this was getting silly. They were on the increase, and each one could represent a missed customer.
I logged the times, and went back to look at my list, now longer than I would have liked, to see what was going on. I saw, to my horror, that apart from one or two of the 70-80 of these little devils, they all happened on the same slaves, and they all happened within 2-3 seconds of a cron job finishing on the master. This is within the time it took for the replication of that crons data to the slaves.

I went to the cron. Now, this particular cron, runs hourly. Its one that Inserts/Updates around 20k rows of data. This actually depends on the sales levels. The company had just started a particularly aggressive expansion.. still going on as I write this. The cron used to write each statement seperately. This had casued locking on a table a few months earlier, during the day. I had suggested a 'fix' whereby we use the 'multirow insert/update', to reduce the lock time (it was MyIsam if anyone is interested). I had completely forgotton that, as the fix had gone in a few weeks prior to the first occurrence of the (4) issue.

Thinking about this, I came up with the slightly crazy idea that al that was happeneing was the network was getting flooded with large chunks of data. not so flooded as to raise cacti alarms, but flooded enough to stop a single connect handshake to happen in the internally set time .
I put this forward to the sysadmins. They did the decent thing, thought about it, and came back with "we believe that to be extremely unlikely". well... for me it was the only sane conclusion. It covered the fact that it wasn't predictable, it covered the fact that it coincided with an increase in trade (that paid my salary), and every single one happened within a second or two of that cron finishing. Why was that then sysadmin? No answer.
It was getting messy. I never like going head to head with sysadmins... we need them. But in this case I had no other option. All other avenues were tried. My suggestion was to make the blocks of data slightly smaller, so that some space could be created for connection handshakes. So, reduce the blocks of rows from 100 to 50.
Nothing happened. Then we started seeing a real increase in the occurrence. I was asked to look again. I stood my ground. I stayed with the  theory that we needed to lower the block size.
I annoyed the IT director (he would never admit it) but it was the only theory anyone was putting forward that held any water. So .. my fix of the fix went in.

It has to be said, that the following morning there was a parting shot by the db, where it produced a (4) error 2 hours after the fix went in. But that was the last anyone has seen of it. That was over a year ago now, so one can only conclude, inconclusively, that I was right, and this was caused primarily by the inability of the connection handshake to make its way through a stream of data.

So my friends, if you are unfortunate enough to find one of these... save yourself a bunch of time, and look at the actual data passing up and down your network when it happens. It probably wont be enough to trigger an alarm, but it could easliy be enough to stop one little tiny relevaant tcp packet being acknowledged by your server, and your client library throwing a hissy fit.... resulting in the MySql swear word : (4) !


Saturday 5 July 2014

How do you log a problem like MySql?

Well, following on from my blog on MongoDB and ELK I thought I would share my exploration for another client, logging MySql. Still using ELK, we are solving 2 issues in the MySql relational world:

1. How not to have to use Splunk and its slightly 'higher than free' licensing mechanism.
2. How to log every single query that goes through your db server, while:
           a) Not affecting current throughput and performance.
           b) Generating the right information to find connection issues, slow queries, and so unclogging your systems before they get clogged.

The backdrop to all this was with a previous client, where as the business increased, so did the throughput on the servers. One day (night actually) something horrible happened, and things ground to a halt (well... nearly). As a result I was given the wonderful task of trawling the slow logs to 'find the reason'. This became a daily task, and it wasn't fun.
Soon we automated the process, getting data into a seperate DB that we could generate graphs from. This was great... we trimmed about 4.5 million queries off the slow logs (anything over 1s) per year. But it still didn't show anything that went through the system under 1s. Enough of these can throw the system out, and you wont see the reason why.
We tried a lot of things, trying to solve the issue, while satisfying 2(a) above... none worked. Whatever we did involved a few points of breakage, or involved io write on the DB servers, and other services while we read the output.
Then, one day , I discovered a little used or known about MySql tool called mysql-proxy, written by Jan Kneschke. This is basically a small lua app that sits between client and server, passing information between the two. There is a lot you can do with this, but what I needed was something that can pick up the information on the way to the server, and particularly, pick up the query/execute times on the way back. Proxy allows you to write your own lua scripts, that use the hooks it provides, to pick up this information. It also requires a lua apr piece of glue to get microtime granularity... so thanks to Pete Rodding for his lua-apr library.
We tried it, it was good. I even got it writing the output to a RabbitMQ server, that then had workers picking that up. We used it in non-production servers to prove the point.

Now, I thought.. I have a new client. I have recently used the ELK stack to log MongoDB. I can take the proxy code I wrote, change the output method slightly (it was already outputting Json documents), make a simple curl call to ElasticSearch, and presto! Putting Kibana up against the results, and we should have 1-2s delayed information on connections/queries/execute times ... all manner of good things.
(Update.. great idea.. lousy implementation. Curl works, but slows the process down considerably. So writing to a log file, allowing logstash to update ES, and logrotate to keep the log file to a minimum every hour, is the way to go for now.)

 We have been in production with these for 8 months now (updated since my original blog post). We 
Master Queries
Web Slaves
 We have slaves split between two groups (web and stats). On the left we have the master details, on the right we have some of teh slave details. The topline gives a histogram of the number of queries in a 10 minute interval. By recording this information the savings made have been mind boggling. We have saved 60m queries per day (yes... 60m). We have been able to identify connections in the code that did nothing (except take up sockets), we have idenitied queries that were being replicated when the data was only alive for a matter of minutes (think sessions).  

Lockout and Server splits
We have another histogram based on the total lockout time of the queries on each slave. High is bad, and needs further investigation.

Query Detail Table
Then we have a histogram of the connections per 10 minutes on the slaves. This is after the code we found that produced about 25k connections per slave per 10 minutes. The 'smoking gun' I was partially hired to discover. Job Done! It was also taken at a slow period on a sunday, so no surprises that not a lot was happening.
Finally we have a table of queries from one of the groups. Actually removed any identifying information for hackers, but in reality we also record Server / Client IPs, Server threads.. and all kinds of interesting information.

If you want to try this yourself, the code required, along with instructions can be found on GitHub here. As with all good things in life, its free.

Without duplicating the instructions on the GitHub repo,  I shall explain the information you get, and the basic requirements from an ELK perspective. However,it is worth mentioning here, that, in this scenario, we installed a copy of mysql-proxy on each DB server, listening on Port 3306. The Servers themselves listen on port 4040. In this way, any application will go through the proxy (and our logging system) with no changes at all... that was cool. For those using TCP replication by statement, that will also go through the proxy.. so you will need to change that if your proxy deals with more than one server, otherwise your data will become slightly messy.
But if it does only go to one server... you will automatically be recording that too... Yippee.


Output from Mysql Server

Server                                 Server ip:port connected
client_connections              Number of Client connections when this call was made
User                                    User name running the query
proxyName                         Proxy ip:port
server_version                     Server Version number
Client                                  Client ip:port making the connection
Thread                                Server thread that ran the query (useful for seeing queues of queries)
QueryLen                           Length of the query
Query                                 The actual query (see below)
QueryType                         The MySql query type
timeSent                             The time query was received from the client (microtime)
queryTime                          The actual execute time of the query
responseTime                     Time taken to process the query
timeReceived                      Time information received back from the server
lockoutTime                       (timeReceived-timeSent-responseTime). So queuetime, network lag etc.
current                                 Number of queries proxy is dealing with at present.

All the above is fed into a Json document and fed into ElasticSearch. Along with a timestamp, and a ttl. You will probably have many thousands of these per second/minute/hour depending on the size of your system. TTL will save storage.
All times are in microtime resolution, which is really useful. Having come across some dynamic sql that is in excess of 100k in size, I decided to limit the length of the query. It defaults to 1k, plus the last 30 bytes, but this is configurable within the lua code. If you are happy storing H U G E queries...
feel free to play.

This data is written to an output file, and Logstash reads that and writes to the ES server.

 Logstash Configuration

The configuration file for Logstash looks like this:
input {
      file {
        codec => json
        discover_interval => 5
        path => ["/var/log/mysql/bitonproxy.out"]
        start_position => "beginning"
    }
 }
filter {
  date{
    match => ["timestamp", ISO8601]
    target => "@timestamp"
    remove_field => ["timestamp"]
    remove_field => ["host"]
    remove_field => ["path"]
    remove_field => ["@version"]
    remove_field => ["message"]
  }
}
output {
  elasticsearch {
    index => "mysql"
    index_type => "query_data"
    host => "ES-Server"
    protocol => "http"
    bind_port => 9200
    manage_template => true
    codec => json
 }
 stdout {codec=> json  }
}

Replace the ES-SERVER with your server definition. Due to the logrotate causing interim havoc with your og file, sometimes the formatting gets messed up. Tis produces a document with a 'message' element, and not the main elements you are looking for. The aobve will remove the majority of these for you. You will end up losing one or two statemets this way, but essentialy who cares? You have 99.995% of all queries going through your system. 

ElasticSearch Mappings

The mappings, included in the Github repo, are here also. These are important, as it enables ES to index the relevant fields as they are written, alowing for faster response times. Searching on unindexed fields doesnt work well. This has to be done prior to you writing to the ES index, but does not require the index to exist. It will create it if it isn't there. If you haven't got ES installed, or Kibana, please go to go to my previous blog for instructions.

curl -XPOST http://[MY_ES_SERVER]:9200/mysql/ -d '{
    "settings" : {
        "number_of_shards" : 8,
        "number_of_replicas" : 1,
        "index.analysis.analyzer.default.type": "standard",
        "index.refresh_interval": "5s"
    },
    "mappings" : {
     "_default_":{
            "_ttl" : {"enabled":true, "default":"1d"},
        "properties" : {
                "@timestamp":{"type":"date"},
                "Server":{"type":"string"},
                "client_connections":{"type":"long"},
                "User":{"type":"string"},
                "proxyName":{"type":"string"},
                "server_version":{"type":"string"},
                "Client":{"type":"string"},
                "Thread":{"type":"string"},
                "QueryLen":{"type":"long"},
                "Query":{"type":"string"},
                "QueryType":{"type":"long"},
                "timeSent":{"type":"double"},
                "queryTime":{"type":"double"},
                "responseTime":{"type":"double"},
                "timeReceived":{"type":"double"},
                "lockoutTime":{"type":"double"},
                "current":{"type":"long"}
            }
         },
        "query_data" : {
            "_ttl" : {"enabled":true, "default":"1d"},
            "properties" : {
                "@timestamp":{"type":"date"},
                "Server":{"type":"string"},
                "client_connections":{"type":"long"},
                "User":{"type":"string"},
                "proxyName":{"type":"string"},
                "server_version":{"type":"string"},
                "Client":{"type":"string"},
                "Thread":{"type":"string"},
                "QueryLen":{"type":"long"},
                "Query":{"type":"string"},
                "QueryType":{"type":"long"},
                "timeSent":{"type":"double"},
                "queryTime":{"type":"double"},
                "responseTime":{"type":"double"},
                "timeReceived":{"type":"double"},
                "lockoutTime":{"type":"double"},
                "current":{"type":"long"}
            }
        }
    }
}'


You will need to replace the server in the curl call with the ip of your ES cluster, and maybe the index name (I have called it... imaginatively... mysql). That is all. 

The other major benefits of this are:

1. You will no longer need to record slow queries at the server side, you now have the ability to monitor queries of any length of time using Kibanas Filter.
2. You can use the search results to query the Sql/connection from any other data, such as table/column names, using a search engine, which is designed for this purpose. So, finding heavy use tables/columns will be made a lot easier.


As an example, within the first 3 weeks of being in Beta with this system, it has been repsonsible for highlighting 3 queries, which, when rationalised, have saved between 40-60m (million... get that?) queries per day, and bringing a 25k (that is thousand) connections per 10minutes down to a maximum of 60 per 10 minutes. Yes .. these were dumb queries for the size of site, but were never going to show up in the slow query logs, or the cacti graphs. The connections did show up in cacti, but there was nothing to say this was out of the ordinary.. until they were tied down to an old query, running on a cron, that was never used, but went into a huge loop.
This meant we have reduced the number of slaves in use by just over a third. For no actual software cost.. and we haven't even started on the actual problem queries.

Hope this has been useful, please feel free to quote, feed back, ask for help...

Later...








Sunday 15 June 2014

An adventure with an ELK .. or how to replace Splunk

ELK - Elasticsearch/Logstash/Kibana

Well heres a thing. You are reading the ramblings of someone who has kept away from the realms of FB, Twitter, and Blogs in General. The dog has been prolific, but I have kept away for no good reason, except I didn't see the point.

Finally I get it. I really get it. I came across a product stack that would really really help, I tried installing , but found the docs missing in the areas that were really really necessary. They just didn't result in what I needed.. a working product.

So, reader, let me explain the scenario (and for those that just need instructions on how to install the ELK stack.. just scroll on down.. I won't be offended.. I am first an foremost a developer who can't RTFM):

I have a client (nameless for the time being, due to business reasons). It is a startup. I have built a largish database setup as it uses data.. lots of it. Strange and magical data. It gets queried in strange ways, it will only get stranger and bigger. They were using MongoDB, but it had been set up by someone who didn't see the beauty of this product. So it is now set up in a fully extensible way, on a number of machines, as 3 shards of 3 replicas each shard. Thats actually on 7 machines, but we can easily expand out to thousands.

The issue? Once up and working we had an issue. MongoDB comes with monitoring software (mms) its cool.. it works... but it misses the salient points of how the databases are split, what the queries are, the time it takes to execute queries... and importantly... how the heck do I visualise this?

I had been using Splunk for a couple of years. Its good... it gives you data, not necessarily live kickass data, but it does what it says on the tin.... and costs a few k dollars/sterling/Euros (Eeek... euros?? what a dumb name that is.. but I digress) per year to run.. per server... not good.
I trawled the web, I looked at various options, even juggling around with some code I wrote for another client in the MySql world... but nothing there in the short term (did I mention it is a startup with the need for speed?).

Queries by Shard (Fig 1)
I stumbled on Kibana... it mentioned logstash... I had briefly used this a year or so ago in the Mysql world.. but it failed. Now, though, its owned by the good folk at Elasticsearch. We are already using ES, the graphs looked great, the data would fit straight in (json logs -> json ES -> json reading Kibana). Whooppeee! Issue solved. I mean who wouldn't want an live output like this:


Query throughput by server (Fig 2)
Well.. not quite. To say I had a few major hiccups with the docs is an understatement, but with help from the amazing  Steve Mayzak of Elasticsearch, I finally got it working for us. As a result I was asked to document the process... so here we are.

Log Enquiry (Fig 3)

Briefly then:
1. Apologies for those that follow the Seattle boys.. I am a Linux Open Source  kindda guy. For those that like their fruit cidery, this will probably work with a  few tweaks, as Steve followed the right type of OS.
2. Our setup, as I mentioned, involves 7 machines running MongoDB, plus 3 running Elasticsearch, spread over 2 data centres. We also have multiple points of entry to both systems, so we have failover in case of a whole data centre being taken out.
3. (2) Does not preclude the single use machine, it just has relevance to those who try to understand some of the manoevres here.

On with the show:
You need to plan a bit first:
1. Which Machine(s) are you going to put ES on? You should have, even with a single node ES unit, a seperate 'sentinel' (no data) access unit. You will need master/slave/ shards at some point.. plan for it now. You do all your connections through the sentinel, it worries about getting the data for you. Make life easy for yourself. (If you are already a Mongo user, its the same as a mongos unit).
2. Which machine are you going to put Kibana (the graph end) on?
3. You will need Java. ES uses Lucene which uses Java. You will need Logstash, which uses Java. Essentially... you need Java. None of the programming/script building you do is Java, just the tools use it.

Elasticsearch (ES):
1. Install java. You may have to replace the version number..
        sh>sudo apt-get install openjdk-7-jre
2. Repeat on all systems using ES, logstash, and probably Kibana for good
    measure.
3. Check http://www.elasticsearch.org/download for the latest version of ES
4. sh>wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-[versionnumber].deb (current 1.2.1)
5. sudo dpkg -i elasticsearch-[versionnumber].deb
6. sudo service elasticsearch start
Thats gets you a node running. If you are building a cluster, you will need to change some bits in teh /etc/elasticsearch.yaml file. If you are not you will also need to change some items. The basic list is here... there is plenty of well documented sections on this elsehere.. so I wont bore you.
The essential elements are here:
            cluster.name: thisCluster
            node.name: thisNode
            node.data: true            (false if the sentinel)
            node.master: true        (always true, so it can be voted a master if the
                                                  current master dies)
            index.number_of_shards: 5 (the default)
            index.number_of_replicas: 2 (or however many masters+slaves you have)
            the paths for data/configs/ and logs
            network.bind_host: thisIP
            network.publish_host: thisIP (the Network address other nodes will use
                                                 to talk to this one).
            http.port: 9200               (default)
            transport.tcp.port:9300 (default)
            discovery.zen.minimum_master_nodes: 1
            discovery.zen.ping.multicast.enabled: false (forces Unicast)
7. Be careful. If installing a cluster, it wont work if one node is a different
    version of ES.
8. If you need an answer as regards replicas the ES way... can I suggest looking here
9. sh> sudo service elasticsearch restart  (with your .yaml changes) 
10. Make sure you have ES set up on each node you want to run it, and also any sentinel machines.

LogStash - The bit that reads the mongo logs and dumps to ES

1. Install Java... you have installed java haven't you?
2. Make sure you have a java tmp folder, that is fully writable/readable
     sh>mkdir /home/[user]/javatmp
3. Set the following environment valriables:
    sh>JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64/"
    sh>source /etc/environment
    sh>_JAVA_OPTIONS=-Djava.io.tmpdir=/home/[user]/javatmp
4. Set them in the profile options (debian ubuntu... others roll as per your
    version)
    sudo nano /etc/profile.d/logstash.sh
       export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
       export _JAVA_OPTIONS=-Djava.io.tmpdir=/home/[user]/javatmp
    sh> sudo chmod 0755 /etc/profile.d/logstash.sh
5. From youor home directory
        sudo wget https://download.elasticsearch.org/logstash/logstash/logstash-1.4.1.tar.gz
6. tar -xvf logstash-1.4.1.tar.gz
7. Edit/Create mongoqry.conf - this bit is important as it tells ES/Lucene how to handle your data, which in turn will help with Kibana using it correctly.
       input {
        file {
          discover_interval => 10
          add_field => { host => "your_host"}
          add_field => { shard => "your_shard"}
          path => ["/var/log/mongodb/your_mongo_log.log"]
          start_position => "beginning"
          tags => ["your_host", "mongo","other_tags"]
          type => "your_host_log_type"          <-- This helps in Kibana to identify these entries
      }
   }
   filter {
     grok { pattern => ["(?m)%{GREEDYDATA} \[conn%{NUMBER:mongoConnection}\] %{WORD:mongoCommand} %{NOTSPACE:mongoDatabase} %{WORD}: \{ %{GREEDYDATA:mongoStatement} \} %{GREEDYDATA} %{NUMBER:mongoElapsedTime:int}ms"] }
    grok { pattern => [" cursorid:%{NUMBER:mongoCursorId}"] }
    grok { pattern => [" ntoreturn:%{NUMBER:mongoNumberToReturn:int}"] }
    grok { pattern => [" ntoskip:%{NUMBER:mongoNumberToSkip:int}"] }
    grok { pattern => [" nscanned:%{NUMBER:mongoNumberScanned:int}"] }
    grok { pattern => [" scanAndOrder:%{NUMBER:mongoScanAndOrder:int}"] }
    grok { pattern => [" idhack:%{NUMBER:mongoIdHack:int}"] }
    grok { pattern => [" nmoved:%{NUMBER:mongoNumberMoved:int}"] }
    grok { pattern => [" nupdated:%{NUMBER:mongoNumberUpdated:int}"] }
    grok { pattern => [" keyUpdates:%{NUMBER:mongoKeyUpdates:int}"] }
    grok { pattern => [" numYields: %{NUMBER:mongoNumYields:int}"] }
    grok { pattern => [" locks\(micros\) r:%{NUMBER:mongoReadLocks:int}"] }
    grok { pattern => [" locks\(micros\) w:%{NUMBER:mongoWriteLocks:int}"] }
    grok { pattern => [" nreturned:%{NUMBER:mongoNumberReturned:int}"] }
    grok { pattern => [" reslen:%{NUMBER:mongoResultLength:int}"] }
  }
  output {
    elasticsearch {
      index => "logstash"
      host => "your_es_sentinel"
      protocol => "http"
      bind_port => 9200
      manage_template => true
   }
   stdout {codec=> json  }
 } 

This will output to ES as well as to screen. At this point I would like to give thanks to the people at techblog.holidaycheck.com for the filter command, which I unashamedly lifted from their git repo. Thanks guys, saved a ton of work.

8. Repeat on all other mongo units, changing the inputs to teh server of your choice, but keep the outputs.

9. Logstash even has teh decency to stay with a rolled file. So if you swap your main log file every hour or so, it will stay with the original filename, acting similarly to 'tail -f ' so yoou miss nothing.. good old logstash.

10 Finally, start her up. You may want to do this inside a 'screen' so you can leave it running:
     sh>sudo bin/logstash -f mongoqry.conf
                            or to Daemonise
     sh>sudo bin/logstash -f mongoqry.conf &

If you get  This: (LoadError) Could not load FFI Provider: (NotImplementedError) FFI not available: null
      The /javatmp is not executable... make sure it is executeable
   Can also be seen if the timestamp filter is out of date... maybe


Right.. still with me? Well done. Finally... yes indeedy.. finally you need Kibana and a webserver, if you don't have one on the machine kibana will be installed on.

Kibana - The Logging output and pretty stuff
1. On your chosen machine for Kibana, install a web server (nginx/apache ..  
    whatever).
2. Ensure your webserer can listen on the port you want to connect to (443 for
    https is advisable).
3. Ensure the servers doc folder is pointing to where kibana will be.
4. Install Kibana
    sh> wget https://download.elasticsearch.org/kibana/kibana/kibana-3.1.0.tar.gz
    sh> tar -xvf kibana-3.1.0.tar.gz
5. In the kibana default directory will be a file called config.js. In this the  
    following will need changing:
      elasticsearch: Make sure the port number is :9200.
                             If your sentinel is on the same machine, you will be Ok,
                                 otherwise you will need to change the destination of ES as
                                 well.
      kibana-index: A name for your kibana index. Your kibana dashboards also
                            stored in ES... which is kindda useful.
6. Navigate to your host...and you should have the Kibana welcome screen.

Done! From here you can build panels, graphs, tables, get your log files in... all kinds of cool stuff... but you should be up and running at this point..... let me know if you aren't.

A couple of notes:
1. Mongodb produces a lot of logs. Putting a ttl (Time to Live) on  the record of 24 hrs, allows ES to automatically drop old records. Make this a sane number for your throughput. In other areas I have either expanded that to 2-3 days, or reduced to 6 hours, and added aggregation.

I am sure there are others.... please feel free to add at will. Hope this has been useful, enjoy the insights it will give you to your data.