Operations in JS - Fault Tolerance - /dev

Over christmas I had the good fortune to work on the continuous delivery goal at work. In doing so, much of the work we need to do involves taking deploying NodeJS based API’s to wrap existing functionality and provide much-needed flexibility on the server-side.

To do so was to be a foray for the team from purely working in development to having an operational capability. This meant examining what was required to ensure the NodeJS webservers and the infrastructure was robust and ready for high loads.

NodeJS in production - Fault tolerance notes

To give context, much of the existing company infrastructure is self-hosted in a datacentre. This means that the jump to AWS is occuring simultaneously as the pivot to continuous delivery. The net result will hopefully be decoupling with current dependencies and the start of a foundation for SOA/Microservice architectural pattern.

The uncaught-exception issue

Coming from a background like PHP, a developer needs to think about fault-tolerance in a different manner when programming with Node. Because the node thread is the webserver, there’s a potential for errors to exist and not be spotted immediately, which can crash your server.

Consider this trivial example:

  var http = require('http');

  try{   

    //Make a nice webserver
    http.createServer(function (req, res) {
          res.end("Hello world");
    }).listen(3000);

    //Create some random error
    setTimeout(function(){
      throw "AAAHH!!!";
    }, 3000)

  }
  catch(e){ 
    //This will never do anything  
  }

What’s important to realise here is that asyncronous errors are not caught by try-catch blocks and require entirely different error-handling methods. The upshot of this is that an error could potentially bubble up, fail to be handled and cause a fatal error for the webserver. Nor is this even entirely under the control of the programmer, since such exceptions may be caused by malfunctioning middleware.

The obvious answer to this is to handle the uncaught exception event and simply carry on as if nothing ever happened. The danger is that leaves the application in an unknown and potentially erratic state. Because (unlike, say a LAMP stack) the webserver is running between requests it maintains state. The result would likely be ongoing unpredictable behaviour and possibly a far more difficult-to- trace error down the track.

Standard unix supervision isn’t sufficient under load

For us, simple unix supervision was not sufficiently responsive to prevent some noticably dropped connections under load during our stress-tests; such mechanisms (we were using amazon’s default elastic beanstalk supervisory daemon) use polling and don’t react immediately, leaving poorly-timed requests to hang or be denied by upstream load-balancers.

Solution: Clustering

The nodeJS cluster API allows it to fork processes of itself and self-monitor in an event-driven manner. This has the significant benefit over a traditional process-polling method in that it’s significantly faster to respond. Rather than rolling our own we relied on the recluster which has the added benefit of exponential backoff for restarting child processes as a sanity-measure to prevent infinite respawn-loops.

This is something of an aside for the cluster module however, it’s design intention was not limited to merely supervision: Because it forks the application it is possible to fork for each CPU core to make use of the entire CPU processing power. Given NodeJS’s single-threaded event-loop based nature this affords a quick and easy way to increase power to the webserver by several-fold. It isn’t a replacement for a thread system or a system of complete concurrency management, but it isn’t intended to be either.

Security

Within the node ecosystem, security is very much left up to the choice of the user. Frameworks are small and unopinionated and the user is free to achieve the end-result however they wish. However, for the uninitiated, this means they are running with sharp objects, it’s very easy to ship functioning, performant and very unsafe code with ease.

Unlike Rails from the ruby community, Symphony, Codeigniter or Laravel from PHP or Microsoft’s IIS there are few built in protections with node’s Express or Koa. XXS protection libraries, ORMs and input sanitation libraries need to be added in, while SSL and transport layer security concerns are probably better outsourced to the provider if possible because of their non-trivial implementation and maintainence requirements. And so it’s often easier to hand over that responsibility to services better suited.

Helmet: A simple express peice of middleware which bundles half a dozen smaller specialised security libraries into once place and adds them in an easy to use manner. Notably it bundles CSP protection and a number of other sanitization and XSS protections. The implication of using CSP is, among other things is to ban inline javascript from pages and to prevent the use of unsafe methods such as eval() and setTimeout() on strings. This is already best-practice and, if acceptable, means an entire swathe of XSS vulnerabilities go away.

SSL: An excellent example of something that a user is free to do, but may not want to. Handling ssl in a truly secure manner is a non-trivial affair involving needing to know what kind of encryption ought to be supported and to react to vulnerabilities in the various library implementations when they occur (such as the infamous POODLE attack fairly recently and Heartbleed before that).

Given this it was our preference to hand over this responsibility to our sservice provider; AWS is quite capable of terminating SSL on the ELB - providing a managed form of SSL on demand. Static front-end asssets too can be placed on S3 and secured to prevent MITM attacks.

Authorization and Keys The generally accepted manner in which to authorize access is first and most preferably to ensure build-agents and resources are themselves implicitly authorized rather than having to rely upon keys.

However, this isn’t always possible, so store passkeys and sensitive information is in an environment variables set as part of the build configuration rather than hardcoded within any files. This is because repositories, particulary public ones are vulnerable to simple scanners looking for password signatures en-masse. When set as environment variables by the macroenvironment they are trivially made accessible to Node via the process.env hash.

NodeJS in production - Fault tolerance notes

The uncaught-exception issue

Standard unix supervision isn’t sufficient under load

Solution: Clustering

Further reading/Watching:

Security