Running Node.js in Production

By Tim Cash - August 3, 2018

17 minutes - 3440 words

Like every company working with products on the web one of the languages we use at Pluralsight is Javascript. While nearly every product engineering team uses Javascript in some form we also have around 12 of 33 teams using Node.js. Other popular languages at the company include C# and Python. This post will focus specifically on Node.js and the experience of deploying the runtime in production over the past few years. Generally speaking I would give these recommendations and advice to any team developing production systems.

Like any tool or language Node can frame the way one thinks about a problem in expanding ways but before choosing a technology one should start with a problem statement. Node may be a good fit if some of the following apply to your problem domain.

Has many small events that need to be processed with business logic in subsecond time frames
Need to Iterate on the problem space quickly
Want to take advantage of open source libraries and APIs
Will interact with multiple databases and network protocols
Want to render HTML
Need a backend for a mobile or single page app
Using JSON as a serialization format

Node was an answer to running a process per thread behind Apache and NGINX or sometimes trying to embed logic directly into NGINX. Perhaps classifying Node.js as an application event router is helpful. Node excels at the layer above the standard OSI model layer 7 load balancers. Unfortunately the word application is overloaded here.

A sample of some problem spaces Node.js may not be a good candidate for include

Compilers
Physics simulators
Guidance systems
Databases
Layer 4 load balancers

Before jumping into the challenges and benefits of running Node in productions let us take a look at some common misunderstandings.

Common Misunderstandings

Node is Single threaded

The answer to this is more complicated than some other languages and depends on which part of the stack one is talking about V8, libuv, or the system APIs. The key here is to keep cpu time on the event loop to a minimum, a few milliseconds before scheduling onto the threadpool. I/O and Database time should be the bottlenecks. Code in Javascript land should act as a coordinator to memory, disk and network. If response times become an issue look at tasks like compression, encryption, regex and even serialization for offloading to a worker. Here is a section of “Don’t Block the Event Loop” from nodejs.org.

Node uses a small number of threads to handle many clients. In Node there are two types of threads: one Event Loop (aka the main loop, main thread, event thread, etc.), and a pool of k Workers in a Worker Pool (aka the threadpool).

If a thread is taking a long time to execute a callback (Event Loop) or a task (Worker), we call it “blocked”. While a thread is blocked working on behalf of one client, it cannot handle requests from any other clients. This provides two motivations for blocking neither the Event Loop nor the Worker Pool:

Performance: If you regularly perform heavyweight activity on either type of thread, the throughput (requests/second) of your server will suffer. Security: If it is possible that for certain input one of your threads might block, a malicious client could submit this “evil input”, make your threads block, and keep them from working on other clients. This would be a Denial of Service attack.

Full article Don’t Block the Event Loop

Node is javascript and javascript is slow

Node.js is a JavaScript runtime and there are many others. For other examples of runtimes look at Firefox, Edge and Safari. Along with a Javascript JIT each browser has a set of APIs for interacting with the operating system. Chrome and Node share V8.

Much of Node is written in C or C++. V8, libuv, and the system APIs all run in native land. With the new N-API for native modules expect more code to be running on C++ or even Rust when passing data across the language boundary makes sense. See neon-binding for Rust examples. If your problem requires compute optimization make sure to profile first.

Libuv is the asynchronous event loop and other projects like .NET Core’s Kestrel have used it in the past (though it is now based on managed sockets). V8 is a Javascript “engine” built at Google written in C++. For many systems the bottlenecks will not be on the CPU and instead somewhere in the overall architecture of networking and databases. Remember to consider developer time and iteration speed as well before swapping to lower level runtimes.

Node is for toy projects

Perhaps some of this perception comes from the fact that servers are incredibly simple to get started in Node. A few dozen lines of code, deploy to service like now and your project is available to the world but this is not all Node offers. Node is used to serve front and backends for some of the most active sites and applications in the world like Walmart and Netflix. Under Electron Node is now also used to back popular desktop applications see Slack and VS Code. All of the necessary tooling is in place including packaging, testing, monitoring, tracing, profiling, and CI/CD to deploy a reliable maintainable system.

That being said Node is not without it challenges. Some of the main sticking points for developers coming from other languages tend to revolve around these related ideas ubiquitous use of callbacks, passing functions as arguments and async I/O. Here are some problems we have seen at Pluralsight and suggestions for avoiding them.

Production Challenges

Async all the things

There are too many ways to write asynchronous code in Node. If I could pick just one area for teams and for the community at large to focus on it would be this. Streams, callbacks, promises, async libraries like co.wrap with generators and Rxjs are just a few of the ways to deal with asynchronous code and to make matters worse many concepts overlap. Thinking about this combined with error handling can be overwhelming. If the questions below each example cannot be answered in a few minutes consider some workshops on these subjects. This time will more than pay for itself in development speed and reduction in bugs. Here are the minimum patterns I would verify understanding and testing strategies against.

Setting up listeners to events is often the “main” of a Node program.

const server = http.createServer((req, res) => {
  res.setHeader('Content-Type', 'text/html')
  res.setHeader('X-Foo', 'bar')
  res.writeHead(200, { 'Content-Type': 'text/plain' })
  res.end('ok')
})

What type of thing is req?
What type of thing is res?
What happens on the network when res.writeHead and res.end are called?
How would one test the function passed as the first argument?
How would this function handle errors?

Wrapping nested callbacks in a promise is incredibly common and can smooth out interacting with all kinds of complicated control flows. Do not solely rely on tooling like util.promisify() as the underlying async code may not have the (err, data) => {} interface

function wrap_with_promise(interval, verify, timeout = 1000) {
    return new Promise((resolve, reject) => {
        setInterval(() => {
            verify((err, data) => {
                if (err) reject(err)
                if (data === "what_we_want") {
                    resolve(data)
                }
            })
        }, interval)

        setTimeout(() => {
            reject("timeout")
        }, timeout)
    })
}

async function foo() {
    const verify = (cb) => {
        // cb(null, "what_we_want")
        cb(null, "not_what_we_want")
    }
    const result = await wrap_with_promise(10, verify)
    console.log(result)
}

foo()

What is wrap_with_promise() for?
Are there any bugs?
What changes if the line cb(null, "what_we_want") is uncommented?
This program can be written more clearly with await and a while loop can you see how?
How would you test this function?

As a final example consider forking and joining async code. Take a look at the same problem in other languages and compare as they are often much more complicated.

function fork_and_join(data) {
    const promise1 = async_thing_1(data[0])
    const promise2 = async_thing_2(data[1])
    return Promise.all([promise1, promise2])
}

Why do we not use await before async_thing_1(data[0])?
What happens if promise1 has significantly higher latency than promise2?
Is there a way to timeout?
How would one do this without promises? (this is a fun exercise and will improve understanding of the event loop)

Streams are another area where developers get tripped up but they are not as common to interact with directly. If large payload sizes or high throughput rates are needed for your project spend a couple days on this concept building tests at the appropriate loads. It will be interesting to see how async iteration influences this space and some experiments in Node v10 are looking good so far.

async function* readLines(path) {
  let file = await fileOpen(path)

  try {
    while (!file.EOF) {
      yield await file.readLine()
    }
  } finally {
    await file.close()
  }
}

For more information on this concept see the TC39 Proposal Async Iteration

After a team has taken the time to understand the various asynchronous strategies decide on a few standard patterns. Consider anything that is not following these patterns tech debt and advocate for cycles to clean up that out of date code.

DNS resolution

If you are seeing occasional spikes in latency and the process calls external resources without direct IP addresses DNS may be the culprit. In the past Node has not cached DNS and it may even be the majority of your request latency. Be cautious with DNS caching on dynamically scaling services and if you are using helper libraries verify it respects the TTL.

For more information on this topic see

Error Handling

Async code and error handling in many languages is riddled with traps and javascript is no exception. Be on the lookout for code like the following. Anything that throws in do_other_async() will be lost or even worse crash your process. The telltale sign is any async function that is not awaited or where the return is not captured in a variable.

function event_handler(event) {
    const result_1 = await do_some_async(event)

    do_other_async(event)  // danger danger!!!

    return result_1
}

My general advice is to avoid throwing errors in javascript completely and instead treat errors like values. In other words catch any errors and return them on the same code path as a valid result. This was an advantage of the callback style with the error as the first argument sometimes called errbacks or error first callbacks. The community movement away from errors as values is a regression in my opinion. Errors are data why throw them away?

function some_callback(err, data) {
    // the error is a value and we can use regular control flow 
    // instead of nesting more try catch statements
    if(err) { 
        // deal with errors
    } 
    // use data
}

Here is an example of how to integrate this idea into code using async await. If step1 and step2 also follow the pattern try catch can be avoided and will only be needed on the edges or wherever the code interacts with external systems.

async function errors_as_values_function(data) {
    // run the first step and return early if error
    const [error1, result1] = await step1(data)
    if(error1) return [error1, null]

    // run the second step and return early if error
    const [error2, result2] = await step2(result1)
    if(error2) return [error2, null]

    // possibly transform the return payload otherwise return
    return [null, result2]
}

// using the above function will follow the same pattern
// here in the context of an api server it could look something like
const server = http.createServer((req, res) => {
    const [error, result] = await errors_as_values_function(req)
    if(error) {
        res.writeHead(500, { 'Content-Type': 'application/json' });
        res.end({error: error.toString()});
    } else {
        res.writeHead(200, { 'Content-Type': 'application/json' });
        res.end(result);
    }
})

Each async step returns a two element list with the error as the first argument

Related to this problem is a lack of top level error handling. Make sure to have something that looks like this at the entrypoint of any server code unless you want the process to crash on uncaught errors and promise rejections. Remember to collect error counts somewhere with monitoring and alarms!

process.on('unhandledRejection', (reason, p) => {
    log_or_metric('unhandledRejection', reason)
})

process.on('uncaughtException', err => {
    log_or_metric('uncaughtException', err)
})

Memory leaks

While not a common problem at Pluralsight memory leaks have happened. They tend to max out the heap over a tens of hours at our request rates. While not an excuse to leave a leak in place one solution is to restart, drain or kill processes at regular intervals. To find the source of the problem the following tools may help

To snapshot the heap heapdump
To verify the call stacks look correct dtrace

Socket leaks

While it is trivial to create many connections in Node what is not as clear is cleaning them up. When working with Node it will be important to monitor open sockets to shared resources likes databases, message queues and system APIs, consider using connection pooling or sharing a connection across events if this is the case. A handy tool for these kinds of problems is lsof

Runtime type validation

There are a few ways to work with types in Javascript, Typescript and Flow are what come to most developers minds these days. Just as important however is runtime validation. At a minimum when working across service and message boundaries programs should validate types, even better to verify across any I/O. An example from the excellent joi library

const schema = Joi.object().keys({
    username: Joi.string().alphanum().min(3).max(30).required(),
    password: Joi.string().regex(/^[a-zA-Z0-9]{3,30}$/),
    access_token: [Joi.string(), Joi.number()],
    birthyear: Joi.number().integer().min(1900).max(2013),
    email: Joi.string().email({ minDomainAtoms: 2 })
}).with('username', 'birthyear').without('password', 'access_token');

// Return result.
const result = Joi.validate({ username: 'abc', birthyear: 1994 }, schema);
// result.error === null -> valid

// You can also pass a callback which will be called synchronously with the validation result.
Joi.validate({ username: 'abc', birthyear: 1994 }, schema, function (err, value) { });  // err === null -> valid

Out of memory

Out of memory errors tend to catch developers off guard in Node for a few reasons. The main one to keep in mind that V8 uses a default heap size around 1.5 GB. This problem often shows up when running database bootstrap processes or when transmitting artifacts like images or compressed files. If the process is serving a busy API and I/O is not yet saturated take a look at streams to keep memory use low. If turning the heap size up is an option see --v8-options and --max_old_space_size for more information.

These problems are often missed in a development or staging environment because the test cases are not large enough. Validate any changes to the heap size against your latency distribution requirements as it can have garbage collection impacts. Some improvements have been made recently to garbage collection and concurrent marking is enabled by default in Chrome 64 and Node.js v10. To learn more about the new GC see concurrent-marking

Security

The recent eslint-scope@3.7.2 event was a good reminder that supply chain injection will always be a vector to pay attention to. Most of the package mangers are vulnerable to the same type of attack however. NPM makes an extra enticing target because Node encourages many small dependencies and the community is gigantic. The jury is still out on the best approach to fixing this problem but a good start would be for NPM to require all writes to a module be 2FA.

To help defend against code exploits look into npm audit as part of your build process

Project structure

At Pluralsight we use the concept of Bounded Contexts and we encourage each context to use a single git repository. There are many tools to help manage large Node projects however consider carefully if the added complexity is needed given the accessibility of NPM. Here is a project layout that has scaled without the need for additional tooling.

Context Name (root)
    - libraries
        - lib_1
            - src
                index.js
                index.test.js
            package.json
        - lib_2
            -src
                index.js
                index.test.js
            package.json
    - services
        - service_1
            - src
                index.js
                index.test.js
            package.json
        - service_2
            - src
                index.js
                index.test.js
            package.json
    Readme.md

Each folder in services is an independently deployable project tied to a CI pipeline. Any code needing to be shared between services will become a library e.g. models, logging, and API wrappers. Libraries are deployed to NPM and then added as a dependency of a service. Breaking changes in libraries can be incrementally updated in services. Tests live in the same folder as the code.

These are a few of the challenging parts of working with Node but they can be overcome with a bit of practice and the tooling is getting better every day. Now onto some of the positive points of working with Node.

Benefits

Async by default

Node has been async from the beginning and all the libraries and drivers have been written to take advantage of the event loop. Other languages and frameworks that have had it added on after the fact tend to run into problems with various dependencies using synchronous calls.

LTS

Node is a few releases into the LTS line now which has been wonderful for working in an enterprise environment. At Pluralsight we ask that each service be on an updated LTS release. This combined with the V8 commitment to Node and TC39 integrating new Javascript proposals give Node a solid path for future evolution.

Compile to javascript

Pluralsight uses Babel and Typescript in a few configurations on many of our projects. Some of the initial configuration can be a frustrating but the benefits of writing against a common syntax and transpiling to the various browser and runtime targets saves time in the long run. Kudos to Microsoft on the great work around Typescript, the ability to add a lightweight type system is a boon to larger projects. This also contributes to the evolution path as new features can get feedback quickly from the community. A point of caution here is to use the type systems without removing the dynamic nature of javascript. If your code and project structure starts to look like C# or Java with generics syntax on many functions maybe you are using the wrong runtime.

NPM

The amount of projects on NPM is staggering. Maybe larger than than the next handful of package mangers combined. While quantity is not always the best indicator I have yet to find a problem space where even a low quality module does not exist to help me get started and one of the beauties of OSS is contributing back to make improvements. The improvements to NPM itself are also a positive source for Node. Getting a module started, tested and published is straightforward and makes collaboration simple. Additions to the cli like cache, ci, prune, shrinkwrap and audit are polishing some rough edges in production deployments. In many ways NPM is a leader and there is some great competition and collaboration between other javascript package managers and even other runtimes.

Building & Artifact deployment

Developing locally and setting up CI/CD pipelines is well known territory now. Much of the industry tooling includes pipelines and documentation for running Node on cloud instances, containers, and serverless platforms.

Serverless

While the term “serverless” may not be the best choice there is a growing trend toward deploying code to automatically scheduled orchestration layers. Node’s fast start time and event based architecture make it a natural fit for these environments. We are probably not too far out from global edge deployments becoming standard practice. Some generally applicable tips for operating on the various cloud providers.

Keep the artifact size small
Pay attention to cold start times and measure the latency distribution
Keep in mind that requests can be executed multiple times
Look out for shared account rate limits
Take advantage of their ephemeral nature and scaling to zero
Develop with the TTLs in mind typically between 60 and 300 seconds

See Cloudflare workers and Fly.io for what the future of edge functions could look like

Testing

At this point tooling around testing in the Node community is strong. Mocha and Jest are both great to work with and many projects come with tests built in now. The npm install -> npm test workflow often works without needing to think about platform or compile targets.

Conclusion

While Node had a rocky governance history its reach and productivity are hard to deny. The LTS process has increased confidence and virtually every major technology company is contributing to the project. Overcoming the shift to asynchronous thinking can take some effort but is worth the time, embrace callbacks and the dynamic types to see the benefits. The need for a lightweight event based runtime is growing with the move to ephemeral compute scheduling and Node is a solid choice for that problem space.

Categories: technical

Tags: javascript, devops, deep dive