Jesse's Software Engineering Blog
CPU Bound Workloads with NodeJS: Processor Parallelism
As discussed in the Understanding the NodeJS Event Loop post, NodeJS is a great language for high I/O workloads. For any program or service that is expected to be I/O bound, NodeJS is arguably one of the best languages for the job due to its asynchronous event model. The single threaded, event loop architecture allows for I/O parallelism via the OS as well high throughput on I/O requests, allowing for a very robust, scalable architecture. But what about CPU workloads? Does the NodeJS single threaded architecture prevent it from being a candidate for CPU bound tasks?
Parallelism
NodeJS is single threaded, it is not possible to run more than one CPU bound task at a time, in a single Node process. However, on machines with multiple cores, multiple Node processes can be ran, one process per core, allowing for CPU workloads to be ran in parallel over different cores in different processes, producing similar parallelized execution to that of a multithreaded language.
In a multithreaded environment, a program is limited by the number of cores as well. Running a multithreaded application on a single cored machine will give no benefits (for blocking or CPU bound tasks) since threads are not running in parallel on a single core, they are interleaving (taking turns executing, “cooperative multitasking”). That is, if a thread is running blocking I/O or a CPU bound task, only one thread will be executing at a time, per core.
Furthermore, context switching between the threads takes a lot of memory and the OS spends a lot of time and resources creating and managing threads. The single threaded, non blocking I/O NodeJS architecture will typically outperform a multi threaded application running async I/O in threads due to the overhead of thread management. Adding asynchronous calls to a Node process simply allocates a bit of heap space for the callback, opposed to a process trying to manage a bunch of threads running async operations. Also, non active threads are consuming both memory and CPU while they wait to be executed (when blocked by another thread), an overhead which Node does not incur while waiting for asynchronous I/O operations to complete.
An important distinction between threads and processes is that threads are run in the same context as the parent process, and have access to the same memory space; while processes have their own execution contexts and memory space. Processes in Node communicate via Inter Process Communication (IPC) protocol while threads can access the same memory (variables, data structures, etc.) as the parent process, which is a much faster way to communicate than IPC. However, since multiple threads can access the same memory, multithreaded languages must offer locking mechanisms so threads are not manipulating data concurrently. This raises code complexity significantly and can lead to deadlocking and heisenbugs which can be extremely difficult and time consuming to debug and fix.
From a software engineering perspective, due to the high cost and overhead on the OS of thread management, it is more efficient to have fewer threads performing larger tasks vs. smaller, granular threads, restricting software design decisions. It is also more performant to have threads live throughout the life cycle of the application, opposed to continually terminating and restarting threads due to the CPU load on the OS for continuously creating and managing new threads, which also limits the level of granularity a worker thread can be.
A multi processed asynchronous non-blocking I/O architecture offers many benefits over a multithreaded architecture. The code is cleaner, easier to maintain and debug, by not requiring data locking mechanisms or allowing for race conditions and deadlocks. I/O tasks can be run in parallel outside of the context of the processes. Both multiprocessing and multithreading are bound by the number of CPU cores, but multiprocessing does not add any overhead to the OS for thread management. Outside of higher IPC overhead, slower processes startup time vs. that of a thread, and slower cross thread (process) communication, there are no negative performance consequences of multiprocessing parallelism.
Various programming languages handle parallel execution differently. PHP is a single threaded language, offering no asynchronous I/O, and no built in modules to help with multiprocessing. Python offers both a Thread and Multiprocessing module, but like NodeJS, only I/O tasks are run in parallel, unless work is parallelized into a new process / core due to the interpreter lock. Ruby also has an interpreter lock, preventing multiple threads (even with multiple cores) from running concurrently, but there are libraries that allow Ruby to be truly multithreaded by executing Ruby code on the JVM. C# and Java offer both true multithreading and asynchronous non-blocking I/O programming paradigms.
Compilation
V8 is an open source JavaScript Just In Time (JIT) compiler written in C++, originally started as a Google project, and the current compiler for both NodeJS and JavaScript in Google Chrome. V8 compiles NodeJS into native machine code (no interpretation, no bytecode), on the fly, as the code is used. There is no pre-compilation process, and code that is not executed is not compiled. V8 focuses on compiling code quickly, not spending much time on maximizing the efficiency of the compiled code; therefore, the native code is not highly optimized. While running pre-compilation into binary code prior to execution, similar to that of the Java Virtual Machine (JVM), is a more efficient compilation process, V8 compilation to native code does not incur significant overhead, and pre-compilation is not an option when running JavaScript code in a web browser, hence not offered by V8.
While V8 is performant, NodeJS code still does not compare to the efficiency of precompiled code. I have tried to research exactly what makes compilers like JVM more efficient than V8, but it is difficult to find detailed, understandable answers. Compilers like JVM and gcc are not as time sensitive in their compilation processes, allowing the compiler to spend a lot of time optimizing the byte code. This lack of pre-compilation with emphasis on efficient code and a lack of strong typing, which provides compilers type hints, likely contributes to Node’s poor performance vs compiled languages like Java and C++ in CPU bound tasks.
Clustering
NodeJS offers both a Cluster module and a Child Processes module. The Cluster module is a wrapper for the Child Processes module and is useful when a service needs to share a port or a socket (i.e. a web server), whereas the Child Process module is useful for spawning workers across cores. Both modules can be used for CPU bound workloads.
To demonstrate a CPU bound workload being parallelized across multiple processes while not blocking the main Node thread, let’s take a look at a naive solution to the the closest pairs algorithm.
Given an array of n points on a graph, find the closest pair of points in the array.
The brute force approach to this problem is to run a nested for loop comparing each point to every other point, with a running time of O(N^2).
# brute(n, 0, n.length); function brute(points, start, end) { var min = -1; for (var i=start; i<end; i++) { for (var j=0; j<points.length; j++) { if (i === j) { continue; } var x = points[i].x - points[j].x; var y = points[i].y - points[j].y; var dist = Math.sqrt(Math.pow(x, 2) + Math.pow(y, 2)); if (min === -1 || dist < min) { min = dist; } } } return min; }
Running this with varying amounts of data:
# 1,000 data points TIME 0.081088227 s # 10,000 data points TIME 0.588222528 s # 100,000 data points TIME 50.941021642 s
The brute force algorithm only handles ~100,000 data points at sub 1m (on my laptop). But what if we break the problem out into different processes, with each process taking a subset of the array and comparing it to the other data points:
# 1,000 data points, 2 processes TIME 0.079384816 s # 1,000 data points, 7 processes TIME 0.128712077 s # 10,000 data points, 2 processes TIME 0.32888444 s # 10,000 data points, 7 processes TIME 0.25184114 s # 100,000 data points, 2 processes TIME 25.241465283 s # 100,000 data points, 7 processes TIME 15.012750336 s
A couple interesting observations. The obvious one, splitting the algorithm out and running in parallel across multiple process significantly sped up the execution of the algorithm. Interestingly, 7x the amount of processes does not equate to 1/7 of the execution time. The difference between two processes and seven processes on a small data set is very minimal. Originally when I was designing this test I tried running it on a binary search, but since the algorithm is efficient, I wasn’t able to get demonstrable performance gains. Prior to using a parallelized approach, it is important to ensure that algorithms are as efficient as possible. However, as algorithm complexity increases and data sizes start to swell, leveraging multiple cores can provide significant performance gains.
Code
I put together some examples of how to manage multi processed parallel execution with NodeJS.
The first example just uses the cluster module directly in-line in a single script. By default, forking a process with cluster will execute the same script from the beginning, just in another process. Hence, it is required to check if the current process is the master (parent) process to avoid continually forking new processes. The fork() call on the cluster is what spawns the new process, and the send() call passes data into the worker process via IPC. In this case we need to pass in the offsets of where to start the search on the data set.
When a worker is spawned the process flow will hit the else portion of the script, which simply listens to the send event, executes the algorithm, and returns data back to the parent process. Meanwhile the parent process is listening to the worker process via the message and exit event listeners on the cluster object. Node’s event driven architecture make the process IPC data sharing simple and intuitive.
This code demonstrates a NON BLOCKING PARALLELIZED CPU BOUND WORKLOAD in NodeJS! Not only is NodeJS running the CPU bound tasks in parallel via multiple processes, but the main NodeJS thread is not blocked and could continue to take in new requests while the CPU work is being performed. Once the CPU workload is complete, the main thread can pick up execution where it left off and continue execution. This a very powerful design pattern in Node, and one of the most useful features for highly concurrent Node backend systems.
While the first code example wasn’t overly complex, having everything in a single file would become very difficult to manage on a large scale production code base. The next example demonstrates using the child process module directly. The benefit to this is there is no longer the need for the isMaster check required by the cluster module as the workers can be forked from a different file, allowing for encapsulation of the worker logic. By removing the cluster object, the event listeners will need to be handled by the worker processes individually. An easy way to manage worker’s event listeners is by wrapping the listeners in a Promise and using Promise.all on a batch of Promises so that all the workers execute in parallel. Once the parallelized CPU work is complete, the parent process's thread can continue execution with the result set. Again, this approach would not block the event loop and the parent Node thread could handle I/O requests while the CPU tasks are running in parallel.
Backend NodeJS scripting can be difficult to test, so a further step of encapsulation can be taken by breaking out process control and the algorithm into testable, modular pieces of code and implementing a transpiler, like Babel or TypeScript, for a more robust code base.
A final architectural example would encompass a bit more control logic, by breaking up an algorithm into smaller pieces and ensuring only a single process is running on each core while starting new processes, or passing in new data, as iterations on the algorithm complete (a process pool). The previous examples wait for all workers to finish prior to moving on, but the ability to selectively move on, or move on with partial data sets is essential. For that example you can hire me as consultant, or buy me a beer.
While this example is fairly trivial, hopefully it highlights both the benefit of using a parallelized approach as well as the potential challenges of breaking algorithms or tasks across multiple processes. It’s also important to limit the number of process to less than the number of cores (preferably cores - 1), as multiple processes on a single core will be interleaving and not truly be running in parallel. Another point of interest, if a parent process dies, the workers will die as well, preventing zombie Node pids from overwhelming the OS.
Conclusion
In this post I gave some background on the multiprocessing paradigm and an brief overview of the NodeJS V8 compiler. Through the cluster and child processes module examples I demonstrated how to run CPU bound workloads in parallel across multiple cores, while keeping the main NodeJS thread unblocked. While NodeJS still cannot compete with compiled languages such as Java or C++ on performance, having tools like clustering, in addition to the native asynchronous non-blocking I/O architecture of JavaScript, makes NodeJS an extremely powerful tool, both in a browser and in backend software, and should be considered for your next engineering project.