Guide to Multi-processing Network Server Models

As someone who’s been writing high performance networking code for a number of years now (my doctoral dissertation was on the topic of a Cache Server for Distributed Applications Adapted to Multicore Systems), I see many tutorials on the subject that completely miss or omit any discussion of the fundamentals of network server models. This article is therefore intended as a hopefully useful overview and comparison of network server models, with the goal being to take some of the mystery out of writing high performance networking code.

This article is intended for “system programmers”, i.e., back-end developers who will work with the low-level details of their applications, implementing network server code. This will usually be done in C++ or C, though nowadays most modern languages and frameworks offer decent low-level functionality, with various levels of efficiency.

’ll take as common knowledge that since it’s easier to scale CPUs by adding cores, it’s only natural to adapt the software to use these cores as best it can. Thus, the question becomes how to partition software among threads (or processes) which can be executed in parallel on multiple CPUs.

I’ll also take for granted that the reader is aware that “concurrency” basically means “multitasking”, i.e. several instances of code (whether the same code or different, it doesn’t matter), which are active at the same time. Concurrency can be achieved on a single CPU, and prior to the modern era, usually was. Specifically, concurrency may be achieved by quickly switching between multiple processes or threads on a single CPU. This is how old, single-CPU systems managed to run many applications at the same time, in a way that the user would perceive as applications being executed simultaneously, although they really weren’t. Parallelism, on the other hand, means specifically that code is being executed at the same time, literally, by multiple CPUs or CPU cores.

Partitioning an Application (into multiple processes or threads)

For the purpose of this discussion, it’s largely not relevant if we are talking about threads or full processes. Modern operating systems (with the notable exception of Windows) treat processes almost as lightweight as threads (or in some cases, vice versa, threads have gained features which make them as weighty as processes). Nowadays, the major difference between processes and threads is in the capabilities of cross-process or cross-thread communication and data sharing. Where the distinction between processes and threads is important, I will make an appropriate note, otherwise, it’s safe to consider the words “thread” and “process” in this section to be interchangeable.

Common Network Application Tasks and Network Server Models

This article deals specifically with network server code, which necessarily implements the following three tasks:

  • Task #1: Establishing (and tearing down) of network connections
  • Task #2: Network communication (IO)
  • Task #3: Useful work; i.e., the payload or the reason why the application exists

There are several general network server models for partitioning these tasks across processes; namely:

  • MP: Multi-Process
  • SPED: Single Process, Event-Driven
  • SEDA: Staged Event-Driven Architecture
  • AMPED: Asymmetric Multi-Process Event-Driven
  • SYMPED: SYmmetric Multi-Process Event-Driven

These are the network server model names used in the academic community, and I remember finding “in the wild” synonyms for at least some of them. (The names themselves are, of course, less important – the real value is in how to reason about what’s going on in the code.)

Each of these network server models is further described in the sections that follow.

The Multi-Process (MP) Model

The MP network server model is the one that everyone used to learn first, especially, when learning about multithreading. In the MP model, there is a “master” process which accepts connections (Task #1). Once a connection is established, the master process creates a new process and passes the connection socket to it, so there is one process per connection. This new process then usually works with the connection in a simple, sequential, lock-step way: it reads something from it (Task #2), then does some computation (Task #3), then writes something to it (Task #2 again).

The MP model is very simple to implement, and actually works extremely well as long as the total number of processes remains fairly low. How low? The answer really depends on what Tasks #2 and #3 entail. As a rule of thumb, let’s say the number of processes or threads should not exceed about twice the number of CPU cores. Once there are too many processes active at the same time, the operating system tends to spend way too much time thrashing (i.e., juggling the processes or threads around on the available CPU cores) and such applications generally end up spending almost all of their CPU time in “sys” (or kernel) code, doing little actually useful work.

Pros: Very simple to implement, works very well as long as the number of connections is small.

Cons: Tends to overburden the operating system if the number of processes grows too large, and may have latency jitter as network IO waits until the payload (computation) phase is over.

The Single Process Event-Driven (SPED) Model

The SPED network server model was made famous by some relatively recent high-profile network server applications, such as Nginx. Basically, it does all three tasks in the same process, multiplexing between them. To be efficient, it requires some fairly advanced kernel functionality like epoll and kqueue. In this model, the code is driven by incoming connections and data “events”, and implements an “event loop” which looks like this:

  • Ask the operating system if there are any new network “events” (such as new connections or incoming data)
  • If there are new connections available, establish them (Task #1)
  • If there is data available, read it (Task #2) and act upon it (Task #3)
  • Repeat until the server exits

All of this is done in a single process, and it can be done extremely efficiently because it completely avoids context-switching between processes, which usually kills the performance in the MP model. The only context switches here come from system calls, and those are minimized by only acting on the specific connections which have some events attached to them. This model can handle tens of thousands of connections concurrently, as long as the payload work (Task #3) isn’t overly complicated or resource intensive.

There are two major downsides, though, of this approach:

  1. Since all three tasks are done sequentially in a single loop iteration, the payload work (Task #3) is done synchronously with everything else, meaning that if it takes a long time to compute a response to the data received by the client, everything else stops while this is being done, introducing potentially huge fluctuations in latency.
  2. Only a single CPU core is used. This has the benefit, again, of absolutely limiting the number of context switches required from the operating system, which increases overall performance, but has the significant downside that any other available CPU cores are doing nothing at all.

It is for these reasons that more advanced models are required.

Pros: Can be highly performant and easy on the operating system (i.e., requires minimal OS intervention). Only requires a single CPU core.

Cons: Only utilizes a single CPU (regardless of the number that are available). If the payload work is not uniform, results in non-uniform latency of responses.

Source: Toptal