Moving forward by backing off: Back pressure’s crucial role in nRoute performance - Part 1
At nToggle, we help ad buying platforms find the best advertising opportunities, in the form of bid requests, for their businesses out of the millions of requests our systems route every second. To do this, we built nRoute, a traffic shaping platform that applies algorithm generated rules in real time.
An example rule may require 50% of traffic to be from Android mobile devices or may require 75% of traffic from a country like the United States. We call the characteristics of a rule (e.g. device type, country, etc) toggles. Since there are multiple toggles, we named ourselves nToggle. Clever, right?
Not only does nRoute route high levels of bid request throughput, but it also routes quickly. In fact, we strive for a 20ms 99th percentile round trip latency for each routed request. In this blog article, I want to share a concept central to nRoute's design that helped us reach our performance goals.
nToggle in the Ad Buying Ecosystem
Before jumping into the challenges we faced, it is helpful to first understand how nToggle fits into the ad buying ecosystem. nToggle operates in the real-time bidding (RTB) space, where numerous bidders participate in an auction to determine who will show an advertisement on the web page or mobile app you just opened. Exchanges conduct the auctions, receive bids, and select the winner in anywhere from 100-200ms. The bidders are often referred to as demand side platforms (DSPs) and represent the interests of advertisers seeking to target specific users with relevant advertising. The diagram below visualizes the bidding process.
nToggle enters the picture in-between the exchange and the DSP in order to sift through the traffic using nRoute. Since the lifetime of auction is no more than several hundred milliseconds, it is crucial that nRoute maintains low latency to avoid taking away decisioning time from DSPs. Given there are some five million auctions happening around the world at once, nRoute must also support high levels of throughput to enable DSPs to see the best available bid requests. Here is the same diagram updated to show nToggle forwarding a toggled bid request:
Now that you have a working knowledge of the nRoute problem domain, let's turn to a major challenge we solved to reach our performance goals.
A Promising Start
The nRoute prototypes performed great in benchmarks. According to the benchmarks, handling dozens of thousands of bid requests within our latency constraints seemed plausible. The team celebrated our early successes, until we realized one caveat: The benchmarks were wrong! Put more accurately, the benchmarks measured one latency, while we were interested in measuring another latency. We had fallen prey to a common latency measurement mistake known as coordinated omission.
Coordinated omission is a terminology I first learned about from JVM performance expert, Gil Tene. When measuring latency, it is common to begin recording latency when a request is transmitted rather than when it should have been sent. This seemingly small semantic difference yields significant measurement differences as the system being measured fails to cope with the incoming request rate. We will come back to this topic later.
Once we corrected the measurement method, we rode the metaphorical rollercoaster that you too have likely experienced in your life as a software engineer. One moment, everything works and you're on top of the world and the next moment, you are in a frantic panic figuring out why nothing works. At low bid request volumes, throughput and latency results with the corrected benchmark were great. Through extensive testing, we discovered that once the bid request levels exceeded a seemingly arbitrary level, nRoute throughput went off a cliff as shown below.
In a perfect world with infinite hardware, every bid request is processed by nToggle as shown by the dashed, ideal performance line. Unfortunately, actual performance illustrated a picture far from ideal and below contractual limits per node. Repeated tests, JVM tuning, and profiling all failed to produce a meaningful change in actual performance. What do you do when you feel like you exhausted all other options?
It's the Network!
As a software person, it's often convenient to throw your hands up and exclaim, "It's the network!" However, network throughput tests with iperf suggested that 40,000 bid requests per second did not come close to saturating network capacity. After ruling out the network, we took a step back to re-evaluate the problem and realized we had overlooked a critical puzzle piece.
The missing puzzle piece comes from answering a simple question. At your grocery store, when customers are ready to checkout at a rate faster than the checkout rate of all the cashiers, what happens? Don't overthink it! A line forms. Similarly, when nRoute receives bid requests exceeding the rate at which it can route bid requests, the bid requests queue up. This insight appears obvious when stated, but is harder to visualize when you are looking at a system with numerous moving parts and a deadline of yesterday!
Let's revisit the throughput problem from a queueing perspective. As the incoming bid request rate increases and exceeds the nRoute processing rate, requests queue up. Since nRoute attempts to route each bid request it receives, nRoute enters a deadly spiral. An increasing number of queued up bid requests result in more frequent and longer lasting garbage collection cycles, which in turn, results in additional bid requests queued. Finally, we had an intuition that explained the performance problem. But, how should we solve the problem?
Introducing Back Pressure
Since nRoute sits between an exchange and a DSP, nRoute must achieve its performance goals while also defending against unexpected downstream behavior changes. For example, imagine a downstream DSP experiences a dramatic increase in response latency due to an unexpected infrastructure outage. nRoute should continue to route the most relevant traffic while still adhering to throughput and latency goals. By making nRoute robust to downstream changes, nToggle maximizes the value delivered.
Fitting a solution to these constraints revolves around challenging a central assumption of the nRoute's design: treat all requests equally. In order to improve throughput, we had to accept that not all requests can be fully processed. To cope with higher bid request rates, nRoute must reject work when overloaded. Rejecting work when overloaded is commonly referred to as applying back pressure.
There are multiple ways to apply back pressure, several of which Martin Thompson, another notable JVM performance expert, explores in his excellent blog post. The challenge we faced when applying back pressure is that exchanges penalize bidders that fail to reply to incoming bid requests. Exchanges penalty box a bidder by reducing the transmitted bid request rate until the bidder begins responding to all bid requests. Since nToggle's mission is to help DSPs find the best possible traffic, we knew we had to find a way to avoid being penalty boxed on behalf of DSPs.
To make the situation more complex, the bid requests are transmitted via HTTP v1.x. Since HTTP v1.x does not support multiplexing, it is common to open additional TCP connections to achieve a constant throughput rate. This policy results in thousands of connections being opened at high rates, precisely when nRoute is already under stress due to a heavy workload. Rejecting TCP connections as a means of applying back pressure is out of the question due to penalty boxing.
Time for a Different Approach
To account for the constraints we've covered, we discovered that an amenable approach is to apply a timeout to all request processing. The instant a bid request is received by nRoute it is recorded as the start of processing. At each logical step of processing, if the amount of time passed exceeds a configured expiry value, then the processing is short-circuited. Short-circuiting in this context means that nRoute responds with a message indicating that no bid will be placed in this auction.
The idea can be extended to have multiple short-circuit points in a processing pipeline with multiple logical steps. Here is a diagram showing a fictitious, multi-step processing pipeline with several timeout scenarios to better visualize the process.
This diagram depicts a pipeline composed of opaquely named steps, stage 1, stage 2, and stage 3. At the end of each step, a decision can be made to either continue or short-circuit request processing depending upon the cumulative time spent processing a request. The expiries are an offset from the start of processing. For example, when stage 1 completes, if request processing started 3ms ago, then processing should be short-circuited and no further processing should occur.
To better understand how the expiries are applied, The table below walks through the three pictured request processing scenarios.
These examples show it is possible to deconstruct a critical path processing pipeline into multiple decision points to evaluate if a request should be processed further. As you might have already concluded, short-circuiting sooner yields fewer wasted cycles.
To apply back pressure effectively in our use case, we record the start of processing timestamp within the application networking stack and add another decision point at the beginning of the pipeline. Since the networking layer communicates with the request processing pipeline via inter-thread communication, there is latency between receipt of a request and the start of processing. As the throughput level increases, this thread hop takes an increasingly longer amount of time and eventually requests will be short-circuited before entering the pipeline. This accomplishes the goal of short-circuiting requests as early as is possible in the processing pipeline.
Using processing time as a bounding factor provides a flexible gating mechanism because it adjusts as the system's throughput and latency capabilities change. Tying system performance to latency matches performance goals with business goals, which ensures that all system tuning translates directly to business impact.
After tuning our request processing expiries to apply back pressure by short circuiting requests as early as possible, the throughput graphs dramatically changed for the better. Here is how our performance now looks:
Instead of breaking down under load, our software now more gracefully handles higher levels of throughput by rejecting work. This leads to a graph where processing throughput levels out and remains constant as the incoming bid request rate increases. Since performance is central to our value proposition, we continually revisit and rework for improved latency and throughput.
Applying back pressure to incoming bid requests ensures nRoute provides robust, low latency, high throughput communication between exchanges and DSPs. In future blog posts, we will dive into another back pressure mechanism essential to nRoute’s stability and performance. To keep you on the edge of your seat, ponder the following question. How else could you apply back pressure to DSPs when there is an unexpected downstream change in network connectivity?
Coming back to the topic of coordinated omission, one resource to consider when exploring the subject is a book I co-authored in 2016 called, Scala High Performance Programming. My previous work experiences and working on nRoute served as a motivator to explore how to use Scala to write performant software. Among other topics, the book discusses back pressure, coordinated omission, and how the functional programming paradigm can produce fast software. If you're interested, check it out!