How does one build an Internet-scale router using data center switches and a bit of SDN grease? One solution is what Spotify built with their open-source SIR (SDN Internet Router).
Before we go any further, let us address the why. Why would one want to do this? The price-performance ratio between a data center switch and an Internet router is on the order of 10x. Data center switches based on merchant silicon can offer three times the density of high-end routers for a third of the price.
For this reason, replacing expensive high-end routers with programmable data center switches using merchant silicon has been an emerging trend in the industry for the past few years. It is important to realize that data center switches, despite being called switches, are also very capable routers. In fact, all the widely adopted merchant silicon ASICs produced in the last 5 years at least exhibit exactly the same performance characteristics whether they are switching (making an L2 decision) or routing (L3 decision). Same latency, same throughput.
So what continues to separate data center switches from high-end routers? There are two differences of interest in this context:
- The number of routes the hardware can support: current-generation merchant silicon can typically absorb on the order of 30-200k IPv4 prefixes, compared to 1-2M for a high end router. This gap has been closing, slowly but surely, with each new generation of ASIC, thanks to Moore’s Law.
- Advanced software routing features (e.g. MPLS TE) and routing scale typically not found on data center switches;
The second issue is “just” a software problem, so it can be addressed. Arista recently announced a new round of improvements to EOS, allowing the software to scale to more than a million routes. But ultimately if only 30k routes can be programmed in hardware, what good does this do?
The Internet today is comprised of over 600k routes, but as David Barroso at Spotify posited, not all of them are useful to everybody. “When you travel… do you carry an Atlas? Or do you carry a local map?”, he asked. “So why are you carrying an Atlas with your router?”.
Indeed, a streaming service like Spotify is likely to serve a lot of content to the so-called “eyeball networks”, large ISPs (Internet Service Providers) servicing end users at home or perhaps at the office or on their mobile phones, but they are unlikely to stream much if anything to the plethora of enterprise networks or networks run for other websites.
The challenge then becomes: if not all the routes are useful, which ones are? David drew inspiration from a tech talk given at NANOG 61 by Elisa Jasinska, who was at Netflix at the time, and Paolo Lucente, author of the open-source traffic monitoring and accounting tool called pmacct. The latter is a key piece of the puzzle.
pmacct can cross-reference traffic samples collected via sFlow with BGP prefixes, which enables SIR to determine the top N prefixes based on actual bandwidth usage. These will be programmed in hardware, and a default route, provided by a transit provider, will be used as a fallback path for all traffic falling outside of the top N prefixes. Of course, the top N prefixes change over time, so they have to be periodically or continuously recomputed.
The initial results presented by David at the SDN meetup in Stockholm, hosted by Spotify, were more than encouraging. With N=1000, almost 90% of the traffic could take the best, specific routes, leaving only 10% to the default route. With 5000 prefixes, over 95% of the traffic was covered.
In a recent blog post, Spotify shared some numbers from their actual deployment at various IXPs (Internet eXchange Points), showing that only between 10 and 27% of the routes received from their peers suffice to cover well over 99% of the traffic in a steady state.
|IXP||Routes installed||Routes not installed|
The last piece of the puzzle is finding a way to program only the chosen routes in the hardware. Most high-end routers support SRD, Selective Route Download, a mechanism to select which routes make it to the hardware (to be more precise, SRD acts as a filter between the RIB and the FIB).
EOS did not support SRD at the time, but thankfully this was not a deterrent for David. While our BGP team was working on implementing the feature, David simply used BIRD instead. Because EOS is an open GNU/Linux distribution based on Fedora, it was easy for David to deploy an alternative BGP stack to fill in the gap.
Alas, as David recalls in his blog post, “We encountered some problems where the routes we tried to install did not get picked up properly by the hardware.” A then-recent change to the mechanism synchronizing the routing table between the Linux kernel and Sysdb contained an overly restrictive check that filtered out the routes of “type BIRD”. Thankfully, this regression was easy enough to fix, and after changing a few lines of code, I was able to send David a patch to unblock him again. EOS now natively supports SRD and you can read more about using SRD on EOS here.
Spotify allowed David to contribute not only the design to the open-source community, but also the entire implementation. You can read detailed instructions on how to deploy SIR as an EOS extension. We also have an #arista channel on Slack at network.toCode(), where you can ask questions about turning your data center switch into an SDN Internet router.
As a quick plug, it was a lot of fun for me to work with David at Spotify on enabling him to create his SDN Internet Router. We work this way in tandem with most of the big cloud providers and Internet companies such as Spotify. If you are interested in solving problems with creative solutions like SIR, Arista is hiring in San Francisco, Santa Clara, Bangalore, Vancouver, Nashua, and Dublin. Just drop me your resume at <tsuna at arista dot com>.