Today, various content provided through the Internet continues to grow exponentially. Content Providers have spent significant CapEx dollars for their infrastructure typically peering with multiple providers to give their customers the best experience possible. This classically calls for BGP peering between these providers and leveraging one provider as a transit with a default route. Given the fact that many views of the global Internet routing table show approximately 580,000 IPv4 prefixes and 20,000 IPv6 prefixes (December 2015), large expensive routers are traditionally used in this capacity. This is because traditional deployments in the past took all the routes in the Control Plane (RIB) and programmed these into the hardware forwarding plane (FIB). Many content providers have taken a fresh look at this legacy deployment model and analyzed their traffic flows to see that this is a very inefficient approach. Various publically available data shows that only 30,000 prefixes are necessary for 99% of the traffic that content providers deliver to a given region. The remaining traffic, which is relatively insignificant, can use the default route through the transit provider. Content providers, such as Spotify, have publically shared the fact that these 30 thousand prefixes are typical and well within the hardware routing table size on Arista switches.
Arista Networks provides a flexible feature that is utilized by Content Providers so they can indeed reduce their CapEx by using lower cost layer 3 switches in place of expensive routers. This gives customers more capital that can be used for other areas of their business. BGP Selective Route Download (SRD) gives a Content Provider the ability to select specific BGP route prefixes and then only program the hardware with these specific Prefixes while the full BGP routes are maintained in the control plane (RIB). Additionally, because of the open architecture of Arista’s Extensible Operating System (EOS) many of the tools used to analyze your traffic can be installed directly on an Arista switch to create a packaged solution for both analytics and route optimization. No other networking vendor provides this level of flexibility.
One extremely useful tool that service providers have been using since its inception over ten years ago is pmacct. (http://www.pmacct.net) The pmacct project provides a subset of passive network monitoring tools to measure, account, classify, aggregate and export IPv4 and IPv6 traffic. Pmacct also supports a passive BGP router through an implementation of Quagga, so that each prefix can be correlated for sampled traffic. Sampling of traffic is accomplished on an Arista switch by enabling sflow sampling, which is sent to the pmacct sflow accounting process. Pmacct then provides IP Accounting by correlating the sflow sample and the BGP prefix to give a network operator full visibility of the prefixes that really matter.
The correlated data is saved in a SQLite database, and then at any point can be queried for top ‘N’ prefixes. Once these prefixes are queried they can easily be added to a prefix list on an Arista switch through Arista’s eAPI, so that they are programmatically installed in the hardware-forwarding table (FIB). This simple approach allows the control plan to maintain a full view of the Internet routing table in the meantime installing only the route prefixes that matter.
There are three approaches with regards to implementing pmacct and Selective Route Download (SRD) on Arista platforms:
- Install pmacct and execute a SQLite database locally and natively on an Arista switch
- Create a VM with pmacct, SQlite and supporting tools and run this on an Arista switch
- Run pmacct, Sqlite and supporting tools on a server
This application note will focus on the first approach.
An Arista switch with a Solid State Drive (SSD) option is necessary. This is required because of the significant read/writes of the SQLite database and BGP snapshot files that pmacct will use. The EOS 4.15.4+ image of EOS is also needed both for SRD functionality and the ability to configure the BGP Peer TCP port. Pmacct compiled from source with SQLite support is also needed. For convenience, an RPM can be installed that includes pmacct and custom python scripts for querying the database and building the prefix-lists for SRD.
The basic logic of using pmacct on an Arista switch is outlined in the diagram below:
The Arista switch is configured to send sflow samples to the pmacct sfacct agent, which is bound to a loopback address on the switch and listening on port 9999. Additionally, pmacct is configured as a passive iBGP peer with the switch itself. Because the switch routing process is listening on port 179, we have to configure the pmacct routing process to listen on another port. The configuration file provided in the SRDTOOLS RPM is configured to listen on port 1179. Every hour, pmacct takes a BGP route snapshot, which is used to correlate the sflow datagrams with the appropriate prefix.
A custom script (srdtool.py) can be used to query the top ‘n’ prefixes with the option to automatically create the prefix-list used for SRD. This makes it extremely easy to deploy in production. Additionally, srdtool.py supports the option to not change the switch configuration but to display the prefix-list to stdout so it can be piped to other applications or scripts if needed. Another tool as part of the srdtool package is srdpurge.py. This allows a simple way to purge out old database entries and BGP snapshots.
The srdtool.py script requires a route map named RIBIN that matches on multiple prefix-lists. Since most content providers have a transit provider, which supplies a default route, this would be the first match. Then the second match is our prefix-list for the specific prefixes queried from the database. (Note, this is only for the FIB. Typically you would create route-maps to filter traffic for a peer to accept specific prefixes, such as the default route from a provider).
ip prefix-list ALLOWDEFAULT
seq 10 permit 0.0.0.0/0
ip prefix-list PLIST1
seq 1 permit 184.108.40.206/15
seq 2 permit 220.127.116.11/27
seq 3 permit 18.104.22.168/28
seq 4 permit 22.214.171.124/15
seq 5 permit 126.96.36.199/27
! ###Truncated for brevity ###
route-map FIBIN permit 5
match ip address prefix-list ALLOWDEFAULT
route-map FIBIN permit 10
match ip address prefix-list PLIST1
The binding of the route-map to the FIB selection is completed under the ‘router bgp <AS>’ configuration using the key word ‘bgp route install-map <routemap>’.
router bgp 65000
bgp route install-map FIBIN
neighbor 127.0.0.9 remote-as 65000
neighbor 127.0.0.9 description "pmacct"
neighbor 127.0.0.9 transport remote-port 1179
In order to run pmacct natively on EOS, it requires the BGP process to listen on a different port. The latest EFT version of EOS added the optional ‘remote-port’ parameter for the BGP neighbor configuration.
The final configuration requirement is sflow sampling.
sflow sample 16384
sflow destination 127.0.0.9 9999
sflow source-interface Loopback0
Operationally, the srdtool.py script is very simple. It queries top ‘n’ prefixes that you specify through the command line. It will look to see which prefix-list is active, either PLIST1 or PLIST2. If one is actively used, it will create the other with the new queried list of prefixes. It will then point the route-map to match on the new prefix list and then delete the old prefix-list. Therefore, the active prefix list will always be either PLIST1 or PLIST2. If neither prefix list exists, it will create PLIST1 for the initial execution.
General Installation Steps
1. Upgrade to at least 4.15.4F of EOS.
2. Install the SRDTOOLS.RPM – https://aristanetworks.egnyte.com/dl/5nZAoBIsCa
- Install the RPM by using the command ‘extension SRDTOOLS-1.0.1-1.i386.rpm’. This will install pmacct support files in their appropriate locations and the srdtools in /mnt/drive/SRD.
- Make this RPM install on subsequent reboots using the command, ‘copy installed-extensions boot-extensions’
3. Create ALLOWDEFAULT prefix-list, PLIST1 (can be blank initially) and route-map FIBIN
4. Configure BGP peers, including pmacct as an iBGP peer.
5. Configure sflow to forward to pmacct sfacct process
6. Start pmacct
sudo immortalize --log=/var/log/pmacct.log --daemonize /usr/sbin/sfacctd -f /etc/pmacct.conf
After initial installation, an event handler can be created to automatically restart pmacct after switch reboot.
action bash sudo immortalize --log=/var/log/pmacct.log --daemonize /usr/sbin/sfacctd -f /etc/pmacct.conf
SRD In Action
Lets take a brief look at using the SRDTOOLS package on an Arista 7280 with three BGP peers with full Internet routes in the RIB. In this example, the Arista 7280 will have 1.7 million routes in the RIB, but only 30,000 prefixes in the FIB with the following topology:
In this example, Provider-A is a transit provider and it supplies a default route. Full routes are received between all three providers, but initially the default is used as this is the only one that is installed in the FIB by means of our ALLOWDEFAULT prefix-list which is bound to the FIBIN route-map. After a few hours, our database will have prefixes that are actively being used. At that time, the srdtool.py script can be executed (either scheduled or on demand) to query the database and create the prefix-list.
[admin@7280 SRD]$ ./srdtool.py --help
Usage: srdtool.py [options] arg1 arg2
-h, –help show this help message and exit
-V, –version The version
-u, –update Update configuration with new prefix list
-q, –query Query Database and only print to stdout
-d Verbose logging
-p prefixes, –prefixes=prefixes
Number of Prefixes
Run srdtool.py with the –u option to update the configuration and we’ll query for 30,000 prefixes.
[admin@7280 SRD]$ ./srdtool.py -p 30000 -u
Querying for top 30000 prefixes
Query complete at: 2015-12-22T19:31:51
Adding top 30000 prefixes, please wait.
Current active prefix list is PLIST1, creating PLIST2
Changing route-map to point to PLIST2
Completed prefix-list creation and route-map change.
Now we can look at the switch and see how many routes are in the RIB and FIB. Notice that we’re receiving ~570,000 prefixes from each provider. Since the pmacct BGP peer is passive, we should never see any prefixes advertised.
To see which routes are installed because of our prefix-list, use the ‘show ip bgp installed’ command.
(…Truncated for brevity)
We can pipe this to the Unix wc command to get a full count in which we can see we have all 30,000 prefixes installed in the FIB.
7280#show ip bgp installed | grep "^ \* >" | wc –l
7280#show ip route | grep “^ B” | wc -l
Additionally, the ‘show ip route summary’ command provides what routes have been programmed into the FIB.
Although we only have 30,000 prefixes in the FIB, what about the RIB? The ‘show ip bgp’ command lists all prefixes learned from our BGP neighbors. Notice in the snippet below that all routes are shown with route status codes that show which ones are installed.
Again, we can pipe this to wc to get a full count.
7280#show ip bgp | grep “^ \*” | wc -l
This means we have 1.7 million routes in our RIB but only 30,000 in our FIB. Working as designed!
Purging Old Data
Since pmacct takes a snapshot of the BGP table every hour (as configured) and the database will continue to grow, its important to purge old data over time to reduce the number of files and disk space usage. Many content providers have seen that after using pmacct and analyzing their traffic patterns, that the top ‘n’ prefixes remain fairly consistent. Therefore, keeping significantly old snapshots of the BGP table and the SQLite database are usually not needed. To easily purge the old data, use the srdpurge.py script. It takes ‘hours’ as an argument for how old records should be in order to be purged.
[admin@7280 SRD]$ ./srdpurge.py --help
Usage: srdpurge.py [options] arg1 arg2
-h, –help show this help message and exit
-V, –version The version
-t PURGETIME, –time=PURGETIME
purge data older than ‘x’ hours
[admin@7280 SRD]$ ./srdpurge.py -t 12
File bgp-172_254_254_1-2015_12_22T06_00_01.txt is old. Deleting
File bgp-172_254_254_1-2015_12_22T08_00_01.txt is old. Deleting
File bgp-172_254_254_1-2015_12_22T07_00_01.txt is old. Deleting
Purging old entries in database
Ideally, this would be executed automatically in EOS using the scheduler. For example, you can execute it every 24 hours to purge data that is a week old (168 hours).
schedule pmacct-purge interval 1440 max-log-files 2 command bash /mnt/drive/SRD/srdpurge.py -t 168
An additional script in the package is maskcount.py. This simple script queries the SQLite database for all prefixes and counts the number of prefixes with a specific mask.
[admin@7280 SRD]$ ./maskcount.py
Querying for IPv4 prefix counts. Please wait...
This is useful to gather basic data on the number of prefixes within a mask range. It allows Network Engineers to see which mask range has the larger number of prefixes and then optimize the FIB leveraging, for example, the LEM (Largest Exact Match) table. In the above example output, we see that the /24 mask has the larger number of entries. To optimize this and move those routes into the LEM table to free up additional LPM memory, the ‘ip hardware fib optimize prefix-length’ configuration parameter can be used.
7280(config)#ip hardware fib optimize prefix-length ?
12 Prefix length 12
16 Prefix length 16
20 Prefix length 20
24 Prefix length 24
28 Prefix length 28
32 Prefix length 32
Optimize the FIB by explicitly moving /24 and /32 routes to the LEM table.
7280(config)#ip hardware fib optimize prefix-length 32 24
Now we can see that the /32 and /24 routes are in the LEM table which gives us more space in the LPM tables.
7280(config)#show platform arad ip route summary
Total number of VRFs: 1
Total number of routes: 30017
Total number of route-paths: 12335
Total number of lem-routes: 17682
Total number of /24 routes in lem: 17680
Total number of /32 routes in lem: 2
For more details on how to optimize the FIB, please contact your local Arista Systems Engineer.
BGP Selective Route Download provides a robust mechanism to efficiently use hardware route tables with the prefixes that really matter. By leveraging open source tools, such as pmacct, and some basic scripts, a robust solution for content providers can be used at a significant cost savings. Additionally, other open source projects such as sir (https://github.com/dbarrosop/sir) provide packages that can be installed natively on an Arista switch to provide a graphical interface for traffic analysis coupled with the power of SRD.
As more demands come upon content providers, cost-saving approaches that use hardware efficiently are a requisite for next generation provider networks. Arista’s Extensible Operating System empowers customers by letting them use tools to enable flexible approaches to network engineering.
For more information on SRD and how you can leverage this feature, please contact your local Arista Systems Engineer.