The OpenConfig working group is tackling a number of challenging problems that have hindered multi-vendor network programmability:
- Creating vendor-independent models to represent all the aspects of a network element;
- Making these models programmatically accessible and modifiable;
- Changing from a pull model to a push model, with subscriptions and update streaming.
We are very excited about this effort and we believe it has a good chance of succeeding as it is driven by some of the biggest cloud and service provider operators. For the past year, we have been working closely with members of the working group and in particular with Anees Shaikh and Joshua George, from Google.
In order to understand where OpenConfig is coming from, one has to take a step back and understand how the largest networks on the planet are operated today. In order to meet uptime SLAs that are north of four 9’s (less than hour of downtime per year), network operations have to be largely automated to eliminate the single largest source of downtime: human error. This means not only automating monitoring, but also things such as incident detection and mitigation, including responding to DDoS attacks, bandwidth allocation and traffic steering, deployment of network OS image upgrades and configuration changes, capacity planning, draining traffic during planned and unplanned maintenances, etc.
These automated systems need constant feedback from all the network elements as well as programmatic access to their configuration in order to make changes. Large networks are always multi-vendor, and thus their operators maintain collections of scripts and systems to collect data from vendor-specific data sources (such as custom SNMP MIBs or YANG models, but also maybe the output of CLI commands!) and act on this data in a vendor-specific way (e.g. to push a config change).
This entails that for each vendor and each flavor of network OS, the operator has to build a translation layer to turn the vendor-dependent state into a vendor-independent representation that automated systems can act on. Then the reverse translation, from the vendor-independent intent emitted by automated systems into their vendor-dependent API calls, when there is an API at all.
One example of this sort of large-scale, pre-OpenConfig automation system is Microsoft’s Statesman, described in their SIGCOMM’14 paper. Statesman fully automates various aspects of Azure’s network, for tasks ranging from dealing with device failures (complete or partial, such as elevated rates of FCS errors on optical links), rolling out new network OS images, steering traffic based on real-time demand, and more.
A significant amount of engineering time has to be spent on building custom versions of the Monitor and the Updater for each flavor of each network OS deployed in Azure’s network.
Arista makes it easier to implement such translation layers for its devices by having a single EOS image to run across all devices as well as various APIs such as eAPI, the JSON-RPC interface to the CLI, or EOS SDK for on-box programming. But what if this translation layer could be eliminated altogether? This is one of the promises of OpenConfig: a collection of standard YANG models to interact with any network element.
In addition to solving this important problem, OpenConfig also promises access to the network element’s state via a modern API. While the OpenConfig working group does not currently define any transport protocol or serialization format, we have seen three main contenders emerge in the industry:
- NETCONF: the traditional SSH + XML combination used in some vendor implementations;
- RESTCONF: a newer alternative using HTTPS and allowing the use of a JSON data encoding, perhaps more suitable for modern tools and scripting;
- gRPC: Google’s open-source protobuf-based RPC framework built on top of HTTP/2, similar to Apache Thrift.
All of these protocols support some degree of streaming event notifications, with the gRPC transport providing a full blown pub-sub interface thanks to bidirectional streaming enabled by HTTP/2. This is crucial for scalability and performance. Constantly polling the device to scrape all its state is expensive both for the device and the collector. Instead the collector needs to be able to subscribe to the device’s state, so it can be notified as soon as that state changes. This also enables the collector to react significantly faster, typically a few milliseconds after the value of interest has changed, rather than finding out the difference in the next collection interval, which is typically several seconds later at best, or often minutes away.
Since the EOS architecture already consists of a stateful publish/subscribe database, with SysDB at its core, the EOS design is particularly well suited to the new requirements of OpenConfig.
A number of our top cloud customers – the so-called “cloud titans” – have told us they have a mandate to “kill SNMP” and automate operations to the point that network operators essentially never have to touch the CLI ever again. They rallied behind the OpenConfig working group as they see it as the only endeavor in the industry that is customer-driven and has received commitment from all the major vendors.
We started working on a RESTCONF implementation last summer and have since also built support for the gRPC transport. We’re hoping to tackle NETCONF next. With our YANG infrastructure in place, the bulk of the effort that remains, besides the NETCONF protocol stack, is to implement support for the YANG models as they are published by the OpenConfig working group. Our implementation is in pure Go and leverages the goyang library contributed to the OpenConfig project by Paul Borman from Google. PS: if if you’re interested in building the next-gen network programmability framework in Go, yes, we’re hiring.