• ZTPServer – Benchmarking the Webserver Gateway Interface

 
 
Print Friendly, PDF & Email

Introduction

ZTPServer provides a bootstrap environment for Arista EOS based products. It is written mostly in Python and leverages standard protocols like DHCP (for boot functions), HTTP (for bi-directional transport), and XMPP/syslog (for logging). Most of the configuration files are YAML based [ documentation ]. We will benchmark the performance of the ZTPServer by using funkload, which will simulate EOS nodes being provisioned.

Objective

The purpose of this post is to evaluate the performance of the ZTPServer with different profiles and modes of operation to determine its ability to scale. We’ll also demonstrate how you can leverage the EOS+ CS team to meet the needs of high-scale deployments. The ZTPServer can be run in two modes:

  • Stand-alone Mode
    • This will run ZTPServer as a single thread and can be started from the terminal by typing: $ztps. This mode is fine for basic testing, lab environments and small demos. We will not focus on benchmarking this mode since it’s only meant for small scale testing.
  • Web Server Gateway Interface
    • This mode will run the ZTPServer as a proper Python Web Server Gateway Interface. This will leverage built-in mechanisms in mod_wsgi or similar WSGI adapters to provide process management, threading, logging and other services. We’ll focus our testing here since it will be used for production deployments where scale might come into consideration. [ For more information see readthedocs ]

Considerations

It’s important to realize that the ZTPServer is not a simple TFTP server that provisions hosts with static files based on a node ID (it can do this in its sleep, but where’s the challenge?). When a typical node is provisioned, there is a sequence of HTTP transactions that build the node’s configuration in a modular way. The ZTPServer may also be used to perform actions like Topology Validation and Resource Allocation which can require a fair amount of analysis (depending upon your pattern, definition and resource pool size). You may also want to install a specific EOS version during provisioning. This use case requires you to consider if you want to host the SWI on the ZTPServer or on some other purpose-built server. This decision can impact bandwidth when you begin to scale to many dozens of nodes being provisioned simultaneously.

Benchmark Testing

Testing with Funkload

The sequence of HTTP transactions within each test case simulate all of the messages required to provision a single node. By using Funkload, we can iterate through varying levels of load to observe how quickly the ZTPServer can process our requests. Each profile was tested using 1, 5, 10, 20, 30, 40 and 50 concurrent users. This type of load far exceeds the typical provisioning load we see in practice. There’s generally a low probability that dozens of nodes all power up and the very same instant and have the same DHCP exchange duration prior to querying the ZTPServer, but we’ll stress the ZTPServer to establish a set of bounds.

Funkload spins up a new test until the desired number of concurrent users is reached during testing. It then captures data for a cycle duration (30s, here). Note too that within each test it fires off a request every range(750, 1000ms). To learn more about funkload benching concepts see this article.

Benching Profiles

Let’s put the ZTPServer through its paces by running a few a tests. We’ll consider three different profiles:

  • A. Provision Static Nodes (existing node directory)
  • B. Use Neighbordb to Dynamically Provision Nodes (without SWI download)
  • C. EOS+ CS Magic

Profile A: Provision Static Nodes (existing node directory)

This is technically the easiest provisioning deployment. For this test we preload the ZTPServer with sixteen static node directories (001122334455 through 001122334465), each with it’s own pattern, definition and startup-config. In this type of scenario, the node will progress through the protocol flow as defined below.

def only_static(self):
    server_url = self.server_url
    # begin test ---------------------------------------------
    nb_time = self.conf_getInt('test_simple', 'nb_time')
    for i in range(nb_time):
        rand = randrange(0, 15+1) + 55
        mac = "0011223344%s" % rand
        post_data = '''{"neighbors": {"Management1":
                    [{"device": "myNeighbord", "port": "000c.295a.4f00"}]},
                    "version": "4.14", "systemmac": "%s",
                    "model": "vEOS", "serialnumber": ""}''' % mac

        self.get(server_url + '/bootstrap', description='Get bootstrap script')
        self.get(server_url + '/bootstrap/config', description='Get bootstrap config')
        self.post(server_url + "/nodes", params=Data('application/json', post_data),
                  description="POST /nodes:%s" % post_data)
        self.get(server_url + '/nodes/%s' % mac, description='Get node definition')
        self.get(server_url + '/actions/replace_config', description='Get action')
        self.get(server_url + '/nodes/%s/startup-config' % mac, description='Get template')
        # end of test -----------------------------------------------

Results

The graphs below from funkload illustrate the performance of the ZTPServer under this testing setup at each level of concurrent users. As you can see the overall load on the ZTPServer is very minimal.

Successful-test-per-sec

Static_CPU

Static_Memory

Profile B: Use Neighbordb to Dynamically Provision Nodes (without SWI download)

This profile uses neighbordb to dynamically provision nodes. This particular case is a real-world example that includes five allocate() functions within the definition to perform dynamic resource allocation and a handful of templates to be applied to the node in order to create a startup-config. The allocate() function has an impact on overall performance as you will see in the results below. This is because resource allocation is file-based. Therefore, every time the allocate() function is called, the resource pool file must be read, analyzed and written. Since the file is locked during read/write to prevent corruption, delay is introduced when a significant number of nodes are being provisioned simultaneously (and all try to allocate a resource from the same pool). In this test, each resource pool has over 500 entries in order to accommodate the large number of nodes that get provisioned. We’ll skip downloading the SWI here because it will simply saturate the client node’s network interface and consume the lion’s share of the client’s memory.

def no_swi(self):
    server_url = self.server_url
    # begin test ---------------------------------------------
    nb_time = self.conf_getInt('test_simple', 'nb_time')
    for i in range(nb_time):
        rand = randrange(0, 540+1)
        mac = str(100000000000 + rand)
        post_data = '''{"neighbors": {"Management1":
                    [{"device": "myNeighbord", "port": "000c.295a.4f00"}]},
                    "version": "4.14", "systemmac": "%s",
                    "model": "vEOS", "serialnumber": ""}''' % mac

        self.get(server_url + '/bootstrap', description='Get bootstrap script')
        self.get(server_url + '/bootstrap/config', description='Get bootstrap config')
        self.post(server_url + "/nodes", params=Data('application/json', post_data),
                  description="POST /nodes:%s" % post_data)

        self.get(server_url + '/nodes/%s' % mac, description='Get node definition')
        self.get(server_url + '/actions/add_config', description='Get action')
        self.get(server_url + '/files/templates/ma1.template', description='Get template')
        self.get(server_url + '/files/templates/login.template', description='Get template')
        self.get(server_url + '/files/templates/bgpauto.template', description='Get template')
        self.get(server_url + '/files/templates/mlag.template', description='Get template')
        self.get(server_url + '/files/templates/torbase.template', description='Get template')
        self.get(server_url + '/files/templates/xmpp.template', description='Get template')
        self.get(server_url + '/files/templates/ztpprep.template', description='Get template')
        self.get(server_url + '/actions/copy_file', description='Get action')
        self.get(server_url + '/files/automate/ztpprep', description='Get file')
        # end of test -----------------------------------------------

Results

The graphs below from funkload illustrate the performance of the ZTPServer under this testing setup at each level of concurrent users. As you can see the overall rate of processing has decreased since we are provisioning each node dynamically and building the startup-config on the fly using a significant amount of resource pool allocation. Note however that none of the tests fails which implies all of the nodes would be successfully provisioned (ie no HTTP timeouts).

NoSWI_TestsPerSec

Notice how STPS has decreased from the first profile. Provisioning takes a bit longer since we’re waiting a little longer for each definition response.

NoSWI_CPU

NoSWI_Memory

Profile C: EOS+ CS Magic

As discussed above, there are inherent delays when bombarding a resource pool file with parallel read and write requests. So what can you do about it? Here’s just a little something whisked together to show the overall flexibility of the ZTPServer (note this is just a demo, not officially supported). Here, we’ll augment the ZTPServer code to add an allocate_fromDB() function which consults a sqlite db instead of flat resource pool files. This significantly improves resource allocation as shown below. For fun, we’ll kick up the concurrent users to 100.

Results

Notice the overall improved performance. This helps show that there is an incredible amount of flexibility in the ZTPServer and the EOS+ CS team is here to help address any need. The average response time for /nodes/ (getting the definition) at 50 concurrent users went from an average of 20s to 19ms when compared to the community version of ZTPServer. As we saw above, the community version can accommodate hundreds of provisioning requests and is a great solution for a majority of deployments, but there are always ways to customize and enhance the ZTPServer to suit your needs.

LITE_TestPerSec

LITE_CPU

LITE_Memory

Follow

Get every new post on this blog delivered to your Inbox.

Join other followers: