Ansible vs. Nornir: Speed Challenge

by Patrick Ogenstad
November 05, 2019

When talking about Nornir and Ansible, speed is one of the topics that come up from time to time. A common argument for Nornir is that it performs better when working with either many hosts or lots of data. For some who hear this, it isn’t entirely clear what we mean. This article will look at some numbers. Recently I came across a quote by Kelsey Hightower that stuck with me.

“You haven’t mastered a tool until you understand when it should not be used.”

Let’s see if any of that can be applied here.

How we got here

A few years ago, I tended to solve a lot of my problems with Ansible. As the tool could work with several devices in parallel, it fits with a lot of the things I needed to solve. One day a client asked me to collect data from some IOS XR devices. They needed to create a graph of some information that they couldn’t access using SNMP. At the time when we did this, the only way to collect the data was by issuing show commands through the CLI and parsing the information.

The output from the “show dhcp ipv4 proxy binding” command looked like this:

                                        Lease
 MAC Address    IP Address     State    Remaining  Interface           VRF      Sublabel
-------------- -------------- --------- --------- ------------------- --------- ----------
2cb0.5d00.000a 10.248.159.182 BOUND      8691     BE1.201             default   0x11664
20d5.bf00.000b 10.48.93.39    BOUND      10315    BE1.1530            cust-a    0x1853b
a4b1.e900.000c 10.200.185.166 BOUND      10617    BE1.1502            default   0x1cf76
3091.8f00.000d 10.200.185.165 DELETING   N/A      BE1.1526            default   0x10606
0006.1900.000e 10.184.88.53   OFFER_SENT 27       BE1.1534            cust-b    0xa98d
0026.f200.000f 10.200.185.170 BOUND      10794    BE1.1546            default   0x1bb34
0006.1900.0010 10.184.88.24   OFFER_SENT 54       BE1.1535            cust-b    0x44d0
0002.9b00.0011 10.48.90.0     BOUND      10796    BE1.1543            cust-a    0x1c5cc

For each device we needed to count the number of subscribers in each VRF. So, for this sample output, we would have two subscribers in the BOUND state in the VRF called cust-a. As this was a few years ago I don’t remember the exact numbers but I think there were about 90 of these devices and the number of subscribers in various VRFs could range from 1000 - 30 000 per device.

I created an ntc-template and my idea was to collect the data with an Ansible playbook that parsed the data, then sent it into InfluxDB so that it could be visualized using Grafana. The exact timestamps wouldn’t matter that much since the goal was just to see how many customers should normally be BOUND to each VRF and if there was an issue the on-call staff would be able to see if a current situation was normal or not. For that reason, I just normalized the timestamp and scheduled the Ansible playbook to run every five minutes. After the first run, it all fell apart as it turned out that the playbook wasn’t able to finish within five minutes.

At that time Nornir didn’t exist and to solve the problem at hand I wrote my own tool to collect the data. For an example of what such a tool could look like today, you can find some inspiration here

One could argue that I was stupid all along. Of course, Ansible isn’t the correct tool to gather data. On the other hand, when working with network equipment even just for config tasks I found myself often using the register option in an Ansible playbook to store some variable state from network devices.

What about generating templates

A scenario that is perhaps more common is to generate configuration from templates for network devices. I decided to compare how Ansible compares to Nornir for this task and how the inventory size impacts the speed. The inventory was one I generated using the Faker library, you can find the script for this in the code section.

The templates used were very basic, the one for Ansible looked like this:

OS: {{ os }}
Management: {{ mgmt_v4 }} {{ subnet_mask }}

While this is the one for Nornir:

OS: {{ host.platform }}
Management: {{ host.mgmt_v4 }} {{ host.subnet_mask }}

The Ansible playbook is also very simple:

---
- hosts: all
  connection: local
  gather_facts: no
  tasks:
    - name: Generate template
      template:
        src: "ansible-base.j2"
        dest: "output/ansible/{{ inventory_hostname }}.cfg"

The first test was using an inventory with 1000 devices. When starting the Ansible Playbook on my Macbook Pro, it was as if it just started a session of high intensive training, a few seconds in the internal fans are giving their all. The playbook ran for several minutes.

To do the same within Nornir the code looks like this:

import sys

from nornir import InitNornir
from nornir.plugins.tasks import files, text
from nornir.plugins.functions.text import print_result


def generate(task):
    template = task.run(
        task=text.template_file,
        name="Render",
        template="nornir-base.j2",
        path="templates",
    )
    task.run(
        task=files.write_file,
        name="Write",
        filename=f"output/nornir/{task.host}.cfg",
        content=template.result,
    )


def main(inventory_size):
    nornir = InitNornir(
        inventory={"options": {"host_file": f"nornir-inventory-{inventory_size}.yaml"}},
        dry_run=False
    )
    result = nornir.run(task=generate)
    print_result(result)


if __name__ == "__main__":
    inventory_size = int(sys.argv[1])
    main(inventory_size)

Nornir completed this task in just over two seconds. As I was running this on my laptop both of these examples, of course, had to compete with other processes that were running such as Spotify and whatnot. In order to make the comparison a bit fairer, I set up a machine at DigitalOcean with dedicated CPUs and ran a few more tests there.

I generated different sized inventories based on 100, 1000, 5000 and 10000 hosts and then looked at how many seconds it took to run the tasks using Ansible or Nornir.

It should also be noted that all of this happens without touching the network, it’s just the time it takes to prepare the data before going out to the network. You will however see the same behavior when gathering data or facts from the network.

Benchmark

To test this I setup a General Purpose Droplet at DigitalOcean with 8 GB of RAM and 2 CPUs. The system was running Debian 10 with Python 3.7.3, Ansible 2.9.0 and Nornir 2.3.0.

I ran the tests with four different inventory sizes.

time ansible-playbook -i ansible-inventory-100.yaml ansible-run.yaml
real    0m18.217s
user    0m26.760s
sys    0m8.380s

time ansible-playbook -i ansible-inventory-1000.yaml ansible-run.yaml
real    3m1.560s
user    4m29.763s
sys    1m28.898s

time ansible-playbook -i ansible-inventory-5000.yaml ansible-run.yaml
real    17m47.708s
user    23m44.082s
sys    11m30.803s

time ansible-playbook -i ansible-inventory-10000.yaml ansible-run.yaml
real    41m22.106s
user    47m53.756s
sys    34m6.729s

time python3 nornir-run.py 100
real    0m0.621s
user    0m0.438s
sys    0m0.072s

time python3 nornir-run.py 1000
real    0m2.005s
user    0m1.630s
sys    0m0.353s

time python3 nornir-run.py 5000
real    0m8.906s
user    0m7.321s
sys    0m1.662s

time python3 nornir-run.py 10000
real    0m17.217s
user    0m15.033s
sys    0m3.345s

I could have done more tests with larger inventories too, but I was paying by the hour. One thing to highlight here is that whe you add twice as many hosts to the Nornir run it more or less doubles in time. Using Ansible doubling the amount of hosts or data will more than double the time.

Looking at the above output in a diagram is very telling. Benchmark It may appear that the data for Nornir is missing in the first two tests, it just looks that way because the bar would be so small. Even the other bars are hard to see. It’s easier if we remove the data for the Ansible tests.

When running the tests I used the default forks in Ansible and num_workers, however for this scenario / problem it doesn’t really matter since all of the processing of the data is done on the host regardless if it’s Nornir or Ansible being used.

I don’t have 10 000 devices, is this relevant for me?

The above is just an easy to illustrate an example, remember I hit a ceiling when using around 90 devices and the issue there was the amount of data. The main problem seems to be that Ansible is serializing and deserializing JSON data between every task and internally within the core, at some point this becomes problematic.

It could be that you are collecting some facts from the network and want to use those facts later on in the Ansible playbook.

Is speed that important?

This, of course, depends on your situation. In some scenarios, it could be absolutely critical. A lot of time it doesn’t matter that much, especially when we are talking about background tasks. A big pain point for me is if I have to wait for a computer to complete something, I always try to minimize that time. Having said that I wouldn’t say that speed would be my main argument for using Nornir. Likewise, there’re a lot of scenarios with Ansible where the numbers above are completely irrelevant.

Conclusion and code

I hope that I’ve shed some light on the discussion about speed when working with data within Ansible or Nornir. Again the findings here might not be relevant for your situation, but it’s something to keep in mind. If you want to take a look at the code used in this article look at the networklore-demos repository within the python/nornir/ansible-nornir-speed directory.