Networklore

Managing network devices with Golang using Netrasp

2021-04-12T00:00:00+00:00

Have you let the Gophers into your network yet? With Netrasp, you can let them roam wild. Netrasp is a Go package that connects to network devices with SSH to allow you to send commands and configuration to them. The rasping sound as your network gets screenscraped would come from Netrasp. For people coming from the Python world, you could compare Netrasp to Netmiko.

Getting started with Netrasp

A simple program using Netrasp could look like this.

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/networklore/netrasp/pkg/netrasp"
)


func main() {
	device, err := netrasp.New("router1",
		netrasp.WithUsernamePassword("my_user", "my_password"),
		netrasp.WithDriver("ios"),
	)

	if err != nil {
		log.Fatalf("unable to initialize device: %v", err)
	}

	err = device.Dial(context.Background())
	if err != nil {
		log.Fatalf("unable to connect: %v", err)
	}
	defer device.Close(context.Background())

	output, err := device.Run(context.Background(), "show version")
	if err != nil {
		log.Fatalf("unable to run command: %v", err)
	}
	fmt.Println(output)
}

Configuring devices with Netrasp

If you instead wanted to send configuration commands to a device, a code snippet could look like this.

config := []string{
    "ip access-list extended SOME-TRAFFIC",
    " permit tcp any any eq 22",
    " permit tcp any any eq 443",
}
_, err = device.Configure(context.Background(), config)
if err != nil {
    log.Fatalf("unable to configure device: %v", err)
}

_, err = device.Run(context.Background(), "write memory")
if err != nil {
    log.Fatalf("unable to save config: %v", err)
}

Connection options

If the default connection options don’t work, you might need to change them. An example is when I connect to an older device with the above program, I get this output:

2021/03/09 07:54:51 unable to connect: unable to open connection: unable to establish connection: ssh: handshake failed: ssh: no common algorithm for client to server cipher; client offered: [aes128-gcm@openssh.com chacha20-poly1305@openssh.com aes128-ctr aes192-ctr aes256-ctr], server offered: [aes128-cbc 3des-cbc aes192-cbc aes256-cbc] exit status 1

The problem here is that the network device doesn’t support modern crypto algorithms, and we need to tell Netrasp to downgrade the security using the WithSSHCipher option.

device, err := netrasp.New("router1",
    netrasp.WithUsernamePassword("my_user", "my_password"),
    netrasp.WithDriver("ios")
        netrasp.WithSSHCipher("aes128-cbc"),
)

Another such option is the WithInsecureIgnoreHostKey(), which would disable validation of the public SSH key of the network device against a known_hosts file. By default, Netrasp validates keys that exist in /etc/ssh/ssh_known_hosts or the user’s home directory (~/.ssh/known_hosts). For now, there’s no option to add any new unknown keys with Netrasp.

Netrasp device support

The initial release of Netrasp comes with support for Cisco IOS, Cisco NXOS, and Cisco ASA. As you’ve seen from above, you specify the platform using the WithDriver option, currently choosing one of “asa”, “ios”, or “nxos.”

It should be reasonably simple to add support for additional drivers in the future.

Room for improvement

With the initial release, Netrasp mostly cares about if the underlying SSH transport is working and can find the network devices’ prompts after running a command. It doesn’t care about the syntax of commands or configuration and leaves that up to the user. An example of what this might look like can be seen with this code.

config := []string{
    "ip access-list extended SOME-TRAFFIC",
    " permit tcp any any eq 22",
    " permit tcp any any 80",
    " permit tcp any any eq 443",
}
output, err := device.Configure(context.Background(), config)
if err != nil {
    log.Fatalf("unable to configure device: %v", err)
}
fmt.Println(output)

Here I’ve made a syntax error and missed the “eq” in front of port 80, so the configuration is invalid. It’s still possible to see these kinds of mistakes. Printing the output will show us this:

 permit tcp any any 80
                    ^
% Invalid input detected at '^' marker.

The current version of Netrasp doesn’t treat this as an error and just continues entering the rest of the commands. I have some thoughts that this behavior could be configurable in the future where the config is reported. Perhaps the Configure method would also return a slice with all of the sent commands along with any output. That way, it would be easier to know what config should be reverted if needed.

Another issue is that Netrasp doesn’t keep track of the current prompt of the devices. This could be problematic if you were to run the Enable() method while you already reside in the device’s privileged mode.

Netrasp and Gornir

While you can, of course, run Netrasp as any other Go package as done above, another way of using it would be to integrate it with Gornir. That way, you would get a similar experience as when using Netmiko and Nornir in Python.

Future development of Netrasp

Any future development of Netrasp will, to a large extent, be driven by interest from the community. If there’s no or little interest, I probably won’t spend much more time on it. So, take it for a test drive and see what you think. A word of warning, as indicated in the project’s readme file, this is an early version, and there’s a good chance that some parts of the API change before it’s settled and an initial real version is released. After that Netrasp will follow semver.

Project details

Netrasp is available at Github: https://github.com/networklore/netrasp

Please let me know what you think!

Test coverage of Python packages in Cisco NSO

2019-11-19T00:00:00+00:00

In most of the Python projects I’m working with Pytest is used to test the code, and Coverage is used to check what lines that the tests validate. For this to work, Coverage must take part in the execution of the Python code. While this isn’t a problem for most projects working with NSO poses a challenge since the actual Python code for each NSO package gets executed in a separate Python virtual machine. The goal of this article is to show you how you can overcome this obstacle and gain some insight into your test coverage for your NSO Python packages.

How does Coverage work

To make things simpler to understand, we’re going to start by looking at how Coverage works outside of NSO. Start by installing the package:

pip install coverage

To test this, we are going to use a simple Python application we call demo.py.

import sys


def show_output(choice):
    if choice == "hello":
        print("Hello to you too!")
    elif choice == "goodbye":
        print("I wish you goodbye!")
    else:
        print("I don't know what to make of that")


if __name__ == "__main__":
    show_output(sys.argv[1])

Here we have an application that takes at least one argument and prints one of three things depending on what the parameter is.

ᐅ python demo.py hello
Hello to you too!
ᐅ

Running the same thing through Coverage:

ᐅ coverage run demo.py hello
Hello to you too!
ᐅ

It looks the same only Coverage now creates a .coverage file. We then use Coverage to read and parse that file.

ᐅ coverage report -m
Name      Stmts   Miss  Cover   Missing
---------------------------------------
demo.py       9      3    67%   7-10
ᐅ

In this scenario, we don’t have any tests. We only see that 67% of the file got executed. We also see that lines 7 to 10 are missing from the report, and as such, we can’t know if they work. If we were writing real tests for this program, we might want to add tests to cover the missing lines.

It’s also worth pointing out that while we have a code coverage of 67 %, we still haven’t written any tests. All that we know is that that 67 % didn’t cause anything to crash. So, code coverage in itself doesn’t guarantee that much. It can, however, help you to see which parts of your code gets used within the tests. If you have lines or entire functions that might be dead branches and should get removed from your codebase.

NSO Test packages

I have created two packages in NSO, the packages themselves won’t do anything exciting; their purpose is to have something in place to show how to use Coverage from within NSO. So, instead of having something that would require network access and a specific NED, these are generic actions that anyone can run.

The packages are:

calc: A simple calculator
greeter: Greets the user with a different message depending on the time of day

The code for the packages live in the network-lore demos repository in Github: https://github.com/networklore/networklore-demos

Demo NSO packages

The NSO action for the calculator can be triggered like this:

admin@ncs> request calc calc number-a 12 operation multiplication number-b 45
message 12 * 45 =
value 540.0
[ok][2019-11-18 18:54:03]
admin@ncs> request calc calc number-a 66 operation division number-b 6
message 66 / 6 =
value 11.0
[ok][2019-11-18 18:54:32]
admin@ncs>

The relevant Python code for the action looks like this:

class CalcAction(Action):
    """calc action."""

    @Action.action
    def cb_action(self, uinfo, name, keypath, user_input, output):
        """cb_action."""
        first = user_input.number_a
        second = user_input.number_b
        value = False
        if user_input.operation == "addition":
            message = f"{first} + {second} ="
            value = first + second
        elif user_input.operation == "subtraction":
            message = f"{first} - {second} ="
            value = first - second
        elif user_input.operation == "multiplication":
            message = f"{first} * {second} ="
            value = first * second
        elif user_input.operation == "division":
            try:
                message = f"{first} / {second} ="
                value = round(first / second, 2)
            except ZeroDivisionError:
                message = "You have to pay extra for that operation"

        output.message = message
        if value is not False:
            output.value = value

The greeter package is even simpler. You can run it like this:

admin@ncs> request greeter greet
message Good evening!
[ok][2019-11-18 19:00:37]
admin@ncs>

The code behind the action:

def time_of_day(hour):
    """Return pleasant time of day."""

    if hour < 5:
        return "night"
    elif hour >= 5 and hour < 12:
        return "morning"
    elif hour == 12:
        return "noon"
    elif hour > 12 and hour < 18:
        return "afternoon"
    elif hour >= 18 and hour < 19:
        return "dinner time"
    elif hour >= 19:
        return "evening"


class GreetAction(Action):
    """Greet action."""

    @Action.action
    def cb_action(self, uinfo, name, keypath, user_input, output):
        """cb_action."""

        hour = datetime.datetime.now().hour
        period = time_of_day(hour)
        if period == "dinner time" or period == "noon":
            output.message = "You must be getting hungry!"
        else:
            output.message = f"Good {period}!"

Both of these packages are obviously quite silly and wouldn’t be terribly useful in a real setup. We only want to use them for demonstration purposes.

An entry point for Coverage

As we saw above Coverage needs to take part in the execution of the code it is analyzing. When you create a new package in NSO, a simple test using Lux gets created as a starting point. There’s no option to use Coverage from there. For my part, I prefer to write NSO tests using Pytest and communicate with the server over Netconf. However, even if Coverage can see the Python code that the tests execute, it doesn’t help us with our NSO packages. When we connect to NSO, the server, in turn, handles the code through Python virtual machines.

We can see these VMs from the NSO box.

root@50f85dd023c6:/nso/run# ps -x | grep python
  186 ?        Ssl    0:00 python -u /opt/ncs/current/src/ncs/pyapi/ncs_pyvm/startup.py -l info -f ./logs/ncs-python-vm -i greeter
  203 ?        Ssl    0:00 python -u /opt/ncs/current/src/ncs/pyapi/ncs_pyvm/startup.py -l info -f ./logs/ncs-python-vm -i calc
root@50f85dd023c6:/nso/run#

The startup.py file itself gets started by an ncs-start-python-vm shell script that looks like this:

#!/bin/sh

pypath="${NCS_DIR}/src/ncs/pyapi"

# Make sure everyone finds the NCS Python libraries at startup
if [ "x$PYTHONPATH" != "x" ]; then
    PYTHONPATH=${pypath}:$PYTHONPATH
else
    PYTHONPATH=${pypath}
fi
export PYTHONPATH

main="${pypath}/ncs_pyvm/startup.py"

echo "Starting ${main} $*"
exec python -u ${main} $*

So, this is the place where we need to insert Coverage. The documentation for NSO specifically says that you shouldn’t make modifications to that this file since it can get wiped during an upgrade. The correct way is to make a copy of the file and tell NSO to use our modified copy.

The standard startup script starts Python in unbuffered mode (-u) so we should do the same, this doesn’t seem to be an option with Coverage, but we can also set the environment variable PYTHONUNBUFFERED to x. A modified startup script for Coverage can look like this:

#!/bin/sh

pypath="${NCS_DIR}/src/ncs/pyapi"

# Make sure everyone finds the NCS Python libraries at startup
if [ "x$PYTHONPATH" != "x" ]; then
    PYTHONPATH=${pypath}:$PYTHONPATH
else
    PYTHONPATH=${pypath}
fi
export PYTHONPATH

main="${pypath}/ncs_pyvm/startup.py"

echo "Starting ${main} $*"
export PYTHONUNBUFFERED=x
exec coverage run --parallel-mode ${main} $*

By default, Coverage writes its findings to a .coverage file to the current directory. However, the above startup script is used to start multiple Python VMs, and it would be unpredictable for them to write to the same file. To avoid that issue, the --parallel-mode setting is used to start Coverage, which creates a separate file for each process.

According to the documentation, we would refer to our start script by adding a section like this to the ncs.conf configuration file.

<python-vm>
  <start-command>
     /nso/run/bin/start-python-coverage-vm.sh
  </start-command>
</python-vm>

However, it seems that changing the configuration in this way causes NSO not to send in any arguments to the shell script, i.e., the $* part which is needed in order to tell the startup script which package to start. An example of what this data might look like: -l info -f ./logs/ncs-python-vm -i calc

A workaround was to instead set the python-vm start-command from NSO, or if this is a throwaway test container it should be fine to modify the original script.

admin@ncs% set python-vm start-command /nso/run/bin/start-python-coverage-vm.sh
[ok][2019-11-19 18:57:21]

[edit]
admin@ncs% commit
Commit complete.
[ok][2019-11-19 18:57:23]

[edit]
admin@ncs%

After the changes are committed, you need to reload the packages, or restart NSO.

Now when we start NSO all of the Python VMs gets executed by Coverage, and when we run our test suite, we can see which part of our code gets hit.

We can verify that we the Python VMs are started correctly using coverage:

root@231b8fe85843:/nso/run# ps -x | grep python
  237 ?        Ssl    0:00 /usr/bin/python3 /usr/local/bin/coverage run --parallel-mode /opt/ncs/current/src/ncs/pyapi/ncs_pyvm/startup.py -l info -f ./logs/ncs-python-vm -i greeter
  238 ?        Ssl    0:00 /usr/bin/python3 /usr/local/bin/coverage run --parallel-mode /opt/ncs/current/src/ncs/pyapi/ncs_pyvm/startup.py -l info -f ./logs/ncs-python-vm -i calc
root@231b8fe85843:/nso/run#

Running tests

At this point, you can run your tests just as you would typically do, be it with Lux, Pytest, or through some other means. Since this article isn’t about any specific testing framework, I’m just going to connect to the CLI and run a few actions. In a real test scenario, the output from the commands or some other condition would get verified, but today, we only care about which part of the code gets hit.

admin@ncs> request greeter greet
message Good evening!
[ok][2019-11-19 19:05:23]
admin@ncs> request calc calc number-a 5 operation multiplication number-b 62
message 5 * 62 =
value 310.0
[ok][2019-11-19 19:05:34]
admin@ncs> request calc calc number-a 1812 operation division number-b 5
message 1812 / 5 =
value 362.4
[ok][2019-11-19 19:05:43]
admin@ncs> exit

The Coverage files are still not created at this time:

root@231b8fe85843:/nso/run# ls -la .cov*
ls: cannot access '.cov*': No such file or directory
root@231b8fe85843:/nso/run#

This is because the Python VMs are still running so Coverage is still waiting for things to happen. The files will only be created when we shutdown NSO:

root@231b8fe85843:/nso/run# ncs --stop
root@231b8fe85843:/nso/run# ls -la .cov*
-rw-r--r-- 1 root root 12762 Nov 19 19:06 .coverage.231b8fe85843.237.722345
-rw-r--r-- 1 root root 12895 Nov 19 19:06 .coverage.231b8fe85843.238.879368
root@231b8fe85843:/nso/run#

At this stage, we have one Coverage file for each of our NSO Python packages. We can merge them all into one file before looking at the result.

root@231b8fe85843:/nso/run# coverage combine
root@231b8fe85843:/nso/run# ls -la .cov*
-rw-r--r-- 1 root root 13119 Nov 19 19:07 .coverage
root@231b8fe85843:/nso/run#

By default, we see Coverage data for all Python packages, including the ones from the NSO Pyapi as well as any third-party package you are using. As all of the code we are interested in lives within the ./state folder, we can filter what to include before generating a report.

root@231b8fe85843:/nso/run# coverage report -m --include=./state/*
Name                                                             Stmts   Miss  Cover   Missing
----------------------------------------------------------------------------------------------
state/packages-in-use.cur/1/calc/python/calc/__init__.py             0      0   100%
state/packages-in-use.cur/1/calc/python/calc/calc.py                31      6    81%   16-17, 19-20, 28-29
state/packages-in-use.cur/1/greeter/python/greeter/__init__.py       0      0   100%
state/packages-in-use.cur/1/greeter/python/greeter/greeter.py       29      6    79%   12, 14, 16, 18, 20, 35
----------------------------------------------------------------------------------------------
TOTAL                                                               60     12    80%
root@231b8fe85843:/nso/run#

In this test run, the complete code coverage is 80%. We also see that the code coverage for the calc package is higher than the greeter, and we also see which lines in the Python file we miss in our test suite. I realize that the greeter app is problematic as it is dependant on the time of day. You could, however, create a unit test outside of NSO to test the time_of_day() function, and if desired, you can combine the .coverage file you get from that run with the one above to get a complete picture.

Running in a pipeline

You wouldn’t want to replace the startup command in your production environment. Examine the nso-docker repository for some information regarding how you can create different containers for your development and production environments.

Environment and code

For this test, I was running NSO 4.7.5. The code for the packages and coverage startup script are up on Github at: https://github.com/networklore/networklore-demos

Specifically, the files for this article are in the coverage folder

Conclusion

You should now be able to add Coverage data to your CI pipeline when testing your NSO Python packages. Keep in mind, though, that in the article, we also saw that we had quite a high coverage without validating anything. So, we saw that nothing crashed, but there still might be bugs lurking there, i.e., coverage in itself might not always mean that much.

I hope you found this useful!

Ansible vs. Nornir: Speed Challenge

2019-11-05T00:00:00+00:00

When talking about Nornir and Ansible, speed is one of the topics that come up from time to time. A common argument for Nornir is that it performs better when working with either many hosts or lots of data. For some who hear this, it isn’t entirely clear what we mean. This article will look at some numbers. Recently I came across a quote by Kelsey Hightower that stuck with me.

“You haven’t mastered a tool until you understand when it should not be used.”

Let’s see if any of that can be applied here.

How we got here

A few years ago, I tended to solve a lot of my problems with Ansible. As the tool could work with several devices in parallel, it fits with a lot of the things I needed to solve. One day a client asked me to collect data from some IOS XR devices. They needed to create a graph of some information that they couldn’t access using SNMP. At the time when we did this, the only way to collect the data was by issuing show commands through the CLI and parsing the information.

The output from the “show dhcp ipv4 proxy binding” command looked like this:

                                        Lease
 MAC Address    IP Address     State    Remaining  Interface           VRF      Sublabel
-------------- -------------- --------- --------- ------------------- --------- ----------
2cb0.5d00.000a 10.248.159.182 BOUND      8691     BE1.201             default   0x11664
20d5.bf00.000b 10.48.93.39    BOUND      10315    BE1.1530            cust-a    0x1853b
a4b1.e900.000c 10.200.185.166 BOUND      10617    BE1.1502            default   0x1cf76
3091.8f00.000d 10.200.185.165 DELETING   N/A      BE1.1526            default   0x10606
0006.1900.000e 10.184.88.53   OFFER_SENT 27       BE1.1534            cust-b    0xa98d
0026.f200.000f 10.200.185.170 BOUND      10794    BE1.1546            default   0x1bb34
0006.1900.0010 10.184.88.24   OFFER_SENT 54       BE1.1535            cust-b    0x44d0
0002.9b00.0011 10.48.90.0     BOUND      10796    BE1.1543            cust-a    0x1c5cc

For each device we needed to count the number of subscribers in each VRF. So, for this sample output, we would have two subscribers in the BOUND state in the VRF called cust-a. As this was a few years ago I don’t remember the exact numbers but I think there were about 90 of these devices and the number of subscribers in various VRFs could range from 1000 - 30 000 per device.

I created an ntc-template and my idea was to collect the data with an Ansible playbook that parsed the data, then sent it into InfluxDB so that it could be visualized using Grafana. The exact timestamps wouldn’t matter that much since the goal was just to see how many customers should normally be BOUND to each VRF and if there was an issue the on-call staff would be able to see if a current situation was normal or not. For that reason, I just normalized the timestamp and scheduled the Ansible playbook to run every five minutes. After the first run, it all fell apart as it turned out that the playbook wasn’t able to finish within five minutes.

At that time Nornir didn’t exist and to solve the problem at hand I wrote my own tool to collect the data. For an example of what such a tool could look like today, you can find some inspiration here

One could argue that I was stupid all along. Of course, Ansible isn’t the correct tool to gather data. On the other hand, when working with network equipment even just for config tasks I found myself often using the register option in an Ansible playbook to store some variable state from network devices.

What about generating templates

A scenario that is perhaps more common is to generate configuration from templates for network devices. I decided to compare how Ansible compares to Nornir for this task and how the inventory size impacts the speed. The inventory was one I generated using the Faker library, you can find the script for this in the code section.

The templates used were very basic, the one for Ansible looked like this:

OS: {{ os }}
Management: {{ mgmt_v4 }} {{ subnet_mask }}

While this is the one for Nornir:

OS: {{ host.platform }}
Management: {{ host.mgmt_v4 }} {{ host.subnet_mask }}

The Ansible playbook is also very simple:

---
- hosts: all
  connection: local
  gather_facts: no
  tasks:
    - name: Generate template
      template:
        src: "ansible-base.j2"
        dest: "output/ansible/{{ inventory_hostname }}.cfg"

The first test was using an inventory with 1000 devices. When starting the Ansible Playbook on my Macbook Pro, it was as if it just started a session of high intensive training, a few seconds in the internal fans are giving their all. The playbook ran for several minutes.

To do the same within Nornir the code looks like this:

import sys

from nornir import InitNornir
from nornir.plugins.tasks import files, text
from nornir.plugins.functions.text import print_result


def generate(task):
    template = task.run(
        task=text.template_file,
        name="Render",
        template="nornir-base.j2",
        path="templates",
    )
    task.run(
        task=files.write_file,
        name="Write",
        filename=f"output/nornir/{task.host}.cfg",
        content=template.result,
    )


def main(inventory_size):
    nornir = InitNornir(
        inventory={"options": {"host_file": f"nornir-inventory-{inventory_size}.yaml"}},
        dry_run=False
    )
    result = nornir.run(task=generate)
    print_result(result)


if __name__ == "__main__":
    inventory_size = int(sys.argv[1])
    main(inventory_size)

Nornir completed this task in just over two seconds. As I was running this on my laptop both of these examples, of course, had to compete with other processes that were running such as Spotify and whatnot. In order to make the comparison a bit fairer, I set up a machine at DigitalOcean with dedicated CPUs and ran a few more tests there.

I generated different sized inventories based on 100, 1000, 5000 and 10000 hosts and then looked at how many seconds it took to run the tasks using Ansible or Nornir.

It should also be noted that all of this happens without touching the network, it’s just the time it takes to prepare the data before going out to the network. You will however see the same behavior when gathering data or facts from the network.

Benchmark

To test this I setup a General Purpose Droplet at DigitalOcean with 8 GB of RAM and 2 CPUs. The system was running Debian 10 with Python 3.7.3, Ansible 2.9.0 and Nornir 2.3.0.

I ran the tests with four different inventory sizes.

time ansible-playbook -i ansible-inventory-100.yaml ansible-run.yaml
real    0m18.217s
user    0m26.760s
sys    0m8.380s

time ansible-playbook -i ansible-inventory-1000.yaml ansible-run.yaml
real    3m1.560s
user    4m29.763s
sys    1m28.898s

time ansible-playbook -i ansible-inventory-5000.yaml ansible-run.yaml
real    17m47.708s
user    23m44.082s
sys    11m30.803s

time ansible-playbook -i ansible-inventory-10000.yaml ansible-run.yaml
real    41m22.106s
user    47m53.756s
sys    34m6.729s

time python3 nornir-run.py 100
real    0m0.621s
user    0m0.438s
sys    0m0.072s

time python3 nornir-run.py 1000
real    0m2.005s
user    0m1.630s
sys    0m0.353s

time python3 nornir-run.py 5000
real    0m8.906s
user    0m7.321s
sys    0m1.662s

time python3 nornir-run.py 10000
real    0m17.217s
user    0m15.033s
sys    0m3.345s

I could have done more tests with larger inventories too, but I was paying by the hour. One thing to highlight here is that whe you add twice as many hosts to the Nornir run it more or less doubles in time. Using Ansible doubling the amount of hosts or data will more than double the time.

Looking at the above output in a diagram is very telling. It may appear that the data for Nornir is missing in the first two tests, it just looks that way because the bar would be so small. Even the other bars are hard to see. It’s easier if we remove the data for the Ansible tests.

When running the tests I used the default forks in Ansible and num_workers, however for this scenario / problem it doesn’t really matter since all of the processing of the data is done on the host regardless if it’s Nornir or Ansible being used.

I don’t have 10 000 devices, is this relevant for me?

The above is just an easy to illustrate an example, remember I hit a ceiling when using around 90 devices and the issue there was the amount of data. The main problem seems to be that Ansible is serializing and deserializing JSON data between every task and internally within the core, at some point this becomes problematic.

It could be that you are collecting some facts from the network and want to use those facts later on in the Ansible playbook.

Is speed that important?

This, of course, depends on your situation. In some scenarios, it could be absolutely critical. A lot of time it doesn’t matter that much, especially when we are talking about background tasks. A big pain point for me is if I have to wait for a computer to complete something, I always try to minimize that time. Having said that I wouldn’t say that speed would be my main argument for using Nornir. Likewise, there’re a lot of scenarios with Ansible where the numbers above are completely irrelevant.

Conclusion and code

I hope that I’ve shed some light on the discussion about speed when working with data within Ansible or Nornir. Again the findings here might not be relevant for your situation, but it’s something to keep in mind. If you want to take a look at the code used in this article look at the networklore-demos repository within the python/nornir/ansible-nornir-speed directory.

Introducing Nornir - The Python automation framework

2018-05-05T00:00:00+00:00

Nornir is a new automation framework written in Python and intended to be consumed directly from Python. You could describe it as the automation framework for Pythonistas. This might strike you as something wonderful, or it could trigger your spider-sense. Writing code? Isn’t that just for programmers?

Note, regarding names

Initially Nornir was called Brigade, but we changed the name due to a naming conflict with another tool.

Stop! Why do we need yet another automation tool?

There are a lot of tools around these days. You have Ansible, Salt, Chef, Puppet. Then there are things like StackStorm, Automatron and so on. When does it stop? How can you choose when this space is getting so crowded?

Nornir is a bit different compared to other tools, in the way that you write your own Python code to control the automation. For example Ansible is written in Python, but uses its own DSL which you use to describe what you want to have done.

What’s in it for me?

In Nornir we don’t use any DSL instead you write everything in Python. There are a few benefits that motivate this choice.

A DSL can get complex when you need “advanced” features
Code is easier to troubleshoot and debug, you can leverage existing Python logging and debugging tools
Sometimes you end up writing code anyway
You can leverage your existing IDE for code completion, linting and such
Easy integration into your existing Python code

I am a non-coder, is Nornir really for me?

For a lot of people, the idea of having to write your own code can be a deal-breaker. I think a large part of this is psychological and people tend to fool themselves. Regardless of which tool you choose, there will be a learning curve. A DSL that implements enough control structures and variable manipulation to be Turing complete is a programming language even if it’s not labelled as such. You don’t think about the fact that you are learning programming concepts, you just do.

A side benefit of using Nornir is that it will also teach you Python which I think will be a lot more useful to you, compared to mastering a DSL. To see some examples of how you can create tools using Nornir, we’ve setup the nornir-tools repo.

What can you do with Nornir?

To put it simple Nornir works with generic collections of data and execute tasks based on that data. Generally, this means that you have a number of hosts and groups along with some data associated with each element, this is the inventory. Then you have a number of tasks that you wish to run against the hosts or a subset of the hosts in your inventory. The tasks that you want to execute are regular Python functions.

A task can be something very simple or as complex as you want to, one of the included task plugins used to read Json files looks like this:

import json
from nornir.core.task import Result

def load_json(task, file):
    with open(file, "r") as f:
        data = json.loads(f.read())

    return Result(host=task.host, result=data)

Basically, it’s a function which takes a task object as its first argument followed by the arguments needed for the task, a Result object is then returned to the Nornir task handler. Nornir 1.0.0 was just released and it ships with a small number of task plugins. While this will naturally grow you will also find yourself writing your own plugins, as you can see from above it can be quite simple.

Getting started with Nornir

Nornir is available through PYPI and you install it as just another Python package.

pip install nornir

Before running Nornir you need an inventory containing your hosts and groups. The input to the Nornir inventory is one or two Python dictionaries, one for hosts and one optional for groups. The first release of Nornir includes three inventory plugins to make this easier for you.

SimpleInventory - Where you create your inventory in the standard Nornir format
AnsibleInventory - Which reads an Ansible ini or yaml inventory file
NSOTInventory - Which connects to the NSOT api to build the inventory

If you’re just starting out the SimpleInventory is a good choice. If you don’t specify anything the SimpleInventory is the default one, it will look for a file called hosts.yaml in the current directory and an optional groups.yaml. Here’s an example of a basic inventory of two Cisco devices I have at home.

hosts.yaml

---
og-sw-01:
  groups: ['home_network']

og-ap-01:
  groups: ['home_network']

groups.yaml

---
home_network:
  nornir_username: patrick
  nornir_password: ReallyS3cret!
  nornir_nos: ios

Then to run Nornir I create a short runner in Python.

run-nornir.py

from nornir.core import InitNornir
from nornir.plugins.tasks.networking import napalm_get
from nornir.plugins.functions.text import print_result

nr = InitNornir()

result = nr.run(
             napalm_get,
             getters=['get_facts'])

print_result(result)

To take it from the top we start by importing InitNornir which is a simple way to initialize Nornir, in this example we just accept all of the default settings. The next import is for napalm_get which uses the Napalm library to collect information from network devices. The last import for print_result is a function that helps us print the output to screen.

The first line of actual code “nr = InitNornir()” initializes Nornir and stores it in the nr variable. Next we run the Napalm getter against the hosts in our inventory and store the result in the result variable. Finally, the output is printed to screen.

Who are the people behind Nornir?

Nornir was initially created by David Barroso, who also brought us Napalm. Currently, the other members of the project are Kirk Byers as well as myself, Patrick Ogenstad. While we’re all part of the networking world Nornir isn’t by any means limited to that domain. The current setup is more a result of the fact that we already knew each other.

More information

For more information about this project visit the Nornir documentation or look at the Nornir source code. You can also hear us talking about Nornir on the Software Gone Wild Episode #90 podcast, please note that during the recording the project was still called Brigade. Just mentally replace Brigade with Nornir as you listen and you should be fine. :)

Extending Ansible action plugins for Cisco IOS

2017-10-30T00:00:00+00:00

It started out as a question. If you are using several networking modules in a playbook, do you really have to repeat the same credentials on every task? Just like the last few articles about Ansible this one came to life after answering questions in a chat room. The short answer is; No you don’t have to include all of the required parameters for every task, you can use an action plugin to work around that.

Great! So what’s an action plugin?

What was trying to be done?

Looking at the playbook below we define a cli variables for the credentials and then use the provider parameter on each of the ios modules we want to use.

---
-  hosts: all
   connection: local
   gather_facts: false
   vars:
     cli:
       username: admin
       password: Password1

   tasks:
     - name: Facts
       ios_facts:
         provider: '{{ cli }}'

     - name: Baseline
       ios_config:
         provider: '{{ cli }}'
         lines:
          - 'no ip http server'
          - 'no ip http secure-server'
          
     - name: VTY
       ios_config:
         provider: '{{ cli }}'
         parents: 'line vty 5 15'
         lines: 'transport preferred ssh'

The question was:

“If I have fifty tasks in my playbook do I really have to specify the same provider for each and every task?”

While it’s only one extra line it’s hard to argue against the fact that it’s a bit redundant. Especially since we’re using a persistent connection in the background. One way to solve this could be to set the environment variables ANSIBLE_NET_USERNAME and ANSIBLE_NET_PASSWORD, at the moment this would trigger a deprecation warning since those variables are tied to the old username and password parameters which are being phased out. That could perhaps be considered a bug since the variables should get assigned to provider.username and provider.password instead. While using environment variables would work you might be using Ansible Vault or some other secrets store and don’t want to export the secrets as environment variables, or have to care about how they are cleared after the playbook completes. Another way to solve this issue is to look at action plugins.

Action plugins in Ansible

Perhaps the most relevant question, if you haven’t heard about action plugins, is; Do I have to care about them? For a lot of people the answer is no. Action plugins will mostly be of interest to developers writing their own Ansible modules. However they might also be relevant for people who just want to understand how things work. Before digging into action plugins it can be helpful to start with regular Ansible modules and a good beginning would be to compare the code of the slack module to that of the template module. They both start in the same way with the variables DOCUMENTATION and EXAMPLES which are in fact used to generate the documentation for Ansible. Further down in the module code you see that the slack module is quite easy to understand, if you know some basic Python that is. The template module can be a bit harder to grasp. Below the documentation there’s no code what so ever. What’s up with that? Magic? If you stop to think about it the Slack module doesn’t really need any additional information, or everything it needs to work you send in as parameters from the playbook. The template module on the other hand needs access to all of the variables within the current Ansible run.

What happens is that Ansible searches for an action plugin with the same name as the module. With the case of the template module all the logic is placed in its action plugin.

Looking in the source code for all of the action plugins we can see that the networking modules generally have two action plugins. For Cisco IOS there’s ios.py and ios_config.py, the ios_config module uses the ios_config.py file. The other ios networking modules such as ios_fact, ios_command, ios_logging just use the prefix of ios and then load the ios.py action plugin. The reason that ios_config needs a separate one has to do with things like templates and creation of backups of the configuration.

Each action plugin uses the ActionModule class and triggers the run() function when a module is called. In the source code we can see that the ios_config plugin inherits from the ios action plugin. Looking at the ios.py code see that it sets the username and password based on the values of the keys username and password within the provider parameter, or if those aren’t set it looks in the self._play_context.connection_user and self._play_context.password variables. If we could overwrite that part we’d be good to go.

Using your own action plugins

When creating your own action plugins Ansible needs to be aware of the fact that they exist. To do this you can either place them in a directory called action_plugins placed at the base of your playbook. The other option would be to point to the directory which contains your action plugins from your ansible.cfg file.

[defaults]

action_plugins = /opt/ansible/plugins/action

Extending the action plugin for ios_facts

First we update the playbook so it reflects to what we want to have:

---
-  hosts: all
   connection: local
   gather_facts: false
   vars:
     cli:
       username: admin
       password: Password1

   tasks:
     - name: Facts
       ios_facts:

     - name: Baseline
       ios_config:
         lines:
          - 'no ip http server'
          - 'no ip http secure-server'

     - name: VTY
       ios_config:
         parents: 'line vty 5 15'
         lines: 'transport preferred ssh'

Running the playbook now will return the good old “unable to open shell” error. Nice.

The first goal is just to avoid typing the username and password so we’ll hardcode it within the action plugin.

/opt/ansible/plugins/action/ios.py:

from ansible.plugins.action.ios import ActionModule as _ActionModule

class ActionModule(_ActionModule):

    def run(self, tmp=None, task_vars=None):
        self._play_context.connection_user = 'admin'
        self._play_context.password = 'Password1'
        result = super(ActionModule, self).run(tmp, task_vars)
        return result

The idea is that the above would set our to variables and then just call the run function from the object we inherited from. Let’s test this!

It would have been nice if that worked. When I saw this I was struggling to find out what was going on. Some import error, the ActionModule doesn’t exist in the ansible.plugins.action.ios namespace? Looking in the code I can see that it clearly does. I tried the import in another way.

import ansible.plugins.action.ios

class ActionModule(ansible.plugins.action.ios.ActionModule):

    def run(self, tmp=None, task_vars=None):
        self._play_context.connection_user = 'admin'
        self._play_context.password = 'Password1'
        result = super(ActionModule, self).run(tmp, task_vars)
        return result

AttributeError on the module object? What’s going on? :) Testing with another file when instead importing ansible.module_utils.ios I validated that it wasn’t some weird naming issue where it wasn’t possible to load any other file named ios.py. There seemed to be some override happening for action plugins. Ansible used to do this for modules in the past where the from ansible.module_utils.basic import * which was required in all modules wasn’t actually interpreted as Python code but was instead a placeholder. I didn’t go searching through the code to see if the same thing was happening here, instead I modified my code and tried to import the original action plugin in using the Python imp library instead.

import imp
ansible_path = imp.find_module('ansible')[1]
plugin_file = 'plugins/action/ios.py'
src = '{0}/{1}'.format(ansible_path, plugin_file)
ios = imp.load_source('ios', src)


class ActionModule(ios.ActionModule):

    def run(self, tmp=None, task_vars=None):
        self._play_context.connection_user = 'admin'
        self._play_context.password = 'Password1'
        result = super(ActionModule, self).run(tmp, task_vars)
        return result

This time it works better! The ios_config tasks are still failing, but that’s because we haven’t created an action plugin for that module yet.

What about the username and password?

A big glaring problem here though is that I hardcoded the username and password within the action plugin. How can we access the variables we defined in the playbook? Or use something from Ansible vault? If you look at the arguments to the run() function you might notice the task_vars argument which actually contains everything we need.

Final action modules

ios.py:

import imp
ansible_path = imp.find_module('ansible')[1]
plugin_file = 'plugins/action/ios.py'
src = '{0}/{1}'.format(ansible_path, plugin_file)
ios = imp.load_source('ios', src)


class ActionModule(ios.ActionModule):

    def run(self, tmp=None, task_vars=None):
        if task_vars.get('cli'):
            if task_vars['cli'].get('username'):
                username = task_vars['cli']['username']
                self._play_context.connection_user = username
            if task_vars['cli'].get('password'):
                self._play_context.password = task_vars['cli']['password']
        result = super(ActionModule, self).run(tmp, task_vars)
        return result

ios_config.py:

import imp
ansible_path = imp.find_module('ansible')[1]
plugin_file = 'plugins/action/ios_config.py'
src = '{0}/{1}'.format(ansible_path, plugin_file)
iosconfig = imp.load_source('ios_config', src)


class ActionModule(iosconfig.ActionModule):

    def run(self, tmp=None, task_vars=None):
        if task_vars.get('cli'):
            if task_vars['cli'].get('username'):
                username = task_vars['cli']['username']
                self._play_context.connection_user = username
            if task_vars['cli'].get('password'):
                self._play_context.password = task_vars['cli']['password']
        result = super(ActionModule, self).run(tmp, task_vars)
        return result

No everything works. Mission accomplished!

Conclusion

For my part i don’t know if I use that many tasks in each playbook so that adding the provider argument to each one is that much of a hassle. Still, I like the fact that I can work around it if I would want to. This article also shows you how you can work with action plugins and create your one ones. Perhaps you like Ansible but aren’t too fond of Jinja and want to make a mako_template module. It might be that you want to create something for the Napalm or ntc-ansible modules. As long as you create your own modules and don’t want to extend existing action plugins you won’t have to worry about the import workaround I implemented above and you should be able to use import the parent ActionModule from another plugin, or ActionBase from ansible.plugins.action.

A final word of warning, the above action plugins were written for Ansible 2.4, they might or might not work for future versions of Ansible.

How to save IOS configurations with Ansible

2017-10-23T00:00:00+00:00

At the outset, a 1200 word article about saving configuration sounds strange. It would perhaps be perfectly normal if the topic was Vi and not Ansible, however there’s a reason for this and its simply speed and itempotency. Saving the configuration in the “wrong” way can take quite a lot of time and one reason for network automation is to accomplish tasks faster and constantly search for ways to improve your processes. This article assumes that you are running Ansible 2.4, but it should work in a similar way regardless.

What is the wrong way to save

First of all, it might not be obvious that changes made by the ios_config module aren’t saved by default. The module does however allow you to save the configuration using the save_when parameter. You could say that it’s a bit harsh to say that it’s wrong to use that option, but I don’t like it very much.

In Ansible 2.2 the save parameter was introduced, it was a boolean option which allowed you to choose if the configuration should be saved or not. While this allowed users to start to save their configurations there was a small problem with this. The save parameter wasn’t idempotent.

- name: Configure SNMP location
  ios_config:
    provider: "{{ cli }}"
    lines: "snmp-server location TEST"
    save: true

Running the above task multiple times wouldn’t change the configuration, it would however always save the configuration and set the changed key to True. So it would look like something changed on the device. Ansible 2.4 deprecated the save parameter and instead introduced save_when which you can set to never (default), always (same as the old save) and modified.

Using “save_when: modified”

Imagine a playbook which looks like the one below, it would have three tasks where each one would trigger a change and save the configuration.

- name: Set SNMP location 1
  ios_config:
    lines: "snmp-server location TEST1"
    save_when: modified

- name: Set SNMP location 2
  ios_config:
    lines: "snmp-server location TEST2"
    save_when: modified

- name: Set SNMP location 3
  ios_config:
    lines: "snmp-server location TEST3"
    save_when: modified

Obviously a playbook like this doesn’t make any sense, the point is that the configuration would be saved on each task. On some devices this wouldn’t matter that much, however it’s not that uncommon that it takes several seconds to write the configuration to flash. On older devices we could be talking about up to a minute so the above playbook could take close to three minutes to complete. If the device is slow enough to take longer than 10 seconds for it to save the configuration the task will fail due to the default timeout of ten seconds. To avoid an issue like that you might have to do something like this instead:

- name: Set SNMP location 1
  ios_config:
    lines: "snmp-server location TEST1"
    timeout: 40
    save_when: modified

- name: Set SNMP location 2
  ios_config:
    lines: "snmp-server location TEST2"
    timeout: 40
    save_when: modified

- name: Set SNMP location 3
  ios_config:
    lines: "snmp-server location TEST3"
    timeout: 40
    save_when: modified

Based on the above it should be quite clear that it’s probably a good idea not to have a save statement for every instance of ios_config. Of course another issue with this is that the ios_config module has an option to save the configuration, but other ios modules released with Ansible 2.4 such as ios_interface, ios_logging and ios_system doesn’t.

When is the configuration saved?

In Ansible 2.4 the modified argument will save the configuration if the startup configuration is different than the running configuration. So you could opt for only using the save_when parameter on the last command as in:

- name: Set SNMP community
  ios_config:
    lines: "snmp-server community networklore"

- name: Set SNMP contact
  ios_config:
    lines: "snmp-server contact SUPPORT"

- name: Set SNMP location
  ios_config:
    lines: "snmp-server location TEST"
    timeout: 40
    save_when: modified

In theory this would only save the configuration if it actually needed to be saved. However on Cisco IOS there exists a number of scenarios where the running configuration could be changed in a dynamic way which either doesn’t appear in the startup configuration or it isn’t relevant from a change management perspective. The most typical example is the ntp period which can look like this in the running configuration:

ntp clock-period 17174051

The counter will change by time and running a playbook with save_when: modified could trigger a change even if the configuration hasn’t actually changed. In order to work around this you can use the diff_ignore_lines parameter to ios_config and send in a list of lines to ignore. Even if you find all of the lines you want to ignore when trying to determine if a save is needed or not you might still run into issues with certificates. Let’s compare how a self-signed certificate is shown in the running configuration compared to the startup configuration.

R1# show running-config | begin crypto pki certificate
crypto pki certificate chain TP-self-signed-963592824
 certificate self-signed 01
  3082032E 30820216 A0030201 02020101 300D0609 2A864886 F70D0101 05050030
  30312E30 2C060355 04031325 494F532D 53656C66 2D536967 6E65642D 43657274
  69666963 6174652D 39363335 39323832 34301E17 0D313731 30323331 34323832
[...]
 058D6639 A7F9CB5B F3BBA7FF 093E745B B50DB469 66492830 060E4C07 47B72D0C
  CB47AEA7 F0F17C7B 15D10F88 D705CB8B 6F2BE8EA 91E9073E 504F231D AFC5F830
  67E4C681 B912CF40 8023B729 19AFAF8E 0F1E
  	quit

R1# show startup-config | begin crypto pki certificate
crypto pki certificate chain TP-self-signed-963592824
 certificate self-signed 01 nvram:IOS-Self-Sig#1.cer

The output from the show running command above is abbreviated for readability but just by looking at the short snippet it should be obvious that it isn’t practical to compare the two line by line and do a diff, in this case the diff_ignore_lines doesn’t offer much help. What the ios_config module does it to run show running-config followed by show startup-config, substract the diff_ignore_lines and calculate a sha1 hash of the resulting configurations. If the hash differs a Ansible sends a copy running-config startup-config command.

The result could be that the playbook isn’t idempotent and also takes longer time to complete as we have to wait for the configuration to be saved to startup.

Saving with a handler

A solution to this problem is to notify a handler. A handler let’s you run a specific task if another one has changed. You can have several tasks which notify the same handler and it will still only run once.

---
-  hosts: all
   connection: local
   gather_facts: no

   tasks: 
    - name: Set SNMP community
      ios_config:
       lines: "snmp-server community networklore"
      notify: "save ios"

    - name: Set SNMP contact
      ios_config:
        lines: "snmp-server contact SUPPORT"
      notify: "save ios"

    - name: Set SNMP location
      ios_config:
        lines: "snmp-server location TEST"
      notify: "save ios"

   handlers:
     - name: save ios
       ios_command:
         commands: "write mem"
         timeout: 40
       when: not ansible_check_mode

This playbook would only save the configuration if any of the config tasks resulted in a changed status. The reason to add the when condition is so that the handler wouldn’t trigger a save if the playbook was run in check mode.

Roles and larger playbooks

If you are using a larger structure with several roles you don’t want to trigger a handler local to each of the roles as that would just trigger a save action for each role. Instead you can create a specific role to save configuration and add that role as a dependency under the a meta folder in main.yml as in:

---
dependencies:
  - role: ios_wr_mem

Summary

I hope this will help you to make sure your playbooks remain idempotent as well as perhaps shave of some time for your runs. Hopefully you learned something more about Ansible too.

How to kill your network with Ansible

2017-10-03T00:00:00+00:00

Aside from being a user, I write about Ansible and try to help others to understand how it works. A few days ago I was answering questions from other Ansible users. Someone was having trouble figuring out why the ios_config module didn’t apply his template correctly. I explained what was wrong with the template, afterwards I thought about the issue some more and realized that the error could potentially be really dangerous. As in a game-over-level-event for your employment dangerous.

A scenario

Let’s imagine you are are a network administrator who has just been given the task of rolling out a new ip network to all your branch offices. You are running Ansible 2.4 (this might work differently in later versions once they are released). The new addressing details have already been pre populated in the IPAM system, and the dynamic inventory script will pull that information for you. What you need to do is to expand the template by adding the new networks and push out the change. Also the guy who used to manage the network had entered “wormhole” as the description for all of the interfaces pointing to the wide area network. While you’re adding those ip networks, can you please change the description to “WAN”. The added network will be used by a new digital signage solution.

You start by reviewing the current template which looks like this:

interface FastEthernet0
 description Wormhole exit
 ip address {{ wan_ip }} {{ wan_mask }}

interface FastEthernet1
 description POS
 ip address {{ pos_ip }} {{ pos_mask }}

interface FastEthernet2
 description OFFICE
 ip address {{ office_ip }} {{ office_mask }}

Logging into the router in Wisconsin you see that it is configured as follows:

interface FastEthernet0
 description Wormhole exit
 ip address 172.29.58.161 255.255.255.224

interface FastEthernet1
 description POS
 ip address 10.17.80.1 255.255.255.0

interface FastEthernet2
 description OFFICE
 ip address 10.17.81.1 255.255.255.0

interface FastEthernet3
 no ip address

This seems simple enough, the new network will be connected to FastEthernet3. You fire up your editor and update the template. The new file ends up like this:

interface FastEthernet0
 description WAN
 ip address {{ wan_ip }} {{ wan_mask }}

interface FastEthernet1
 description POS
 ip address {{ pos_ip }} {{ pos_mask }}

interface FastEthernet2
 description OFFICE
 ip address {{ office_ip }} {{ office_mask }}

interface FastEthernet3
description SIGNAGE
ip address {{ signage_ip }} {{ signage_mask }}

Sweet an updated template ready to use. You commit the new template to your git repo fire up the terminal.

ansible-playbook network-baseline.yml

Welcome to trouble

After giving yourself a mental highfive you stop to savor the moment. The joy lasts up until about the time when someone asks:

“Why did Wisconsin just drop of the map?”

That’s strange, you only added a new network. Right? , is when you get that nagging feeling in your stomach. The one that isn’t related to the Wisconsin office. Instead it’s that question if you limited the run to a single office or if you just killed all of your branch offices?

What just happened?

Before going into details of what went wrong. It’s important to realize that these things can happen. You can make errors, or there could be bugs in the software. What’s really important is that you test what you do in a safe environment first. A good idea in this case would have been to use the check mode in Ansible (-C) along with the verbose flag (-v). That way you would be able to see what configuration would have been sent to the device without actually changing anything. Another vital point is that you should never run something like this across your entire network if you aren’t certain what will happen. Use the –limit option and start with a few devices.

Using the verbose option we can gain some insight as to what went wrong.

So it looks like the actual config that gets sent to the poor device is:

interface FastEthernet0
description WAN
description SIGNAGE
ip address 10.17.82.1 255.255.255.0

That’s nice, the playbook reconfigured the WAN interface and gave it the ip address which should have been assigned to FastEthernet3. From this point you could call your therapist, Red Hat support or perhaps your lawyer. Or you could read on to figure out why this happened.

How Ansible parses templates for network devices

Like the rest of Ansible the networking modules use the Jinja2 templating engine. However it works a bit different with networking as opposed to when you are templating a configuration for Nginx or some other service. What Ansible does is to parse the running configuration of the device and based on this it decides what needs to be applied to the device. So for example, nothing is changed on FastEthernet1 and FastEthernet2. So Ansible doesn’t try to change anything on those interfaces.

Given a template Ansible will only apply the configuration which isn’t already on the device. The thing is that Ansible doesn’t actually understand the configuration or what it does. Instead it tries to parse the configuration based on a set of predefined rules. If we start with the description on the WAN interface that is obviously a thing that needs to be changed. However, we can’t only add the description command as it needs to be configured under the interface. Since the description line is indented under interface FastEthernet0, Ansible treats the interface line as a parent for that section. Even though the interface line exists in the configuration Ansible will send that command before the description command. This is why the updates sent out by Ansible starts with:

interface FastEthernet0
 description WAN

The rentering of the later part of the template would look like this:

interface FastEthernet3
description SIGNAGE
ip address 10.17.82.1 255.255.255.0

Since there’s no indentation Ansible won’t realize that interface FastEthernet3 is a parent command to description and ip address. Instead it will just assume that those commands are global config snippets, and since there’s no line for description or `ip address directly under global config they will be included in the list of commands sent to the device.

Normally we’d get an error if we entered a description and ip address directly under global config.

WIS-RTR-01(config)#description SIGNAGE
                   ^
% Invalid input detected at '^' marker.

WIS-RTR-01(config)#ip address 10.17.82.1 255.255.255.0
                           ^
% Invalid input detected at '^' marker.

WIS-RTR-01(config)#

However, because we also changed the description of FastEthernet0 the ssh session will still be in the config-if context. Since we’re not sending an exit command to return to global config (i.e. go from (config-if)# to `(config)#) the wrong ip address is applied to the FastEthernet0 interface. The final configuration will be:

interface FastEthernet0
 description SIGNAGE
 ip address 10.17.82.1 255.255.255.0

interface FastEthernet1
 description POS
 ip address 10.17.80.1 255.255.255.0

interface FastEthernet2
 description OFFICE
 ip address 10.17.81.1 255.255.255.0

interface FastEthernet3
 no ip address

Whoops.

What to keep in mind

I think it’s safe to say that for the above scenario to play out like it did a bit of unfortunate luck would have to be involved. The point I would like to make is that it can be easy for something similar to happen. In this case it was two missing space characters along with another unrelated description change. Even if it’s not as disastrous as this you might end up applying configuration where you didn’t want it.

Again to reiterate. Make sure you test and validate what you are doing! Use the dry runs and look at the result so that you see what’s going on.

A different approach

If the above example sounds scary to you, keep in mind that you don’t have to use templates in this way. You can also use the lines and parents parameters with ios_config. You might also want to take a look at the NAPALM library, specifically the napalm_install_config module for Ansible. As mentioned earlier the core Ansible networking modules parses the running configuration and tries to figure out what configuration is missing from the device in order to decide which commands to send. NAPALM on the other hand is completely oblivious to the configurations on the device and instead leaves the decision of what to apply up to the device. For an IOS device NAPALM would copy the entire rentered template to the file system of the device, evaluate if a change is needed and then merge it to the running configuration (or replace the entire configuration if you wanted to).

Conclusion

To close this off, what I hope that this article will help you to understand of how Ansible works with templates when applied to network devices, and why it’s really important use the correct indentation. More importantly the real takeaway; Make sure you always test what you do before you push out config changes.

Finally, I hope you don’t kill any networks in Wisconsin or anywhere else for that matter. :)

Fighting CLI cowboys with Napalm - An Introduction

2017-02-06T00:00:00+00:00

A lot of people who aren’t familiar with Napalm tend to laugh nervously when you suggest they use it in their network. The name Napalm is partly based on getting that perfect acronym and partly a desire to incinerate the old way of doing things and move to network automation. This article is about explaining what Napalm is and what lies behind the acronym.

What is Napalm Automation?

Napalm is an open source Python library which makes it easy to configure and gather information from network devices through a unified API. As an example if you were to use Napalm and wanted to list the BGP neighbors of three devices running IOS XR, Junos and Arista EOS you could use the get_bgp_neighbors() function and the returned data would be in the same format regardless of the operating system running on the device.

Likewise when configuring devices, Napalm allows you to use the same function without caring about what type of device you are using. First you create the configuration you want to apply to the device, this can be a configuration snippet loaded using the load_merge_candidate() function, or a full configuration using the load_replace_candidate() function. After the configuration is loaded to the device you can use compare_config() to see if the change is actually needed and either use commit_config() or discard_config() to apply or back out of the configuration change.

Origin story

Napalm started its life at Spotify where David Barroso wanted a better way to handle configuration changes in their network. After doing some mockups he talked with Elisa Jasinska about the project and soon after Napalm became a collaborative effort. As time went on more and more people got involved and started to contribute code. Now it has grown into a community effort. While some vendors have shown an interest in helping out with the network drivers, Napalm remains an independent project.

How Napalm is used

Since Napalm is a Python package, developers can install it using pip with “pip install napalm”. You might have to install some other dependencies, check the install docs to be sure. Using Napalm in this way lets you write your own scripts to automate your network with Napalm.

If you’re using Ansible, you might have heard that there’s a repo for napalm-ansible modules. Napalm is also integrated with SaltStack, so it’s easy to use Napalm and Salt together.

Napalm in action - Getting BGP Neighbors

As stated earlier you can use Napalm as a part of an automation frameworks. But for simplicity we’ll take a look at using it directly from Python. This will require some basic knowledge about Python, but even if you don’t know Python you will probably be able to see what’s going on. Here is an example of how you can use Napalm to gather information about BGP neighbors.

import json
from napalm.base import get_network_driver
driver = get_network_driver('iosxr')
dev = driver(hostname='r1', username='admin',
             password='admin')
dev.open()
bgp_neighbors = dev.get_bgp_neighbors()
dev.close()
print(json.dumps(bgp_neighbors, sort_keys=True, indent=4))
{
    "global": {
        "peers": {
            "10.255.255.2": {
                "address_family": {
                    "ipv4": {
                        "accepted_prefixes": 0,
                        "received_prefixes": 0,
                        "sent_prefixes": 1
                    }
                },
                "description": "",
                "is_enabled": true,
                "is_up": true,
                "local_as": 65900,
                "remote_as": 65900,
                "remote_id": "10.255.255.2",
                "uptime": 372
            },
            "10.255.255.3": {
                "address_family": {
                    "ipv4": {
                        "accepted_prefixes": 0,
                        "received_prefixes": 0,
                        "sent_prefixes": 1
                    }
                },
                "description": "",
                "is_enabled": true,
                "is_up": true,
                "local_as": 65900,
                "remote_as": 65900,
                "remote_id": "10.255.255.3",
                "uptime": 372
            }
        },
        "router_id": "10.255.255.1"
    }
}

The above example is using IOS XR, but Napalm will return data in the same format regardless.

Napalm in action - Configuring devices

Moving on to configuration, Napalm currently works with configuration files in raw text format. Typically you create these files with by passing values through a templating language such as Jinja2. A lot can be said about generating configuration using templates and I won’t go into any of that here. Instead we just have a simple file containing an access-list for an XR device.

ACL_SAMPLE.cfg

no ipv4 access-list ACCESS_OUT
ipv4 access-list ACCESS_OUT
 10 permit tcp any any eq domain
 20 remark udp any any eq dns
 30 permit tcp any any eq www
 40 remark tcp any any eq https

The config snipped will start by deleting the access-list and recreated it, the reason for this is to remove potential extra lines in the acl. To configure a device with this access-list we can use the code below and rely on the underlying napalm-iosxr driver to figure out what needs to be done on the device.

from napalm.base import get_network_driver
driver = get_network_driver('iosxr')
dev = driver(hostname='r1', username='admin',
             password='admin')
dev.open()
dev.load_merge_candidate(filename='ACL_SAMPLE.cfg')
dev.commit_config()
dev.close()

While this allows us to push configuration to the devices Napalm also lets you see if a configuration change is actually needed. A modified script could look like this:

from napalm.base import get_network_driver
driver = get_network_driver('iosxr')
dev = driver(hostname='r1', username='admin',
             password='admin')
dev.open()
dev.load_merge_candidate(filename='ACL_SAMPLE.cfg')
diffs = dev.compare_config()
if len(diffs) > 0:
    print(diffs)
    dev.commit_config()
else:
    print('No changes needed')
    dev.discard_config()

dev.close()

Using this check we can validate if a change is actually needed before applying it, or use it to validate that the configuration on the device is what we expect it to be.

Another option is to use the load_replace_candidate function instead of load_merge_candidate. Using the replace option would let you replace the entire configuration on the target device.

What network devices does Napalm support?

As of this writing the list of supported network devices looks like this:

Arista EOS
Cisco IOS
Cisco IOS-XR
Cisco NX-OS
Fortinet Fortios
IBM
Juniper JunOS
Mikrotik RouterOS
Palo Alto NOS
Pluribus
Vyos

New devices are also being worked on. A thing to keep in mind though is that Napalm is developed by its community. What this means is that the people writing code will focus on the features that they need. This means that not all features of Napalm will work on all of the devices. For more information on this you can look at the support matrix in the documentation.

Network device configuration

As you saw above the configuration sent to an IOS XR device looks like XR config, if you were to target an Arista box you would use EOS config. The unified part about configuration when it comes to Napalm, is how you apply the configuration to the devices. It is however quite easy to choose different Jinja templates as you target different device types, making device types more transparent.

OpenConfig and Yang

Currently we need to use raw text when configuring the devices. However, OpenConfig and Yang support is a work in progress so things might change in the future. Though this is an interesting development, don’t wait for everything to be in place before you start to move away from the CLI. Start testing Napalm and other tools sooner rather than later.

What’s in a name?

So what does the acronym stand for? NAPALM is Network Automation and Programmability Abstraction Layer with Multivendor support. I think it fits quite nicely.

More information

As with many other Python projects the documentation for Napalm uses Read the Docs and the Napalm source code is hosted at GitHub.

How to use Ansible ios_config to configure devices

2016-09-11T22:15:21+00:00

A lot of new networking modules were released as part of Ansible 2.1. The Cisco IOS, IOS XR, NXOS, Junos and Arista EOS platforms got three common modules, the platform_config, platform_command and platform_template. The command and template modules more or less explains themselves. The config modules have some more tricks to them and I’ve gotten a few questions about how they work. In this article I’m going to focus on the ios_config module and show how you can use it to configure Cisco IOS devices. Future version of Ansible will add more parameters, this article is for Ansible 2.1.

Prerequisite

As mentioned above these modules were released in Ansible 2.1 so you must have at least that version installed. This article also assumes that you have some basic knowledge about Ansible.

To save some space it’s assumed that the username, password and host parameters are used in the examples below. I.e.:

- name: Configure device
  ios_config:
    host: "{{ inventory_hostname }}"
    username: "{{ device_ssh_username }}"
    password: "{{ device_ssh_password }}"

Parameters of ios_config

You might want to start by just reviewing the documentation.

ansible-doc ios_config

The parameters you will want to focus on are:

after - What to do after the config commands
before - What to do before the config commands
lines - The lines in the configuration
match - How to match or compare the config lines
parents - Define parents in a hierarchy of config objects
replace - How to perform replace operations

Using lines (or commands)

First off “commands” is an alias for “lines”, so you might see examples using both variants. Just remember that it’s the same thing.

According to the documentation (as it is now) the lines parameter is described like this:

The ordered set of commands that should be configured in the section. The commands must be the exact same commands as found in the device running-config. Be sure to note the configuration command syntax as some commands are automatically modified by the device config parser.

Basically it’s for defining the configuration lines you want in your device config. At first it might not be clear what is meant by the exact same commands. So let’s look at an example.

- name: Create an access-list
  ios_config:
    lines: access-list 180 permit tcp any any eq 80

The above task would work, and would add the line “access-list 180 permit tcp any any eq 80” to the configuration. However, it would not be idempotent, meaning that each time you would run the same task it would register as a change. This is because when you add this line to the configuration IOS will swap out port 80 and replace it with www. So the correct way to write the above task and make it idempotent would be like this:

- name: Create an access-list
  ios_config:
    lines: access-list 180 permit tcp any any eq www

The lines parameter accepts a single configuration line as above or a list like this:

- name: Configure service
  ios_config:
    lines:
      - no service pad
      - service timestamps debug uptime
      - service timestamps log uptime
      - service password-encryption

Using after

The after parameter is simply a list of configuration commands to run after the desired lines have been applied. An example where this could be useful is when configuring SNMPv3.

As part of the configuration you might want to add these lines:

snmp-server group SNMPv3 v3 priv
snmp-server user snmpv3 SNMPv3 v3 auth sha AUTHPW123 priv aes 128 Pr1vPW123

The problem is that the second command never shows up in the configuration, still it’s needed to create the user. If you want to configure this with Ansible in an idempotent way you couldn’t have the play look like this:

- name: Configure SNMPv3
  ios_config:
    lines:
      - snmp-server group SNMPv3 v3 priv
      - snmp-server user snmpv3 SNMPv3 v3 auth sha AUTHPW123 priv aes 128 Pr1vPW123

As it would register as a change each time you ran your playbook. As a workaround you could use the after parameter so that the snmp-server user was only added if the group was missing from the configuration.

- name: Configure SNMPv3
  ios_config:
    lines:
      - snmp-server group SNMPv3 v3 priv
    after:
      - snmp-server user snmpv3 SNMPv3 v3 auth sha AUTHPW123 priv aes 128 Pr1vPW123

Using parents

Some of the configuration in an IOS device is structured so that all the configuration under a specific item is indented. For example, the configuration related to an interface.

interface GigabitEthernet0/1
 description Uplink
 ip address 192.168.0.1 255.255.255.0

In the above case “interface GigabitEthernet0/1” would be the parent to description and ip address parameters. Using the ios_config module you could use a task looking like this:

- name: Configure Uplink
  ios_config:
    parents: "interface GigabitEthernet0/1"
    lines:
      - description Uplink
      - ip address 192.168.0.1 255.255.255.0

You can use multiple parents for hierarchical configuration items. In IOS you typically see this if you are setting up policy-maps, for example if you are setting up something like zone based firewalls or dot1x your config might look like this:

policy-map type control subscriber POLICY_MAB
 event session-started match-all
  10 class always do-until-failure
   10 authenticate using mab aaa authc-list ISE priority 20

Looking at setting up this in Ansible (with the xyz_config modules) it would look like this.

- name: Configure PM POLICY_MAB
  ios_config:
    parents:
      - policy-map type control subscriber POLICY_MAB
      - event session-started match-all
      - 10 class always do-until-failure
    lines:
      - 10 authenticate using mab aaa authc-list ISE priority 20

Using before

At times you need to do something before the configuration is applied. An example might be if you want to remove an access-list before reapplying it.

- name: Configure TEST-ACL
   ios_config:
     parents: "ip access-list extended TEST-ACL"
     lines:
       - permit tcp any any eq smtp
       - permit tcp any any eq www
     before: "no ip access-list extended TEST-ACL"

If the access-list needed to be changed it would be removed first and then recreated.

Using match and replace

The match parameter currently takes three parameters.

line: Match everything line by line
strict: Match with respect to the positions of the lines
exact: The lines must exactly match the config and no other lines are allowed

This is mostly self-explanatory, though a tricky part can be the exact statement as that one is context related. If you are using parents the configuration will be matched within that section. I.e. if your parent is “ip access-list extended TEST-ACL” the configuration the module matches your lines against will only be the ones under that access-list. However, if you don’t have any parents the match will be against the entire configuration.

Looking at the replace parameter it takes two parameters.

line: Replace the missing lines
block: Replace all the lines

To illustrate the difference of the options let’s say wa have an access-list which looks like this:

ip access-list extended TEST-ACL
 permit tcp any any eq www

Then we have an Ansible task looking like this.

- name: Configure TEST-ACL
  ios_config:
    parents: "ip access-list extended TEST-ACL"
    lines:
      - permit tcp any any eq smtp
      - permit tcp any any eq www
    before: "no ip access-list extended TEST-ACL"
    match: exact
    replace: line

Our end goal is to ensure that our access-list looks exactly as we desire. When we run this play we want an acl which allows smtp and http. However what we will end up with is an acl which looks like this:

ip access-list extended TEST-ACL
 permit tcp any any eq smtp

The reason is that the line “permit tcp any any eq www” already exists in the access-list and since we use “replace: line” we only add the missing lines. Since we use the before parameter to remove the access-list we don’t reach the desired result. So what we need to do is to change the replace parameter to block. That way we will remove the acl and they recreate all of the desired entries.

So combining the match and replace parameters works well with indented configuration items where we have parents.

If we look at another example. We want to make sure that our snmp settings are exactly as we desire and if someone had configured extra traps we don’t want or other settings we just want to wipe them out with the “default snmp-server” command and only reapply the wanted config.

At first you might want to create a task looking like this:

- name: Configure SNMP
  ios_config:
    lines:
      - snmp-server community cisco RO
      - snmp-server location STH
      - snmp-server contact NORTH
      - snmp-server enable traps snmp linkdown linkup
      - snmp-server host 172.29.52.11 version 2c public
    before: "default snmp-server"
    match: exact
    replace: block

The problem here is that the match will be done against the entire configuration so it will always differ. A current workaround to this would be to use the config parameter to match against a pre filtered config file. Otherwise just keep in mind that the match is done differently depending on the context of your config lines.

Using combinations

In some scenarios it’s much easier to just use the different templates modules to configure your devices. But at times you want the configuration to be applied in a certain order and you might want to include commands which doesn’t show up in the configuration after you have reached the desired state.

One such scenario might be if you want to replace an access-list without breaking your own connection to the device.

- name: ACL - ACL-IN
  ios_config:
    parents: ["ip access-list extended ACL-IN"]
    commands:
     - permit tcp any any eq 22
     - permit tcp any any eq www
     - permit udp any any eq snmp
    match: exact
    replace: block
    before:
      - interface GigabitEthernet0/1
      - no ip access-group ACL-IN in
      - no ip access-list extended ACL-IN
    after:
      - interface GigabitEthernet0/1
      - ip access-group ACL-IN in
  notify: save configuration

This task would first check if the access-list looks exactly as we want it to do. If something has changed Ansible would remove the acl from the interface, delete the access-list and then recreate it and reapply it to the interface. In this example we also notify a handler to save the configuration at the end of the playbook, this will only happen if a change has actually been made.

You could also expand this and read the access-list from a file using lookup function or have the configuration be a condition of the output from ios_command or another similar module.

Final words

Now you should be familiar with how you can use the ios_config and other related modules. But remember, the _config modules is just one way you can use Ansible to configure your network devices. You should also get familiar with the _template modules and look at third party modules such as Napalm.

Regarding automation exceptions

2016-07-21T08:20:40+00:00

It can get quite exciting when you start to think about network automation and what it can do for you and your network. Once you’ve automated everything you can instead focus on deep work to evolve your business. However this daydream can soon fade away as you start to think about the things you can’t automate, or at least don’t know how to do. Ivan Pepelnjak wrote a piece about automating the exceptions. The post is based on a discussion he had with Rok Papež and his ideas about handling exceptions in an automated way.

While the strategy presented is great I think it overlooks some parts when it comes to exceptions that can arise, also the post doesn’t highlight how limitation of the configuration management tools were solved.

The problem

The underlying question is; When implementing some form of network automation how should you handle exceptions? In this case we are talking about exceptions in the configuration. When looking at a new solution for network automation you might find something which allows you to automate almost everything you want to do. However there might also be some odd cases or some legacy things which doesn’t completely fit into the new workflow you will be using. In Rok’s case the solution was to “store the non-automated part of the device configuration (configuring the pilot services) in an extra field in their configuration management database, and append it to the configuration generated from device/service data model through standardized templates”.

As Ivan also pointed out in the best of worlds there should be no exceptions. You can look at this from a few different point of views. The most convenient would be that you gain so much from automating the standard parts so that you don’t care about the few manual steps you need to take. You could also argue that you don’t have to automate everything at once.

Taking a step back the underlying problem might be that the automation solution doesn’t allow for exceptions or that it takes too much time to implement those exceptions in the config management tool. It can also be that engineers are happy enough by implementing new standard services using the config management solution, but they don’t have the skills to add these new exceptions to the config management workflow.

Regardless whichever tool you choose to use there will always be things it can’t do. The tools I would stay away from would be those that doesn’t allow you to handle those issues outside of the tool. A simple example would be the standard Ansible inventory which is just a flat text file, it works well enough for testing but isn’t very flexible. In this case Ansible allows you to use a dynamic inventory which you can create yourself and have total control over and connect it to your real inventory tool.

Another thing to consider in terms of network automation is that even if you have a solution to handle all of the configuration exceptions, there will always be other kind of exceptions. Some of these might be impossible to plan for in advance, likewise it might not be possible to solve them with your current automation tool.

A real world scenario

So it was the evening before a bank holiday and I’ve just finished the barbecue dinner, when the phone rang.

“Hi, we have an issue. Do you have a moment?”

It was a client and their issue was something which they previously referred to as a “code purple”. A lot of their customers had lost their internet connection. At this point I eyed the wine glass in front of me, it turned out to be half empty.

This customer has a config management solution which provisions and configures all of their switches. It owns the config and makes sure that every switch is in sync with the desired configuration template. If someone were to login and change some lines the config management tool would revert this.

What happened though was that there was a bug in the config management tool. The way the bug worked was that it constantly removed a specific vlan from each switch. As that vlan wasn’t used that bug in itself shouldn’t have mattered. As it turned out it really did matter. For a lot of the switches this wasn’t a problem, however on some switch models it triggered a bug in the switch os.

The bug in the switches made it look like arp had stopped working on all the customer facing ports. In reality the switch cut four bytes from each packet, but the end result was that nothing worked. Earlier on the day the switch vendor had confirmed this as a bug and promised to deliver a patch quickly. However before that they had come up with a workaround, the bug condition was fixed if you removed the access vlan from the customer ports and reapplied the same configuration (this wasn’t the same vlan as the one which was removed earlier).

Now the problem for my client was that their config management solution couldn’t do this. That tool only ensured that the config was correct, it couldn’t make changes in the way which was needed. So they wanted me to write a script to do this instead. The configuration needed to be removed and reapplied on somewhere between 30 and 35 000 ports. They knew which switches were impacted and had a list of these, however they didn’t have a list of all the impacted ports since not all ports were configured the same way. Some of the ports were exceptions with non-standard config. The configuration management could handle exceptions in that case but the information about which ports were configured in that way was locked within the config management tool and not easily accessible.

The script needed to do two things, first login to each switch and figure out which ports needed to be reconfigured and then only reconfigure those ports.

Unfortunately I didn’t drink any more wine that evening, but I fixed the clients issue.

Bottom line

Granted the above scenario is extreme and you will probably never face that. However that’s the point, the client didn’t think they would either. It wasn’t something which they could have planned for. When purchasing the network config tool they never asked if it would be possible to do something like this. Why would they have?

It was a one off fluke incident which will never happen again. Still other strange issues will happen. Hey it’s networking.

The important point is that you shouldn’t only rely on your automation tool to help you. Just like Ivan and Rok discussed you will need to look at what’s possible and then have someone who has the creativity and skill to work around the limitations of your tools.

So the point I was missing in Ivans article was that the real solution to many situations like these are the people who can think of how to work around issues and have the skills to solve them. Extra points to your organization if that is in fact people as opposed to a single person.