Unit Testing in Python – Part 2 – Coverage

This is the second part in a larger series about unit testing in Python. If you missed the first one, you can find it here. This time I’ll be exploring how to see the code coverage of my test.

It’s all well and good to have tests written and passing, but it’d be nice to see how much of my code is being executed by those tests. To achieve this I’m going to use the coverage package. I’ll continue to use the same python example from Part 1 (GitHub).

I will be using the coverage cli (installed via `pip install coverage`). To generate the coverage reports I run two commands.

coverage run --branch apiwrapper_test.py

This first command creates a .coverage file which isn’t very human-readable. I like to pass the `–branch` parameter so that I can see which parts of my conditionals are being executed. If you have multiple test files you will need to run the command multiple times with the `-a` option to append to the existing results.

coverage html

This second command converts the .coverage file into an nice html page that I can use to visualize the quality of my tests. I can now point my browser at the index.html in the ​`htmlcov` folder and see a nice summary of all packages tested:

2018-03-05 20_17_46-Coverage report

When I click on apiwrapper.py I can see the lines I missed:

2018-03-05 20_17_09-Coverage for apiwrapper.py_ 90%

Here you can see that I missed a case where the request was successful and returned a 201. I can simply add that test, rerun my two commands and refresh this page to see that I’ve addressed the issue.

I hope you enjoyed this look at Python’s code coverage capabilities. Join me next time as I explore how Visual Studio Code makes very easy to run Python unit tests.

Unit Testing in Python: Part 1 – Requests

Welcome to a blog series about unit testing in Python. In this first post, I’d like to explore some basics of how to test REST API calls.

I’ve recently written several large Python classes to call a REST API. When I first started, the script was very small and I could easily run it to test all the situations. It quickly grew and it was obvious that I needed some more sophisticated testing. After some reading, I found I could solve this problem with two Python libraries: unittest and requests-mock.

Python has a very good unit testing framework in unittest. However, a script that makes many external requests can be very difficult to test because you don’t want to make a real call to a live server during a unit test. Luckily, I always use the requests library for my HTTP(S) calls in python and there is this great library called requests-mock that will capture all requests to a certain URL and return some JSON (or whatever) I specify. I’ll use an example to show how easy this is. (For the full source code, please see my GitHub.)

I’ll start with a simple class with one method to create a user.

import requests

class ApiWrapper(object):
    """Class that wraps some REST API"""

    def __init__(self, api_url: str):
        self._api_url = api_url

    def add_user(self, first_name: str, last_name: str) -> str:
        """Adds a new user to the system and returns its ID"""
        params = {'first_name': first_name,
                  'last_name': last_name}
        response = requests.post(self._api_url + '/users',
                                 json=params)
        if response.status_code != 201:
            raise RuntimeError

        return response.json()['id']

I need to test the logic in the add_user method but I don’t want to hit the real REST API during my unit testing. You can see below that this is quite simple.

I first add the @requests_mock.mock() annotation to the class so that each test method will be passed a mock object.

@requests_mock.mock()
class ApiWrapperTest(unittest.TestCase):

In this test, I want to test the case where the POST fails and the case where it succeeds and returns valid JSON. In the first case I set up mock to return 401 and test that it does raise a RuntimeError. In the second case I tell mock to return a valid JSON and test that add_user returns the correct ID.

def test_add_user(self, mock):
    wrapper = ApiWrapper(API_URL)

    mock.post(API_URL + '/users', status_code=401)
    with self.assertRaises(RuntimeError):
        wrapper.add_user('The', 'User')

    mock.post(API_URL + '/users', text='{"id": "1234"}',
              status_code=201)
    self.assertEqual('1234',
                     wrapper.add_user('The', 'User'))

Lastly, I add the main method so that I can easily run this from the command line by executing the test file (e.g. python apiwrapper_test.py).

if __name__ == '__main__':
    unittest.main()

Here is a look at the file in its entirety.

import unittest
import requests_mock
from apiwrapper import ApiWrapper

API_URL = 'http://example.com'

@requests_mock.mock()
class ApiWrapperTest(unittest.TestCase):
    """Tests ApiWrapper"""

    def test_add_user(self, mock):
        wrapper = ApiWrapper(API_URL)

        mock.post(API_URL + '/users', status_code=401)
        with self.assertRaises(RuntimeError):
            wrapper.add_user('The', 'User')

        mock.post(API_URL + '/users', text='{"id": "1234"}',
                  status_code=201)
        self.assertEqual('1234',
                         wrapper.add_user('The', 'User'))

if __name__ == '__main__':
    unittest.main()

I hope you enjoyed this look at Python’s unit testing capabilities. Join me next time as I explore how to see the code coverage of the unit tests.

Docker – An instant cluster on your PC

I’ve recently started working with Docker Toolbox at work. The original goal was to learn Docker and as we run Windows 7 machines this seemed like the way to go. It has since transformed into something far more useful.

I’m fully aware that Docker’s main purpose is not really the one I’m using it for but I’ve still found value. I’m often in need of many Linux servers to test the deployment of some tool that we’ll be rolling out. Sometimes it’s a well-known tool like Redis, other times it’s me needing to learn the intricacies of Ansible. Whatever the reason, it is a real pain in the behind to get a set of VMs from our central team. I don’t really need a ton of CPUs, RAM, disk space; they will sit idle most of the day. A public cloud could work here but I’m not allowed to use it. I really just need something with sshd and some basic tooling.

Enter Docker. After learning the ropes, which only took a few days, I was able to create a series of containers with sshd and the same ssh keys (insecure but convenient). Aside from the ease of setup there were several other benefits.

I dread it every time I can’t run something locally. I normally edit my code/scripts on my local Windows PC. This is simply more convenient; I have all my tools set up just the way I want them. I could edit the files remotely as I’m quite comfortable in vim but it is just inconvenient enough for me to do something about it. With Docker containers, I can mount a directory from my Windows box onto the container and have it act like a native directory. This has shortened my feedback loop (code, run, test) to the point where it’s the same as working locally.

I’ve been known to make mistakes; that’s how I learn. But some mistakes are easier to recover from than others. Blowing up something on a centrally provided VM is a time consuming process to fix. I could either revert stuff manually (ugh) or I could have the central team re-image the box (double ugh), both terrible options. The consequence of it being hard to recover is that I’m less likely to try things on those machines. This problem goes away with Docker; it takes me seconds to blow away the container and start a new one. This enables me to have more freedom in exploring new technologies.

Overall, I’m very happy with my current use of Docker and I look forward to discovering more unintended benefits from this approach.

P.S. I know I could have used Vagrant but it takes too much memory per Linux instance.

Elephants In The Cloud

I’ve been working with Hadoop for the last 2 years. The pace of change in the industry has been incredible. I attended Hadoop Summit San Jose in both 2015 and 2016 and I noticed a few trends.

BI tools are still important

In 2015 BI tools were very prevalent in the Community Showcase. This year they were there but it felt like there were fewer in attendance. They remain an important part of the Hadoop ecosystem as visualizing the data is still so powerful.

ETL is still hard

I didn’t see a huge change in the number of ETL tools on display this year. I’m still not convinced that this is the way forward. I feel like we are trading traditional ETL (DataStage, Ab Initio, etc.) for new tools that run in Hadoop. In the end, they all suffer the same weakness: lock-in. In order to make the ETL flows simple and easy to use most tools have created some UI that forces me to use their tool or rewrite with another one.

Hadoop on the cloud is now a viable option

Through all the innovations in Big Data, the thing that no one fixed was how difficult it is to set up Hadoop. For years we’ve been told to keep the data close to the compute to optimize the processing. This has meant large on-prem clusters running on physical storage. That’s started to change.

Offerings from Google, Microsoft, Altiscale, and VMWare make running Hadoop on the cloud a real choice. Having set up Hadoop at RBC (https://www.youtube.com/watch?v=sIRT_IuTr7M) I know how much work it is. I would highly recommend anyone who can justify the use of an external cloud provider have a good look at the various Hadoop on the cloud offerings.

There are several levels of commitment for running Hadoop in the cloud. It’s the same trade off that all cloud offers: control vs convenience. The sweet spot for me was around having data in a cloud data storage (i.e. Amazon, Azure, Google) and then instantiating a cluster and running just the workload you need