Docker – An instant cluster on your PC

I’ve recently started working with Docker Toolbox at work. The original goal was to learn Docker and as we run Windows 7 machines this seemed like the way to go. It has since transformed into something far more useful.

I’m fully aware that Docker’s main purpose is not really the one I’m using it for but I’ve still found value. I’m often in need of many Linux servers to test the deployment of some tool that we’ll be rolling out. Sometimes it’s a well-known tool like Redis, other times it’s me needing to learn the intricacies of Ansible. Whatever the reason, it is a real pain in the behind to get a set of VMs from our central team. I don’t really need a ton of CPUs, RAM, disk space; they will sit idle most of the day. A public cloud could work here but I’m not allowed to use it. I really just need something with sshd and some basic tooling.

Enter Docker. After learning the ropes, which only took a few days, I was able to create a series of containers with sshd and the same ssh keys (insecure but convenient). Aside from the ease of setup there were several other benefits.

I dread it every time I can’t run something locally. I normally edit my code/scripts on my local Windows PC. This is simply more convenient; I have all my tools set up just the way I want them. I could edit the files remotely as I’m quite comfortable in vim but it is just inconvenient enough for me to do something about it. With Docker containers, I can mount a directory from my Windows box onto the container and have it act like a native directory. This has shortened my feedback loop (code, run, test) to the point where it’s the same as working locally.

I’ve been known to make mistakes; that’s how I learn. But some mistakes are easier to recover from than others. Blowing up something on a centrally provided VM is a time consuming process to fix. I could either revert stuff manually (ugh) or I could have the central team re-image the box (double ugh), both terrible options. The consequence of it being hard to recover is that I’m less likely to try things on those machines. This problem goes away with Docker; it takes me seconds to blow away the container and start a new one. This enables me to have more freedom in exploring new technologies.

Overall, I’m very happy with my current use of Docker and I look forward to discovering more unintended benefits from this approach.

P.S. I know I could have used Vagrant but it takes too much memory per Linux instance.

Elephants In The Cloud

I’ve been working with Hadoop for the last 2 years. The pace of change in the industry has been incredible. I attended Hadoop Summit San Jose in both 2015 and 2016 and I noticed a few trends.

BI tools are still important

In 2015 BI tools were very prevalent in the Community Showcase. This year they were there but it felt like there were fewer in attendance. They remain an important part of the Hadoop ecosystem as visualizing the data is still so powerful.

ETL is still hard

I didn’t see a huge change in the number of ETL tools on display this year. I’m still not convinced that this is the way forward. I feel like we are trading traditional ETL (DataStage, Ab Initio, etc.) for new tools that run in Hadoop. In the end, they all suffer the same weakness: lock-in. In order to make the ETL flows simple and easy to use most tools have created some UI that forces me to use their tool or rewrite with another one.

Hadoop on the cloud is now a viable option

Through all the innovations in Big Data, the thing that no one fixed was how difficult it is to set up Hadoop. For years we’ve been told to keep the data close to the compute to optimize the processing. This has meant large on-prem clusters running on physical storage. That’s started to change.

Offerings from Google, Microsoft, Altiscale, and VMWare make running Hadoop on the cloud a real choice. Having set up Hadoop at RBC ( I know how much work it is. I would highly recommend anyone who can justify the use of an external cloud provider have a good look at the various Hadoop on the cloud offerings.

There are several levels of commitment for running Hadoop in the cloud. It’s the same trade off that all cloud offers: control vs convenience. The sweet spot for me was around having data in a cloud data storage (i.e. Amazon, Azure, Google) and then instantiating a cluster and running just the workload you need