I’ve been working with Hadoop for the last 2 years. The pace of change in the industry has been incredible. I attended Hadoop Summit San Jose in both 2015 and 2016 and I noticed a few trends.
BI tools are still important
In 2015 BI tools were very prevalent in the Community Showcase. This year they were there but it felt like there were fewer in attendance. They remain an important part of the Hadoop ecosystem as visualizing the data is still so powerful.
ETL is still hard
I didn’t see a huge change in the number of ETL tools on display this year. I’m still not convinced that this is the way forward. I feel like we are trading traditional ETL (DataStage, Ab Initio, etc.) for new tools that run in Hadoop. In the end, they all suffer the same weakness: lock-in. In order to make the ETL flows simple and easy to use most tools have created some UI that forces me to use their tool or rewrite with another one.
Hadoop on the cloud is now a viable option
Through all the innovations in Big Data, the thing that no one fixed was how difficult it is to set up Hadoop. For years we’ve been told to keep the data close to the compute to optimize the processing. This has meant large on-prem clusters running on physical storage. That’s started to change.
Offerings from Google, Microsoft, Altiscale, and VMWare make running Hadoop on the cloud a real choice. Having set up Hadoop at RBC (https://www.youtube.com/watch?v=sIRT_IuTr7M) I know how much work it is. I would highly recommend anyone who can justify the use of an external cloud provider have a good look at the various Hadoop on the cloud offerings.
There are several levels of commitment for running Hadoop in the cloud. It’s the same trade off that all cloud offers: control vs convenience. The sweet spot for me was around having data in a cloud data storage (i.e. Amazon, Azure, Google) and then instantiating a cluster and running just the workload you need