I recently had an opportunity to work on a technology consulting project for one of our customers who is in the services industry of providing business analytics solutions. The scope of the project included recommending an “analytical platform” among others, so that the customer could move out of their existing custom application with limited analytical flexibility to a more generic platform that would help them carry out full spectrum of analytic solutions.
We started by trying to understand the state of business analytics now and how it might shape up in the near future, which being in a span of about next 2-4 years. This was important because the platform is not only supposed to address the current needs, but also should take care of the near future requirements. After looking at various leading recent research publications on this subject, it was clear that more and more companies are adopting analytics not only to out-perform competition, but to avoid risk and to make decisions based on data rather than intuition.
As companies start to become analytics driven, they would want to use every bit of data that might be available to them. And this data can come in all shapes, sizes and speed. That is, the data might become available as structured, semi-structured or unstructured. It might become available as database dumps, text files, audio or video files, sensor data, web server logs all of which could be in varying sizes. A lot of it could become quickly available, owing to the fact that virtually every business is done digitally, i.e. involving digital computing in some form or the other. Thus, the recommended platform should
- Enable storing of any and all sorts of data
- Enable easy conversion of the raw data into right insight to aid in better business decisions.
The latter part of above mentioned needs, i.e. the raw data transformation into the right insights by applying various analytical techniques, can only be done by the right talent which is very limited in the market at present but is critical for any analytical project success. So the key take away from the exercise above, is sort of an “analytics outlook” for us were the facts that the platform should be an enabler in producing the right data from the raw data and right talent, which is hard to find currently, from an existing pool of talent.
With an understanding of the above, we started by looking at various options we have readily got and could not help but think about the platforms we are familiar with such as Oracle, SQL Server, Teradata etc., that have become ubiquitous and skill set of such technologies can be found very easily. We quickly figured out that these traditional relational database systems cannot deliver the kind of analytics the current businesses demand. The old systems were not built for that purpose. They are “stone age” systems when speaking of the “new age” analytic requirements including but not limited to prescriptive, predictive or the good old descriptive analytics. Add to that list the text analytics, exploratory analytics, machine learning etc. that have started to become the norm of late.
Looking at how the traditional database management systems are built, most of them are SMP i.e. Symmetric Multi-Processing systems, designed to share resources amongst the various processes. These kind of designs have practical limitations when it comes to scaling for both “data and performance” as
- They cannot work with tons of data requiring serious processing horse power.
- Neither can they support ad-hoc, on demand, interactive analytics requiring rapid iterations.
We concluded that there’s a need for systems that is designed with a radically different approach to traditional systems.
New Age Systems:
The technological breakthroughs in the last decade have produced systems that are built to address the shortcomings of the traditional systems. Specifically, the MPP i.e. Massively Parallel Processing systems, with shared-nothing architecture do a great job of leveraging the hardware and building systems that can scale up and scale out for both data and performance. A few notable solutions built on this idea are Apache Hadoop, IBM Netezza, Teradata Aster, EMC Greenplum, Actian Matrix and SAP HANA among others.
Of these leading solutions, some are built on proprietary custom hardware – software combination and shipped as an appliance that can be used for plug and play. The examples of this category are IBM Netezza and to some extent, though not strictly categorized as an appliance is Teradata. Others are purely software solutions built on commodity hardware but can deliver equally good performance or in some instances fare better. The solutions offered as appliances are typically priced higher and maintenance could be an issue in the long run. That left us to evaluate software only solutions that leverage the commodity hardware. As hardware prices continue to fall and the same dollar value can buy more of data and processing power, this model was as a clear winner. Among the purely software solutions, the choice was to be made between a system such as Hadoop and Actian Matrix or SAP HANA.
Evaluation of Hadoop yielded some great insights, especially owing to the fact that there has been a lot of development going on in this eco-system, where many projects are being promoted to be top-level projects and Hadoop continues to become stronger and making case for its place firmer in the analytics world. We understand that as of now, Hadoop can’t meet requirements such as low-latency, high-speed, ad-hoc querying of data, advanced analytical functions with rapid iterations. It is still a niche system requiring hardcore technical talent to write map-reduce programs in languages such as Java, Python etc. and such talent is very difficult to get in the market, although there have been advancements with projects such as Spark & Shark etc. related to such requirements that work to leverage in-memory capabilities. We have to agree that Hadoop has a place in the current analytics eco-system, because it works best for batch processes, is required to work with variety of data coming in at a great velocity and in more volume every day.
The other software only systems such as Actian Matrix or SAP HANA work well when it comes to scaling up for performance or data volumes as they
- Built to address requirements of ad-hoc querying
- Support low-latency query results and
- Can do rapid iterations through the analytics model
- Run advanced analytics such as predictive analytics and/or prescriptive analytics & text analytics
- Features such as in-memory capability with in-database analytics truly deliver on the promise of real-time analytical needs where time is the currency.
A key feature of these systems is to store data in columnar fashion which aids greatly in retrieval/query and storing data as such can allow leveraging various compression techniques to be able to reduce disk and network I/O. These systems, like Hadoop, are also built for high availability and disaster recovery.
Our evaluation yielded an interesting outcome – the fact that Hadoop is a great MPP system built in the modern distributed computing era, but can’t address all requirements of the new age analytics. Thus systems like Actian Matrix or SAP HANA have their own place and are indispensable for businesses of our time that wish to benefit greatly by being analytics driven organizations. So both of these are required — Hadoop as well as Matrix/HANA. What that necessitates is that there should be a mechanism where both these systems have the ability to interact seamlessly otherwise they could end up being silos causing great pains. That is, these both systems should integrate so well that it should seem they are one single system.
Having personally worked on Actian Matrix and seen its integration not only with Hadoop, but also other external systems, it was our final choice!