Confused about choosing the best technology to suit your Big Data needs?

There is lot of technology out there and it can make one especially the novice very confused and take wrong decisions in the Big Data strategy. Should I use cloud, on premise? Should I jump to storing Hadoop right away? what about NoSQL? Cassandra, MongoDB? what are the differentiators? In this post, I will try to address these questions and list the key elements to consider when making technology choices

In my previous blog post – Understanding the data world, I talked about the different data sets available and how they can be classified- structured, unstructured and semi structured. In this post, I want to talk about how best to store different data sets and the things to keep in mind while making your choice.

Lets say you have captured your data, now you want to find ways to store it in the most cost-effective yet efficient way.

Firstly – data has to reside on ‘metal’ some where and where you decide this metal should reside physically, is in your control. You can decide to store your data

  • on premise, on bare metal that you own or
  • in the Cloud (Hybrid, Public or Private)

You will have to make this decision keeping in mind compliance and regulations, what sort of accessibility is required, cost of having it on premise versus cloud, whether you have the skills to maintain your servers etc.,

It is possible to go too deep into both cloud and on premise choices. But I like keeping my posts short (at least reasonably). So here, I am only going to talk about On premise storage options that one might want to consider.

On-premise data storage

Let’s say you are a small organisation dealing with some customer data. You are very likely dealing with a lot of structured content in the range of some Gigabytes. You can consider storing it in traditional databases like MySQL, Oracle etc., which is sufficient to meet your needs.

Lets say that your new product went viral and you have lots of users of your products and you intend to store user information which spans terabytes. You can think about, quite naturally, “scale up” on the resources i.e, add more resources to your database servers hoping that it solves the problem of storage and query efficiency. But you will definitely hit a limit at some point where you start fighting with indexing speed and query speed as your product continues gaining popularity. and more over, as you scale up, your database system will be a single point of failure. That is when the data you have accumulated or need to accumulate, starts falling into the Big data category. traditional RDBMS were not designed to scale out, not for high velocity and nor were they designed to be distributed efficiently. Also they tend to be best fits for OLTP and OLAP systems.

Traditional RDBMS is no longer your friend now. So what should you do in that case?

NoSQL

NoSQL (NotOnlySQL) is a better, scalable and efficient option to store BigData. In the last decade, companies have come up with different alternate designs for traditional databases to address the limitations posed by RDBMS technology which has resulted in the NoSQL world.

How is NoSQL different from traditional RDBMS?

  • The first and foremost difference is that NoSQL does not support the ACID (Atomicity, Consistency, Isolation, Durability) property completely like how the relational databases do.They may however offer eventual consistency and comply with the ‘AID’ part immediately. There is NoSQL systems out that that brand themselves as being ACID compliant, so you might want to research on that while making your choice. Oracle published an article back in 2012 that helps in demystifying the eventuial consistency philosophy.
  • NoSQL allows you to store semi-structured, structured and unstructured data. Traditional RDBMSs only support storing structured data.
  • Most projects that we deal with today tend to be agile and because of the agile nature, it has become more and more difficult to define schemas upfront and stick to it once designed. NoSQL, unlike traditional RDBMSs offers a way to have a flexible schema design.
  • Traditional RDBMS  had indexes on columns. NoSQL databases, on the other hand, usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool

In this post I am not going to delve deep into the NoSQL topic. But would like to say that different vendors of NoSQL have their unique offerings and you might want to check and compare which option suits you the best while making a pick.

Kay/Value Databases Redis, Memcache
Columnar Databases HBase, Hypertable, Cassandra
Document Databases MongoDb, CouchDB
Graph Databases InfoGrid, Neo4j

So when do you use NoSQL?

NoSQL is not a replacement for the traditional RDBMS, hence managing crucial business information (transactional) is not a valid use case. If you want to store and analyse data that is not machine critical then NoSQL is your friend.

NoSQL is about real time processing and interactive access to data. NoSQL use cases often entail end user interactivity, like in web applications, but more broadly they are about reading and writing data very quickly while scaling very efficiently.

Do you need a BigData for NoSQL implementation? Not really. But in my practical experience, generally people tend to use NoSQL when they need to deal with high volumes.

To give concrete examples: In some cases we need a system to be highly available for high volume of writes. Such usecase is critical for ecommerce. For extremely large volume of orders, NoSQL is a possible solution.

I would say NoSQL is for small “big data”. So what do you do if your data is spanning petabytes? would NoSQL still scale?  The answer is No. Chances are that if your data is spanning PBs you are trying to tap into unstructured data generated by humans and machines.  In such case you will need to consider using an alternate solution, Hadoop for example.

Hadoop

about 50% of big data projects use some sort of a Nosql while 5% of the big data projects have hadoop in them. So when do you use Hadoop instead of NoSQL?

  • Hadoop is for real volumes spanning Terrabytes and petabytes.
  • Native Hadoop does not offer transactional integrity.
  • You would use Hadoop in a ‘write once and read many’ set up
  • Natively, Hadoop is best suited for batch processing. You can use libraries that can help in streaming ingestion.

Hadoop is designed for scalability and on commodity hardware. Unlike databases, it does not conform to the CAP theorem strictly.

Transactional data is not a good fit for hadoop. A lot of businesses think that Hadoop is a replacement for relational database, but its not. it is a complementary technology.

With Hadoop you have to select the libraries or the abstractions which can include NoSQL libraries such as Hive and some other ones. The key is you have to understand what those abstractions are and you have to understand the difficulty of working with them.

Then you need to find, load, clean all your source data, query the data, and present it and visualize it. So choose Hadoop when you have a huge amount of data, many terabytes or even petabytes. You have non-structured data or possibly semi-structured. It’s not great for structured data.  You should also be willing to invest in consulting and/or training of your staff.

RDBMS NoSQL Hadoop
Scale not scalable scalable easily scalable easily
Performance fast reads fast reads and writes good for batch processing
Hardware high end hardware commodity hardware commodity hardware
Consistency ACID eventually consistent eventual consistent
Data Type Structured only Structured, unstructured and semi structured Structured, unstructured and semi structure
Use case interactive interactive large scale data analytics

Conclusion

Think about what your data is like and what do you want to do with it. If you decide to store it on premise, you can either go with the traditional RDBMS or evaluate a NoSQL solution to your problem depending on the current and predicted volume. Use Hadoop when you have lots of unstructured and semi structured data that are non transactional and suitable for batch processing. Statistics indicate that more than two thirds of the NoSQL & Hadoop implementations that have happened so far, tend to be in the cloud. So watch out for my next blog post about Data in the cloud.

Unknown's avatar

Author: Naveen Keshava

A Big Data enthusiast and love to keep myself updated on the latest trends in Big Data and how it is impacting the way we see and do business.

Leave a comment

Design a site like this with WordPress.com
Get started