There is lot of technology out there and it can make one especially the novice very confused and take wrong decisions in the Big Data strategy. Should I use cloud, on premise? Should I jump to storing Hadoop right away? what about NoSQL? Cassandra, MongoDB? what are the differentiators? In this post, I will try to address these questions and list the key elements to consider when making technology choices

In my previous blog post – Understanding the data world, I talked about the different data sets available and how they can be classified- structured, unstructured and semi structured. In this post, I want to talk about how best to store different data sets and the things to keep in mind while making your choice.

Lets say you have captured your data, now you want to find ways to store it in the most cost-effective yet efficient way.

Firstly – data has to reside on ‘metal’ some where and where you decide this metal should reside physically, is in your control. You can decide to store your data

on premise, on bare metal that you own or
in the Cloud (Hybrid, Public or Private)

You will have to make this decision keeping in mind compliance and regulations, what sort of accessibility is required, cost of having it on premise versus cloud, whether you have the skills to maintain your servers etc.,

It is possible to go too deep into both cloud and on premise choices. But I like keeping my posts short (at least reasonably). So here, I am only going to talk about On premise storage options that one might want to consider.

On-premise data storage

Let’s say you are a small organisation dealing with some customer data. You are very likely dealing with a lot of structured content in the range of some Gigabytes. You can consider storing it in traditional databases like MySQL, Oracle etc., which is sufficient to meet your needs.

Lets say that your new product went viral and you have lots of users of your products and you intend to store user information which spans terabytes. You can think about, quite naturally, “scale up” on the resources i.e, add more resources to your database servers hoping that it solves the problem of storage and query efficiency. But you will definitely hit a limit at some point where you start fighting with indexing speed and query speed as your product continues gaining popularity. and more over, as you scale up, your database system will be a single point of failure. That is when the data you have accumulated or need to accumulate, starts falling into the Big data category. traditional RDBMS were not designed to scale out, not for high velocity and nor were they designed to be distributed efficiently. Also they tend to be best fits for OLTP and OLAP systems.

Traditional RDBMS is no longer your friend now. So what should you do in that case?

NoSQL

NoSQL (NotOnlySQL) is a better, scalable and efficient option to store BigData. In the last decade, companies have come up with different alternate designs for traditional databases to address the limitations posed by RDBMS technology which has resulted in the NoSQL world.

How is NoSQL different from traditional RDBMS?

The first and foremost difference is that NoSQL does not support the ACID (Atomicity, Consistency, Isolation, Durability) property completely like how the relational databases do.They may however offer eventual consistency and comply with the ‘AID’ part immediately. There is NoSQL systems out that that brand themselves as being ACID compliant, so you might want to research on that while making your choice. Oracle published an article back in 2012 that helps in demystifying the eventuial consistency philosophy.
NoSQL allows you to store semi-structured, structured and unstructured data. Traditional RDBMSs only support storing structured data.
Most projects that we deal with today tend to be agile and because of the agile nature, it has become more and more difficult to define schemas upfront and stick to it once designed. NoSQL, unlike traditional RDBMSs offers a way to have a flexible schema design.
Traditional RDBMS had indexes on columns. NoSQL databases, on the other hand, usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool

In this post I am not going to delve deep into the NoSQL topic. But would like to say that different vendors of NoSQL have their unique offerings and you might want to check and compare which option suits you the best while making a pick.

Kay/Value Databases	Redis, Memcache
Columnar Databases	HBase, Hypertable, Cassandra
Document Databases	MongoDb, CouchDB
Graph Databases	InfoGrid, Neo4j

So when do you use NoSQL?

NoSQL is not a replacement for the traditional RDBMS, hence managing crucial business information (transactional) is not a valid use case. If you want to store and analyse data that is not machine critical then NoSQL is your friend.

NoSQL is about real time processing and interactive access to data. NoSQL use cases often entail end user interactivity, like in web applications, but more broadly they are about reading and writing data very quickly while scaling very efficiently.

Do you need a BigData for NoSQL implementation? Not really. But in my practical experience, generally people tend to use NoSQL when they need to deal with high volumes.

To give concrete examples: In some cases we need a system to be highly available for high volume of writes. Such usecase is critical for ecommerce. For extremely large volume of orders, NoSQL is a possible solution.

I would say NoSQL is for small “big data”. So what do you do if your data is spanning petabytes? would NoSQL still scale? The answer is No. Chances are that if your data is spanning PBs you are trying to tap into unstructured data generated by humans and machines. In such case you will need to consider using an alternate solution, Hadoop for example.

Hadoop

about 50% of big data projects use some sort of a Nosql while 5% of the big data projects have hadoop in them. So when do you use Hadoop instead of NoSQL?

Hadoop is for real volumes spanning Terrabytes and petabytes.
Native Hadoop does not offer transactional integrity.
You would use Hadoop in a ‘write once and read many’ set up
Natively, Hadoop is best suited for batch processing. You can use libraries that can help in streaming ingestion.

Hadoop is designed for scalability and on commodity hardware. Unlike databases, it does not conform to the CAP theorem strictly.

Transactional data is not a good fit for hadoop. A lot of businesses think that Hadoop is a replacement for relational database, but its not. it is a complementary technology.

With Hadoop you have to select the libraries or the abstractions which can include NoSQL libraries such as Hive and some other ones. The key is you have to understand what those abstractions are and you have to understand the difficulty of working with them.

Then you need to find, load, clean all your source data, query the data, and present it and visualize it. So choose Hadoop when you have a huge amount of data, many terabytes or even petabytes. You have non-structured data or possibly semi-structured. It’s not great for structured data. You should also be willing to invest in consulting and/or training of your staff.

	RDBMS	NoSQL	Hadoop
Scale	not scalable	scalable easily	scalable easily
Performance	fast reads	fast reads and writes	good for batch processing
Hardware	high end hardware	commodity hardware	commodity hardware
Consistency	ACID	eventually consistent	eventual consistent
Data Type	Structured only	Structured, unstructured and semi structured	Structured, unstructured and semi structure
Use case	interactive	interactive	large scale data analytics

Conclusion

Think about what your data is like and what do you want to do with it. If you decide to store it on premise, you can either go with the traditional RDBMS or evaluate a NoSQL solution to your problem depending on the current and predicted volume. Use Hadoop when you have lots of unstructured and semi structured data that are non transactional and suitable for batch processing. Statistics indicate that more than two thirds of the NoSQL & Hadoop implementations that have happened so far, tend to be in the cloud. So watch out for my next blog post about Data in the cloud.

Many of us are already aware of the different types of data that is out there. Structured, Unstructured and Semi-structured data. As part of my first blog post and in an attempt to build further on and write more in the data and analytics space, here is a brief description of the different data sets.

Structured data –

This is the data that can be visualised and stored in rows and columns. Although this is the oldest form of data ever since humans have been having to deal with data, they contribute to less than about 10% of the data available today.

So, where do you store structured data? Simple – in traditional relational database systems RDBMS. MySQL is free to download and would suit such data efficiently, despite many commercial RDBMS vendors out there – Oracle, IBM, Microsoft to name a few. And lets not forget the simple yet useful, MS Excel and MSAccess.

Unstructured data –

Data that does not contribute to the above falls into the realm of unstructured data. It constitutes about 90% of the data available out there and the surprising(or shocking) thing is that it 99% of that data was created in the last 5 years! This data generally tends to be the sort of data generated by humans (social media, for example). The other kind is the data generated by machines (satellites, for example).

So, this is essentially voluminous data. How do you store that? And more importantly how do you make use of the data to derive meaningful and actionable insights? Hadoop, which is a opensource technology ( founded by Google originally) is by so far, the best way to store such “Big” data efficiently. I will talk more about Hadoop in my other blog posts 🙂 The market prediction is that Hadoop will be a 50 billion dollar business by 2020 ! However, security and administration issues faced by Hadoop architecture and the unavailability of skilled labour are still some areas of concern that is impacting the Hadoop market directly. Thats said, if the hadoop market sustains, it could be a great time for organisations that specialise in “hadoop cluster maintenance” to spring up and get a big share from the booming Hadoop market.

Semi-Structured data –

There is a very fine line of demarcation between semi structured and unstructured data. As mentioned earlier, most of the data generated by humans tend to be unstructured in nature. Semi structured data, it is believed, contributes to making about 8 to 10% of the data available out there. Semi structured data can be data that is essentially and inherently unstructured but still conforming to some sort of a structure. So- what sort of structrue are we talking about? This sort of data has tags and hierarchies that define the semantics of the underlying data. Hence adding an element of structure to the data. XML and JSON could be good examples of such kind of data. In fact, my study suggests that these are the only two known formats of semi-structured data available so far.

Typically NoSQL databases are good to store such data. I can think of MongoDB and Cassandra. The area of NoSQL is in itself quite deep and getting into the details of it is not my aim atleast in this blog.

As I finish this blog post of mine, I want to leave a thought for you. Google BigQuery as many of you might be aware of, falls under the realm of “structured data store in the cloud”. But Google recently announced its support for json objects in addition to CSV. So does that mean BigQuery is starting to move towards Unstructured data storage? or is it just the fact that the notion of semi structured should be abolished all together going forward?

Appreciate your comments and thoughts and thanks for reading my first blog post!

Confused about choosing the best technology to suit your Big Data needs?