Many of us are already aware of the different types of data that is out there. Structured, Unstructured and Semi-structured data. As part of my first blog post and in an attempt to build further on and write more in the data and analytics space, here is a brief description of the different data sets.
Structured data –
This is the data that can be visualised and stored in rows and columns. Although this is the oldest form of data ever since humans have been having to deal with data, they contribute to less than about 10% of the data available today.
So, where do you store structured data? Simple – in traditional relational database systems RDBMS. MySQL is free to download and would suit such data efficiently, despite many commercial RDBMS vendors out there – Oracle, IBM, Microsoft to name a few. And lets not forget the simple yet useful, MS Excel and MSAccess.
Unstructured data –
Data that does not contribute to the above falls into the realm of unstructured data. It constitutes about 90% of the data available out there and the surprising(or shocking) thing is that it 99% of that data was created in the last 5 years! This data generally tends to be the sort of data generated by humans (social media, for example). The other kind is the data generated by machines (satellites, for example).
So, this is essentially voluminous data. How do you store that? And more importantly how do you make use of the data to derive meaningful and actionable insights? Hadoop, which is a opensource technology ( founded by Google originally) is by so far, the best way to store such “Big” data efficiently. I will talk more about Hadoop in my other blog posts 🙂 The market prediction is that Hadoop will be a 50 billion dollar business by 2020 ! However, security and administration issues faced by Hadoop architecture and the unavailability of skilled labour are still some areas of concern that is impacting the Hadoop market directly. Thats said, if the hadoop market sustains, it could be a great time for organisations that specialise in “hadoop cluster maintenance” to spring up and get a big share from the booming Hadoop market.
Semi-Structured data –
There is a very fine line of demarcation between semi structured and unstructured data. As mentioned earlier, most of the data generated by humans tend to be unstructured in nature. Semi structured data, it is believed, contributes to making about 8 to 10% of the data available out there. Semi structured data can be data that is essentially and inherently unstructured but still conforming to some sort of a structure. So- what sort of structrue are we talking about? This sort of data has tags and hierarchies that define the semantics of the underlying data. Hence adding an element of structure to the data. XML and JSON could be good examples of such kind of data. In fact, my study suggests that these are the only two known formats of semi-structured data available so far.
Typically NoSQL databases are good to store such data. I can think of MongoDB and Cassandra. The area of NoSQL is in itself quite deep and getting into the details of it is not my aim atleast in this blog.
As I finish this blog post of mine, I want to leave a thought for you. Google BigQuery as many of you might be aware of, falls under the realm of “structured data store in the cloud”. But Google recently announced its support for json objects in addition to CSV. So does that mean BigQuery is starting to move towards Unstructured data storage? or is it just the fact that the notion of semi structured should be abolished all together going forward?
Appreciate your comments and thoughts and thanks for reading my first blog post!
Very well written Mr. Naveen. Very informative and thoughtful.
Keep going.
Good luck.
Soujanya.SV
LikeLike