Author: Jeremiah Evans, Sr. Application Developer
In our previous article, we discussed the “3 V’s” of Big Data: Volume, Velocity, and Variety. We encouraged you to ask yourself, “What questions am I not asking because there’s too much data to sift through?” While capacity isn’t the only consideration when it comes to big data, it is important to understand how you’re going to manage your growing storage needs.
As you begin to track more data from more sources, you have to find a place to put it all. That place needs to be able to grow with your data, maybe even dynamically as you reach capacity thresholds. But more than that, you need to still be able to access this data.
The Hadoop platform offers numerous techniques and tools to store, manage and query your data. More than that, it grows with you and lets you add business value at every step in the process. In this article we will discuss eating the Big Data elephant in manageable pieces and how Zirous can help you navigate from where you are to where you want to be.
Where did I put my keys…?
We’ve all been there. You’ve got to leave for work in 5 minutes and can’t find your keys. You know they’re there, but they aren’t where you expected them to be, and it will take too long to comb through the entire house. With that comparison in mind, if you’re asked to deliver a presentation that you can’t backup with findings from your data because the dataset is too large, you’re going to feel as though you’re searching for those keys all over your house! You know the data is in there, but you don’t have any way to find it. Maybe you have a fully developed data warehouse, but the information you need is coming from an outside source. Between architecting the data structure for the new data and running the complex joins on large data sets, you’re spending weeks or even months to ask one question that might not give you the answer you need.
Storage is not the issue when it comes to volume. Storage is cheap, and it’s getting cheaper all the time. The challenge is maintaining consistent, reliable, timely access to it.
There are many strategies in the traditional database and file storage world for storing data, expanding storage capacity, and finding individual elements of data through robust key indices. But what about the data your database hasn’t been architected to find? What about when you want to start mining your data for new combinations of data your warehouse isn’t setup to search for? Or when you want to join multiple large data sets together when a traditional database takes hours – or even days – to run the query?
Know Your Data
Enterprise data warehouses are good at what they are build for, but they often don’t support answering the questions you need to answer. The existing operational reports you have work well with your warehouse, but what happens when you need a new report around data you’ve never looked at before? Or when you’re making connections between data sets you’ve never connected before?
Unlike a traditional data warehouse, which requires that data structures be architected before putting data into it, Hadoop embraces “schema on read.” Structured or unstructured, internal or external, new or old, the Hadoop environment doesn’t need to know what the data looks like to store it. That means you can begin mining your data the day you get your platform.
No waiting around for an architecture that may or may not match your future needs (or even your current ones!). Build the architecture as you need it to bring value to your research. If your architecture needs to change, change it! The underlying data is still the same, allowing you to look at it through multiple lenses until you find the one with the answers you need.
Get the Most Out of ALL Your Data
You have the data. You know where it is. You know the question you want to ask, but you can’t get your answer. Why not?
More than one of our clients face this challenge. Their data sets are not astronomical, but the queries they want to run join together the data in ways that cripple a typical database. Whether it’s running aggregations over large quantities of data, or joining together multiple large data sets to get trend information related to multiple key factors, plenty of “standard data” questions can’t be asked in a “standard data way.” The more complex the relations in your database, the harder it is to bring them together quickly and efficiently. Hadoop offers a toolset to run these queries in a way that takes minutes instead of days.
Your data is growing every day – it’s time to start implementing a strategy that lets you manage that growth effectively. Be sure to stay tuned for our next Big Data “3 V” article where we will take a look at the Velocity of data, and how our phased approach to Big Data implementation can help you get from where you are to where you want to be. Data is changing, and you can change with it – one piece at a time. Bon appetite!