Eating the Elephant: Putting it All Together

Author: Jeremiah Evans, Senior Application Developer

So far in this series we have introduced the “3 V’s” of Big Data and gone over Volume, Velocity, and Variety. I encouraged you to ask yourself several questions:

  • “What questions am I not asking because there’s too much data to sift through?”
  • “What decisions could I be making better if I was seeing the current state of the system, and not the state from yesterday, or a week ago?”
  • “What data am I missing because it’s not tracked when it doesn’t fit into my existing data models?”

It’s clear that to maintain and improve competitive advantage, companies need a Modern Data Architecture. We’ve seen how what has traditionally been thought of as “big data” has evolved to encompass techniques and technologies that can benefit any company and prepare them for the future. Let’s take a look at what that project might look like.

The Client Example

Our client is in the manufacturing space, with a nationwide procurement and distribution footprint. Years ago they created an in-house system to track inventory of raw materials and completed product, as well as upcoming orders. This system is a series of spreadsheets that a team puts together based on various sources from suppliers and order data, including price and location of raw materials, and location of product deliveries. Based on this report, decisions are made on purchasing of raw materials and which plants fulfill which orders.

When this system was first created, the calculations were run every day, however, with the growth of the company, these calculations became complex enough and large enough to require almost a week to gather all the data. In addition, the mechanics of these spreadsheets are only understood by 1 or 2 people who have been involved with their creation.

Our client wants to modernize their data architecture, automate their data ingestion, make better use of their existing data warehouse and order-pay systems, and bring in additional third party data as necessary to help make decisions that will increase profit margins while keeping their costs and employee benefits competitive.

Phase 1

After meeting with the client, the product owner identified Reporting Visibility and Ingest Automation as the highest priorities. Once the data analytics cluster was installed, the team began to work through these tasks.

The development team began addressing Reporting Visibility by defining a location to upload the finished spreadsheet to the cluster, and placing dashboards on top of that data. By doing this, the predictions from previous iterations were made available in a way they hadn’t been previously, and business users and executives were able to see new charts and data views provided by the dynamic dashboarding tool.

At the same time, other members of the team worked with the client’s data acquisition SMEs to begin pulling raw data into the platform. There was a combination of flat file exports and API endpoints, so the decision was made to leverage a streaming toolkit. Using the data flow platform, the team created a series of connections that used what APIs were available combined with reading from flat files dropped in a shared network mount. These streams read the data in and applied basic cleanup tasks that has been previously accomplished by these SMEs. Once this data was cleaned up, it was landed in a temporary location in the cluster where it was readily accessible for use in the master spreadsheet.

Once this phase was complete, data was typically in place to make new decisions every couple of days, instead of once a week.

Phase 2

Having seen the flexibility of the dashboarding tool, the business users had a list of new views they wanted to see, as well as historical data. At the same time, the product owner decided that the other highest priority was to begin automating the master spreadsheet. Working with the principle analytics SME, the team was able to identify core functionalities of the spreadsheet, and began specifying data transformations upstream.

Because the decision had been made to stream in all data - including the batch flat files - these streaming applications were able to be developed for all the data. A series of staging tables was created to hold the output of these streaming apps, allowing for a testing phase where the streaming results and traditional results could be compared for accuracy.

While the speed to final analysis did not significantly improve in this phase, the analysts were able to begin developing other reports off of these intermediate tables and identified several previously unseen gaps where raw materials were arriving too soon, taking up warehouse space and risking damage and loss.

Phase 3

In this phase, the development team finished replicating the spreadsheet logic in a more efficient streaming fashion, replacing the old spreadsheet completely. After validation against the manual process, the team switched over to the new process, which improved performance significantly. While some data was still only updated on a daily or weekly basis, data flowed from source systems to the reporting table in near-real-time, allowing for predictions to be updated live as new data streamed into the system.

Phase 4

With the existing process fully automated, the product owner identified key reporting abilities asked for by the business unit, along with additional external data to help make the predictive models better.

The team began to develop data streams to connect with traffic, weather, and fuel price data. This additional information helped to better estimate transport costs for materials and products, and more efficiently route supplies. This new external data was able to be streamed into the same core processes, and joined with existing data to affect the real-time decision making abilities of the client. With the increased amount of historical data, they were also able to begin better tracking the accuracy of predictions, and added that as an input into their future models.

Could this be you?

This example use case draws on what we have seen with our clients and the industry as a whole. Gartner research predicts that within 5 years, the amount of data being processed will grow 800%, with 93% of that data being unstructured. By 2020 70% of organizations will adopt data streaming to enable real-time analytics performed by your modern data architecture on the fly. Will your analysts be busy with manually processing data, or with building and refining real-time models? Your company needs a modern data architecture, and our phased approach can help you make that a reality.