Understanding Modern Data Concepts for Organizations — Part 2: Core concepts for your Big Data

Nishant Deshpande
6 min readMay 11, 2021

Series Home

In Article 1, we saw the three types of data. We’re going to dig a little deeper into big data (data type 3), and come up with some core concepts to use when building your big data infrastructure.

What about Operational Data (data type 1)?

Operational data resides in RDBMS. There isn’t much to think about here for most organizations. Applications that keep the lights on are built using RDBMS, and there is no reason to rethink that. (CRM systems, HR systems, Customer facing applications, ERP/MES systems are all examples of such applications).

The conversation without big data

First, I want to motivate in more depth why big data (type 3) is important. So let’s go back to the example of forecasting sales at Amazon from Article 1.

Here is a conversation between the hypothetical Amazon apparel products manager (PM) (she is responsible for giving the forecast), the head of data engineering (DE) and the analyst or data science (DS) person who is creating models for the forecast.

DS> After talking with PM, these are the most important things that can influence sales of a product.

Here is what we would like to have for every product for our sales by zip code forecast model.

  • Sales history for 3 years.
  • Price history for 3 years.
  • If product is eligible for Amazon prime and when it became eligible.
  • User rating and reviews for last 3 years.

DE> We have sales history in our warehouse. We also have reviews — we know when they were submitted.

But price history — we don’t keep price changes in the warehouse. Only sales. But sales don’t necessarily reflect the advertised price because of discounts that might be applied. Also if there were no sales during some time period, we don’t have the advertised price.

Same for user rating. We just update the user rating when new ratings come in. We don’t keep a history or user rating changes.

PM> That’s Ok! The models will work with what DE has, right?

DS> Yes… but we have found from research that some products are very sensitive to price changes. And prices change quite often, so….

PM> Ok! You guys rock! So — what can we do? I’m sure you guys can do this.

DE> Ok let me think about it.

One week later..

DE> Ok. We can keep a record of those changes going forward. But, it will mean we need to make changes to our warehouse database schema and increase capacity, possibly by 3x given the number of changes that happen. The database and infrastructure team is pretty stretched right now with the latest security patch upgrades and migration to the new blade servers.

But if I get the exact types of changes you want to track, I can work out the schema changes and capacity increases we will need and get cost estimates for that as well.

DS> Ok. Also, we had some other ideas on tracking the number of product styles and colors… those can be important, but are not as high priority.

DE> Ok. I’d prefer to get everything we need spec’ed out once, so we can make all the changes in one go. Also, if we start adding too much more data, we will have to change our sharding, and that requires approval from the infrastructure VP, and he always wants any additional warehouse work justified in terms of incremental revenue.

PM> Ok! You guys are awesome! I’ll set up another meeting to discuss this in a week. Meantime, I’m going to use my old spreadsheet for the next few months forecast. Thanks!

Breaking down the problems

  • A traditional (type 2) data warehouse is being used as the primary way to store historical data that will feed the sales forecast.
  • It looks like the DE wants to store all the changes in some way that entails significant effort from the infrastructure team.This is a bottleneck.
  • DE is making it clear they won’t like iterative changes. They want all the “required” changes up front. This either means too little of what might be beneficial, or too much which will create unnecessary costs.
  • The “justify it with revenue increases” is a way to say “go away” to most ideas, because most ideas don’t come with revenue projections. The barrier to making good ideas happen should be lower than that, and that is enabled by having the capability to try things — in this case the right data infrastructure.

The library vs the government office

Libraries are my favorite places — they have been since I could read. On the other hand, going to government offices to get information gets my heart rate up, and not in a good way. Why? And what does this have to do with big data and our problem above?

Libraries are organized with a few simple rules. Books are ordered by author’s last name, grouped by genre. It is easy to add and remove books. It is self-serve, and if the users are co-operative, lets lots of them get what they want without increasing support staff.

Compare that to the government office. You go to the front desk. You try to ask for information. You need to ask for it in their language. Then they might go into the back, you have no idea what they are doing. And they might emerge with some of the information you want.

You get the idea. One is a system with a few easy to understand rules and universal access. The only requirement for something to be there is that it is a book (not loose sheets of paper), and it has an author, and people want to read it.

The other is controlled by a gatekeeper, and there might only be a couple of them, so there are long lines of people waiting to get their precious time.

You want your big data to be the library, not the government office.

(Aside: there are of course some valid reasons for the differences between libraries and government kept information, but increasingly less with technology. Check out https://opendatainitiative.github.io/).

Core concepts for Big Data

These core concepts are a good way to measure any big data infrastructure.

  1. Storing new data should be easy. No need for a lot of resources from the database/infrastructure team to do this.
  2. Storing new data should be cheap. Costs should increase proportionally with the amount of new data to be stored. And storage costs today are cheap.
  3. Storing new data should be flexible. You shouldn’t need to anticipate everything in advance. Iterative is better.
  4. Accessing the data should be easy. I.e. no queues in front of the front desk to get access to the data.
  5. Specialized requirements (fast response times, complex analysis) can be handled by specialized software that uses this data, either directly or by copying parts of it. Don’t compromise on 1 to 4 above for this.

The reality is messy

Unfortunately, if it was that easy, I wouldn’t be writing this.

There are many trade-offs. In particular

  1. How much are your services or competitive advantage determined by your ability to continually innovate with new data and new ideas? That will determine how much flexibility vs convenience you will need.
  2. Your organizational set up and skill set.
  3. Your size and your ability to leverage the cloud if you don’t have a large enough scale to do without the cloud.

Wrapping up

We have gotten deeper into a conceptual framework for looking at Big Data. All of the core concepts listed are not a must, but they enable decision making knowing the trade-offs.

In subsequent articles, we will look at some typical data platforms, seeing how well they perform on the core concepts but also other dimensions like organizational skill set required and flexibility vs risk. We will also start to look at Cloud — possibly the biggest single factor in making technology infrastructure decisions today.

Questions? Feel free to leave comments here or reach out to me directly.

--

--