About the Author

Dr. Michael Watson, one of the industry’s foremost experts on supply chain network design and advanced analytics, is a columnist and subject matter expert (SME) for Supply Chain Digest.

Dr. Watson, of Northwestern University, was the lead author of the just released book Supply Chain Network Design, co-authored with Sara Lewis, Peter Cacioppi, and Jay Jayaraman, all of IBM. (See Supply Chain Network Design – the Book.)

Prior to his current role at Northwestern, Watson was a key manager in IBM's network optimization group. In addition to his roles at IBM and now at Northwestern, Watson is director of The Optimization and Analytics Group.

By Dr. Michael Watson

August 26, 2014

Just Because People are Talking about Big Data Doesn’t Mean it is Clean Data

When Completing a Supply Chain Study, Plan on Spending a lot of Time Cleaning and Validating Data

Dr. Watson Says:

...the three to six man months is more than you will spend cleaning supply chain data, but not much more...

What Do You Say?

Click Here to Send Us Your Comments

Click Here to See Reader Feedback

All the talk and hype around Big Data has put the pressure on managers in the supply chain to do more with the data you have. But, what often frustrates managers is that the data you have is not clean enough or in the right format to answer the questions you want.

I was recently reminded that the problems we face in answering supply chain problems pops up in other areas as well:

First, in a recent Freakonomics podcast (around the 18:00 minute mark), Steven Levitt, who is also a partner in a consulting firm that answers questions with data- not unlike supply chain questions, made two interesting points:

	1.	Companies that embrace data will dominate those that don’t. (Nothing new with this idea, but worth reminding ourselves)
	2.	Companies just don’t have the data they need to answer important questions. He mentioned that his firm will spend three to six man-months to just put together a data set they can use for a basic analysis. The problem, he says, is that “the data are held in 27 different data sets that have different identifiers.”

The last point is the one that we see when we are pulling together supply chain data. We’ll have different demand files for different divisions, different transportation data from different parts of the business, and different production data. And, nothing will match up. Creating a coherent data set will take work—the three to six man-months is more than you will spend cleaning supply chain data, but not much more.

Previous Columns by Dr. Watson

The Three Use Cases for Data Scientists

Learn Python, PuLP, Jupyter Notebooks, and Network Design

EOQ Model and the Hidden Costs of Fixed Costs

CSCMP Edge - Nike Quote: "It is All an Art Project Until you Get it on Someone's Feet"

Supply Chain by Design: Why Business Leaders should think of AI as an Umbrella Term

Second, a New York Times article came out called “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.” The article discusses the importance of data cleaning (or “data wrangling” or “data janitor work”). Like the Steve Levitt quote, the article reminds us of the strategic value of data and the fact that it is difficult to clean it. I liked the following sentence—it highlights that you have work to do before you get good answers:

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington

The above podcast and article remind us that the problems we have with cleaning supply chain data are the same ones faced by other people in different industries. In my experience, when doing a supply chain study, you should plan on spending about 60-70% of the total time developing a clean data set.

Final Thoughts:

You could read this article as a depressing reminder that we haven’t come all that far in our access to clean data. However, I have a different outlook: the need to answer questions with data won’t go away and access to new data sets won’t go way. Instead of worrying about the difficulty of getting clean data, build skills on your team so you can create clean data sets and come up with new insights faster than the competition.

Recent Feedback

Great post Mike. The old guideline that 80% of ad-hoc analytic time goes into data collection and cleaning still holds true, but in part I think that's because we demand more of our data gathering than we did 10-15 years ago. Today we use more data, more sources of data and generally are looking for clearer/deeper insight.

It does pose an interesting question though as to how you can dramatically improve your analytic productivity. If 80% of the time goes into data cleaning, it's not by doing faster analytics.

One off projects are necessarily inefficient if we must start from un-clean, un-collated data sources every time. Planning out a pipeline of analytic work, taking advantage of the substantial overlaps in data requirements between them and continuously harmonizing/cleaning these sources into one platform can be dramatically more productive. My personal experience says that you can flip this so that less than 20% of your time is spent doing the janitorial work. That's a huge productivity increase and gets analysts doing what they are really good at - analyzing.

Andrew Gibson
Partner
Crabtree Analytics
Aug, 27 2014

I couldn't agree more. I've spent forty years in transportation and logistics. My main area of responsibility at this point is managing data for our own customers as well as data related to tenders for transportation.

My first step is always to cleanse the data, keep what's important and remove the unnecessary. Over the years I've put together various data files that enable quick lookups to clean the important data...specifically around destinations and Postal Codes.

The biggest hurdle I find is the variations in the spellings of cities, even within one set of data from one client I may find as many as four or five different spellings for one destination. To that end I've created a file which looks at all the variations I have and returns the destination spelling the way I require so it will run through various rating engines with compatability. If I come across a new variation, I add it for future reference. I have a comparable file for Postal Codes. They are both constantly evolving as requirements dictate, but the requirements to update become less and less over time.

I used to treat each file individually as it came in but determined that many of the data issues across various sources were comparable. I now spend about a quarter of the time previously spent trying to cleanse data.

Dale Hamelin
Manager - Transportation
DB Schenker of Canada
Sep, 04 2014