The myth that AI or Cognitive Analytics will replace data scientists: There is no easy button

download

First, let me say I would love to have an “easy button” for predictive analytics. Unfortunately, predictive analytics is hard work that requires deep subject matter expertise in both the business problem and data science.  I’m not trying to be provocative with this post.  I just want to share my viewpoint over many year of experience.

————-

There are a couple of myths that I see more an more these days.  Like many myths they seem plausible on the surface but experienced data scientist know that the reality is more nuanced (and sadly requires more work).

Myths:

  • Deep learning (or Cognitive Analytics) is an easy button.  You can throw massive amounts of data and the algorithm will deliver a near optimal model.
  • Big data is always better than small data.  More rows of data always results in a significantly better model than less rows of data.

Both of these myths lead some (lately it seems many) people to conclude that data scientist will eventually become superfluous.  With enough data and advanced algorithms maybe we don’t need these expensive data scientists…

In my experience, the two most important phases of a predictive/prescriptive analytics implementation are:

  1. Data preparation: getting the right universe of possible variables, transforming these variables into useful model inputs (feature creation) and finalizing a parsimonious set of variables (feature selection)
  2. Deployment: making the predictive (and/or optimization) model part of an operational process (making micro-level decisions)

Note that I didn’t say anything about picking the “right” predictive model.  There are circumstances where the model type makes a big difference but in general data preparation and deployment are much, much more important.

The Data Scientist Role in Data Preparation

Can you imagine trying to implement a predictive insurance fraud detection solution without deeply understanding the insurance industry and fraud?  You might think that you could rely on the SIU (insurance team that investigates fraud) to give you all the information your model will need.  I can tell you from personal experience that you would be sadly mistaken.  The SIU relies on tribal knowledge, intuition and years of personal experience to guide them when detecting fraud.  Trying to extract the important variables to detect fraud from the SIU can take weeks or even months.  In the past, I’ve asked to go through the training that SIU team members get just to start this process. As a data scientist, it’s critical to understand every detail of both how fraud is committed and what bread crumbs are left for you to follow.  Some examples for personal injury fraud are:

  • Text mining of unstructured text (claim notes, police reports, etc.) can reveal the level of injury and damage to the car; when compared to medical bills discrepancies can be found
  • Developing a fraud neighborhood using graph theory can determine if a claim/provider is surrounded by fraud or not
  • Entity analytics can determine if a claimant was trying to hide their identity across multiple claims (and possibly part of a crime ring)

All of these created features, are key predictors for a fraud model.  None of them are present in the raw data.  Sadly, no spray-and-pray approach to throwing massive amounts of data at a deep learning (or ensemble machine learning) algorithm will ever uncover these patterns.

Example: IoT Driven Predictive Maintenance

IoT connected devices like your car, manufacturing equipment and heavy machinery (to name a few) generate massive amounts of “low information density” data.  The IoT sensor data is represented as massive time series that falls into the “Big Data” category.  I’ve seen people try to analyze this data directly (spray-and-pray) method with massive Spark clusters and advanced algorithms with very weak results.  Again, the reality is that you must have deep understanding of the business problem.  In many cases, you need to understand how parts are engineered, how they are used and how they fail.  With this understanding, you can start to perform feature engineering on the low information density data and transform it into high information metrics (or anomalies) that can be fed into a predictive model.  For example, vibration time series data can be analyzed with FFT to determine the intensity of vibration in specific frequency ranges.  The frequency ranges that should be examined again are driven by subject matter expertise.  Only after this feature engineering process can you generate a metric that will predict failures with vibration data.

Hopefully, I’ve made a case of how data scientists are crucial to the predictive solution implementation process.  AI and cognitive analytics are very important tools in the data scientist tool chest but the data scientist is still needed to bridge the gap with subject matter understanding.

If you are interested in feature creation check out this book (as a start):

Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists

Advertisements

Author: scottmutchler

Scott Mutchler is the Chief Data Scientist at QueBIT Consulting. Scott has a long track record of delivering enterprise class Advanced Analytics solutions that drive massive customer value. Scott is equally proficient at building high-performing teams, where ever team member focuses on their individual strengths. This approach has resulted in a team this is passionate about Advanced Analytics. That passion for Advanced Analytics is reflected in every client engagement. Scott spent the first 17 years of his career building enterprise solutions as a DBA, software developer, and enterprise architect. When Scott discovered his true passion was for Advanced Analytics, he moved into Advanced Analytics leadership roles where he was able to drive 100′s of millions of dollars in incremental revenues and cost savings through the application of Advanced Analytics to most challenging business problems. His strong IT background turned out to be a huge asset in building integrated Advanced Analytics solutions. Recently Scott was the Predictive Analytics WW Industrial/Retail Sector Lead for IBM. In this role, he worked with IBM SPSS clients world-wide. He architected Advanced Analytic solutions for client in some of the world’s largest retailers and manufacturers. Scott received his Masters from Virginia Tech in Geochemistry. He lives in Colorado and enjoys an outdoors lifestyle, playing guitar and travelling.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s