First, let me say I would love to have an “easy button” for predictive analytics. Unfortunately, predictive analytics is hard work that requires deep subject matter expertise in both the business problem and data science. I’m not trying to be provocative with this post. I just want to share my viewpoint over many year of experience.
There are a couple of myths that I see more an more these days. Like many myths they seem plausible on the surface but experienced data scientist know that the reality is more nuanced (and sadly requires more work).
- Deep learning (or Cognitive Analytics) is an easy button. You can throw massive amounts of data and the algorithm will deliver a near optimal model.
- Big data is always better than small data. More rows of data always results in a significantly better model than less rows of data.
Both of these myths lead some (lately it seems many) people to conclude that data scientist will eventually become superfluous. With enough data and advanced algorithms maybe we don’t need these expensive data scientists…
In my experience, the two most important phases of a predictive/prescriptive analytics implementation are:
- Data preparation: getting the right universe of possible variables, transforming these variables into useful model inputs (feature creation) and finalizing a parsimonious set of variables (feature selection)
- Deployment: making the predictive (and/or optimization) model part of an operational process (making micro-level decisions)
Note that I didn’t say anything about picking the “right” predictive model. There are circumstances where the model type makes a big difference but in general data preparation and deployment are much, much more important.
The Data Scientist Role in Data Preparation
Can you imagine trying to implement a predictive insurance fraud detection solution without deeply understanding the insurance industry and fraud? You might think that you could rely on the SIU (insurance team that investigates fraud) to give you all the information your model will need. I can tell you from personal experience that you would be sadly mistaken. The SIU relies on tribal knowledge, intuition and years of personal experience to guide them when detecting fraud. Trying to extract the important variables to detect fraud from the SIU can take weeks or even months. In the past, I’ve asked to go through the training that SIU team members get just to start this process. As a data scientist, it’s critical to understand every detail of both how fraud is committed and what bread crumbs are left for you to follow. Some examples for personal injury fraud are:
- Text mining of unstructured text (claim notes, police reports, etc.) can reveal the level of injury and damage to the car; when compared to medical bills discrepancies can be found
- Developing a fraud neighborhood using graph theory can determine if a claim/provider is surrounded by fraud or not
- Entity analytics can determine if a claimant was trying to hide their identity across multiple claims (and possibly part of a crime ring)
All of these created features, are key predictors for a fraud model. None of them are present in the raw data. Sadly, no spray-and-pray approach to throwing massive amounts of data at a deep learning (or ensemble machine learning) algorithm will ever uncover these patterns.
Example: IoT Driven Predictive Maintenance
IoT connected devices like your car, manufacturing equipment and heavy machinery (to name a few) generate massive amounts of “low information density” data. The IoT sensor data is represented as massive time series that falls into the “Big Data” category. I’ve seen people try to analyze this data directly (spray-and-pray) method with massive Spark clusters and advanced algorithms with very weak results. Again, the reality is that you must have deep understanding of the business problem. In many cases, you need to understand how parts are engineered, how they are used and how they fail. With this understanding, you can start to perform feature engineering on the low information density data and transform it into high information metrics (or anomalies) that can be fed into a predictive model. For example, vibration time series data can be analyzed with FFT to determine the intensity of vibration in specific frequency ranges. The frequency ranges that should be examined again are driven by subject matter expertise. Only after this feature engineering process can you generate a metric that will predict failures with vibration data.
Hopefully, I’ve made a case of how data scientists are crucial to the predictive solution implementation process. AI and cognitive analytics are very important tools in the data scientist tool chest but the data scientist is still needed to bridge the gap with subject matter understanding.
If you are interested in feature creation check out this book (as a start):