The myth that AI or Cognitive Analytics will replace data scientists: There is no easy button


First, let me say I would love to have an “easy button” for predictive analytics. Unfortunately, predictive analytics is hard work that requires deep subject matter expertise in both the business problem and data science.  I’m not trying to be provocative with this post.  I just want to share my viewpoint over many year of experience.


There are a couple of myths that I see more an more these days.  Like many myths they seem plausible on the surface but experienced data scientist know that the reality is more nuanced (and sadly requires more work).


  • Deep learning (or Cognitive Analytics) is an easy button.  You can throw massive amounts of data and the algorithm will deliver a near optimal model.
  • Big data is always better than small data.  More rows of data always results in a significantly better model than less rows of data.

Both of these myths lead some (lately it seems many) people to conclude that data scientist will eventually become superfluous.  With enough data and advanced algorithms maybe we don’t need these expensive data scientists…

In my experience, the two most important phases of a predictive/prescriptive analytics implementation are:

  1. Data preparation: getting the right universe of possible variables, transforming these variables into useful model inputs (feature creation) and finalizing a parsimonious set of variables (feature selection)
  2. Deployment: making the predictive (and/or optimization) model part of an operational process (making micro-level decisions)

Note that I didn’t say anything about picking the “right” predictive model.  There are circumstances where the model type makes a big difference but in general data preparation and deployment are much, much more important.

The Data Scientist Role in Data Preparation

Can you imagine trying to implement a predictive insurance fraud detection solution without deeply understanding the insurance industry and fraud?  You might think that you could rely on the SIU (insurance team that investigates fraud) to give you all the information your model will need.  I can tell you from personal experience that you would be sadly mistaken.  The SIU relies on tribal knowledge, intuition and years of personal experience to guide them when detecting fraud.  Trying to extract the important variables to detect fraud from the SIU can take weeks or even months.  In the past, I’ve asked to go through the training that SIU team members get just to start this process. As a data scientist, it’s critical to understand every detail of both how fraud is committed and what bread crumbs are left for you to follow.  Some examples for personal injury fraud are:

  • Text mining of unstructured text (claim notes, police reports, etc.) can reveal the level of injury and damage to the car; when compared to medical bills discrepancies can be found
  • Developing a fraud neighborhood using graph theory can determine if a claim/provider is surrounded by fraud or not
  • Entity analytics can determine if a claimant was trying to hide their identity across multiple claims (and possibly part of a crime ring)

All of these created features, are key predictors for a fraud model.  None of them are present in the raw data.  Sadly, no spray-and-pray approach to throwing massive amounts of data at a deep learning (or ensemble machine learning) algorithm will ever uncover these patterns.

Example: IoT Driven Predictive Maintenance

IoT connected devices like your car, manufacturing equipment and heavy machinery (to name a few) generate massive amounts of “low information density” data.  The IoT sensor data is represented as massive time series that falls into the “Big Data” category.  I’ve seen people try to analyze this data directly (spray-and-pray) method with massive Spark clusters and advanced algorithms with very weak results.  Again, the reality is that you must have deep understanding of the business problem.  In many cases, you need to understand how parts are engineered, how they are used and how they fail.  With this understanding, you can start to perform feature engineering on the low information density data and transform it into high information metrics (or anomalies) that can be fed into a predictive model.  For example, vibration time series data can be analyzed with FFT to determine the intensity of vibration in specific frequency ranges.  The frequency ranges that should be examined again are driven by subject matter expertise.  Only after this feature engineering process can you generate a metric that will predict failures with vibration data.

Hopefully, I’ve made a case of how data scientists are crucial to the predictive solution implementation process.  AI and cognitive analytics are very important tools in the data scientist tool chest but the data scientist is still needed to bridge the gap with subject matter understanding.

If you are interested in feature creation check out this book (as a start):

Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists


Confessions of a Data Scientist: Why I quit social media and still cut my own grass

I hate to have to read a long blog post before I get to the payoff.  So here is mine:

  • Social media fosters a short attention span by triggering a dopamine (reward) response in our brains.  Innovation in Data Science (and many other technical fields) requires deep thought.  Mindless tasks like mowing the grass allows you enter into deep thought and tap the purely creative center of your brain.  Revolutionary Data Science is as much about creativity as it is about having an analytical mind.

Over a year ago, I realized I had greatly reduced the number of technical books I was reading.  I usually read 1-2 books a month and truly enjoyed learning new analytical techniques with the driving purpose of creating something unique… something no one else had ever thought of (even if it was derivative of other work).  It’s how I’m wired I guess.

Instead of learning, I was spending 2-3 hours a day mindlessly scrolling through Facebook looking at cat videos and political hate speech.  I was also posting family pictures of our adventures and eagerly awaiting the dopamine reward of getting likes from friends and family.  My attention span was getting shorter and shorter by the day.  I was also fundamentally unhappy because I always felt that others were having more adventures than me.  I guess I wasn’t alone in this feeling.

Now I’ve been fortunate enough to have tremendous success in Data Science.  I’ve built analytical solutions for Fortune 500 companies that have driven $100M’s of incremental revenue.  I’m not bragging, I’m incredibly grateful to be one of the seemingly few data scientist that had leadership that would listen to my ideas and give me a long rope to implement them.  I’m thinking specifically of you Kevin Freeland (LinkedIn profile).

As I thought of those times, when I was happiest I realized that I spent a lot of time thinking… especially during my leisure time.  I took books on Data Mining, Oracle, etc. to the beach.  I spent hours on long drives thinking of new ways to solve business problems with analytics.  Specifically, I thought about the exact moment I came up with a new way to assort products in retail stores.  I was mowing my grass.  I had a big yard.  The mindless repetition of mowing allowed my mind to access the creative center of my brain.  Apparently it did for Joe Walsh also.  Ironically, I live not far from where he crashed a riding mower when he thought up “Rocky Mountain Way”!

Everyone is different and you should take this advice with a huge bag of rock salt but here goes:

  • Try leaving social media for 1 month
  • Take long drives
  • Mow your own grass
  • Block 2-3 hours a day to do nothing… no work, no TV, no social media, no phone.


Make time to


MICAD: A new algorithm/R package for anomaly detection


Anomaly detection algorithms are core to many fraud and security applications/business solutions.  Identifying cases where specific values are outside norms can be useful in outlier detection (as a predicate to predictive modeling) and to identify cases of interest when labeled data is not available for supervised learning models.  For example, an insurance company might run anomaly detection against a claims database in the hopes of identifying potentially fraudulent (anomalous) claims.  If the medical bills for personal injury claims are anomalously high (given the other characteristics of the claim), then those cases can be further reviewed (by a claims adjuster).  Finally these newly (human  labeled) claims could be used in a supervised model to predict fraud.

The Most Common Approach to Anomaly Detection

Probably the most common approach to identifying anomalies is a case-wise comparison (value by value) to peer group averages.  For example, if we take a personal injury claim and compare it’s medical bill total to it’s peer claim medical bill averages.  If our claim has one or more extreme variable values when compared to the cluster distribution (for the same variable) it can be considered an outlier.  Here’s some pseudocode for a naive modeling algorithm based on this approach:

cluster cases;
for each cluster
  for each variable
    calculate variable averages;
    calculate standard deviation;

for each case
  score case using cluster model to determine appropriate cluster;
  if abs(case variable – cluster average for variable) > 4.0 * cluster stddev
    anomaly score += variable weight;

High scores indicate anomalies.  Supplying variables weights to the above algorithm allows you to tune the overall score (such that [subjectively] important variables contribute more heavily to the overall score total).  Once a supervised model is built, these weights can be tuned using the variable importance measure of the model.

Again this approach is pretty naive.  There are several challenges with this approach:

  • How do we handle nominal/unordered (factor) variables?
  • What if the data distributions are strongly skewed?
  • What about variable interactions?  Might outlier values be perfectly predictable (and normal) if we included variable interactions?


MICAD is an attempt to improve upon the above naive approach.  The simplest explanation of MICAD is:

Multiple imputation comparison anomaly detection (MICAD) is an algorithm that compares the imputed (or predicted) value of each variable to the actual value. If the predicted value != the actual value, the anomaly score is incremented by the variable weight.

Imputation of values is done using RandomForest (or similar predictive model).  The predictors are the remaining variables in the case.  For example, using the Iris data set we can impute the Sepal Length using the Species, Petal Length, Petal Width and Sepal Width.

Here is the pseudocode for MICAD:

# data preparation
for each variable
  if (type of variable is numeric)
    convert variable to quartile;

# model building
for each variable
   build randomForest classifcation model to predict variable using remaining variables;
   store randomForest model;

# model scoring
for each variable
   retrieve randomForest model;
   score randomForest model for all cases;
   if (predicted class != actual class)
     anomaly score += variable weight

An Example Using an Appended Iris Data Set

Downloading & Installing MICAD


Loading the Appended Iris Data Set


Building the MICAD S3 Model

micad.model <- micad(x=iris_anomaly[iris_anomaly$ANOMALY==0,], 

We build the model while excluding the anomaly records.  The reason we do this is because Iris is a small data set and a few anomaly records will have a large impact on the models being built.  In production data sets, the affects of a few anomaly records will [likely] not have such a large impact on the models.

The weights are driven by subject matter expertise intially.  Once a supervised model can be built, the weights could be adjusted using the variable importances of each variable.

Scoring the Iris Anomaly Data Set <- predict(micad.model, iris_anomaly)

The output is:


The last 4 cases are labeled anomaly = 1.  The appended A$_SCORE column reveals high aggregate scores for the anomaly cases.