MICAD: A new algorithm/R package for anomaly detection

Overview

Anomaly detection algorithms are core to many fraud and security applications/business solutions.  Identifying cases where specific values are outside norms can be useful in outlier detection (as a predicate to predictive modeling) and to identify cases of interest when labeled data is not available for supervised learning models.  For example, an insurance company might run anomaly detection against a claims database in the hopes of identifying potentially fraudulent (anomalous) claims.  If the medical bills for personal injury claims are anomalously high (given the other characteristics of the claim), then those cases can be further reviewed (by a claims adjuster).  Finally these newly (human  labeled) claims could be used in a supervised model to predict fraud.

The Most Common Approach to Anomaly Detection

Probably the most common approach to identifying anomalies is a case-wise comparison (value by value) to peer group averages.  For example, if we take a personal injury claim and compare it’s medical bill total to it’s peer claim medical bill averages.  If our claim has one or more extreme variable values when compared to the cluster distribution (for the same variable) it can be considered an outlier.  Here’s some pseudocode for a naive modeling algorithm based on this approach:

cluster cases;
for each cluster
  for each variable
    calculate variable averages;
    calculate standard deviation;
  endfor;
endfor;

for each case
  score case using cluster model to determine appropriate cluster;
  if abs(case variable – cluster average for variable) > 4.0 * cluster stddev
    anomaly score += variable weight;
  endif;
endfor;

High scores indicate anomalies.  Supplying variables weights to the above algorithm allows you to tune the overall score (such that [subjectively] important variables contribute more heavily to the overall score total).  Once a supervised model is built, these weights can be tuned using the variable importance measure of the model.

Again this approach is pretty naive.  There are several challenges with this approach:

  • How do we handle nominal/unordered (factor) variables?
  • What if the data distributions are strongly skewed?
  • What about variable interactions?  Might outlier values be perfectly predictable (and normal) if we included variable interactions?

MICAD

MICAD is an attempt to improve upon the above naive approach.  The simplest explanation of MICAD is:

Multiple imputation comparison anomaly detection (MICAD) is an algorithm that compares the imputed (or predicted) value of each variable to the actual value. If the predicted value != the actual value, the anomaly score is incremented by the variable weight.

Imputation of values is done using RandomForest (or similar predictive model).  The predictors are the remaining variables in the case.  For example, using the Iris data set we can impute the Sepal Length using the Species, Petal Length, Petal Width and Sepal Width.

Here is the pseudocode for MICAD:

# data preparation
for each variable
  if (type of variable is numeric)
    convert variable to quartile;
  endif;
endfor;

# model building
for each variable
   build randomForest classifcation model to predict variable using remaining variables;
   store randomForest model;
endfor;

# model scoring
for each variable
   retrieve randomForest model;
   score randomForest model for all cases;
   if (predicted class != actual class)
     anomaly score += variable weight
   endif;
endfor;

An Example Using an Appended Iris Data Set

Downloading & Installing MICAD

install.packages("devtools")
library("devtools")
install_github("smutchler/micad/micad")
library("micad")

Loading the Appended Iris Data Set

data(iris_anomaly)

Building the MICAD S3 Model

micad.model <- micad(x=iris_anomaly[iris_anomaly$ANOMALY==0,], 
  vars=c("SEPAL_LENGTH","SEPAL_WIDTH",
         "PETAL_LENGTH","PETAL_WIDTH",
         "SPECIES"),
  weights=c(10,10,10,10,20))
 
print(micad.model)

We build the model while excluding the anomaly records.  The reason we do this is because Iris is a small data set and a few anomaly records will have a large impact on the models being built.  In production data sets, the affects of a few anomaly records will [likely] not have such a large impact on the models.

The weights are driven by subject matter expertise intially.  Once a supervised model can be built, the weights could be adjusted using the variable importances of each variable.

Scoring the Iris Anomaly Data Set

scored.data <- predict(micad.model, iris_anomaly)
tail(scored.data)

The output is:

Capture

The last 4 cases are labeled anomaly = 1.  The appended A$_SCORE column reveals high aggregate scores for the anomaly cases.

Advertisements

Author: scottmutchler

Scott Mutchler is the Vice President of Advanced Analytics Services at QueBIT Consulting. Scott has a long track record of delivering enterprise class Advanced Analytics solutions that drive massive customer value. Scott is equally proficient at building high-performing teams, where ever team member focuses on their individual strengths. This approach has resulted in a team this is passionate about Advanced Analytics. That passion for Advanced Analytics is reflected in every client engagement. Scott spent the first 17 years of his career building enterprise solutions as a DBA, software developer, and enterprise architect. When Scott discovered his true passion was for Advanced Analytics, he moved into Advanced Analytics leadership roles where he was able to drive 100′s of millions of dollars in incremental revenues and cost savings through the application of Advanced Analytics to most challenging business problems. His strong IT background turned out to be a huge asset in building integrated Advanced Analytics solutions. Recently Scott was the Predictive Analytics WW Industrial/Retail Sector Lead for IBM. In this role, he worked with IBM SPSS clients world-wide. He architected Advanced Analytic solutions for client in some of the world’s largest retailers and manufacturers. Scott received his Masters from Virginia Tech in Geochemistry. He lives in Colorado and enjoys an outdoors lifestyle, playing guitar and travelling.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s