Anomaly detection algorithms are core to many fraud and security applications/business solutions. Identifying cases where specific values are outside norms can be useful in outlier detection (as a predicate to predictive modeling) and to identify cases of interest when labeled data is not available for supervised learning models. For example, an insurance company might run anomaly detection against a claims database in the hopes of identifying potentially fraudulent (anomalous) claims. If the medical bills for personal injury claims are anomalously high (given the other characteristics of the claim), then those cases can be further reviewed (by a claims adjuster). Finally these newly (human labeled) claims could be used in a supervised model to predict fraud.

Probably the most common approach to identifying anomalies is a case-wise comparison (value by value) to peer group averages. For example, if we take a personal injury claim and compare it’s medical bill total to it’s peer claim medical bill averages. If our claim has one or more extreme variable values when compared to the cluster distribution (for the same variable) it can be considered an outlier. Here’s some pseudocode for a naive modeling algorithm based on this approach:

*cluster cases;*

* for each cluster*

* for each variable*

* calculate variable averages;*

* calculate standard deviation;*

* endfor;*

* endfor;*

*for each case*

* score case using cluster model to determine appropriate cluster;*

* if abs(case variable – cluster average for variable) > 4.0 * cluster stddev*

* anomaly score += variable weight;*

* endif;*

* endfor;*

High scores indicate anomalies. Supplying variables weights to the above algorithm allows you to tune the overall score (such that [subjectively] important variables contribute more heavily to the overall score total). Once a supervised model is built, these weights can be tuned using the variable importance measure of the model.

Again this approach is pretty naive. There are several challenges with this approach:

- How do we handle nominal/unordered (factor) variables?
- What if the data distributions are strongly skewed?
- What about variable interactions? Might outlier values be perfectly predictable (and normal) if we included variable interactions?

MICAD is an attempt to improve upon the above naive approach. The simplest explanation of MICAD is:

*Multiple imputation comparison anomaly detection (MICAD) is an algorithm that compares the imputed (or predicted) value of each variable to the actual value. If the predicted value != the actual value, the anomaly score is incremented by the variable weight.*

Imputation of values is done using RandomForest (or similar predictive model). The predictors are the remaining variables in the case. For example, using the Iris data set we can impute the Sepal Length using the Species, Petal Length, Petal Width and Sepal Width.

Here is the pseudocode for MICAD:

*# data preparation*

* for each variable*

* if (type of variable is numeric)*

* convert variable to quartile;*

* endif;*

* endfor;*

*# model building*

* for each variable*

* build randomForest classifcation model to predict variable using remaining variables;*

* store randomForest model;*

* endfor;*

*# model scoring*

* for each variable*

* retrieve randomForest model;*

* score randomForest model for all cases;*

* if (predicted class != actual class)*

* anomaly score += variable weight*

* endif;*

* endfor;*

install.packages("devtools") library("devtools") install_github("smutchler/micad/micad") library("micad")

data(iris_anomaly)

We build the model while excluding the anomaly records. The reason we do this is because Iris is a small data set and a few anomaly records will have a large impact on the models being built. In production data sets, the affects of a few anomaly records will [likely] not have such a large impact on the models.

The weights are driven by subject matter expertise intially. Once a supervised model can be built, the weights could be adjusted using the variable importances of each variable.

The output is:

The last 4 cases are labeled anomaly = 1. The appended A$_SCORE column reveals high aggregate scores for the anomaly cases.

]]>