and <ahref="https://dl.acm.org/citation.cfm?id=2133363">Isolation-Based Anomaly Detection</a> [2].
We design and implement a distributed iForest on Spark, which is trained via model-wise parallelism, and predicts a new Dataset via data-wise parallelism.
It is implemented in the following steps:
1. Sampling data from a Dataset. Data instances are sampled and grouped for each iTree.
As indicated in the paper, the number samples for constructing each tree is usually not very large (default value 256).
Thus we can construct sampled paired RDD, where each row key is tree index and row value is a group of sampled data instances for a tree.
1. Training and constructing each iTree on parallel via a map operation and collect all iTrees to construct a iForest model.
1. Predict a new Dataset on parallel via a map operation with the collected iForest model.
## Usage
Spark iForest is designed and implemented easy to use. The usage is similar to the iForest sklearn implementation [3].
*Parameters:*
-*numTrees:* The number of trees in the iforest model (>0).
-*maxSamples:* The number of samples to draw from data to train each tree (>0).
If maxSamples <= 1, the algorithm will draw maxSamples * totalSample samples.
If maxSamples > 1, the algorithm will draw maxSamples samples.
The total memory is about maxSamples * numTrees * 4 + maxSamples * 8 bytes.
-*maxFeatures:* The number of features to draw from data to train each tree (>0).
If maxFeatures <= 1, the algorithm will draw maxFeatures * totalFeatures features.
If maxFeatures > 1, the algorithm will draw maxFeatures features.
-*maxDepth:* The height limit used in constructing a tree (>0).
The default value will be about log2(numSamples).
-*contamination:* The proportion of outliers in the data set, the value should be in (0, 1).
It is only used in the prediction phase to convert anomaly score to predicted labels.
In order to enhance performance, Our method to get anomaly score threshold is caculated by approxQuantile.
Note that this is an approximate quantiles computation, if you want an exactly answer,
you can extract ”$anomalyScoreCol" to select your anomalies.
-*bootstrap:* If true, individual trees are fit on random subsets of the training data sampled with replacement.
If false, sampling without replacement is performed.
-*seed:* The seed used by the randam number generator.
-*featuresCol:* features column name, default "features".