Commit 5f54b9b9 authored by titicaca's avatar titicaca
Browse files

add some benchmark results, fix typos

parent 1dd58689
......@@ -11,7 +11,7 @@ and <a href="https://dl.acm.org/citation.cfm?id=2133363">Isolation-Based Anomaly
We design and implement a distributed iForest on Spark, which is trained via model-wise parallelism, and predicts a new Dataset via data-wise parallelism.
It is implemented in the following steps:
1. Sampling data from a Dataset. Data instances are sampled and grouped for each iTree.
As indicated in the paper, the number samples for constructing each tree is usually not very large (default value 256).
As indicated in the paper, the number of samples for constructing each tree is usually not very large (default value 256).
Thus we can construct sampled paired RDD, where each row key is tree index and row value is a group of sampled data instances for a tree.
1. Training and constructing each iTree on parallel via a map operation and collect all iTrees to construct a iForest model.
1. Predict a new Dataset on parallel via a map operation with the collected iForest model.
......@@ -98,7 +98,56 @@ println(s"The model's auc: ${binaryMetrics.areaUnderROC()}")
```
## Benchmark
TO BE ADDED
#### Environment
Hardware Setup:
- CPU: Intel(R) Xeon(R) ES-2620 V2 @ 2.1GHz
- RAM: 128G
Software Setup:
- Spark Version: v2.2.0
- Sklearn Version: v0.19.1
#### Accuracy Performance
The following table shows the testing AUC result among origin paper [1], spark-iforest and sklearn-iforest.
| Dataset | #Samples | Anomaly-Rate | Dimension | Origin-Paper | Spark-iForest | Sklearn-iForest |
| ----------|:---------:| -----------:| ---------:| ------------:| -------------:| ---------------:|
| breastw | 683 | 35% | 9 | 0.98 | 0.96 | 0.94 |
| shuttle | 49097 | 7% | 9 | 1.00 | 0.89 | 0.95 |
| http | 567498 | 0.4% | 3 | 1.00 | 0.99 | 0.99 |
| ionosphere| 351 | 36% | 32 | 0.85 | 0.65 | 0.71 |
| satellite | 6435 | 33% | 36 | 0.71 | 0.60 | 0.68 |
#### Time Performance
The following table shows the time consuming between sklearn-iforest and spark-iforest.
Here we use the above largest dataset *http* for testing.
|time cost (s) | sklearn | spark (4 cores) |
|-------------:| -----------:| ---------------:|
| training | 335 | 34 |
| prediction | 300 | 86 |
* Model Parameters: numTrees = 100, maxSamples = 256
#### Scalability Performance
The following table shows the scalability of spark-iforest model. The testing dataset is still *http*.
The memory is set 1G per executor on Spark. The number of cores are range from 1 to 4 cores.
|time cost (s) | 1 core | 2 cores | 3 cores | 4 cores |
|-------------:| -----------:| ------------:| ------------:| ------------:|
| training | 74 | 52 | 40 | 34 |
| prediction | 272 | 157 | 117 | 86 |
* Model Parameters: numTrees = 100, maxSamples = 256
## Requirements
......
......@@ -178,15 +178,15 @@ class IForestSuite extends SparkFunSuite with MLlibTestSparkContext with Default
)
}
}
//TODO figure out why it doesn't work
// val iforest = new IForest()
// testEstimatorAndModelReadWrite(
// iforest,
// dataset,
// IForestSuite.allParamSettings,
// IForestSuite.allParamSettings,
// checkModelData
// )
val iforest = new IForest()
testEstimatorAndModelReadWrite(
iforest,
dataset,
IForestSuite.allParamSettings,
IForestSuite.allParamSettings,
checkModelData
)
}
test("boundary case") {
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment