Commit 0bb94696 authored by Jeremy Yen's avatar Jeremy Yen
Browse files

Clear all docs page

parent c75e33e6
# Deflect Documentation
This is the public repository for the documentation of the Deflect project - a
Distributed Denial of Service (DDoS) mitigation service created to neutralize
cyberattacks against independent media and human rights defenders.
Our goal is to create a community-driven, technical response to the censorship
of online voices caused by DDoS attacks.
* Go to the [home page]( for an index of the most important resources.
* Read the [instructions to contribute]( to this documentation.
* Read the [instructions to translate]( this documentation.
## Translations
To extract strings and upload them to Transifex, use the `` script.
To fetch strings from Transifex, use ``.
Previous (unreliable) attempts at automating this:
* ``: we used to run this in a CI job.
* ``: we tried some `txgh` project, didn't like it. Also tried just running
`transifex --fetch && git add -u && git commit && git push` in a cron script on a server,
but we didn't like that either, or something.
In practice, these cron script / webhook Rube Goldberg contraptions were
breaking all the time. We don't change these strings very often, and since we
have to be in contact with translators anyway, we might as well just do these
things by hand.
......@@ -20,14 +20,11 @@
{% endif %}
<li class="nav-item">
<a class="nav-link" target="_blank" href="">{{ _("Go to main site") }}</a>
<li class="nav-item">
<a class="nav-link" href="">{{ _("Sign In") }}</a>
<li class="nav-item nav-support-us">
<a class="nav-link" target="_blank" href="{{ pathto('support_us') }}">{{ _("Support Us") }}</a>
<a class="nav-link" target="_blank" href="">{{ _("Go to main site") }}</a>
</div><!--/.navbar-collapse -->
About Deflect
.. toctree::
:maxdepth: 1
Deflect is a website security service built and operated by `eQualitie
<>`__.. Deflect is powered by open source software, a global
infrastructure and a dedicated team. In continuous operation since 2011,
Deflect protects hundreds of websites, serving over a million daily readers.
Our unique offering gives you the option of joining the Deflect service or
running the software yourself, on premises. Deflect is built on open source
software and has always made its work available under an open licence.
• Enterprise DDoS mitigation
• Secure Wordpress hosting
• Web traffic analytics
• Multilingual control panel
• Multilingual support desk
• Encrypted connections
• Transparent pricing
Deflect is offered free of charge to qualifying individuals and organizations.
Profits generated by the commercial offering, are channelled to supporting
hundreds of non-profits, independent media and civil society organizations
around the world.
Deflect is a proven anti-censorship technology, achieving enterprise-scale goals
on a non-profit budget a fraction of parallel commercial efforts. Deflect has
never refused service to qualifying organizations nor encouraged anyone to leave
for attracting too many attacks. Our infrastructure has withstood malicious traffic
in excess of 100Gbps. Deflect-protected websites served over 74 million unique
readers in 2018, representing ~ 2% of the world’s population connected to the Internet.
About Deflect Approach
Antispam recommendations
Spam is a bummer, as we’re sure most of you agree. Here are some antispam tools
we’ve personally used. This first one is great.
`Anti-spam <>`_
* super simple, no configuration
* integrates seamlessly with any theme
* free
So far no cons. Crazy, right?
Another strategy is to use 2 plugins together. For example, we’ve had excellent
results using Antispam Bee and Spam Free WordPress. Antispam Bee alone
sometimes misses spam posted by a spambot and since a blog can receive
thousands of these per month, even a small percentage can mean quite a lot of
spam removal to deal with. By adding Spam Free WordPress into the mix, you can
pretty much eliminate automated (spambot) comment spam. Unfortunately, this
plugin fails to catch some of the manual spam added by real people, which is
where Antispam Bee shines, since it’s using the Project Honeypot which
publishes a list of the top URLs, domains, and keywords being promoted by
comment spammers. Project Honey Pot also publishes a list of the top IP
addresses being used by comment spammers.
`Antispam Bee <>`_
* uses `Project Honeypot <>`_
* integrates seamlessly with any theme
* free
* doesn’t always work against automated (spambots) comment spam
* support seems to be non existent or only in German
`Spam Free WordPress <>`_
* blocks 100% of automated (spambots) comment spam
* free
* may need some work to fit in with your theme
* can be tricked by human spammers (actual people paid to add spam manually)
`Akismet <>`_
* free for personal use
* integrates seamlessly with any theme
* doesn’t categorize as accurately as you’d hope (`false positives as high as
10% <>`_)
* can cost a lot if you receive lots of comments
`Banjax <>`__ is responsible for
early stage filtering, challenging and banning of bots, identified via
regular expression (regex) matching, in conjunction with the Swabber
Banjax allows for per site configuration of either SHA-inverse
proof-of-work or Captcha challenges. These challenges are served as a
simple method to detect bot requests that have not been intercepted by
one of the system configured regexes.
The :doc:`Banjax <banjax>` module is integrated into the ATS proxy server
working as a filter which intercepts and analyses HTTP requests before
any content is served. This tool has a number of functional operations.
It allows for the use of regular expression (regex) banning, SHA-inverse
challenging and generation of Captchas. Additionally, it provides
whitelisting capabilities to prevent banning of allowed bots as well as
legitimate automated requests. The three Banjax functionalities are
enabled as filters in banjax.conf. Additionally Banjax can gather and
submit the detailed information on each request to Botbanger for further
analysis of the requester behavior.
Regex detection
In the majority of large-scale attacks, clear and distinct patterns can
be found in each attacker's requests, allowing BotnetDBP to be fed with
regular expressions to find these patterns and ban any bot whose
requests show up with them at a given rate. The regex filter is
basically imitating Fail2ban's capabilities but with a greater level of
efficiency needed for DDoS defence. This is done before serving any such
requests and has proven crucial when particularly weak origins are under
attack by a significantly large network of bots.
The regex filters are defined as a series of rules indicating the
circumstance within which a given request should be banned. The
parameters supported are:
- Rule - is the human readable name for a given regex to be banned.
- Regex - contains the regular expression string that a request will be
tested against.
- Interval - is the window of time between requests that Banjax should
- Hits\_per\_interval - is the number of hits for a given time window
that Banjax should consider before banning a given IP address. The
actual banning is calculated as 1/1000 x hits\_per\_interval. This
means that for a given interval, Banjax considers the number of
requests per millisecond and when it crosses the allowed threshold, a
ban is implemented.
Challenging the request
The caching proxy does not offer an effective measure to prevent
cache-busting attacks. In this type of attack, each bot requests a new,
unique resource from the network. Because each request is different from
the previous one, there is no copy of such a resource in the caching
proxy so all the requests reach the origin, effectively amplifying the
DDoS attack by the number of proxy servers. This strategy essentially
turns the Deflect network against the very people we are trying to help.
To prevent this, BotnetDBP can be configured to serve challenges to each
computer requesting the content, allowing only those which solve the
challenge successfully to proceed (automatic detection of cache-busting
has been implemented and will be deployed shortly). The Challenger
filter supports two methods for HTTP requests: SHA-inverse challenge and
Captcha challenge. These functions are primarily intended to mitigate
cache-busting attacks but also serve to ensure the legitimacy of the
request and to provide a mechanism to slow traffic during a heavy load.
- The SHA-inverse challenge asks the user browser for an inverse image
of partial SHA256 value via Javascript. This ensures that traffic is
originating from a legitimate browser rather than from a
pre-programmed Bot. This challenge is seamless for the user as it
occurs as part of the interaction between their browser and the
- By contrast, the Captcha is presented directly to the user, who must
correctly solve the visual challenge and submit their answer before
gaining access to the site's content.
Banjax users can configure Challenger by the following parameters:
- the number of times a requester can fail a challenge before they are
- the difficulty level (the time taken for the browser to solve the
SHA256 inverse problem)
- the length of time that a solved challenge can grant access to the
Additionally, there is a key which ensures that the cookies passed to
and from the user have not been tampered with or manipulated and the
tool can also be configured for multiple hostnames, allowing for
different challenges to be set per host.
White Listing
The Banjax tool can also be configured to allow HTTP requests for
specific IP addresses via the White Lister in order to interact with the
server without interference.
.. |Banjax| image:: img/GitHub-Mark-120px-plus.png
Banjax authentication
When your website is behind Deflect, requests for a new page will come
from our caching servers. This means that they may be several minutes
old and may not have the very latest updates. This is not ideal for when
you are editing the website and need to see updates immediately. Deflect
provides a special way to authenticate yourself to the system and access
your website **without caching**. We call this Banjax authentication.
After you have created the password in the
:doc:`Dashboard <dashboard_walkthrough/step2_admin_credentials>`, the login
page to your website (e.g. /wp-admin, /login, /administrator, etc.) will
appear like this:
.. figure:: img/Banjax_authentication.png
:alt: Banjax Authentication
Banjax Authentication
Only those in possession of the authentication password will be able to
proceed. This has an extra side effect of protecting your website
editorial from password brute-force attacks.
Baskerville is a complete pipeline to receives as input incoming web logs, either from a Kafka topic,
from a locally saved raw log file, or from log files saved to an Elasticsearch
instance. It processes these logs in batches, forming request sets by
grouping them by requested host and requesting IP. It subsequently extracts
features for these request sets and predicts whether they are malicious or
benign using a model that was trained offline on previously observed and
labelled data. Baskerville saves all the data and results to a Postgres database, and publishes metrics on its processing
(e.g. number of logs processed, percentage predicted malicious, percentage
predicted benign, processing speed etc) that can be consumed by Prometheus and
visualised using a Grafana dashboard.
.. figure:: img/baskerville_schematic.png
:alt: Baskerville Schematic
:figwidth: 60%
:align: center
Baskerville Schematic
As well as an engine that consumes and process web logs, a set of offline
analysis tools have been developed for use in conjunction with Baskerville.
These tools may be accessed directly, or via two Jupyter notebooks,
which walk the user through the machine learning tools and the investigations tools,
respectively. The machine learning notebook comprises tools for training,
evaluating, and updating the model used in the Baskerville engine.
The investigations notebook comprises tools for processing, analysing,
and visualising past attacks, for reporting purposes.
.. figure:: img/baskerville_pipeline.jpg
:alt: Baskerville Pipeline
:figwidth: 70%
:align: center
Baskerville Pipeline
A brief overview of the current state of the Baskerville project is
:doc:`here <baskerville_readme>`,
and the full in-depth documentation is available
:doc:`here <baskerville_readme_indepth>`.
You can also download some presentation slides `here <_static/bothunting_with_baskerville.pdf>`__.
# Baskerville
Baskerville is a network traffic anomaly detector, for use in identifying and
characterising malicious IP behaviour. It additionally comprises a selection of
offline tools to investigate and learn from past web server logs.
The in-depth Baskerville documentation can be found
## Contents
- [Overview](#overview)
- [Baskerville Engine](#baskerville-engine)
- [Live](#baskerville-live)
- [Manual](#baskerville-manual)
- [Baskerville Storage](#baskerville-storage)
- [Offline Analysis](#offline-analysis)
- [Requirements](#requirements)
- [Installation](#installation)
- [Configuration](#configuration)
- [Running](#running)
- [In Depth](
- [Testing](#testing)
- [To Do](#to-do)
## Overview
Baskerville is the component of the Deflect analysis engine that is used to
decide whether IPs connecting to Deflect hosts are authentic normal
connections, or malicious bots. In order to make this assessment, Baskerville
groups incoming requests into *request sets* by requested host and requesting IP.
For each request set, a selection of *features* are
computed. These are properties of the requests within the request set (e.g.
average path depth, number of unique queries, HTML to image ratio...) that are
intended to help differentiate normal request sets from bot request sets. A supervised
*novelty detector*, trained offline on the feature vectors of a set of normal
request sets, is used to predict whether new request sets are normal or suspicious.
Additionally, a set of offline analysis tools exist to cluster, compare, and
visualise groups of request sets based on the feature values. The request sets,
their features, trained models, and details of suspected attacks and attributes,
are all saved to a Baskerville database.
Put simply, the Baskerville *engine* is the workhorse for consuming,
processing, and saving the output from input web logs. This engine can be run as
Baskerville *live*, which enables the real-time identification and
banning of suspicious IPs, or as Baskerville *manual*, which conducts this same
analysis for log files saved locally or in an elasticsearch database.
There is additionally an *offline analysis* library for use with Baskerville,
intended for a) developing the supervised model for use in the Baskerville engine,
and b) reporting on attacks and botnets. Both utilise the Baskerville *storage*,
which is the database referenced above.
### Baskerville Engine
In depth documentation
The main Baskerville engine consumes web logs and uses these to compute
request sets (i.e. the groups of requests made by each IP-host pair) and extract
the request set features. It applies a trained novelty detection algorithm to
predict whether each request set is normal or anomalous. It saves the request set
features and predictions to the Baskerville storage database. It additionally cross-references
incoming IP addresses with attacks logged in the database, to determine if a
label (known malicious or known benign) can be applied to the request set.
As we would like to be able to make predictions
about whether the request set is benign or malicious while the runtime (and thus
potentially each request set) is ongoing, we divide each request set up into *subsets*.
Subsets have a fixed two-minute length, and the request set
features (and prediction) are updated at the end of each subset using a feature-specific
update method (discussed
For nonlinear features, the feature value will be dependent on the subset
length, so for this reason, logs are processed in two-minute subsets even when
not being consumed live. This is also discussed in depth in the feature document above.
The Baskerville engine utilises [Apache Spark](; an
analytics engine designed for the purpose of large-scale data processing. The
decision to use Spark in Baskerville was made to ensure that the engine can
achieve a high enough level of efficiency to consume and process web logs in real
time, and thus run continuously as part of the Deflect ecosystem.
__Summary of Components__
- [**Initialize:**](
Set up necessary components for engine to run.
- [**Create Runtime:**](
Create a record in the Runtimes table of the Baskerville storage, to indicate that
Baskerville has been run.
- [**Get Data:**](
Receive web logs, and load them into a Spark dataframe.
- [**Get Window:**](
Select data from current time bucket window.
- [**Preprocessing:**](
Handle missing values, filter columns, add calculation columns etc.
- [**Group-by:**](
Formulate request sets from log data via grouping based on host-IP pairs.
- [**Feature Calculation:**](
Add additional calculation columns, and extract the features of each request set.
- [**Label or Predict:**](
Apply a trained model to classify the request set as suspicious or benign.
Cross reference the IP with known malicious IPs to see if a label can be applied.
- [**Save:**](
Save the analysis results to Baskerville storage.
#### Live
In depth documentation
The live version of Baskerville is designed to consume (ATS) logs from a Kafka topic
in predefined intervals (`time bucket` set to 120 seconds by default),
while a runtime is ongoing. This will be integrated into the online
Deflect analysis engine, and receive logs directly from ATS.
As logs are supplied to the Baskerville engine and processed, various metrics are
produced, e.g. the number of incoming request sets, the average feature values for
these request sets, and the predictions and/or labels (normal/anomalous) associated
with these request sets. These metrics are exported to Prometheus, which publishes
them for consumption by Grafana and other subscribers.
Grafana is a metrics visualization web application that can be configured to
display several dashboards with charts, raise alerts when metric crosses a user defined threshold and notify through mail or other means. Within Baskerville, under data/metrics, there is an importable to Grafana dashboard which presents the statistics of the Baskerville engine in a customisable manner. It is intended to be the principal visualisation and alerting tool of incoming Deflect traffic, displaying metrics in graphical form.
Prometheus is the metric storage and aggregator that provides Grafana with the charts data.
#### Manual
In depth documentation
The manual version of Baskerville consumes logs from locally saved raw log files or
from an elasticsearch database, and conducts the processing steps enumerated in the
Baskerville Engine.
The processing of old logs can be carried out either by supplying raw log files,
or by providing a time period, batch length, and (optionally) host,
in which case the logs will be pulled from elasticsearch in batches of the
specified length. These details are provided in the Baskerville
engine configuration file, and the type of run is determined by the commandline
argument 'rawlog' or 'es' when calling the Baskerville main function.
To label the request sets as benign or malicious, the Attacks and Attributes tables
in Baskerville storage must be filled, either via syncing with a
MISP database of attacks, or by directly inputting records of past attacks and
the attributes associated with those attacks.
### Baskerville Storage
In depth documentation
The Baskerville storage is a database containing all the data output from the
Baskerville engine, as well as the trained models and records of attacks utilized
by the Baskerville engine for prediction and labelling, respectively.
[__Summary of Components__](
- **Runtimes:** Details of the occasions on which Baskerville has been run.
- **Request sets:** Data on requests made by an IP-host pair (features,
prediction, label etc).
- **Subsets:** The host-IP pair request data for each subset time bucket.
- **Models:** Different versions of the trained novelty detectors
(when they were created, which features they use, their accuracy etc).
- **Attacks:** Details of known attacks, optionally synced with a MISP database using
the offline tools.
- **Attributes:** IPs implicated in the incidents listed in the Attacks table.
- **Model Training Sets Link:** Table linking models with the request sets they
were trained on.
- **Requestset Attack Link:** Table linking request sets with the attacks they
were involved in.
- **Attribute Attack Link:** Table linking attributes with the attacks they were
involved in.
### Offline Analysis
In depth documentation
The offline component of Baskerville comprises a selection of tools to conduct
model development (for use in the Baskerville engine) and analysis (for use in investigations).
A supervised binary classifier may be trained based on the labelled request sets,
newly trained models may be used to make predictions for existing request sets,
clustering based on request set feature values may be conducted to identify
similar requesting IP behaviour and characterise botnets, and the results of the
all of the above may be visualized.
These offline tools can be accessed by
running the offline main script. Alternatively, two notebooks ("investigations"
and "machine learning") guide the user through the process of using these tools
to investigate network traffic, or train a new classifier for use in Baskerville,
__Summary of Components__
- [**Investigations**](
- **Misp Sync:** Copy the attack data stored in a MISP database to the Baskerville
storage Attacks and Attributes tables.
- **Labelling**: Label already processed request sets as malicious or benign,
based on e.g. cross-referencing with the Attacks table.
- **Data Exploration:** Visualise mean feature statistics over time and across
different groups.
- **Clustering:** Group request sets based on their features to investigate botnets.
- **Visualisation:** Produce figures to aid model development and investigations.
- [**Machine Learning**](
- Investigations tools above, and...
- **Training:** Train novelty detection classifier on labelled request sets.
- **Model Evaluation:** Assess the model accuracy with training size, and across
different attacks / reference periods.
- **Prediction:** Classify request sets as malicious/benign using newly trained models.
## Requirements
- Python >= 3.6,
- Postgres 10,
- Java 8 needs to be in place (and in PATH) for [Spark](
(Pyspark version 2.3+) to work,
- The required packages in `requirements.txt`,
- Tests need `pytest`, `mock` and `spark-testing-base`,
- Access to the [esretriever](
repository (to get logs from elasticsearch),
- Access to the [Deflect analytics ecosystem](
repository (to run Baskerville online services).
## Installation
In the root Baskerville directory, run:
pip install -e . --process-dependency-links
Note that Baskerville uses Python3.6. The above command should be modified to
`pip3.6 install -e . --process-dependency-links` if this is not the default version
of python in your environment.
To use Baskerville live, you will need to have Postgres, Kafka, and Zookeeper running.
There is a docker file for these services
The [prometheus.yml](
should have the following job listed for the Baskerville exporter to run:
- job_name: 'Baskerville_exporter'
- targets:
- 'my-local-ip:8998'
To use Baskerville manually or the Baskerville offline analysis tools,
only a Postgres database is required.
For both the live and manual use of Baskerville, it is possible to export metrics to
Prometheus, and visualise these in a Grafana dashboard. Both Prometheus and Grafana
are also included in the docker file
## Configuration
The run settings for the Baskerville engine are contained in the configuration file
`baskerville/conf/baskerville.yaml`. The example configuration file should be
renamed to `baskerville.yaml`, and edited as detailed
The main compoents of the configuration file are:
- `DatabaseConfig`: mysql or postgres config
- `ElasticConfig`: elasticsearch config
- `MispConfig` : misp database config
- `EngineConfig`: baskerville-specific config, including the following:
- `ManualEsConfig`: details for pulling logs from elasticsearch
- `ManualRawlogConfig`: details for getting logs from a local file
- `SimulationConfig`: the config for running in online simulation mode
- `MetricsConfig`: the config for metrics exporting
- `DataConfig`: details of the format of the incoming web logs
- `KafkaConfig` : kafka and zookeeper configs