Effective Anomaly Detection with Scarce Training Data
Authors:William Robertson, Federico Maggi, Christopher Kruegel, Giovanni Vigna
Proceedings of the Network and Distributed System Security Symposium (NDSS)
Journal Article
Abstract
Learning-based anomaly detection has proven to be an effective black-box technique for detecting unknown attacks. However, the effectiveness of this technique crucially depends upon both the quality and the completeness of the training data. Unfortunately, in most cases, the traffic to the system (e.g., a web application or daemon process) protected by an anomaly detector is not uniformly distributed. Therefore, some components (e.g., authentication, payments, or content publishing) might not be exercised enough to train an anomaly detection system in a reasonable time frame. This is of particular importance in real-world settings, where anomaly detection systems are deployed with little or no manual configuration, and they are expected to automatically learn the normal behavior of a system to detect or block attacks. In this work, we first demonstrate that the features utilized to train a learning-based detector can be semantically grouped, and that features of the same group tend to induce similar models. Therefore, we propose addressing local training data deficiencies by exploiting clustering techniques to construct a knowledge base of well-trained models that can be utilized in case of undertraining. Our approach, which is independent of the particular type of anomaly detector employed, is validated using the realistic case of a learning-based system protecting a pool of web servers running several web applications such as blogs, forums, or Web services. We run our experiments on a real-world data set containing over 58 million HTTP requests to more than 36,000 distinct web application components. The results show that by using the proposed solution, it is possible to achieve effective attack detection even with scarce training data.
@InProceedings{ robertson_longtail_2010,
abstract = {Learning-based anomaly detection has proven to be an
effective black-box technique for detecting unknown
attacks. However, the effectiveness of this technique
crucially depends upon both the quality and the
completeness of the training data. Unfortunately, in most
cases, the traffic to the system (e.g., a web application
or daemon process) protected by an anomaly detector is not
uniformly distributed. Therefore, some components (e.g.,
authentication, payments, or content publishing) might not
be exercised enough to train an anomaly detection system in
a reasonable time frame. This is of particular importance
in real-world settings, where anomaly detection systems are
deployed with little or no manual configuration, and they
are expected to automatically learn the normal behavior of
a system to detect or block attacks. In this work, we first
demonstrate that the features utilized to train a
learning-based detector can be semantically grouped, and
that features of the same group tend to induce similar
models. Therefore, we propose addressing local training
data deficiencies by exploiting clustering techniques to
construct a knowledge base of well-trained models that can
be utilized in case of undertraining. Our approach, which
is independent of the particular type of anomaly detector
employed, is validated using the realistic case of a
learning-based system protecting a pool of web servers
running several web applications such as blogs, forums, or
Web services. We run our experiments on a real-world data
set containing over 58 million HTTP requests to more than
36,000 distinct web application components. The results
show that by using the proposed solution, it is possible to
achieve effective attack detection even with scarce
training data.},
author = {Robertson, William and Maggi, Federico and Kruegel,
Christopher and Vigna, Giovanni},
booktitle = {Proceedings of the Network and Distributed System Security
Symposium (NDSS)},
date = {2010-03-01},
doi = {10.1.1.183.3323},
file = {files/papers/conference-papers/robertson_longtail_2010.pdf},
publisher = {The Internet Society},
shorttitle = {LongTail},
title = {Effective Anomaly Detection with Scarce Training Data}}