Effective Anomaly Detection with Scarce Training Data

Authors: William Robertson, Federico Maggi, Christopher Kruegel, Giovanni Vigna

March 2010

Proceedings of the Network and Distributed System Security Symposium (NDSS)

Journal Article

Abstract

Learning-based anomaly detection has proven to be an effective black-box technique for detecting unknown attacks. However, the effectiveness of this technique crucially depends upon both the quality and the completeness of the training data. Unfortunately, in most cases, the traffic to the system (e.g., a web application or daemon process) protected by an anomaly detector is not uniformly distributed. Therefore, some components (e.g., authentication, payments, or content publishing) might not be exercised enough to train an anomaly detection system in a reasonable time frame. This is of particular importance in real-world settings, where anomaly detection systems are deployed with little or no manual configuration, and they are expected to automatically learn the normal behavior of a system to detect or block attacks. In this work, we first demonstrate that the features utilized to train a learning-based detector can be semantically grouped, and that features of the same group tend to induce similar models. Therefore, we propose addressing local training data deficiencies by exploiting clustering techniques to construct a knowledge base of well-trained models that can be utilized in case of undertraining. Our approach, which is independent of the particular type of anomaly detector employed, is validated using the realistic case of a learning-based system protecting a pool of web servers running several web applications such as blogs, forums, or Web services. We run our experiments on a real-world data set containing over 58 million HTTP requests to more than 36,000 distinct web application components. The results show that by using the proposed solution, it is possible to achieve effective attack detection even with scarce training data.

PDF Cite

@InProceedings{	  robertson_longtail_2010,
  abstract	= {Learning-based anomaly detection has proven to be an
		  effective black-box technique for detecting unknown
		  attacks. However, the effectiveness of this technique
		  crucially depends upon both the quality and the
		  completeness of the training data. Unfortunately, in most
		  cases, the traffic to the system (e.g., a web application
		  or daemon process) protected by an anomaly detector is not
		  uniformly distributed. Therefore, some components (e.g.,
		  authentication, payments, or content publishing) might not
		  be exercised enough to train an anomaly detection system in
		  a reasonable time frame. This is of particular importance
		  in real-world settings, where anomaly detection systems are
		  deployed with little or no manual configuration, and they
		  are expected to automatically learn the normal behavior of
		  a system to detect or block attacks. In this work, we first
		  demonstrate that the features utilized to train a
		  learning-based detector can be semantically grouped, and
		  that features of the same group tend to induce similar
		  models. Therefore, we propose addressing local training
		  data deficiencies by exploiting clustering techniques to
		  construct a knowledge base of well-trained models that can
		  be utilized in case of undertraining. Our approach, which
		  is independent of the particular type of anomaly detector
		  employed, is validated using the realistic case of a
		  learning-based system protecting a pool of web servers
		  running several web applications such as blogs, forums, or
		  Web services. We run our experiments on a real-world data
		  set containing over 58 million HTTP requests to more than
		  36,000 distinct web application components. The results
		  show that by using the proposed solution, it is possible to
		  achieve effective attack detection even with scarce
		  training data.},
  author	= {Robertson, William and Maggi, Federico and Kruegel,
		  Christopher and Vigna, Giovanni},
  booktitle	= {Proceedings of the Network and Distributed System Security
		  Symposium (NDSS)},
  date		= {2010-03-01},
  doi		= {10.1.1.183.3323},
  file		= {files/papers/conference-papers/robertson_longtail_2010.pdf},
  publisher	= {The Internet Society},
  shorttitle	= {LongTail},
  title		= {Effective Anomaly Detection with Scarce Training Data}
}

Back to Publications