72. BEYOND BLACKLISTS: LEARNING TO DETECT MALICIOUS WEB SITES FROM SUSPICIOUS URLS
Department: Computer Science & Engineering
Research Institute Affiliation: Center for Networked Systems (CNS)
Faculty Advisor(s):
Lawrence Saul | Stefan Savage | Geoffrey M. Voelker
Primary Student
Name: Justin Tung Ma
Email: jtma@ucsd.edu
Phone: 858-534-8173
Grad Year: 2010
Abstract
Malicious websites are a cornerstone of Internet criminal activities. They host a variety of unwanted content ranging from spam-advertised products, to phishing sites, to dangerous "drive-by'' exploits that infect a visitor's machine with malware. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features potentially indicative of suspicious URLs. The resulting classifiers obtain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest numbers of false positives.
This approach is complementary to both blacklisting --- which is fundamentally reactive and cannot predict the status of previously unseen URLs --- and systems based on evaluating site content and behavior --- which require visiting potentially dangerous sites. Further, we show that with appropriate classifiers it is feasible to automatically sift through comprehensive feature sets (i.e., without requiring domain expertise) and identify those features that are most critical to identification.