Discovery of Web Robot Sessions Based on their Navigational Patterns
Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from other users. First of all, e-commerce retailer...
Saved in:
Published in: | Data mining and knowledge discovery Vol. 6; no. 1; p. 9 |
---|---|
Main Authors: | , |
Format: | Journal Article |
Language: | English |
Published: |
New York
Springer Nature B.V
01-01-2002
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from other users. First of all, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. In addition, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform clickstream analysis effectively on the Web data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouflaged and previously unknown robots. In this paper, we propose an alternative approach that uses the navigational patterns in the click-stream data to determine if it is due to a robot. Experimental results on our Computer Science department Web server logs show that highly accurate classification models can be built using this approach. We also show that these models are able to discover many camouflaged and previously unidentified robots. |
---|---|
ISSN: | 1384-5810 1573-756X |
DOI: | 10.1023/A:1013228602957 |