 | |  |  | 

We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering dust; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines dust effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about dust to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank. View all 23 works published by WWW 2007 |
 Do Not Crawl in the DUST: Different URLs with Similar Text A user has reported that the URL we had indexed no longer works properly. This link is offline until a volunteer finds a new, valid URL for the work and updates our site.
Bar-Yossef, Ziv, Idit Keidar and Uri Schonfeld WWW 2007 2007
Abstract: We consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering dust; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines dust effectively from previous crawl logs or web server logs, without examining page contents. Verifying these rules via sampling requires fetching few actual web pages. Search engines can benefit from information about dust to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.
|
 |
 |  |