SALS-SIG Research Seminar

Home ButtonPeople ButtonDOTG Buttonltg buttonEmail MRI

Web document clustering using suffix trees


Speaker:

Oren Zamir

University of Washington, USA
Date: 10th September 1998
Time: 11:30am
Place: Seminar Room 357, Building E6A, Macquarie University

Abstract:

Users of Web search engines are often forced to sift through the long ordered list of document "snippets" returned by the engines. The IR community has explored document clustering as an alternative method of organizing retrieval results, but clustering has yet to be deployed on the major search engines.

The talk articulates the unique requirements of Web document clustering and reports on the first evaluation of clustering methods in this domain. A key requirement is that the methods create their clusters based on the short "snippets" returned by Web search engines. Surprisingly, we find that clusters based on snippets are almost as good as clusters created using the full text of Web documents.

To satisfy the stringent requirements of the Web domain, we introduce an incremental, linear time (in the number of documents) algorithm called Suffix Tree Clustering (STC), which creates clusters based on phrases shared between documents. We show that STC is faster than standard clustering methods in this domain, and argue that Web document clustering via STC is both feasible and potentially beneficial.


Enquiries: sals@mri.mq.edu.au

Last modified: August, 1998