A bag of paths model for measuring structural similarity in Web documents

Authors:
Sachindra Joshi;Neeraj Agrawal;Raghu Krishnapuram;Sumit Negi
Affiliations:
Indian Institute of Technology, Hauz Khas, New Delhi;Indian Institute of Technology, Hauz Khas, New Delhi;Indian Institute of Technology, Hauz Khas, New Delhi;Indian Institute of Technology, Hauz Khas, New Delhi
Venue:
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2003

Citing 11
Cited 29

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Extracting semi-structured data through examples

Proceedings of the eighth international conference on Information and knowledge management
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
A System for Approximate Tree Matching

IEEE Transactions on Knowledge and Data Engineering
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Measuring Structural Similarity Among Web Documents: Preliminary Results

EP '98/RIDT '98 Proceedings of the 7th International Conference on Electronic Publishing, Held Jointly with the 4th International Conference on Raster Imaging and Digital Typography: Electronic Publishing, Artistic Imaging, and Digital Typography

EShopMonitor: A Web Content Monitoring Tool

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
The eShopmonitor: a comprehensive data extraction tool for monitoring web sites

IBM Journal of Research and Development
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
An approach to XML path matching

Proceedings of the 9th annual ACM international workshop on Web information and data management
Towards a unified approach to document similarity search using manifold-ranking of blocks

Information Processing and Management: an International Journal
Matching XML documents in highly dynamic applications

Proceedings of the eighth ACM symposium on Document engineering
On Finding Templates on Web Collections

World Wide Web
A cluster-based approach to XML similarity joins

IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Semantic Structural Similarity Measure for Clustering XML Documents

WISM '09 Proceedings of the International Conference on Web Information Systems and Mining
Visual structure-based web page clustering and retrieval

Proceedings of the 19th international conference on World wide web
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
An approach for measuring similarity between XML documents

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
An automatic HTTP cookie management system

Computer Networks: The International Journal of Computer and Telecommunications Networking
XML structural similarity search using mapreduce

WAIM'10 Proceedings of the 11th international conference on Web-age information management
Graph homomorphism revisited for graph matching

Proceedings of the VLDB Endowment
Structure and content similarity for clustering XML documents

WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Ingredients for accurate, fast, and robust XML similarity joins

DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Block-based similarity search on the web using manifold-ranking

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Classification of news web documents based on structural features

FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
No tag, a little nesting, and great XML keyword search

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
KCAM: concentrating on structural similarity for XML fragments

WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Factors affecting web page similarity

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
XML document clustering using structure-preserving flat representation of XML content and structure

ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Measuring web page similarity based on textual and visual properties

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Mining frequent association tag sequences for clustering XML documents

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review
Approximate algorithms for solving o1 consensus problems using complex tree structure

Transactions on Computational Collective Intelligence VIII
Locality sensitive hashing for scalable structural classification and clustering of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Automated cookie collection testing

ACM Transactions on Software Engineering and Methodology (TOSEM)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Structural information (such as layout and look-and-feel) has been extensively used in the literatuce for extraction of interesting or relevant data, efficient storage, and query optimization. Traditionally, tree models (such as DOM trees) have been used to represent structural information, especially in the case of HTML and XML documents. However, computation of structural similarity between documents based on the tree model is computationally expensive. In this paper, we propose an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model. Since the model includes partial information about parents, children and siblings, it allows us to define a new family of meaningful (and at the same time computationally simple) structural similarity measures. Our experimental results based on the SIGMOD XML data set as well as HTML document collections from ibm.com, dell.com, and amazon.com show that the representation is powerful enough to produce good clusters of structurally similar pages.