SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Extracting semi-structured data through examples
Proceedings of the eighth international conference on Information and knowledge management
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
A System for Approximate Tree Matching
IEEE Transactions on Knowledge and Data Engineering
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Comparing Hierarchical Data in External Memory
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Measuring Structural Similarity Among Web Documents: Preliminary Results
EP '98/RIDT '98 Proceedings of the 7th International Conference on Electronic Publishing, Held Jointly with the 4th International Conference on Raster Imaging and Digital Typography: Electronic Publishing, Artistic Imaging, and Digital Typography
EShopMonitor: A Web Content Monitoring Tool
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
The eShopmonitor: a comprehensive data extraction tool for monitoring web sites
IBM Journal of Research and Development
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
An approach to XML path matching
Proceedings of the 9th annual ACM international workshop on Web information and data management
Towards a unified approach to document similarity search using manifold-ranking of blocks
Information Processing and Management: an International Journal
Matching XML documents in highly dynamic applications
Proceedings of the eighth ACM symposium on Document engineering
On Finding Templates on Web Collections
World Wide Web
A cluster-based approach to XML similarity joins
IDEAS '09 Proceedings of the 2009 International Database Engineering & Applications Symposium
Semantic Structural Similarity Measure for Clustering XML Documents
WISM '09 Proceedings of the International Conference on Web Information Systems and Mining
Visual structure-based web page clustering and retrieval
Proceedings of the 19th international conference on World wide web
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
An approach for measuring similarity between XML documents
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
An automatic HTTP cookie management system
Computer Networks: The International Journal of Computer and Telecommunications Networking
XML structural similarity search using mapreduce
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Graph homomorphism revisited for graph matching
Proceedings of the VLDB Endowment
Structure and content similarity for clustering XML documents
WAIM'10 Proceedings of the 2010 international conference on Web-age information management
Ingredients for accurate, fast, and robust XML similarity joins
DEXA'11 Proceedings of the 22nd international conference on Database and expert systems applications - Volume Part II
Block-based similarity search on the web using manifold-ranking
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Classification of news web documents based on structural features
FinTAL'06 Proceedings of the 5th international conference on Advances in Natural Language Processing
No tag, a little nesting, and great XML keyword search
AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
KCAM: concentrating on structural similarity for XML fragments
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Factors affecting web page similarity
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
XML document clustering using structure-preserving flat representation of XML content and structure
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Measuring web page similarity based on textual and visual properties
ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Mining frequent association tag sequences for clustering XML documents
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Survey: An overview on XML similarity: Background, current trends and future directions
Computer Science Review
Approximate algorithms for solving o1 consensus problems using complex tree structure
Transactions on Computational Collective Intelligence VIII
Locality sensitive hashing for scalable structural classification and clustering of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Automated cookie collection testing
ACM Transactions on Software Engineering and Methodology (TOSEM)
Hi-index | 0.00 |
Structural information (such as layout and look-and-feel) has been extensively used in the literatuce for extraction of interesting or relevant data, efficient storage, and query optimization. Traditionally, tree models (such as DOM trees) have been used to represent structural information, especially in the case of HTML and XML documents. However, computation of structural similarity between documents based on the tree model is computationally expensive. In this paper, we propose an alternative scheme for representing the structural information of documents based on the paths contained in the corresponding tree model. Since the model includes partial information about parents, children and siblings, it allows us to define a new family of meaningful (and at the same time computationally simple) structural similarity measures. Our experimental results based on the SIGMOD XML data set as well as HTML document collections from ibm.com, dell.com, and amazon.com show that the representation is powerful enough to produce good clusters of structurally similar pages.