IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Athena: Mining-Based Interactive Management of Text Database
EDBT '00 Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Exploratory search: from finding to understanding
Communications of the ACM - Supporting exploratory search
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Introduction to Information Retrieval
Introduction to Information Retrieval
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
Duplicate identification in deep web data integration
WAIM'10 Proceedings of the 11th international conference on Web-age information management
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
Link-based hidden attribute discovery for objects on Web
Proceedings of the 14th International Conference on Extending Database Technology
An indent shape based approach for web lists mining
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Hi-index | 0.00 |
Web entities are often associated with many attributes that describe them. It is essential to extract these attributes for Web entity data extraction. This paper proposes a novel approach using duplicated attribute value pairs. We start by constructing a initial seed set of attributes including names and enumerable values, and a training set of Web pages from target website; After that we locate the position of each attribute by matching attribute values within the pages of the site with values contained in the seed set; Thirdly we choose the position with the highest supportiveness as path for extraction, which we use to extract other attribute value pairs with the same template. Finally, we conduct an extensive experimental study with large real data set to demonstrate the effectiveness of our extraction approach.