A linear programming framework for logics of uncertainty
Decision Support Systems
Information Systems - Special issue on semistructured data
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Conceptual-model-based data extraction from multiple-record Web pages
Data & Knowledge Engineering
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
A brief survey of web data extraction tools
ACM SIGMOD Record
Using Grammatical Inference to Automate Information Extraction from the Web
PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Automatic Web Information Extraction in the ROADRUNNER System
Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
OLERA: Semisupervised Web-Data Extraction with Visual Support
IEEE Intelligent Systems
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
Data & Knowledge Engineering
Semi-supervised learning of attribute-value pairs from product descriptions
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Unsupervised named-entity extraction from the Web: An experimental study
Artificial Intelligence
Semistructured data: the TSIMMIS experience
ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
Mining taxonomies from web menus: rule-based concepts and algorithms
ICWE'13 Proceedings of the 13th international conference on Web Engineering
Strigil: A Framework for Data Extraction in Semi-Structured Web Documents
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Linkage of compound objects for supporting maintenance of large-scale web sites
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Hi-index | 0.00 |
Most of today's web content is designed for human consumption, which makes it difficult for software tools to access them readily. Even web content that is automatically generated from back-end databases is usually presented without the original structural information. In this paper, we present an automated information extraction algorithm that can extract the relevant attribute-value pairs from product descriptions across different sites. A notion, called structural-semantic entropy, is used to locate the data of interest on web pages, which measures the density of occurrence of relevant information on the DOM tree representation of web pages. Our approach is less labor-intensive and insensitive to changes in web-page format. Experimental results on a large number of real-life web page collections are encouraging and confirm the feasibility of the approach, which has been successfully applied to detect false drug advertisements on the web due to its capacity in associating the attributes of records with their respective values.