Identifying syntactic differences between two programs
Software—Practice & Experience
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
Machine Learning
Adaptive View Validation: A First Step Towards Automatic View Detection
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Learning the Common Structure of Data
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
A Comparative Study of Information Extraction Strategies
CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Wrapper induction for information extraction
Wrapper induction for information extraction
Accurately and reliably extracting data from the Web: a machine learning approach
Intelligent exploration of the web
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
OLERA: Semisupervised Web-Data Extraction with Visual Support
IEEE Intelligent Systems
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Active learning with strong and weak views: a case study on wrapper induction
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
On Finding Templates on Web Collections
World Wide Web
Information extraction for search engines using fast heuristic techniques
Data & Knowledge Engineering
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Facilitating wrapper generation with page analysis
ISI'09 Proceedings of the 2009 IEEE international conference on Intelligence and security informatics
Tag tree template for Web information and schema extraction
Expert Systems with Applications: An International Journal
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
Extracting product descriptions from polish e-commerce websites using classification and clustering
ISMIS'11 Proceedings of the 19th international conference on Foundations of intelligent systems
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Towards a unified solution: data record region detection and segmentation
Proceedings of the 20th ACM international conference on Information and knowledge management
Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System
International Journal of Data Warehousing and Mining
Robust detection of semi-structured web records using a DOM structure-knowledge-driven model
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
This paper studies structured data extraction from Web pages. Existing approaches to data extraction include wrapper induction and automated methods. In this paper, we propose an instance-based learning method, which performs extraction by comparing each new instance to be extracted with labeled instances. The key advantage of our method is that it does not require an initial set of labeled pages to learn extraction rules as in wrapper induction. Instead, the algorithm is able to start extraction from a single labeled instance. Only when a new instance cannot be extracted does it need labeling. This avoids unnecessary page labeling, which solves a major problem with inductive learning (or wrapper induction), i.e., the set of labeled instances may not be representative of all other instances. The instance-based approach is very natural because structured data on the Web usually follow some fixed templates. Pages of the same template usually can be extracted based on a single page instance of the template. A novel technique is proposed to match a new instance with a manually labeled instance and in the process to extract the required data items from the new instance. The technique is also very efficient. Experimental results based on 1,200 pages from 24 diverse Web sites demonstrate the effectiveness of the method. It also outperforms the state-of-the-art existing systems significantly.