Inference of regular grammars via skeletons
IEEE Transactions on Systems, Man and Cybernetics
Solving airline crew scheduling problems by branch-and-cut
Management Science
Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Inference of Reversible Languages
Journal of the ACM (JACM)
Conceptual-model-based data extraction from multiple-record Web pages
Data & Knowledge Engineering
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
A brief survey of web data extraction tools
ACM SIGMOD Record
DEByE - Date extraction by example
Data & Knowledge Engineering
The power of languages for the manipulation of complex values
The VLDB Journal — The International Journal on Very Large Data Bases
Information Extraction in Structured Documents Using Tree Automata Induction
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Extraction Techniques for Mining Services from Web Sources
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Automatic information extraction from large websites
Journal of the ACM (JACM)
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
NET – a system for extracting web data from flat and nested data records
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
ALFRED: crowd assisted data extraction
Proceedings of the 22nd international conference on World Wide Web companion
A framework for learning web wrappers from the crowd
Proceedings of the 22nd international conference on World Wide Web
Extraction and integration of partially overlapping web sources
Proceedings of the VLDB Endowment
Hi-index | 0.01 |
Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques.