Automatic information extraction from large websites

Authors:
Valter Crescenzi;Giansalvatore Mecca
Affiliations:
Università di Roma Tre;Università della Basilicata, Potenza, Italy
Venue:
Journal of the ACM (JACM)
Year:
2004

Citing 28
Cited 46

Inference of regular grammars via skeletons

IEEE Transactions on Systems, Man and Cybernetics
A survey of theoretical research on typed complex database objects

Databases
Cut and paste

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
One-unambiguous regular languages

Information and Computation
Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Grammars have exceptions

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Extracting semi-structured data through examples

Proceedings of the eighth international conference on Information and knowledge management
Inference of Reversible Languages

Journal of the ACM (JACM)
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
Wrapper induction: efficiency and expressiveness

Artificial Intelligence - Special issue on Intelligent internet systems
RoadRunner: automatic data extraction from data-intensive web sites

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Wrapper verification

World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Virtual Database Technology

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Information Extraction in Structured Documents Using Tree Automata Induction

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Jedi: Extracting and Synthesizing Information from the Web

COOPIS '98 Proceedings of the 3rd IFCIS International Conference on Cooperative Information Systems
Learning the Common Structure of Data

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Inductive Inference, DFAs, and Computational Complexity

AII '89 Proceedings of the International Workshop on Analogical and Inductive Inference
Identification of function distinguishable languages

Theoretical Computer Science
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research

Documentum ECI self-repairing wrappers: performance analysis

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
From Wrapping to Knowledge

IEEE Transactions on Knowledge and Data Engineering
Using Google distance to weight approximate ontology matches

Proceedings of the 16th international conference on World Wide Web
Protection Techniques from Information Extraction

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Information Extraction from Web Pages Using Presentation Regularities and Domain Knowledge

World Wide Web
A redundancy-based method for the extraction of relation instances from the Web

International Journal of Human-Computer Studies
Flint: Google-basing the Web

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
A Workflow-Based Approach for Creating Complex Web Wrappers

WISE '08 Proceedings of the 9th international conference on Web Information Systems Engineering
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
USING GRAMMATICAL INFERENCE TECHNIQUES TO LEARN ONTOLOGIES THAT DESCRIBE THE STRUCTURE OF DOMAIN INSTANCES

Applied Artificial Intelligence
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Tuning up FOIL for extracting information from the web

International Journal of Computer Applications in Technology
Automated construction of web accessibility models from transaction click-streams

Proceedings of the 18th international conference on World wide web
Personal News RSS Feeds Generation Using Existing News Feeds

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

World Wide Web
Fixing weakly annotated web data using relational models

ICWE'07 Proceedings of the 7th international conference on Web engineering
NowOnWeb: news search and summarization

EUROCAST'07 Proceedings of the 11th international conference on Computer aided systems theory
Creating a dead poets society: extracting a social network of historical persons from the web

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Web data extraction system based on label library

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
No Code Required: Giving Users Tools to Transform the Web

No Code Required: Giving Users Tools to Transform the Web
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Using latent-structure to detect objects on the web

Procceedings of the 13th International Workshop on the Web and Databases
From the web of data to a world of action

Web Semantics: Science, Services and Agents on the World Wide Web
Otium: A web based planner for tourism and leisure

Expert Systems with Applications: An International Journal
Building Mashups by Demonstration

ACM Transactions on the Web (TWEB)
A language specification tool for model-based parsing

IDEAL'11 Proceedings of the 12th international conference on Intelligent data engineering and automated learning
Automatic web information extraction based on rules

WISE'11 Proceedings of the 12th international conference on Web information system engineering
Extracting and summarizing hot item features across different auction web sites

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Decomposition-Based optimization of reload strategies in the world wide web

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Chapter 6: web data extraction for service creation

Search Computing
The HiLeX system for semantic information extraction

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
DIADEM: domain-centric, intelligent, automated data extraction methodology

Proceedings of the 21st international conference companion on World Wide Web
Automatically learning gazetteers from the deep web

Proceedings of the 21st international conference companion on World Wide Web
Computationally effective algorithm for information extraction and online review mining

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Ontology-assisted automatic precise information extractor for visually impaired inhabitants

Artificial Intelligence Review
Turn the page: automated traversal of paginated websites

ICWE'12 Proceedings of the 12th international conference on Web Engineering
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
An unsupervised technique to extract information from semi-structured web pages

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Towards discovering ontological models from big RDF data

ER'12 Proceedings of the 2012 international conference on Advances in Conceptual Modeling
Towards discovering conceptual models behind web sites

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Unsupervised wrapper induction using linked data

Proceedings of the seventh international conference on Knowledge capture
Rhea: automatic filtering for unstructured cloud storage

nsdi'13 Proceedings of the 10th USENIX conference on Networked Systems Design and Implementation
A framework for learning web wrappers from the crowd

Proceedings of the 22nd international conference on World Wide Web
Discovering implicit schemas in JSON data

ICWE'13 Proceedings of the 13th international conference on Web Engineering
Effects of Terms Recognition Mistakes on Requests Processing for Interactive Information Retrieval

International Journal of Information Retrieval Research
Synthesizing union tables from the web

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.01

Visualization

Abstract

Information extraction from websites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature.We present a novel approach to information extraction from websites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised---that is, fully automatic---wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks.The main contributions of the article stand in the definition of a class of regular languages, called the prefix mark-up languages, that abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The article shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes.A system based on the techniques described in the article has been implemented in a working prototype. We present some experimental results on known Websites, and discuss opportunities and limitations of the proposed approach.