TEX: An efficient and effective unsupervised Web information extractor

Authors:
Hassan A. Sleiman;Rafael Corchuelo
Affiliations:
Universidad de Sevilla, ETSI Informática. Avda. de la Reina Mercedes, s/n, Sevilla E-41012, Spain;Universidad de Sevilla, ETSI Informática. Avda. de la Reina Mercedes, s/n, Sevilla E-41012, Spain
Venue:
Knowledge-Based Systems
Year:
2013

Citing 57
Cited 1

Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Clean up your Web pages with HP's HTML tidy

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Grammars have exceptions

Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
Regression testing for wrapper maintenance

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
WebOQL: restructuring documents, databases, and webs

Theory and Practice of Object Systems
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes

DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web

ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Bottom-up relational learning of pattern matching rules for information extraction

The Journal of Machine Learning Research
Schema-guided wrapper maintenance for web-data extraction

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic information extraction from large websites

Journal of the ACM (JACM)
Mining Web Pages for Data Records

IEEE Intelligent Systems
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Automatic wrapper maintenance for semi-structured web sources using results from previous queries

Proceedings of the 2005 ACM symposium on Applied computing
Automatically Generating Labeled Examples for Web Wrapper Maintenance

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques

IEEE Transactions on Knowledge and Data Engineering
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Documentum ECI self-repairing wrappers: performance analysis

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Information extraction from structured documents using k-testable tree automaton inference

Data & Knowledge Engineering
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
From Wrapping to Knowledge

IEEE Transactions on Knowledge and Data Engineering
Adaptive record extraction from web pages

Proceedings of the 16th international conference on World Wide Web
U-REST: an unsupervised record extraction system

Proceedings of the 16th international conference on World Wide Web
Lightweight structured text processing

ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Bootstrapping Information Extraction from Semi-structured Web Pages

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
FiVaTech: Page-Level Web Data Extraction from Template Pages

IEEE Transactions on Knowledge and Data Engineering
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Exploiting content redundancy for web information extraction

Proceedings of the 19th international conference on World wide web
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
Web-scale information extraction with vertex

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Exploiting the Information Web

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Employing Clustering Techniques for Automatic Information Extraction From HTML Documents

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
Towards a method for unsupervised web information extraction

ICWE'12 Proceedings of the 12th international conference on Web Engineering
An unsupervised technique to extract information from semi-structured web pages

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering

NLP-based faceted search: Experience in the development of a science and technology search engine

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The World Wide Web is an immense information resource. Web information extraction is the task that transforms human friendly Web information into structured information that can be consumed by automated business processes. In this article, we propose an unsupervised information extractor that works on two or more web documents generated by the same server side template. It finds and removes shared token sequences amongst these web documents until finding the relevant information that should be extracted from them. The technique is completely unsupervised and does not require maintenance, it allows working on malformed web documents, and does not require the relevant information to be formatted using repetitive patterns. Our complexity analysis reveals that our proposal is computationally tractable and our empirical study on real-world web documents demonstrates that it performs very fast and has a very high precision and recall.