Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Clean up your Web pages with HP's HTML tidy
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Information Systems - Special issue on semistructured data
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Regression testing for wrapper maintenance
AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
WebOQL: restructuring documents, databases, and webs
Theory and Practice of Object Systems
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Building intelligent web applications using lightweight wrappers
Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
Hierarchical Wrapper Induction for Semistructured Information Sources
Autonomous Agents and Multi-Agent Systems
Data-rich Section Extraction from HTML pages
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes
DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Fully Automated Object Extraction System for the World Wide Web
ICDCS '01 Proceedings of the The 21st International Conference on Distributed Computing Systems
Bottom-up relational learning of pattern matching rules for information extraction
The Journal of Machine Learning Research
Schema-guided wrapper maintenance for web-data extraction
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic information extraction from large websites
Journal of the ACM (JACM)
Mining Web Pages for Data Records
IEEE Intelligent Systems
OLERA: Semisupervised Web-Data Extraction with Visual Support
IEEE Intelligent Systems
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Automatic wrapper maintenance for semi-structured web sources using results from previous queries
Proceedings of the 2005 ACM symposium on Applied computing
Automatically Generating Labeled Examples for Web Wrapper Maintenance
WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
IEEE Transactions on Knowledge and Data Engineering
ViPER: augmenting automatic information extraction with visual perceptions
Proceedings of the 14th ACM international conference on Information and knowledge management
Documentum ECI self-repairing wrappers: performance analysis
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Information extraction from structured documents using k-testable tree automaton inference
Data & Knowledge Engineering
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering
Adaptive record extraction from web pages
Proceedings of the 16th international conference on World Wide Web
U-REST: an unsupervised record extraction system
Proceedings of the 16th international conference on World Wide Web
Lightweight structured text processing
ATEC '99 Proceedings of the annual conference on USENIX Annual Technical Conference
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
Bootstrapping Information Extraction from Semi-structured Web Pages
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
ODE: Ontology-assisted data extraction
ACM Transactions on Database Systems (TODS)
Automatic hidden-web table interpretation, conceptualization, and semantic annotation
Data & Knowledge Engineering
Wrapper maintenance: a machine learning approach
Journal of Artificial Intelligence Research
Information extraction for search engines using fast heuristic techniques
Data & Knowledge Engineering
FiVaTech: Page-Level Web Data Extraction from Template Pages
IEEE Transactions on Knowledge and Data Engineering
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
ViDE: A Vision-Based Approach for Deep Web Data Extraction
IEEE Transactions on Knowledge and Data Engineering
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Exploiting content redundancy for web information extraction
Proceedings of the 19th international conference on World wide web
Collective extraction from heterogeneous web lists
Proceedings of the fourth ACM international conference on Web search and data mining
Web-scale information extraction with vertex
ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering
NET – a system for extracting web data from flat and nested data records
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Exploiting the Information Web
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Employing Clustering Techniques for Automatic Information Extraction From HTML Documents
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews
Semistructured data: the TSIMMIS experience
ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems
Towards a method for unsupervised web information extraction
ICWE'12 Proceedings of the 12th international conference on Web Engineering
An unsupervised technique to extract information from semi-structured web pages
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
NLP-based faceted search: Experience in the development of a science and technology search engine
Expert Systems with Applications: An International Journal
Hi-index | 0.00 |
The World Wide Web is an immense information resource. Web information extraction is the task that transforms human friendly Web information into structured information that can be consumed by automated business processes. In this article, we propose an unsupervised information extractor that works on two or more web documents generated by the same server side template. It finds and removes shared token sequences amongst these web documents until finding the relevant information that should be extracted from them. The technique is completely unsupervised and does not require maintenance, it allows working on malformed web documents, and does not require the relevant information to be formatted using repetitive patterns. Our complexity analysis reveals that our proposal is computationally tractable and our empirical study on real-world web documents demonstrates that it performs very fast and has a very high precision and recall.