Information extraction from HTML: application of a general machine learning approach

Authors:
Dayne Freitag
Affiliations:
-
Venue:
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Year:
1998

Citing 7
Cited 68

A scalable comparison-shopping agent for the World-Wide Web

AGENTS '97 Proceedings of the first international conference on Autonomous agents
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Learning Logical Definitions from Relations

Machine Learning
The CN2 Induction Algorithm

Machine Learning
Learning Text Analysis Rules for Domain-specific Natural Language Processing

Learning Text Analysis Rules for Domain-specific Natural Language Processing
Wrapper induction for information extraction

Wrapper induction for information extraction
Toward general-purpose learning for information extraction

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1

Learning to remove Internet advertisements

Proceedings of the third annual conference on Autonomous Agents
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Personal navigating agents

Proceedings of the third annual conference on Autonomous Agents
Learning to extract hierarchical information from semi-structured documents

Proceedings of the ninth international conference on Information and knowledge management
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
Content integration for e-business

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Hierarchical Wrapper Induction for Semistructured Information Sources

Autonomous Agents and Multi-Agent Systems
A Portrait of the Semantic Web in Action

IEEE Intelligent Systems
Human Language Technologies for Knowledge Management

IEEE Intelligent Systems
Embedded Grammar Tags: Advancing Natural Language Interaction on the Web

IEEE Intelligent Systems
Gleaning the Web

IEEE Intelligent Systems
Wrapping Web Information Providers by Transducer Induction

EMCL '01 Proceedings of the 12th European Conference on Machine Learning
Extracting Information from Semi-structured Web Documents

OOIS '02 Proceedings of the Workshops on Advances in Object-Oriented Information Systems
Information Extraction in Structured Documents Using Tree Automata Induction

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Knowledge Discovery in SportsFinder: An Agent to Extract Sports Results from the Web

PAKDD '99 Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining
Information Extraction from HTML: Combining XML and Standard Techniques for IE from the Web

Proceedings of the 14th International conference on Industrial and engineering applications of artificial intelligence and expert systems: engineering of intelligent systems
Event Pattern Discovery from the Stock Market Bulletin

DS '02 Proceedings of the 5th International Conference on Discovery Science
Knowledge Discovery from Semistructured Texts

Progress in Discovery Science, Final Report of the Japanese Discovery Science Project
Mining Semi-structured Data by Path Expressions

DS '01 Proceedings of the 4th International Conference on Discovery Science
Information Extraction - Tree Alignment Approach to Pattern Discovery in Web Documents

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
Mediation in a dynamic context: arguing for a request-oriented approach and structuring it

Web-enabled systems integration
A maximum entropy approach to information extraction from semi-structured and free text

Eighteenth national conference on Artificial intelligence
Unsupervised learning of mDTD extraction patterns for web text mining

Information Processing and Management: an International Journal
Mining free text for structure

Data mining
A semi-universal e-commerce agent: domain-dependant information gathering

Enterprise information systems IV
Bottom-up relational learning of pattern matching rules for information extraction

The Journal of Machine Learning Research
Text mining agent for net auction

Proceedings of the 2004 ACM symposium on Applied computing
Toward general-purpose learning for information extraction

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
LearningPinocchio: adaptive information extraction for real world applications

Natural Language Engineering
Information Extraction from the Web: System and Techniques

Applied Intelligence
Automatic information extraction from large websites

Journal of the ACM (JACM)
Constraint-based wrapper specification and verification for cooperative information systems

Information Systems - Special issue: Data quality in cooperative information systems
Supervised learning for the legacy document conversion

Proceedings of the 2004 ACM symposium on Document engineering
TEG: a hybrid approach to information extraction

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Information extraction with automatic knowledge expansion

Information Processing and Management: an International Journal
Information extraction from structured documents using k-testable tree automaton inference

Data & Knowledge Engineering
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Extracting personal names from email: applying named entity recognition to informal text

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
SERGEANT: A framework for building more flexible web agents by exploiting a search engine

Web Intelligence and Agent Systems
Webpage understanding: an integrated approach

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Speculative plan execution for information gathering

Artificial Intelligence
A genetic algorithm for segmentation and information retrieval of SEC regulatory filings

dg.o '08 Proceedings of the 2008 international conference on Digital government research
A modular information extraction system

Intelligent Data Analysis
Learning (k,l)-contextual tree languages for information extraction from web pages

Machine Learning
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

Applied Artificial Intelligence
A method for extracting knowledge from medical texts including numerical representation

International Journal of Computer Applications in Technology
Sub Node Extraction with Tree Based Wrappers

Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence
Serving Comparative Shopping Links Non-invasively

WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Adaptive information extraction from text by rule induction and generalisation

IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
Information extraction for search engines using fast heuristic techniques

Data & Knowledge Engineering
Researcher affiliation extraction from homepages

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
An information extraction approach to reorganizing and summarizing specifications

Information and Software Technology
A method for web information extraction

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Using support vector machines for terrorism information extraction

ISI'03 Proceedings of the 1st NSF/NIJ conference on Intelligence and security informatics
Name entity recognition using inductive logic programming

Proceedings of the 2010 Symposium on Information and Communication Technology
Multimodal social intelligence in a real-time dashboard system

The VLDB Journal — The International Journal on Very Large Data Bases
Dynamic relationship and event discovery

Proceedings of the fourth ACM international conference on Web search and data mining
Tuples extraction from HTML using logic wrappers and inductive logic programming

AWIC'05 Proceedings of the Third international conference on Advances in Web Intelligence
SVM based learning system for information extraction

Proceedings of the First international conference on Deterministic and Statistical Methods in Machine Learning
Logic wrappers and XSLT transformations for tuples extraction from HTML

XSym'05 Proceedings of the Third international conference on Database and XML Technologies
Mining travel resources on the web using l-wrappers

ICAISC'06 Proceedings of the 8th international conference on Artificial Intelligence and Soft Computing
Chapter 6: web data extraction for service creation

Search Computing
The HiLeX system for semantic information extraction

Transactions on Large-Scale Data- and Knowledge-Centered Systems V
Computationally effective algorithm for information extraction and online review mining

Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
In praise of laziness: a lazy strategy for web information extraction

ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
Web news extraction via path ratios

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Towards improving the online shopping experience: A client-based platform for post-processing Web search results

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Because the World Wide Web consists primarily of text, information extraction is central to any effort that would use the Web as a resource for knowledge discovery. We show how information extraction can be cast as a standard machine learning problem, and argue for the suitability of relational learning in solving it. The implementation of a general-purpose relational learner for information extraction, SRV, is described. In contrast with earlier learning systems for information extraction, SRV makes no assumptions about document structure and the kinds of information available for use in learning extraction patterns. Instead, structural and other information is supplied as input in the form of an extensible token-oriented feature set. We demonstrate the effectiveness of this approach by adapting SRV for use in learning extraction rules for a domain consisting of university course and research project pages sampled from the Web. Making SRV Web-ready only involves adding several simple HTML-specific features to its basic feature set.