Information extraction for search engines using fast heuristic techniques

Authors:
Jer Lang Hong;Eu-Gene Siew;Simon Egerton
Affiliations:
School of Information Technology, Monash University, 68000 Ampang, Selangor, Malaysia;School of Information Technology, Monash University, 68000 Ampang, Selangor, Malaysia;School of Information Technology, Monash University, 68000 Ampang, Selangor, Malaysia
Venue:
Data & Knowledge Engineering
Year:
2010

Citing 47
Cited 10

Algorithms for string searching

ACM SIGIR Forum
Identifying syntactic differences between two programs

Software—Practice & Experience
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Information extraction from HTML: application of a general machine learning approach

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Grammars have exceptions

Information Systems - Special issue on semistructured data
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Conceptual-model-based data extraction from multiple-record Web pages

Data & Knowledge Engineering
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Building intelligent web applications using lightweight wrappers

Data & Knowledge Engineering - Special issue on heterogeneous information resources need semantic access
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
A brief survey of web data extraction tools

ACM SIGMOD Record
DEByE - Date extraction by example

Data & Knowledge Engineering
WebOQL: Restructuring Documents, Databases, and Webs

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Alignment of Trees - An Alternative to Tree Edit

CPM '94 Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching
Episode Matching

CPM '97 Proceedings of the 8th Annual Symposium on Combinatorial Pattern Matching
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
OLERA: Semisupervised Web-Data Extraction with Visual Support

IEEE Intelligent Systems
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Thresher: automating the unwrapping of semantic content from the World Wide Web

WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
U-REST: an unsupervised record extraction system

Proceedings of the 16th international conference on World Wide Web
AllInOneNews: development and evaluation of a large-scale news metasearch engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Extracting Web Data Using Instance-Based Learning

World Wide Web
Integration of association rules and ontologies for semantic query expansion

Data & Knowledge Engineering
Integration of association rules and ontologies for semantic query expansion

Data & Knowledge Engineering
Mining templates from search result records of search engines

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
Extracting data records from the web using tag path clustering

Proceedings of the 18th international conference on World wide web
ODE: Ontology-assisted data extraction

ACM Transactions on Database Systems (TODS)
Automatic hidden-web table interpretation, conceptualization, and semantic annotation

Data & Knowledge Engineering
Extracting Loosely Structured Data Records Through Mining Strict Patterns

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Can we learn a template-independent wrapper for news article extraction from a single training site?

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
ViDE: A Vision-Based Approach for Deep Web Data Extraction

IEEE Transactions on Knowledge and Data Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Semistructured data: the TSIMMIS experience

ADBIS'97 Proceedings of the First East-European conference on Advances in Databases and Information systems

WMS-extracting multiple sections data records from search engine results pages

Proceedings of the 2010 ACM Symposium on Applied Computing
A methodology to learn ontological attributes from the Web

Data & Knowledge Engineering
Linear combination of component results in information retrieval

Data & Knowledge Engineering
Data extraction for search engine using safe matching

AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
TEX: An efficient and effective unsupervised Web information extractor

Knowledge-Based Systems
A novel classification model for cotton yarn quality based on trained neural network using genetic algorithm

Knowledge-Based Systems
Multiple sections extraction using visual cue

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Query Recommendation for Improving Search Engine Results

International Journal of Information Retrieval Research
Cluster-based page segmentation-a fast and precise method for web page pre-processing

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
NLP-based faceted search: Experience in the development of a science and technology search engine

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study the structured records of web pages and the relevant problems associated with the extraction and alignment of these structured records. Current automatic wrappers are complicated because they take into consideration the problems of locating relevant data region using visual cues and the use of complicated algorithms to check the similarity of data records. In this paper, we develop a non-visual automatic wrapper which questions the need for complex visual based wrappers in data extraction. The novel techniques for our wrapper are (1) filtering rules to detect and filter out irrelevant data records, (2) a tree matching algorithm using frequency measures to increase the speed of data extraction, (3) an algorithm to calculate the number and size of the components of data records to detect the correct data region, (4) a data alignment algorithm which is able to align iterative (repetitive HTML command tags) and disjunctive (optional) data items and (5) a data merging and partitioning method to solve the imperfect segmentation problem (the problem of correctly identifying the atomic entities in data items). Results show that our wrapper is as robust and in many cases outperforms the state of the art wrappers such as ViNT and DEPTA. This wrapper could have significant speed advantages when processing large volumes of web sites data, which could be helpful in meta search engine development.