Towards Comparative Mining of Web Document Objects with NFA: WebOMiner System

Authors:
C. I. Ezeife;Titas Mutsuddy
Affiliations:
School of Computer Science, University of Windsor, Windsor, ON, Canada;School of Computer Science, University of Windsor, Windsor, ON, Canada
Venue:
International Journal of Data Warehousing and Mining
Year:
2012

Citing 22
Cited 0

A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Discovering Internet marketing intelligence through online analytical web usage mining

ACM SIGMOD Record
Web mining research: a survey

ACM SIGKDD Explorations Newsletter
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
DEByE - Date extraction by example

Data & Knowledge Engineering
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
A Prototype for Metadata-Based Integration of Internet Sources

CAiSE '99 Proceedings of the 11th International Conference on Advanced Information Systems Engineering
The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes

DEXA '02 Proceedings of the 13th International Workshop on Database and Expert Systems Applications
Web Warehousing: Design and Issues

ER '98 Proceedings of the Workshops on Data Warehousing and Data Mining: Advances in Database Technologies
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Editorial: special issue on web content mining

ACM SIGKDD Explorations Newsletter
Fully automatic wrapper generation for search engines

WWW '05 Proceedings of the 14th international conference on World Wide Web
Extracting context to improve accuracy for HTML content extraction

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)
A high performance integrated web data warehousing

Cluster Computing
Extracting Web Data Using Instance-Based Learning

World Wide Web
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Cleaning web pages for effective web content mining

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Data Warehouse Testing

International Journal of Data Warehousing and Mining
An Empirical Evaluation of Similarity Coefficients for Binary Valued Data

International Journal of Data Warehousing and Mining
User Behaviour Pattern Mining from Weblog

International Journal of Data Warehousing and Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

The process of extracting comparative heterogeneous web content data which are derived and historical from related web pages is still at its infancy and not developed. Discovering potentially useful and previously unknown information or knowledge from web contents such as "list all articles on 'Sequential Pattern Mining' written between 2007 and 2011 including title, authors, volume, abstract, paper, citation, year of publication," would require finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse before web content extraction and mining from the database. This paper proposes a technique for automatic web content data extraction, the WebOMiner system, which models web sites of a specific domain like Business to Customer B2C web sites, as object oriented database schemas. Then, non-deterministic finite state automata NFA based wrappers for recognizing content types from this domain are built and used for extraction of related contents from data blocks into an integrated database for future second level mining for deep knowledge discovery.