An empirical study on using hidden markov model for search interface segmentation

Authors:
Ritu Khare;Yuan An
Affiliations:
Drexel University, Philadelphia, PA, USA;Drexel University, Philadelphia, PA, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 20
Cited 11

A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
Artificial intelligence: a modern approach

Artificial intelligence: a modern approach
Efficient Web form entry on PDAs

Proceedings of the 10th international conference on World Wide Web
Efficient Web form entry on PDAs

Proceedings of the 10th international conference on World Wide Web
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
An interactive clustering-based approach to integrating source query interfaces on the deep Web

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Automatic integration of Web search interfaces with WISE-Integrator

The VLDB Journal — The International Journal on Very Large Data Bases
Layered representations for learning and inferring office activity from multiple sensory channels

Computer Vision and Image Understanding - Special issue on event detection in video
Automating Content Extraction of HTML Documents

World Wide Web
Why Your Data Won't Mix

Queue - Semi-structured Data
A Robust Approach to Schema Matching overWeb Query Interfaces

ICDEW '06 Proceedings of the 22nd International Conference on Data Engineering Workshops
Accessing the deep web

Communications of the ACM - ACM at sixty: a look back in time
A Generalized Hidden Markov Model Approach for Web Information Extraction

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Towards Deeper Understanding of the Search Interfaces of the Deep Web

World Wide Web
Instance-based schema matching for web databases by domain-specific query probing

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Extracting Personalised Ontology from Data-Intensive Web Application: an HTML Forms-Based Reverse Engineering Approach

Informatica
Learning to extract form labels

Proceedings of the VLDB Endowment
Google's Deep Web crawl

Proceedings of the VLDB Endowment

Understanding deep web search interfaces: a survey

ACM SIGMOD Record
A study on using two-phase conditional random fields for query interface segmentation

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Web Query Interface Parsing for Building Web-Based Metasearch Systems

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
Automatically mapping and integrating multiple data entry forms into a database

ER'11 Proceedings of the 30th international conference on Conceptual modeling
OPAL: automated form understanding for the deep web

Proceedings of the 21st international conference on World Wide Web
OPAL: a passe-partout for web forms

Proceedings of the 21st international conference companion on World Wide Web
Web-based closed-domain data extraction on online advertisements

Information Systems
Learning to discover complex mappings from web forms to ontologies

Proceedings of the 21st ACM international conference on Information and knowledge management
Understanding query interfaces by statistical parsing

ACM Transactions on the Web (TWEB)
Web object identification for web automation and meta-search

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
The ontological key: automatically understanding and integrating forms to access the deep Web

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a hidden Markov model (HMM) based approach to perform search interface segmentation. Automatic processing of an interface is a must to access the invisible contents of deep Web. This entails automatic segmentation, i.e., the task of grouping related components of an interface together. While it is easy for a human to discern the logical relationships among interface components, machine processing of an interface is difficult. In this paper, we propose an approach to segmentation that leverages the probabilistic nature of the interface design process. The design process involves choosing components based on the underlying database query requirements, and organizing them into suitable patterns. We simulate this process by creating an "artificial designer" in the form of a 2-layered HMM. The learned HMM acquires the implicit design knowledge required for segmentation. We empirically study the effectiveness of the approach across several representative domains of deep Web. In terms of segmentation accuracy, the HMM-based approach outperforms an existing state-of-the-art approach by at least 10% in most cases. Furthermore, our cross-domain investigation shows that a single HMM trained on data having varied and frequent design patterns can accurately segment interfaces from multiple domains.