Automated Semantic Analysis of Schematic Data

Authors:
Saikat Mukherjee;I. V. Ramakrishnan
Affiliations:
Integrated Data Systems Department, Siemens Corporate Research, Princeton, USA 08540;Computer Science Department, Stony Brook University, Stony Brook, USA 11794
Venue:
World Wide Web
Year:
2008

Citing 64
Cited 7

Combinatorial optimization: algorithms and complexity

Combinatorial optimization: algorithms and complexity
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Template-based wrappers in the TSIMMIS system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Cut and paste

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
Digestor: device-independent access to the World Wide Web

Selected papers from the sixth international conference on World Wide Web
Ontology-based extraction and structuring of information from data-rich unstructured documents

Proceedings of the seventh international conference on Information and knowledge management
A hierarchical approach to wrapper induction

Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Learning Information Extraction Rules for Semi-Structured and Free Text

Machine Learning - Special issue on natural language learning
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Improving Web interaction on small displays

WWW '99 Proceedings of the eighth international conference on World Wide Web
Relational learning of pattern-match rules for information extraction

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Power browser: efficient Web browsing for PDAs

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Focused Web searching with PDAs

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Two approaches to bringing Internet services to WAP devices

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Semantic community Web portals

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Accordion summarization for end-game browsing on PDAs and cellular phones

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Annotea: an open RDF infrastructure for shared Web annotations

Proceedings of the 10th international conference on World Wide Web
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
Improving mobile internet usability

Proceedings of the 10th international conference on World Wide Web
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Map adaptation for users of mobile systems

Proceedings of the 10th international conference on World Wide Web
Knowledge encapsulation for focused search from pervasive devices

Proceedings of the 10th international conference on World Wide Web
Reconciling schemas of disparate data sources: a machine-learning approach

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Automatic repairing of web wrappers

Proceedings of the 3rd international workshop on Web information and data management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Authoring and annotation of web pages in CREAM

Proceedings of the 11th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Topic Detection and Tracking: Event-Based Information Organization

Topic Detection and Tracking: Event-Based Information Organization
A brief survey of web data extraction tools

ACM SIGMOD Record
Wrapper verification

World Wide Web
A Context-Aware Decision Engine for Content Adaptation

IEEE Pervasive Computing
Navigating in a mobile XHTML application

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Fractal summarization for mobile devices to access large documents on the web

WWW '03 Proceedings of the 12th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
On deep annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Reverse Engineering for Web Data: From Visual to Semantic Structures

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Using urls and table layout for web classification tasks

Proceedings of the 13th international conference on World Wide Web
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
How to make a semantic web browser

Proceedings of the 13th international conference on World Wide Web
Using link analysis to improve layout on mobile devices

Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web
Understanding Web query interfaces: best-effort parsing with hidden syntax

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Using the structure of Web sites for automatic segmentation of tables

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
iMAP: discovering complex semantic matches between database schemas

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Bootstrapping Semantic Annotation for Content-Rich HTML Documents

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
Browsing fatigue in handhelds: semantic bookmarking spells relief

WWW '05 Proceedings of the 14th international conference on World Wide Web
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
Wrapper maintenance: a machine learning approach

Journal of Artificial Intelligence Research
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Active learning with strong and weak views: a case study on wrapper induction

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Building a Usable and Accessible Semantic Web Interaction Platform

World Wide Web
Emergent Semantics and Cooperation in Multi-knowledge Communities: the ESTEEM Approach

World Wide Web
A Human-Centered Semantic Service Platform for the Digital Ecosystems Environment

World Wide Web
Automatic extraction of clickable structured web contents for name entity queries

Proceedings of the 19th international conference on World wide web
The ESTEEM platform: enabling P2P semantic collaboration through emerging collective knowledge

Journal of Intelligent Information Systems
Transaction models for Web accessibility

World Wide Web
Can predicate-argument structures be used for contextual opinion retrieval from blogs?

World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Content in numerous Web data sources, designed primarily for human consumption, are not directly amenable to machine processing. Automated semantic analysis of such content facilitates their transformation into machine-processable and richly structured semantically annotated data. This paper describes a learning-based technique for semantic analysis of schematic data which are characterized by being template-generated from backend databases. Starting with a seed set of hand-labeled instances of semantic concepts in a set of Web pages, the technique learns statistical models of these concepts using light-weight content features. These models direct the annotation of diverse Web pages possessing similar content semantics. The principles behind the technique find application in information retrieval and extraction problems. Focused Web browsing activities require only selective fragments of particular Web pages but are often performed using bookmarks which fetch the contents of the entire page. This results in information overload for users of constrained interaction modality devices such as small-screen handheld devices. Fine-grained information extraction from Web pages, which are typically performed using page specific and syntactic expressions known as wrappers, suffer from lack of scalability and robustness. We report on the application of our technique in developing semantic bookmarks for retrieving targeted browsing content and semantic wrappers for robust and scalable information extraction from Web pages sharing a semantic domain.