Information extraction from HTML: application of a general machine learning approach
AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Learning Information Extraction Rules for Semi-Structured and Free Text
Machine Learning - Special issue on natural language learning
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Bottom-up relational learning of pattern matching rules for information extraction
The Journal of Machine Learning Research
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Fully automatic wrapper generation for search engines
WWW '05 Proceedings of the 14th international conference on World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
Activity Recognition and Abnormality Detection with the Switching Hidden Semi-Markov Model
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
An integrated architecture for shallow and deep processing
ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Topic transition detection using hierarchical hidden Markov and semi-Markov models
Proceedings of the 13th annual ACM international conference on Multimedia
2D Conditional Random Fields for Web information extraction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Combining lexical and formatting cues for named entity acquisition from the web
EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Efficient inference on sequence segmentation models
ICML '06 Proceedings of the 23rd international conference on Machine learning
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Combining linguistic and statistical analysis to extract relations from web documents
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Wrapper maintenance: a machine learning approach
Journal of Artificial Intelligence Research
Parsing '05 Proceedings of the Ninth International Workshop on Parsing Technology
An unsupervised framework for extracting and normalizing product attributes from multiple web sites
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
The Journal of Machine Learning Research
Closing the loop in webpage understanding
Proceedings of the 17th ACM conference on Information and knowledge management
Webpage understanding: beyond page-level search
ACM SIGMOD Record
Extracting structured information from user queries with semi-supervised conditional random fields
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
An integrated discriminative probabilistic approach to information extraction
Proceedings of the 18th ACM conference on Information and knowledge management
Normalizing web product attributes and discovering domain ontology with minimal effort
Proceedings of the fourth ACM international conference on Web search and data mining
Automatically adapting web pages to heterogeneous devices
CHI '11 Extended Abstracts on Human Factors in Computing Systems
Towards a top-down and bottom-up bidirectional approach to joint information extraction
Proceedings of the 20th ACM international conference on Information and knowledge management
Learning to adapt cross language information extraction wrapper
Applied Intelligence
Hi-index | 0.00 |
Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels ofthe text phrases can be used in page structure understanding tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research homepage extraction show the feasibility and promise of our approach.