On the limited memory BFGS method for large scale optimization
Mathematical Programming: Series A and B
Learning to construct knowledge bases from the World Wide Web
Artificial Intelligence - Special issue on Intelligent internet systems
Accelerated focused crawling through online relevance feedback
Proceedings of the 11th international conference on World Wide Web
Learning Logical Definitions from Relations
Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Focused Crawling Using Context Graphs
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Shallow parsing with conditional random fields
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
A comparison of algorithms for maximum entropy parameter estimation
COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Hi-index | 0.01 |
In this article we describe a novel information extraction task on the web and show how it can be solved effectively using the emerging conditional exponential models. The task involves learning to find specific goal pages on large domain-specific websites. An example of such a task is to find computer science publications starting from university root pages. We encode this as a sequential labeling problem solved using Conditional Random Fields (CRFs). These models enable us to exploit a wide variety of features including keywords and patterns extracted from and around hyperlinks and HTML pages, dependency among labels of adjacent pages, and existing databases of named entities in a unified probabilistic framework. This is an important advantage over previous rule-based or generative models for tackling the challenges of diversity on web data.