Semi-supervised learning of semantic classes for query understanding: from the web and for the web

Authors:
Ye-Yi Wang;Raphael Hoffmann;Xiao Li;Jakub Szymanski
Affiliations:
Microsoft Corporation, Redmond, WA, USA;University of Washington, Seattle, WA, USA;Microsoft Corporation, Redmond, WA, USA;Microsoft Corporation, Redmond, WA, USA
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 17
Cited 6

Learning dictionaries for information extraction by multi-level bootstrapping

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Concept discovery from text

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Shallow parsing with conditional random fields

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Semi-supervised learning with graphs

Semi-supervised learning with graphs
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Espresso: leveraging generic patterns for automatically harvesting semantic relations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Chinese segmentation and new word detection using conditional random fields

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Learning query intent from regularized click graphs

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
WebTables: exploring the power of tables on the web

Proceedings of the VLDB Endowment
Extracting structured information from user queries with semi-supervised conditional random fields

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
A context pattern induction method for named entity extraction

CoNLL-X '06 Proceedings of the Tenth Conference on Computational Natural Language Learning
Methods for domain-independent information extraction from the web: an experimental comparison

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Weakly-supervised acquisition of labeled class instances using graph random walks

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Automatic set expansion for list question answering

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing

Semantic lexicon adaptation for use in query interpretation

Proceedings of the 19th international conference on World wide web
Learning 5000 relational extractors

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
SEISA: set expansion by iterative similarity aggregation

Proceedings of the 20th international conference on World wide web
Fine-grained class label markup of search queries

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Labeling queries for a people search engine

NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
A joint model for discovery of aspects in utterances

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding intents from search queries can improve a user's search experience and boost a site's advertising profits. Query tagging via statistical sequential labeling models has been shown to perform well, but annotating the training set for supervised learning requires substantial human effort. Domain-specific knowledge, such as semantic class lexicons, reduces the amount of needed manual annotations, but much human effort is still required to maintain these as search topics evolve over time. This paper investigates semi-supervised learning algorithms that leverage structured data (HTML lists) from the Web to automatically generate semantic-class lexicons, which are used to improve query tagging performance - even with far less training data. We focus our study on understanding the correct objectives for the semi-supervised lexicon learning algorithms that are crucial for the success of query tagging. Prior work on lexicon acquisition has largely focused on the precision of the lexicons, but we show that precision is not important if the lexicons are used for query tagging. A more adequate criterion should emphasize a trade-off between maximizing the recall of semantic class instances in the data, and minimizing the confusability. This ensures that the similar levels of precision and recall are observed on both training and test set, hence prevents over-fitting the lexicon features. Experimental results on retail product queries show that enhancing a query tagger with lexicons learned with this objective reduces word level tagging errors by up to 25% compared to the baseline tagger that does not use any lexicon features. In contrast, lexicons obtained through a precision-centric learning algorithm even degrade the performance of a tagger compared to the baseline. Furthermore, the proposed method outperforms one in which semantic class lexicons have been extracted from a database.