Concordance-based entity-oriented search

Authors:
Mikhail Bautin;Steven Skiena
Affiliations:
Corresponding author;Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA. E-mail: {mbautin,skiena}@cs.sunysb.edu
Venue:
Web Intelligence and Agent Systems
Year:
2009

Citing 18
Cited 1

Foundations of statistical natural language processing

Foundations of statistical natural language processing
The double metaphone search algorithm

C/C++ Users Journal
Condor: a distributed job scheduler

Beowulf cluster computing with Linux
Comparing top k lists

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Query-free news search

WWW '03 Proceedings of the 12th international conference on World Wide Web
A taxonomy of web search

ACM SIGIR Forum
Ranking a stream of news

WWW '05 Proceedings of the 14th international conference on World Wide Web
A picture of search

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Semantic search via XML fragments: a high-precision approach to IR

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The Semantic Web Revisited

IEEE Intelligent Systems
Spatial Analysis of News Sources

IEEE Transactions on Visualization and Computer Graphics
Names and similarities on the web: fact extraction in the fast lane

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Supporting entity search: a large-scale prototype search engine

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Concordance-Based Entity-Oriented Search

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
On Query Completion in Web Search Engines Based on Query Stream Mining

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence
Open information extraction from the web

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Identifying co-referential names across large corpora

CPM'06 Proceedings of the 17th Annual conference on Combinatorial Pattern Matching
Lydia: a system for large-scale news analysis

SPIRE'05 Proceedings of the 12th international conference on String Processing and Information Retrieval

From web data to entities and back

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering

Quantified Score

Hi-index	0.01

Visualization

Abstract

We consider the problem of finding relevant named entities in response to a search query over a given text corpus. Entity search can readily be used to augment conventional web search engines for a variety of applications. We use entity concordance documents to generate lists of relevant entities for arbitrary text queries. To assess the significance of entity search, we analyzed the AOL dataset of 36 million web search queries with respect to two different sets of entities: namely (a) 2.3 million distinct entities extracted from a news text corpus and (b) 2.9 million Wikipedia article titles. The results clearly indicate that search engines should be aware of entities, for under various criteria of matching between 18-39% of all web search queries can be recognized as specifically searching for entities, while 73-87% of all queries contain entities. Our entity search engine creates a concordance document for each entity, consisting of all the sentences in the corpus containing that entity. We then index and search these documents using open-source search software. This gives a ranked list of entities as the result of search. Visit http://www.textmap.com for a demonstration of our entity search engine over a large news corpus. In the case where the query is a named entity, we evaluate the performance of our system by comparing the results of our search engine to the list of entities that have highest statistical juxtaposition scores with the queried entity. Juxtaposition score is a measure of how strongly two entities are related in terms of a probabilistic upper bound. The results show excellent performance, particularly over well-characterized classes of entities such as people.