Robust estimation of Google counts for social network extraction

Authors:
Yutaka Matsuo;Hironori Tomobe;Takuichi Nishimura
Affiliations:
AIST, Chiyoda-ku, Tokyo, Japan;AIST, Chiyoda-ku, Tokyo, Japan;AIST, Chiyoda-ku, Tokyo, Japan
Venue:
AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Year:
2007

Citing 15
Cited 5

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Introduction to the special issue on the web as corpus

Computational Linguistics - Special issue on web as corpus
Towards the self-annotating web

Proceedings of the 13th international conference on World Wide Web
Automatic acquisition of hyponyms from large text corpora

COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 2
Ensemble methods for automatic thesaurus extraction

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
POLYPHONET: an advanced social network extraction system from the web

Proceedings of the 15th international conference on World Wide Web
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Introduction to Information Retrieval

Introduction to Information Retrieval
Disambiguating Personal Names on the Web using Automatically Extracted Key Phrases

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Measuring semantic similarity by latent relational analysis

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Flink: Semantic Web technology for the extraction and analysis of social networks

Web Semantics: Science, Services and Agents on the World Wide Web
Ontologies are us: a unified model of social networks and semantics

ISWC'05 Proceedings of the 4th international conference on The Semantic Web
A method for learning part-whole relations

ISWC'06 Proceedings of the 5th international conference on The Semantic Web

POLYPHONET: An advanced social network extraction system from the Web

Web Semantics: Science, Services and Agents on the World Wide Web
Mining recommendations from the web

Proceedings of the 2008 ACM conference on Recommender systems
A "quick and dirty" website data quality indicator

Proceedings of the 2nd ACM workshop on Information credibility on the web
A cohesion graph based approach for unsupervised recognition of literal and non-literal use of multiword expressions

TextGraphs-4 Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing
Improving relational similarity measurement using symmetries in proportional word analogies

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Various studies within NLP and Semantic Web use the so-called Google count, which is the hit count on a query returned by a search engine (not only Google). However, sometimes the Google count is unreliable, especially when the count is large, or when advanced operators such as OR and NOT are used. In this paper, we propose a novel algorithm that estimates the Google count robustly. It (i) uses the co-occurrence of terms as evidence to estimate the occurrence of a given word, and (ii) integrates multiple evidence for robust estimation. We evaluated our algorithm for more than 2000 queries on three datasets using Google, Yahoo! and MSN search engine. Our algorithm also provides estimate counts for any classifier that judges a web page as positive or negative. Consequently, we can estimate the number of documents with included references of a particular person (among namesakes) on the entire web.