Noise robust detection of the emergence and spread of topics on the web

Authors:
Masahiro Inoue;Keishi Tajima
Affiliations:
Kyoto University, Yoshida-Honmachi, Sakyo, Kyoto, Japan;Kyoto University, Yoshida-Honmachi, Sakyo, Kyoto, Japan
Venue:
Proceedings of the 2nd Temporal Web Analytics Workshop
Year:
2012

Citing 10
Cited 1

Retrieval and novelty detection at the sentence level

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Trend detection through temporal link analysis

Journal of the American Society for Information Science and Technology - Special issue: Webometrics
A Markov random field model for term dependencies

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Similarity measures for tracking information flow

Proceedings of the 14th ACM international conference on Information and knowledge management
What's really new on the web?: identifying new pages from a series of unstable web snapshots

Proceedings of the 15th international conference on World Wide Web
A comparison of sentence retrieval techniques

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Using neighbors to date web documents

Proceedings of the 9th annual ACM international workshop on Web information and data management
Detecting age of page content

Proceedings of the 9th annual ACM international workshop on Web information and data management
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Topic initiator detection on the world wide web

Proceedings of the 19th international conference on World wide web

Carbon dating the web: estimating the age of web resources

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the same information appears on many Web pages, we often want to know which page is the first one that discussed it, or how the information has spread on the Web as time passes. In this paper, we develop two methods: a method of detecting the first page that discussed the given information, and a method of generating a graph showing how the number of pages discussing it has changed along the timeline. To extract such information, we need to determine which pages discuss the given topic, and also need to determine when these pages were created. For the former step, we design a metric for estimating inclusion degree between information and a page. For the latter step, we develop a technique of extracting creation timestamps on web pages. Although timestamp extraction is a crucial component in temporal Web analysis, no research has shown how to do it in detail. Both steps are, however, still error-prone. In order to improve noise elimination, we examine not only the properties of each page, but also temporal relationship between pages. If temporal relationship between some candidate page and other pages are unlikely in typical patterns of information spread on the Web, we eliminate the candidate page as a noise. Results of our experiments show that our methods achieve high precision and can be used for practical use.