Sifting micro-blogging stream for events of user interest

Authors:
Maxim Grinev;Maria Grineva;Alexander Boldakov;Leonid Novak;Andrey Syssoev;Dmitry Lizorkin
Affiliations:
Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.;Institute for System Programming of the Russian Academy of Sciences, Moscow, Russian Fed.
Venue:
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Year:
2009

Citing 2
Cited 5

BlogScope: a system for online analysis of high volume text streams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Extracting key terms from noisy and multitheme documents

Proceedings of the 18th international conference on World wide web

Detecting dynamic association among twitter topics

Proceedings of the 21st international conference companion on World Wide Web
Uprising microblogs: a bayesian network retrieval model for tweet search

Proceedings of the 27th Annual ACM Symposium on Applied Computing
See what's enBlogue: real-time emergent topic identification in social media

Proceedings of the 15th International Conference on Extending Database Technology
Discovery and analysis of evolving topical social discussions on unstructured microblogs

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Mining user interest and its evolution for recommendation on the micro-blogging system

WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Micro-blogging is a new form of social communication that encourages users to share information about anything they are seeing or doing, the motivation facilitated by the ability to post brief text messages through a variety of devices. Twitter, the most popular micro-blogging tool, is exhibiting rapid growth [3]: up to 11% of online Americans are using Twitter by December 2008, compared to 6% in May 2008. Due to its nature, micro-blogosphere has unique features: (i) It is a source of extremely up-to-date information about what is happening in the world; (ii) It captures the wisdom of millions of people and covers a broad range of domains. These features make micro-blogosphere more than a popular medium of social communication: we believe that it has additionally become a valuable source of extremely up-to-date news on virtually any subject of user interest. Making use of micro-blogosphere in this new role we meet the following challenges: (A) Since any given subject is generally mentioned in the micro-blogging stream on the continuous basis, a method is needed for locating periods of news on this subject. (B) Additionally, even for such periods, stream filtering is required for removing noise and for extracting messages that best describe the news. To address these challenges we make and exploit the following observations: (A) For an arbitrary subject, events that catch user interest gain distinguishably more attention than the average mentioning of the subject resulting in message activity bursts for it. (B) Most of the messages in an activity burst describe common event in close variations - either rephrased or "retweeted" between the users. We demonstrate TweetSieve - a system that allows obtaining news on any given subject by sifting the Twitter stream. Our work is related to frequecy-based analysis applied to blogs [1], but higher latency and lower coverage in blogs makes the analysis less effective than in case of micro-blogs. In TweetSieve demo, the user is able to express the subject of her interest by an arbitrary search string. The system shows the period of events occuring for the subject and outputs tweets that best describe each of the events. Figure 1 shows a screenshot of the system for "Semantic search" as a sample subject. The underlying process consists of two steps: Identifying activity bursts. Counting the messages matching the search string in the stream over time, the frequency curve is constructed. Activity bursts in the curve are identified by taking the periods of frequency exceeding the standard deviation from the average. Selecting messages that best describe news events. For the set of all messages matching the search string in an activity burst, we apply the message-granular variation of our keyphrase extraction algorithm [2] that is specifically suited to efficiently filtering noisy data. The algorithm clusters messages with respect to their similarity to each other and chooses central messages from the most dense clusters. As the similarity measure we use Jaccard coefficient for the "bag of words" representation of messages. The demonstration illustrates the potential of our approach in bringing news acquisition to a new level of promptness and coverage range.