User generated content: how good is it?

Authors:
Ricardo Baeza-Yates
Affiliations:
Yahoo! Research, Barcelona, Spain
Venue:
Proceedings of the 3rd workshop on Information credibility on the web
Year:
2009

Citing 7
Cited 7

Extracting semantic relations from query logs

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Toward a PeopleWeb

Computer
Near-Term Prospects for Semantic Technologies

IEEE Intelligent Systems
Finding high-quality content in social media

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Flickr tag recommendation based on collective knowledge

Proceedings of the 17th international conference on World Wide Web
Classifying tags using open content resources

Proceedings of the Second ACM International Conference on Web Search and Data Mining
From capturing semantics to semantic search: a virtuous cycle

ESWC'08 Proceedings of the 5th European semantic web conference on The semantic web: research and applications

Prototype hierarchy based clustering for the categorization and navigation of web collections

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Distortion as a validation criterion in the identification of suspicious reviews

Proceedings of the First Workshop on Social Media Analytics
Detection of text quality flaws as a one-class classification problem

Proceedings of the 20th ACM international conference on Information and knowledge management
Characterizing Wikipedia pages using edit network motif profiles

Proceedings of the 3rd international workshop on Search and mining user-generated contents
Bursty event detection from collaborative tags

World Wide Web
A breakdown of quality flaws in Wikipedia

Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality
Predicting quality flaws in user-generated content: the case of wikipedia

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

User Generated Content (UGC) is one of the main current trends in the Web. This trend has allowed all people that can access the Internet to publish content in different media, such as text (e.g. blogs), photos or video. This data can be crucial for many applications, in particular for semantic search. It is early to say which impact UGC will have and to what extent. However, the impact will be clearly related to the quality of this content. Hence, how good is the content that people generate in the so called Web 2.0? Clearly is not as good as editorial content in the Web site of a publisher. However, histories of success such as the case of the Wikipedia, show that it can be quite good. In addition, the quality gap is balanced by volume, as user generated content is much larger than, say, editorial content. In fact, Ramakrishnan and Tomkins estimate that UGC generates daily from 8 to 10GB while the professional Web only generates 2GB in the same time. How we can estimate the quality of UGC? One possibility is to directly evaluate the quality, but that is not easy as depends on the type of content and the availability of human judgments. One example of such approach is the study of Yahoo! Answers done by Agichtein et al. In this work they start from a judged question/answer collection where good questions usually have good answers. Then they predict good questions and good answers, obtaining an AUC (area under the curve of the precision-recall graph) of 0.76 and 0.88, respectively. A second possibility is obtaining indirect evidence of the quality. For example, use UGC for a given task and then evaluate the quality of the task results. One such example is the extraction of semantic relations done by Baeza-Yates and Tiberi. To evaluate the quality of the results they used the Open Directory Project (ODP), showing that the results had a precision of over 60%. For the cases that were not found in the ODP, a manually verified sample showed that the real precision was close to 100%. What happened was that the ODP was not specific enough to contain very specific relations, and every day the problem gets worse as we have more data. This example shows the quality of ODP as well as the semantic encoded in queries. Notice that we can define queries as implicit UGC, because each query can be considered an implicit tag to Web pages that are clicked for that query, and hence we have an implicit folksonomy. A final alternative is crossing different UGC sources and infer from there the quality of those sources. An example of this case, is the work by Van Zwol et al. where they use collective knowledge (wisdom of crowds) to extend image tags, and prove that almost 70% of the tags can be semantically classified by using Wordnet and Wikipedia. This exposes the quality of both Flickr tags and Wikipedia. Our main motivation, is that by being able to generate semantic resources automatically from the Web (and in particular the Web 2.0), even with noise, coupling that with open content resources, we can create a virtuous feedback circuit. In fact, explicit and implicit folksonomies can be used to do supervised machine learning without the need of manual intervention (or at least drastically reduce it) to improve semantic tagging. After that, we can feedback the results on itself, and repeat the process. Using the right conditions, every iteration should improve the output, obtaining a virtuous cycle. As a side effect, we can also improve Web search, our main goal.