Reliability and verification of natural language text on the world wide web

  • Authors:
  • Melanie Jeanne Martin;Roger T. Hartley

  • Affiliations:
  • New Mexico State University;New Mexico State University

  • Venue:
  • Reliability and verification of natural language text on the world wide web
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

With the explosive growth of the World Wide Web has come, not just an explosion of information, but also an explosion of false, misleading and unsupported information. At the same time, the web is increasingly being used for tasks where information quality and reliability are vital, from legal and medical research by both professionals and lay people, to fact checking by journalists and research by government policy makers. In this thesis we define reliability as a measure of the extent to which information on a given web page can be trusted. We explore the standard criteria for determining the reliability of printed information and how the criteria can be translated to the web. Based on these criteria, the HTML markup of web pages, linguistic properties of the text and the link topology of the Web, we develop a set of features to use in learned automatic classifiers. This enables us to classify web pages in the medical domain as reliable or unreliable with reasonable accuracy. As a secondary task we also classify web pages from the medical domain by type (commercial, link, or patient leaflet). This work extends previous work on reliability of information in the medical domain and of reliability, or quality, of information on the web in general. This work also contributes to our knowledge which features are truly appropriate to determine reliability on the Web, through empirical testing and principled feature selection. We bring a greater level of automation to the task of determining the reliability of medical information on the web through the use of a variety of machine learning algorithms.