Text characteristics of English language university Web sites: Research Articles

  • Authors:
  • Mike Thelwall

  • Affiliations:
  • School of Computing and Information Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1SB, United Kingdom

  • Venue:
  • Journal of the American Society for Information Science and Technology
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic-specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three English-speaking nations: Australia, New Zealand, and the United Kingdom. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include university names and acronyms, Internet terminology, and computing product names: not always words in common usage away from the Web. A minority of low frequency words are spelling mistakes, with other common types including nonwords, proper names, foreign language terms or computer science variable names. Based upon these findings, recommendations for data cleansing and filtering are made, particularly for clustering applications. © 2005 Wiley Periodicals, Inc.