WordNet: a lexical database for English
Communications of the ACM
Literature-based discovery by lexical statistics
Journal of the American Society for Information Science
A Winnow-Based Approach to Context-Sensitive Spelling Correction
Machine Learning - Special issue on natural language learning
The impact on retrieval effectiveness of skewed frequency distributions
ACM Transactions on Information Systems (TOIS)
ACM Computing Surveys (CSUR)
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Principles of data mining
Mining the web to create minority language corpora
Proceedings of the tenth international conference on Information and knowledge management
Modern Information Retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Journal of the American Society for Information Science and Technology
Automatic Analysis of Large Text Corpora - A Contribution to Structuring WEB Communities
IICS '02 Proceedings of the Second International Workshop on Innovative Internet Computing Systems
Graph structure in three national academic webs: power laws with anomalies
Journal of the American Society for Information Science and Technology
Engineering a multi-purpose test collection for web retrieval experiments
Information Processing and Management: an International Journal
Computational Linguistics - Special issue on web as corpus
Using the web to obtain frequencies for unseen bigrams
Computational Linguistics - Special issue on web as corpus
Automatic association of web directories with word senses
Computational Linguistics - Special issue on web as corpus
Extracting the lowest-frequency words: pitfalls and possibilities
Computational Linguistics
The Oxford Handbook of Computational Linguistics (Oxford Handbooks in Linguistics S.)
The Oxford Handbook of Computational Linguistics (Oxford Handbooks in Linguistics S.)
Journal of the American Society for Information Science and Technology - Special issue: Webometrics
Scientific web intelligence: finding relationships in university webs
Communications of the ACM - Designing for the mobile device
AWIC'03 Proceedings of the 1st international Atlantic web intelligence conference on Advances in web intelligence
Genre and domain processing in an information retrieval perspective
ICWE'03 Proceedings of the 2003 international conference on Web engineering
Hi-index | 0.00 |
The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic-specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three English-speaking nations: Australia, New Zealand, and the United Kingdom. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include university names and acronyms, Internet terminology, and computing product names: not always words in common usage away from the Web. A minority of low frequency words are spelling mistakes, with other common types including nonwords, proper names, foreign language terms or computer science variable names. Based upon these findings, recommendations for data cleansing and filtering are made, particularly for clustering applications. © 2005 Wiley Periodicals, Inc.