On identifying academic homepages for digital libraries

Authors:
Sujatha Das Gollapalli;C. Lee Giles;Prasenjit Mitra;Cornelia Caragea
Affiliations:
Penn State University, University Park, PA, USA;Penn State University, University Park, PA, USA;Penn State University, University Park, PA, USA;Penn State University, University Park, PA, USA
Venue:
Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Year:
2011

Citing 27
Cited 3

A technique for measuring the relative size and overlap of public Web search engines

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Machine Learning Approach for Homepage Finding Task

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
A taxonomy of web search

ACM SIGIR Forum
Automatic document metadata extraction using support vector machines

Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Query-independent evidence in home page finding

ACM Transactions on Information Systems (TOIS)
Latent dirichlet allocation

The Journal of Machine Learning Research
Discriminative models for information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating the size of the telephone universe: a Bayesian Mark-recapture approach

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Estimating the Support of a High-Dimensional Distribution

Neural Computation
Random sampling from a search engine's index

Proceedings of the 15th international conference on World Wide Web
CiteSeerχ: a scalable autonomous scientific digital library

InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Estimating corpus size via queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Mining a digital library for influential authors

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting Author Meta-Data from Web Using Visual Features

ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
ArnetMiner: extraction and mining of academic social networks

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Social Network Extraction of Academic Researchers

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Web-scale named entity recognition

Proceedings of the 17th ACM conference on Information and knowledge management
Web page classification: Features and algorithms

ACM Computing Surveys (CSUR)
Formal Models for Expert Finding on DBLP Bibliography Data

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Learning to recognize webpage genres

Information Processing and Management: an International Journal
Determining expert profiles (with an application to expert finding)

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Combining Super-Structuring and Abstraction on Sequence Classification

ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Estimating the web robot population

Proceedings of the 19th international conference on World wide web
Efficient name disambiguation for large-scale databases

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Web page classification exploiting contents of surrounding pages for building a high-quality homepage collection

ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities

Similar researcher search in academic environments

Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Named entity recognition and identification for finding the owner of a home page

PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Researcher homepage classification using unlabeled data

Proceedings of the 22nd international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Academic homepages are rich sources of information on scientific research and researchers. Most researchers provide information about themselves and links to their research publications on their homepages. In this study, we address the following questions related to academic homepages: (1) How many academic homepages are there on the web? (2) Can we accurately discriminate between academic homepages and other webpages? and (3) What information can be extracted about researchers from their homepages? For addressing the first question, we use mark-recapture techniques commonly employed in biometrics to estimate animal population sizes. Our results indicate that academic homepages comprise a small fraction of the Web making automatic methods for discriminating them crucial. We study the performance of content-based features for classifying webpages. We propose the use of topic models for identifying content-based features for classification and show that a small set of LDA-based features out-perform term features selected using traditional techniques such as aggregate term frequencies or mutual information. Finally, we deal with the extraction of name and research interests information from an academic homepage. Term-topic associations obtained from topic models are used to design a novel, unsupervised technique to identify short segments corresponding to research interests of the researchers specified in academic homepages. We show the efficacy of our proposed methods on all the three tasks by experimentally evaluating them on multiple publicly-available datasets.