A technique for measuring the relative size and overlap of public Web search engines
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Machine Learning Approach for Homepage Finding Task
SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
ACM SIGIR Forum
Automatic document metadata extraction using support vector machines
Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries
Query-independent evidence in home page finding
ACM Transactions on Information Systems (TOIS)
The Journal of Machine Learning Research
Discriminative models for information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Estimating the size of the telephone universe: a Bayesian Mark-recapture approach
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Estimating the Support of a High-Dimensional Distribution
Neural Computation
Random sampling from a search engine's index
Proceedings of the 15th international conference on World Wide Web
CiteSeerχ: a scalable autonomous scientific digital library
InfoScale '06 Proceedings of the 1st international conference on Scalable information systems
Estimating corpus size via queries
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Mining a digital library for influential authors
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Joint optimization of wrapper generation and template detection
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting Author Meta-Data from Web Using Visual Features
ICDMW '07 Proceedings of the Seventh IEEE International Conference on Data Mining Workshops
ArnetMiner: extraction and mining of academic social networks
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Social Network Extraction of Academic Researchers
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
Web-scale named entity recognition
Proceedings of the 17th ACM conference on Information and knowledge management
Web page classification: Features and algorithms
ACM Computing Surveys (CSUR)
Formal Models for Expert Finding on DBLP Bibliography Data
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Learning to recognize webpage genres
Information Processing and Management: an International Journal
Determining expert profiles (with an application to expert finding)
IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
The WEKA data mining software: an update
ACM SIGKDD Explorations Newsletter
Combining Super-Structuring and Abstraction on Sequence Classification
ICDM '09 Proceedings of the 2009 Ninth IEEE International Conference on Data Mining
Estimating the web robot population
Proceedings of the 19th international conference on World wide web
Efficient name disambiguation for large-scale databases
PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
ICADL'06 Proceedings of the 9th international conference on Asian Digital Libraries: achievements, Challenges and Opportunities
Similar researcher search in academic environments
Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
Named entity recognition and identification for finding the owner of a home page
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Researcher homepage classification using unlabeled data
Proceedings of the 22nd international conference on World Wide Web
Hi-index | 0.00 |
Academic homepages are rich sources of information on scientific research and researchers. Most researchers provide information about themselves and links to their research publications on their homepages. In this study, we address the following questions related to academic homepages: (1) How many academic homepages are there on the web? (2) Can we accurately discriminate between academic homepages and other webpages? and (3) What information can be extracted about researchers from their homepages? For addressing the first question, we use mark-recapture techniques commonly employed in biometrics to estimate animal population sizes. Our results indicate that academic homepages comprise a small fraction of the Web making automatic methods for discriminating them crucial. We study the performance of content-based features for classifying webpages. We propose the use of topic models for identifying content-based features for classification and show that a small set of LDA-based features out-perform term features selected using traditional techniques such as aggregate term frequencies or mutual information. Finally, we deal with the extraction of name and research interests information from an academic homepage. Term-topic associations obtained from topic models are used to design a novel, unsupervised technique to identify short segments corresponding to research interests of the researchers specified in academic homepages. We show the efficacy of our proposed methods on all the three tasks by experimentally evaluating them on multiple publicly-available datasets.