Building a search engine for computer science course syllabi

Authors:
Nakul Rathod;Lillian Cassel
Affiliations:
Villanova University, Villanova, PA, USA;Villanova University, Villanova, PA, USA
Venue:
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Year:
2013

Citing 11
Cited 0

Modern Information Retrieval

Modern Information Retrieval
Induction of Decision Trees

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Learning and Problem Solving with Multilayer Connectionist Systems

Learning and Problem Solving with Multilayer Connectionist Systems
Automatic Identification of Home Pages on the Web

HICSS '05 Proceedings of the Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS'05) - Track 4 - Volume 04
Some Effective Techniques for Naive Bayes Text Classification

IEEE Transactions on Knowledge and Data Engineering
Towards a syllabus repository for computer science courses

Proceedings of the 38th SIGCSE technical symposium on Computer science education
Natural Language Processing with Python

Natural Language Processing with Python
Solr 1.4 Enterprise Search Server

Solr 1.4 Enterprise Search Server
Machine learning in building a collection of computer science course syllabi

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Syllabi are rich educational resources. However, finding Computer Science syllabi on a generic search engine does not work well. Towards our goal of building a syllabus collection we have trained various Machine Learning classifiers to recognize Computer Science syllabi from other web pages and the discipline that they represent (AI or SE for instance) among other things. We have crawled 50 Computer Science departments in the US and gathered 100,000 candidate pages. Our best classifiers are more than 90% accurate at identifying syllabi from real-world data. The syllabus repository we created is live for public use (at http://syllabus.sdakak.com) and contains more than 3000 syllabi that our classifiers filtered out from the crawl data. We present an analysis of the various feature selection methods and classifiers used.