Automatic semantic subject indexing of web documents in highly inflected languages

Authors:
Reetta Sinkkilä;Osma Suominen;Eero Hyvönen
Affiliations:
Semantic Computing Research Group (SeCo), Aalto University, Department of Media Technology, University of Helsinki, Department of Computer Science;Semantic Computing Research Group (SeCo), Aalto University, Department of Media Technology, University of Helsinki, Department of Computer Science;Semantic Computing Research Group (SeCo), Aalto University, Department of Media Technology, University of Helsinki, Department of Computer Science
Venue:
ESWC'11 Proceedings of the 8th extended semantic web conference on The semantic web: research and applications - Volume Part I
Year:
2011

Citing 10
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
KEA: practical automatic keyphrase extraction

Proceedings of the fourth ACM conference on Digital libraries
Automatic Indexing: An Experimental Inquiry

Journal of the ACM (JACM)
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Tagging and morphological disambiguation of Turkish text

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
A non-projective dependency parser

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Thesaurus based automatic keyphrase indexing

Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Does topic metadata help with Web search?

Journal of the American Society for Information Science and Technology
Efficient Content Creation on the Semantic Web Using Metadata Schemas with Domain Ontology Services (System Description)

ESWC '07 Proceedings of the 4th European conference on The Semantic Web: Research and Applications
MeSH Up

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Structured semantic metadata about unstructured web documents can be created using automatic subject indexing methods, avoiding laborious manual indexing. A succesful automatic subject indexing tool for the web should work with texts in multiple languages and be independent of the domain of discourse of the documents and controlled vocabularies. However, analyzing text written in a highly inflected language requires word form normalization that goes beyond rule-based stemming algorithms. We have tested the state-of-the art automatic indexing tool Maui on Finnish texts using three stemming and lemmatization algorithms and tested it with documents and vocabularies of different domains. Both of the lemmatization algorithms we tested performed significantly better than a rule-based stemmer, and the subject indexing quality was found to be comparable to that of human indexers.