Character N-Gram Tokenization for European Language Text Retrieval

Authors:
Paul Mcnamee;James Mayfield
Affiliations:
Applied Physics Laboratory, Johns Hopkins University, 11100 Johns Hopkins Road, Laurel, MD 20723-6099, USA. mcnamee@jhuapl.edu;Applied Physics Laboratory, Johns Hopkins University, 11100 Johns Hopkins Road, Laurel, MD 20723-6099, USA. mayfield@jhuapl.edu
Venue:
Information Retrieval
Year:
2004

Citing 32
Cited 56

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Optimizing a text retrieval system utilizing N-gram indexing

Optimizing a text retrieval system utilizing N-gram indexing
Relevance feedback revisited

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Highlights: language- and domain-independent automatic indexing terms for abstracting

Journal of the American Society for Information Science
TELLTALE: experiments in a dynamic hypertext environment for degraded and multilingual data

Journal of the American Society for Information Science - Special issue on full-text retrieval
Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
One-time complete indexing of text: theory and practice

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Phrasal translation and query expansion techniques for cross-language information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Overlapping statistical word indexing: a new indexing method for Japanese text

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Resolving ambiguity for cross-language retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
A study of smoothing methods for language models applied to Ad Hoc information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing cross-language query expansion techniques by degrading translation resources

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings

Information Retrieval
CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Experiments with the Eurospider Retrieval System for CLEF 2000

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Multilingual Information Retrieval Based on Parallel Texts from the Web

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
A Language-Independent Approach to European Text Retrieval

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
TNO at CLEF-2001: Comparing Translation Resources

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
JHU/APL Experiments at CLEF: Translation Resources and Score Normalization

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
Cross-language information retrieval: experiments based on CLEF 2000 corpora

Information Processing and Management: an International Journal
Empirical methods for exploiting parallel texts

Empirical methods for exploiting parallel texts
Char_align: a program for aligning parallel texts at the character level

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Should we translate the documents or the queries in cross-language information retrieval?

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Complete statistical indexing of text by overlapping word fragments

ACM SIGIR Forum
Letter level learning for language independent diacritics restoration

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20

Cross-Language Evaluation Forum: Objectives, Results, Achievements

Information Retrieval
The effectiveness of combining information retrieval strategies for European languages

Proceedings of the 2004 ACM symposium on Applied computing
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Translating pieces of words

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Light stemming approaches for the French, Portuguese, German and Hungarian languages

Proceedings of the 2006 ACM symposium on Applied computing
Different indexing strategies for multilingual web retrieval: experiments with the EuroGOV corpus

Proceedings of the seventeenth conference on Hypertext and hypermedia
s-grams: Defining generalized n-grams for information retrieval

Information Processing and Management: an International Journal
Cross-language information retrieval using PARAFAC2

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Searching strategies for the Hungarian language

Information Processing and Management: an International Journal
Stemming Indonesian: A confix-stripping approach

ACM Transactions on Asian Language Information Processing (TALIP)
Don't have a stemmer?: be un+concern+ed

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic acquisition of inflectional lexica for morphological normalisation

Information Processing and Management: an International Journal
Ontology-Driven Approximate Duplicate Elimination of Postal Addresses

IEA/AIE '08 Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: New Frontiers in Applied Artificial Intelligence
Text Retrieval through Corrupted Queries

IBERAMIA '08 Proceedings of the 11th Ibero-American conference on AI: Advances in Artificial Intelligence
Corrupted queries in Spanish text retrieval: error correction vs. N-Grams

Proceedings of the 2nd ACM workshop on Improving non english web searching
Current research issues and trends in non-English Web searching

Information Retrieval
Addressing morphological variation in alphabetic languages

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Does dictionary based bilingual retrieval work in a non-normalized index?

Information Processing and Management: an International Journal
Indexing and stemming approaches for the Czech language

Information Processing and Management: an International Journal
Latent morpho-semantic analysis: multilingual information retrieval with character n-grams and mutual information

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Evaluation of the bible as a resource for cross-language information retrieval

MLRI '06 Proceedings of the Workshop on Multilingual Language Resources and Interoperability
Indexing and searching strategies for the Russian language

Journal of the American Society for Information Science and Technology
Term selection and query operations for video retrieval

ECIR'07 Proceedings of the 29th European conference on IR research
JHU ad hoc experiments at CLEF 2008

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

ACM Transactions on Asian Language Information Processing (TALIP)
Language identification: the long and the short of the matter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Arabic script web page language identifications using decision tree neural networks

Pattern Recognition
Plagiarism detection across distant language pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Improve feature selection method of web page language identification using fuzzy ARTMAP

International Journal of Intelligent Information and Database Systems
Dynamic public service mediation

EGOV'10 Proceedings of the 9th IFIP WG 8.5 international conference on Electronic government
Selecting automatically the best query translations

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Managing misspelled queries in IR applications

Information Processing and Management: an International Journal
Cross-language plagiarism detection

Language Resources and Evaluation
A Fast Corpus-Based Stemmer

ACM Transactions on Asian Language Information Processing (TALIP)
CRTER: using cross terms to enhance probabilistic information retrieval

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
A novel corpus-based stemming algorithm using co-occurrence statistics

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Search result caching in peer-to-peer information retrieval networks

IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility
Comparative information retrieval evaluation for scanned documents

Proceedings of the 15th WSEAS international conference on Computers
GRAS: An effective and efficient stemming algorithm for information retrieval

ACM Transactions on Information Systems (TOIS)
Ad-Hoc mono- and bilingual retrieval experiments at the university of hildesheim

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Exploring new languages with HAIRCUT at CLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Web retrieval experiments with the EuroGOV corpus at the university of hildesheim

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Cross-language retrieval using HAIRCUT at CLEF 2004

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Mono- and crosslingual retrieval experiments at the university of hildesheim

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Language identification in multi-lingual web-documents

NLDB'06 Proceedings of the 11th international conference on Applications of Natural Language to Information Systems
Authorship Attribution Based on Specific Vocabulary

ACM Transactions on Information Systems (TOIS)
Information retrieval strategies for digitized handwritten medieval documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)
A first approach to CLIR using character n-grams alignment

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
JHU/APL ad hoc experiments at CLEF 2006

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Character N-grams translation in cross-language information retrieval

NLDB'07 Proceedings of the 12th international conference on Applications of Natural Language to Information Systems
Cross-Language high similarity search using a conceptual thesaurus

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics
Cross-Language plagiarism detection using a multilingual semantic network

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Microblog-genre noise and impact on semantic annotation accuracy

Proceedings of the 24th ACM Conference on Hypertext and Social Media
Extrinsic evaluation on automatic summarization tasks: testing affixality measurements for statistical word stemming

MICAI'12 Proceedings of the 11th Mexican international conference on Advances in Computational Intelligence - Volume Part II
Identifying useful human correction feedback from an on-line machine translation service

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.