On the cost of multilingualism in database systems

  • Authors:
  • A. Kumaran;Jayant R. Haritsa

  • Affiliations:
  • Department of computer science and automation, Indian Institute of Science, Bangalore, India;Department of computer science and automation, Indian Institute of Science, Bangalore, India

  • Venue:
  • VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Database engines are well-designed for storing and processing text data based on Latin scripts. But in today's global village, databases should ideally support multilingual text data equally efficiently. While current database systems do support management of multilingual data, we are not aware of any prior studies that compare and quantify their performance in this regard. In this paper, we first compare the multilingual functionality provided by a suite of popular database systems. We find that while the systems support most SQL-defined multilingual functionality, some needed features are not yet implemented. We then profile their performance in handling text data in IS0:8859, the standard database character set, and in Unicode, the multilingual character set. Our experimental results indicate significant performance degradation while handling multilingual data in these database systems. Worse, we find that the query optimizer's accuracy is different between standard and multilingual data types. As a first step towards alleviating the above problems, we propose Cuniform, a compressed format that is trivially convertible to Unicode. Our initial experimental results with Cuniform indicate that it largely eliminates the performance degradation for multilingual scripts with small repertoires. Further, the Cuniform format can elegantly support extensions to SQL for multilexical text processing.