On the cost of multilingualism in database systems

Authors:
A. Kumaran;Jayant R. Haritsa
Affiliations:
Department of computer science and automation, Indian Institute of Science, Bangalore, India;Department of computer science and automation, Indian Institute of Science, Bangalore, India
Venue:
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Year:
2003

Citing 9
Cited 4

Bayan: an Arabic text database management system

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Understanding the new SQL: a complete guide

Understanding the new SQL: a complete guide
Natural language understanding (2nd ed.)

Natural language understanding (2nd ed.)
The Unicode standard, version 2.0

The Unicode standard, version 2.0
The BUCKY object-relational benchmark

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
SQL:1999: understanding relational language components

SQL:1999: understanding relational language components
Digital Democracy: Policy and Politics in the Wired World

Digital Democracy: Policy and Politics in the Wired World
Multicode: A Truly Multilingual Approach to Text Encoding

Computer
Compression of Unicode Files

DCC '98 Proceedings of the Conference on Data Compression

LexEQUAL: Supporting Multilexical Queries in SQL

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
LexEQUAL: multilexical matching operator in SQL

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
MIRA: multilingual information processing on relational architecture

EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
SemEQUAL: multilingual semantic matching in relational systems

DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Database engines are well-designed for storing and processing text data based on Latin scripts. But in today's global village, databases should ideally support multilingual text data equally efficiently. While current database systems do support management of multilingual data, we are not aware of any prior studies that compare and quantify their performance in this regard. In this paper, we first compare the multilingual functionality provided by a suite of popular database systems. We find that while the systems support most SQL-defined multilingual functionality, some needed features are not yet implemented. We then profile their performance in handling text data in IS0:8859, the standard database character set, and in Unicode, the multilingual character set. Our experimental results indicate significant performance degradation while handling multilingual data in these database systems. Worse, we find that the query optimizer's accuracy is different between standard and multilingual data types. As a first step towards alleviating the above problems, we propose Cuniform, a compressed format that is trivially convertible to Unicode. Our initial experimental results with Cuniform indicate that it largely eliminates the performance degradation for multilingual scripts with small repertoires. Further, the Cuniform format can elegantly support extensions to SQL for multilexical text processing.