Bayan: an Arabic text database management system
SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Understanding the new SQL: a complete guide
Understanding the new SQL: a complete guide
Natural language understanding (2nd ed.)
Natural language understanding (2nd ed.)
The Unicode standard, version 2.0
The Unicode standard, version 2.0
The BUCKY object-relational benchmark
SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
SQL:1999: understanding relational language components
SQL:1999: understanding relational language components
Digital Democracy: Policy and Politics in the Wired World
Digital Democracy: Policy and Politics in the Wired World
DCC '98 Proceedings of the Conference on Data Compression
LexEQUAL: Supporting Multilexical Queries in SQL
ICDE '04 Proceedings of the 20th International Conference on Data Engineering
LexEQUAL: multilexical matching operator in SQL
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
MIRA: multilingual information processing on relational architecture
EDBT'04 Proceedings of the 2004 international conference on Current Trends in Database Technology
SemEQUAL: multilingual semantic matching in relational systems
DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Hi-index | 0.00 |
Database engines are well-designed for storing and processing text data based on Latin scripts. But in today's global village, databases should ideally support multilingual text data equally efficiently. While current database systems do support management of multilingual data, we are not aware of any prior studies that compare and quantify their performance in this regard. In this paper, we first compare the multilingual functionality provided by a suite of popular database systems. We find that while the systems support most SQL-defined multilingual functionality, some needed features are not yet implemented. We then profile their performance in handling text data in IS0:8859, the standard database character set, and in Unicode, the multilingual character set. Our experimental results indicate significant performance degradation while handling multilingual data in these database systems. Worse, we find that the query optimizer's accuracy is different between standard and multilingual data types. As a first step towards alleviating the above problems, we propose Cuniform, a compressed format that is trivially convertible to Unicode. Our initial experimental results with Cuniform indicate that it largely eliminates the performance degradation for multilingual scripts with small repertoires. Further, the Cuniform format can elegantly support extensions to SQL for multilexical text processing.