Unicode Considerations

Unicode is a method for encoding characters that allows one application to process characters and strings from most of the languages of the world. Nearly one million unique characters can be represented in Unicode. Unicode is neither a single byte character system (SBCS) nor a double byte character system (DBCS).

The C# and Java languages support Unicode only for characters, literals, and strings represented in code. This is true for all C# and Java applications, not just those generated by CA Gen.

It is possible for users of C# or Java generated applications to provide input data that cannot be stored on a database that is not configured to store Unicode.

Customers should evaluate the advantages and disadvantages of converting the applications database to Unicode. If the database is not converted, C# and Java applications could submit unrecognized character that cannot be stored correctly. Alternatively, customers may choose to use only C and COBOL applications with a database that is set to store DBCS characters.

This creates challenges when you attempt to use C# or Java code to access databases. Most existing databases store characters in a codepage that represents a subset of Unicode characters. Thus, many of the characters processed by the C# or Java application cannot be represented on the database. Unsupported characters are usually replaced with a 'sub character' or a '?' depending upon the database translation algorithm.

Converting existing databases to Unicode is possible with most implementations and would allow all characters entered by a user and processed by the code to be stored. However, converting a database to use Unicode could require changes to existing applications and in the way they access the database.

For a CA Gen application that creates a new database, CA suggests that the new database always support Unicode encoding. There are two major Unicode encoding schemes supported by most databases: UTF-8 and UCS-16. UCS-16 Unicode Coding Sequence 16-bit is the simplest because every character in the database is represented as an unsigned 16-bit number, which is the same representation as characters in a C# and Java application. UCS-16 uses greater amounts of storage when most of the characters represented in a database are indicated by code point numbers less than 128 (0x80). UTF-8 (Unicode Transfer Format 8-Bit) solves this problem because all code points less than 128 require only a single 8-bit byte of storage (code points between 128 and 2047 require two bytes, and code points between 2048 and 65535 require 3 bytes, and so on).

Further, a database created to support UTF-8 makes more assumptions about the type and size of characters in each text column. For example, a text column that is fixed ten characters (bytes) in length correctly stores ten characters when all ten code points are less than 128. But any higher character code points would require more than ten bytes in the column. When the type of data is unknown, all character columns in a UTF-8 database should be defined as variable and three times the length of the maximum string written by the C# or Java application (in this example, 3*10 or thirty).