Specifying a Custom Character Set

The Command Line Interface › Creating Custom Character Sets for File Conversion › Specifying a Custom Character Set

Specifying a Custom Character Set

To use a custom character set you must change the parameter values in the xcom.glb file for the INTERNAL_CONVERSION_TABLES parameter and then either the CODETABL value, or the ETOA_FILENAME and/or ATOE_FILENAME values, depending on the custom character set changes that were made. To activate the changes in xcom.glb, you must restart xcomd.

To specify a custom character set in xcom.glb

From the command line, enter the following:
```
vi xcom.glb
```
The xcom.glb file is opened for editing.
Set INTERNAL_CONVERSION_TABLES=NO.
a. Change the value of CODETABL to the 1 to 3 character prefix specified for the xxxatoe.tab/xxxetoa.tab file names you created.
or

b. Change the values for ETOA_FILENAME and/or for ATOE_FILENAME to the file names containing the customized files you created.
Save xcom.glb and exit the editor.
Restart xcomd to activate the changes made.

Unicode and Multi-Byte Character Sets Support for Data Transfer

Before the advent of Unicode, a significant number of character sets were devised to permit the representation of symbols used in the Chinese, Japanese, Korean, and Taiwanese (CJK) languages. Today, Unicode is favored and there is an ongoing transition from these legacy character sets to Unicode encodings, most notably UTF-8 and UTF-16. Many CJK legacy Multi-Byte character sets are ASCII based, as is the case for the most commonly used Unicode encodings (i.e.UTF-8, UTF-16).

In the IBM mainframe (predominantly EBCDIC) world however composite character sets are commonly employed, involving a Shift-in/Shift-out encoding method. This encoding mechanism enables a single byte ASCII or EBCDIC character set to be used for the representation of Latin characters, in tandem with a multi byte character set for the representation of non-Latin characters. Shift-in and shift-out control characters are then inserted in the data stream to signal a switch between the two embedded character sets. For example, the CCSID 937 character set combines an EBCDIC single byte character set with a ‘Traditional Chinese’ multi-byte character set, whilst the CCSID 938 character set combines an ASCII single byte character set with the same ‘Traditional Chinese’ multi-byte character set.

CA XCOM Data Transport allows for transmission of text files that are encoded using Multi-Byte characters sets, including in-flight conversion of data between different character sets.

CA XCOM Data Transport utilizes the ICU (International Components for Unicode) toolkit to perform data conversion functions. For information on the ICU toolkit, please refer to the ICU website http://site.icu-project.org/.

Using Unicode Transfer

CA XCOM Data Transport is capable of transmitting data using either the UTF-8 or UTF-16 Unicode encodings. They can be specified with the CODE_FLAG parameter to allow for conversion of files from one character encoding to Unicode and back.

When a file’s data comprises a low ratio of non-Latin characters versus Latin characters, UTF-8 encoding will consume fewer bytes and therefore result in a faster transfer. In contrast, when a file’s data comprises a high ratio of non-Latin characters versus Latin characters, UTF-16 encoding will produce the best result.

The Unicode transfer Data Formats parameter:

CODE_FLAG

Specifying Charset

The CA XCOM Data Transport sending server converts the input encoding to UTF-8 or UTF-16, while the CA XCOM Data Transport receiving server converts to the required output encoding. This divides the conversion workload between the two CA XCOM Data Transport servers.

Use the LOCAL_CHARSET and REMOTE_CHARSET parameters in order to choose the local file and remote file character encoding. If not specified for the transfer, they default to the value specified for the DEFAULT_CHARSET global parameter in the XCOM.GLB file.

For a list of supported Charsets see Appendix C.

The Unicode transfer Charset parameters:

DEFAULT_CHARSET
LOCAL_CHARSET
REMOTE_CHARSET

Handling Conversion Errors

Not all characters can be converted between Unicode and other charsets or vice versa. In most cases, Unicode is a superset of the characters supported by any given charset.

Use MBCS_INPUTERROR and MBCS_CONVERROR to specify what action CA XCOM Data Transport should take in the event of a character being encountered cannot be converted.

When erroneous data is encountered during conversion then the following actions are possible:

Skip the erroneous data and continue.
Replace the erroneous data with the charsets default substitution character.
Replace the erroneous data with the supplied Unicode character.
Fail the transfer.

The CA XCOM Data Transport sending server uses MBCS_INPUTERROR whereas the CA XCOM Data Transport receiving server uses MBCS_CONVERROR to specify the action. If not specified the value of the DEFAULT_INPUTERROR and DEFAULT_CONVERROR global parameters in XCOM.GLB will be used.

The Unicode transfer Conversion error handling parameters:

DEFAULT_INPUTERROR
DEFAULT_CONVERROR
MBCS_INPUTERROR
MBCS_CONVERROR

Record Processing

CA XCOM Data Transport uses a newline also known as line break or end-of-line (EOL) marker special character to identify records in the text files. With the advent of Unicode, all the end of the line delimiters are supported irrespective of the platform.

Use the LOCAL_DELIM or REMOTE_DELIM parameters to choose the delimiter depending on the charset used in a Unicode transfer. If these parameters are not specified then the value of the DEFAULT_DELIM global parameter in the xcom.glb file will be used.

The Unicode transfer Delimiter handling parameters:

DEFAULT_DELIM
LOCAL_DELIM
REMOTE_DELIM

Examples

Example1:

In the following example, the XCOMTCP command is used to run a Unicode transfer by specifying CODE_FLAG=UTF8.

The local file input.txt encoded in CP949 is converted to EUC-KR.

XCOMTCP -c1 -f MYCONFIG.CNF LOCAL_FILE=input.txt CODE_FLAG=UTF8 LOCAL_CHARSET=CP949 REMOTE_CHARSET=EUC-KR

Example2:

In the following example, the conversion errors are handled.

If any erroneous character is found while reading the input file, the transfer will be FAILED. If any erroneous character is found while converting to the remote charset, the transfer will continue by substituting the malformed character with a default substitution character.

XCOMTCP -c1 -f MYCONFIG.CNF LOCAL_FILE=input.txt CODE_FLAG=UTF16 LOCAL_CHARSET=CP949 REMOTE_CHARSET=EUC-KR MBCS_INPUTERROR=FAIL MBCS_CONVERROR=REPLACE

Example3:

In the following example, the EBCDIC conversion is handled.

The ASCII based text file is converted to EBCDIC encoding. The ASCII LF (Line Feed) delimiter is used to detect the end of a record. The NL is added as a line delimiter in the output file.

XCOMTCP -c1 -f MYCONFIG.CNF LOCAL_FILE=input.txt CODE_FLAG=UTF8 LOCAL_CHARSET=ISO-8859-1 REMOTE_CHARSET=CCSID#37 LOCAL_DELIM=ASCII:LF REMOTE_DELIM=EBCDIC:NL