Field Types C1, C2, and C3

Record Definition Language › RDL Field Type Descriptions › Field Types C1, C2, and C3—Character Data

Field Types C1, C2, and C3—Character Data

These field types are compressed using the Huffman algorithm, coupled with elimination of successive repetitions of the same byte value. The value in each byte is assigned a variable-length bit code, with the most-frequently occurring value assigned the shortest bit code and the least-frequently occurring value assigned the longest bit code. The frequency of occurrence of each value is determined during the Prepass and is stored in 1 of 3 character frequency tables. A separate character frequency table is associated with each of the character-type RDL field specifications C1, C2 and C3. When coding RDL specifications for type C fields, you should attempt to group together in the same frequency table those fields whose byte values are likely to have a similar distribution.

For example, you can define predominantly alphabetic fields as type C1, predominantly numeric fields as C2, and fields with another kind of distribution as C3. The compression ratio thus obtained is better, at no increase in processing overhead, than if all fields are defined as the same type. With the exception that types C1, C2, and C3 have their own individual frequency table, they are treated identically by CA Compress.

For example, on a name and address file, suppose the name appears in the first 40 positions, the street address in the next 39, the city and state in the next 28, and the ZIP code in the final 5 positions. The code could be:

C1F112.

but

C1F40, C2F39, C3F28, C2F5.

gives better results. After the Prepass, the C1 table is heavily skewed toward alphabetics, with the letters M, R, S, blank and the vowels used most frequently. The more frequently the character is used, the shorter is its bit code representation. The second field is preponderantly alphabetics, numerics, and blank, so that its compression may be improved using a separate frequency table, C2. Because the last field is numeric, it can also be grouped in C2, although it probably is better for both fields to code it ZRF5. The city and state field may have the same approximate distribution as the first C1 field, but because there is one more table to spare, (C3), it should be used.

Note: In the improbable event that all 256-byte values are equally represented in the file, each character translates into an 8-bit code. But even in this case, some compression may be obtained through the type C automatic elimination of successive duplicate byte values.