Schema Notes

Platform Deployment Guide › Binary Text Extractor › Example BTE Configuration File › Schema Notes

Schema Notes

This section describes the XML elements and attributes that you need to include in a BinaryTextorConfig.xml configuration file.

<UniveralBinaryTextor>

This root element contains zero or more <FileType> elements.

You can use a 'zero file type' configuration file to disable the BTE.

Zero File Type Configuration Files

If the <UniversalBinaryTextor> element contains zero <FileType> elements, the BTE is effectively disabled. A 'zero file type' configuration file is shown below:

<?xml version="1.0" encoding="utf-8" ?>
<UniversalBinaryTextor>
  <!--This is an empty configuration file-->
</UniversalBinaryTextor>

Note: You cannot use an empty BinaryTextorConfig.xml file to disable the BTE. You must use a 'zero file type' version of BinaryTextorConfig.xml.

<FileType>

This element specifies the type of file that you want the BTE to process. It has the following optional attribute:

name

The name is only used in log files. Use a descriptive name that identifies the file type. For example:

<FileType name="AFF Files for use in the Oil Industry">

Each <FileType> element can contain any number of <MagicNumber> and <Encoding> sub-elements.

<MagicNumber>

This element specifies the magic number, or file signature, of the file type that you want the BTE to process. You can include multiple <MagicNumber> elements for each <FileType> element.

If the file's magic number does not match the magic number specified in the configuration file, the BTE does not process the file.

This element has the following attributes:

value

This attribute specifies the actual magic number (or part of the magic number) used by the file type.

type

This attribute specifies whether the magic number is a text string or hexadecimal string:

type="hex-string"
type="ascii-string"

For example, F8DE627B6 and A1A1A1 are valid hexadecimal magic numbers.

Likewise, 'abcPK£' is interpreted as a single byte of ASCII and is valid magic number. But 'Ωω' is not a valid magic number (because Ω and ω are not valid ASCII characters).

offset

This attribute specifies the location of the magic number within the file. The location is specified as a character offset, where zero specifies the first character, 1 specifies the second character, and so on.

Examples

This example matches files that begin with the hexadecimal 'EFBBBF' magic number.
```
<MagicNumber value="EFBBBF" type="hex-string" offSet="0"/>
```
This example matches any DLL or executable. These files begin with an 'MZ' magic number:
```
<MagicNumber value="MZ" type="ascii-string" offSet="0" />
```

<Encoding>

This element specifies the encoding system. The encoding system defines how individual text characters are represented in the files that you want the BTE to process.

The BTE supports four encodings: ASCII, UTF-8, Little Endian UTF-16 (more commonly known as Unicode) and Big Endian UTF16.

The <Encoding> element has these attributes:

name

This attribute specifies the encoding system type. The supported values are:

ASCII
UTF8
UTF16-LITTLEENDIAN
UTF16-BIGENDIAN

minLength

You can specify the shortest word that you want the BTE to extract. For example, if you want CA DataMinder policies to detect files that contain the word 'Unipraxis', the BTE only needs to extract strings with a minimum length of 9 characters.

This attribute specifies how long a character string must be before the BTE considers it to be a valid string. This example specifies a 6-character ASCII string as the shortest word that you want the BTE to extract:

<Encoding name="ASCII" minLength="6">

For each <Encoding> element, you must also specify which character ranges are valid. You define valid character ranges in the <CharSet> sub-elements.

<CharSet>

The character set defines the range of characters that are eligible for extraction. You typically ignore non-printing characters (such as paragraph markers) and only extract printing characters.

The <CharSet> elements determine which characters are considered valid constituents of strings. One or more elements are required.

You identify characters by their Unicode code point. You can identify a valid range by specifying the first and last characters in the range, or you can specify a Unicode block.

start, end

These attributes specify the first and last Unicode code points in the character range. Code points are expressed in hexadecimal.

This example specifies the range of printable ASCII characters:

<CharSet start="0x20" end="0x7F" />

This example specifies the Latin alphabet in lower case:

<CharSet start="97" end="122" />

blockName

Block names are aliases for character ranges. You can find a list of valid block names at:

http://www.unicode.org/Public/UNIDATA/Blocks.txt

For example, this element is equivalent to the Arabic character range 0x0600..0x06FF:

<CharSet blockName="Arabic" />

Note: The BTE ignores case, spaces, hyphens, and underscores when checking block names. For example, the BTE interprets these block names as being the same:

<CharSet blockName="Basic Latin" />
<CharSet blockName="BasicLatin" />
<CharSet blockName="basic latin" />