Previous Topic: OverviewNext Topic: How to Configure the Binary Text Extractor


How Does the Binary Text Extractor Work?

You configure the BTE to extract text from specific types of file. If CA DataMinder is unable to analyze a file using standard methods, it calls the BTE. If the BTE recognizes the file type, it extracts the text and passes it to a CA DataMinder policy engine or endpoint agent for analysis. CA DataMinder then applies policy to the file as normal.

A configuration file, BinaryTextorConfig.xml, specifies which file types the BTE can process. For each file type, the configuration file specifies the 'magic number' and the extracted text:

Magic Number

The magic number is a file signature (a text string or hexadecimal string) that identifies the file type that you want the BTE to support.

If the BTE detects that a file contains a magic number listed in BinaryTextorConfig.xml, the BTE proceeds to extract the text content.

Extracted Text

When you specify the text that you want the BTE to extract, you specify the encoding system, the character range, and the minimum string length. The BTE then extracts any text strings that match these requirements.

The encoding system defines how individual text characters are represented in the files that you want to analyze. The BTE supports ASCII, UTF-8, Little Endian UTF-16 (more commonly known as Unicode) and Big Endian UTF16.

The character set defines the range of characters that are eligible for extraction. You typically ignore non-printing characters (such as paragraph markers) and only extract printing characters.

The minimum string length defines the shortest word that you want the BTE to extract. For example, if you want CA DataMinder policies to detect files that contain the word 'Unipraxis', the BTE only needs to extract strings with a minimum length of 9 characters.