Sets

User Guide › Design › Data Acquisition › Adapters › Regular Expression Syntax › Sets

Sets

A set is a set of characters that can match any single character that is a member of the set. Sets are delimited by "[" and "]" and can contain literals, character ranges, character classes, collating elements and equivalence classes. Set declarations that start with "^" contain the compliment of the elements that follow.

Character literals Examples:

"[abc]" will match either of "a", "b", or "c".
"[^abc] will match any character other than "a", "b", or "c".

Character ranges Examples:

"[a-z]" will match any character in the range "a" to "z".
"[^A-Z]" will match any character other than those in the range "A" to "Z".

Note that character ranges are highly locale dependent: they match any character that collates between the endpoints of the range, ranges will only behave according to ASCII rules when the default "C" locale is in effect. For example if the library is compiled with the Win32 localization model, then [a-z] will match the ASCII characters a-z, and also 'A', 'B' etc, but not 'Z' which collates just after 'z'. This locale specific behavior can be disabled by specifying regbase::nocollate when compiling, this is the default behavior when using regbase::normal, and forces ranges to collate according to ASCII character code. Likewise, if you use the POSIX C API functions then setting REG_NOCOLLATE turns off locale dependent collation.

Character classes are denoted using the syntax "[:classname:]" within a set declaration, for example "[[:space:]]" is the set of all whitespace characters. Character classes are only available if the flag regbase::char_classes is set. The available character classes are:

alnum	Any alpha numeric character.
alpha	Any alphabetical character a-z and A-Z. Other characters may also be included depending upon the locale.
blank	Any blank character, either a space or a tab.
cntrl	Any control character.
digit	Any digit 0-9.
graph	Any graphical character.
lower	Any lower case character a-z. Other characters may also be included depending upon the locale.
print	Any printable character.
punct	Any punctuation character.
space	Any whitespace character.
upper	Any upper case character A-Z. Other characters may also be included depending upon the locale.
xdigit	Any hexadecimal digit character, 0-9, a-f and A-F.
word	Any word character - all alphanumeric characters plus the underscore.
unicode	Any character whose code is greater than 255, this applies to the wide character traits classes only.

There are some shortcuts that can be used in place of the character classes, provided the flag regbase::escape_in_lists is set then you can use:

\w in place of [:word:]
\s in place of [:space:]
\d in place of [:digit:]
\l in place of [:lower:]
\u in place of [:upper:]

Collating elements take the general form [.tagname.] inside a set declaration, where tagname is either a single character, or a name of a collating element, for example [[.a.]] is equivalent to [a], and [[.comma.]] is equivalent to [,]. The library supports all the standard POSIX collating element names, and in addition the following digraphs: "ae", "ch", "ll", "ss", "nj", "dz", "lj", each in lower, upper and title case variations. Multi-character collating elements can result in the set matching more than one character, for example [[.ae.]] would match two characters, but note that [^[.ae.]] would only match one character.

Equivalence classes take the general form [=tagname=] inside a set declaration, where tagname is either a single character, or a name of a collating element, and matches any character that is a member of the same primary equivalence class as the collating element [.tagname.]. An equivalence class is a set of characters that collate the same, a primary equivalence class is a set of characters whose primary sort key are all the same (for example strings are typically collated by character, then by accent, and then by case; the primary sort key then relates to the character, the secondary to the accentation, and the tertiary to the case). If there is no equivalence class corresponding to tagname, then [=tagname=] is exactly the same as [.tagname.]. Unfortunately there is no locale independent method of obtaining the primary sort key for a character, except under Win32. For other operating systems the library will "guess" the primary sort key from the full sort key (obtained from strxfrm), so equivalence classes are probably best considered broken under any operating system other than Win32.

To include a literal "-" in a set declaration then: make it the first character after the opening "[" or "[^", the endpoint of a range, a collating element, or if the flag regbase::escape_in_lists is set then precede with an escape character as in "[\-]". To include a literal "[" or "]" or "^" in a set then make them the endpoint of a range, a collating element, or precede with an escape character if the flag regbase::escape_in_lists is set.