Unicode character classes

About character classes

The Unicode Character Database (UCD) defines a character class for every code-point in Unicode.

Within the UCD, these character classes are written as a single upper-case latin letter, to denote the major category, followed by a single lower-case latin letter, to denote the minor category. For example, if a code-point is classified as an upper case letter, then it is given the character class of Lu. The 'L' denotes "letter" as the major category, and the 'u' denotes "upper-case" as the minor category.

These character classes are translated into Eiffel code as INTEGER constants in UC_UNICODE_CONSTANTS. The names are structured such that they start with the minor category, are followed by the major category, and end with the word category. So we have Uppercase_letter_category, for instance.

The character class for a given code-point can be retrieved by means of character_class from ST_UNICODE_CHARACTER_CLASS_INTERFACE. This can be useful for defining your own classification routines. For instance, an is_alphanumeric routine might be defined in this way (the library doesn't provide one, as more than one plausible definition is possible).

However, some code-points can usefully be considered to be in more than one category. The UCD deals with this situation by defining the principle category, and also defining various character properties. For example, the code-points in the range 24B6..24CF (CIRCLED LATIN SMALL LETTER A..CIRCLED LATIN SMALL LETTER Z) are classified as So (Other_symbol_category). But they also have the Other_Uppercase property. The library handles that by providing the routine is_upper_case, which includes all code-points in Uppercase_letter_category and all code-points with the Other_Uppercase property.

How to access the character class routines provided by the library

The character class routines provided by the library are defined in the deferred class ST_UNICODE_CHARACTER_CLASS_INTERFACE. In order to gain access to these routines you can either inherit from ST_UNICODE_CHARACTER_CLASS_ROUTINES, which provides information from the latest available version of unicode. Or you can inherit from a class for a particular version of Unicode (e.g. ST_UNICODE_V410_CHARACTER_CLASS_ROUTINES for Unicode version 4.1.0).

Alternatively, you can access these classes as clients, by inheriting from ST_IMPORTED_UNICODE_CHARACTER_CLASS_ROUTINES or ST_IMPORTED_UNICODE_V410_CHARACTER_CLASS_ROUTINES, which provide the routines unicode_character_class and unicode_v410_character_class respectively.

The latter method is necessary if you want to be able to use routines from two different versions of Unicode within the same class.