Unicode normalization routines provided by the library PreviousNext

Which class to choose

You have the choice of using the latest version of Unicode that is supported by the library, or you can choose a specific version.

You have the choice of inheriting the routines, or using them as a client object. The latter is the only way to use more than one version of Unicode within a class, although this seems an unlikely requirement.

At the time I write this documentation, 4.1.0 is the latest version of Unicode, and it is also the latest and only version supported by this library. But 5.0.0 is currently in beta. So for the purposes of illustration, I shall assume that 5.0.0 is now live, and that this library supports it. So you could inherit from one of the following:

ST_UNICODE_NORMALIZATION_ROUTINES
This would give you direct access to routines for version 5.0.0 of Unicode, but when 5.1.0 was supported by the library, you would then get the 5.1.0 routines.
ST_UNICODE_V500_NORMALIZATION_ROUTINES
This would also give you direct access to routines for version 5.0.0 of Unicode, but this would not change, even when the library was updated to support 5.1.0.
ST_UNICODE_V410_NORMALIZATION_ROUTINES
This would give you direct access to routines for version 4.1.0 of Unicode, and this would not change, even when the library was updated.
ST_IMPORTED_UNICODE_NORMALIZATION_ROUTINES
This would give you indirect access to routines for version 5.0.0 of Unicode, via normalization, but when 5.1.0 was supported by the library, you would then get indirect access to the 5.1.0 routines via the same feature.
ST_IMPORTED_UNICODE_V500_NORMALIZATION_ROUTINES
This would give you indirect access to routines for version 5.0.0 of Unicode, via normalization_v500, but this would not change, even when the library was updated to support 5.1.0.
ST_IMPORTED_UNICODE_V410_NORMALIZATION_ROUTINES
This would give you indirect access to routines for version 4.1.0 of Unicode, via normalization_v410.

To use routines from both 5.0.0 and 4.1.0 versions of Unicode, you would import both of the last two classes in the preceding list.

The routines provided by the library

All these classes provide access to the same set of routines, via inheritance from ST_UNICODE_NORMALIZATION_INTERFACE. Note that ASCII and Latin-1 STRINGs can be passed to all of these routines, as well as UC_UTF8_STRINGs, as they all operate on code-points, accessed via the item_code feature. By design, Unicode allocates the same code points to Latin-1 characters as the ISO-8859-1 encoding does. Thus we keep compatibility.

The following routines are available to check if a STRING is in the desired normalization form:

The following routines do not assume anything about the current normalization state of their argument. If the argument is already in the desired normalization form, then the original object will be returned. Otherwise a new object will be allocated and returned:

The following routines require that their argument is not in the required normalization form, and always allocate and return a new object:

So when you desire to work with a decomposed form:

  1. If you know that your string is in the desired form, do nothing
  2. If you know that your string is not in the desired form, call to_nfd or to_nfkd
  3. Otherwise, call as_nfd or as_nfkd. You can then test for object identity, if you wish to know whether normalization was actually performed.

The composing routines always allocate and return a new object, irrespective of the status of the argument:

The Unicode Character Database tabulates a lot of properties for each code-point. Some of these are relevant for the normalization process. There are three that in some circumstances might be of interest to an application. They are:

The canonical combining class determines how different parts of a character are overlayed when displayed.

The decomposition mapping lists the code-points in the decomposition of a code-point. The decomposition type indicates whether a code-point has a canonical decomposition, or if not, what type of compatibility decomposition it has. The results of this routine are available as symbolic constants in UC_UNICODE_CONSTANTS. Note that if a code-point has no decomposition, then the result of this routine will be Canonical_decomposition_mapping, so you must check for a non-Void result from decomposition_mapping_property to confirm that a code-point actually has a canonical decomposition.


Copyright © 2005, Colin Adams
mailto:colin-adams@users.sourceforge.net
http://www.gobosoft.com
Last Updated: 4 November 2005
HomeTocPreviousNext