About Unicode normalization PreviousNext

Unicode normalization is all about comparing strings for equality.

But I know how to compare two strings

But I already know how to compare two Unicode strings - I call STRING_.same_string or STRING_.same_case_insensitive, so why do I need normalization?

The problem with STRING_.same_string is that it compares two strings code-point by code-point, and assumes that if two code-points are the same, then the abstract characters they represent are the same. Well, that's true enough. But it also assumes that if two code-points are not the same, then the abstract characters that they represent are not the same. This is a reasonable assumption (ignoring surrogate code-points, which we can because we don't have a UC_UTF16_STRING yet - and when we do, STRING_.same_string would have to take that into account when comparing a UC_UTF16_STRING with a UC_UTF8_STRING.)

The reason this assumption fails is due to history - Unicode has been designed for round-trip compatibility with a number of legacy encodings, such as ISO-8859-1 (Latin-1). Latin-1 has a number of accented characters. One example LATIN CAPITAL LETTER A WITH GRAVE, to give it it's Unicode name, which has a code-point of 192.

Other encodings enable you to compose accented characters out of a combination of a base character (in this case LATIN CAPITAL LETTER A at code-point 65 and COMBINING GRAVE ACCENT at code-point 768). Now should these two possibilites within Unicode be considered the same abstract character or not? In most cases, you will want to consider them the same, but STRING_.same_string will consider them to be different, and so return False even if the strings differ in that one respect only.

There are other characters where the situation is not so clear. What about the character MATHEMATICAL BOLD CAPITAL A (code-point 119808)? Is this the same abstract character as LATIN CAPITAL LETTER A with just a slight presentational difference, or are they two distinct characters? This rather depends upon your application.

Canonical and Compatibility compositions and decompositions

Unicode answers these questions by giving you some choice in how they are answered. The basic idea is you convert your strings so that the same kind of representation for characters is used throughout, and then you can perform a binary comparison (with STRING_.same_string, for instance). This process is called Normalization. But you have a choice of four different ways of performing the normalization, depending upon the requirements of your application.

The basic choice you have to make is whether to represent your characters using composed forms (such as LATIN CAPITAL LETTER A WITH GRAVE), or decomposed forms (such as LATIN CAPITAL LETTER A followed by COMBINING GRAVE ACCENT).

The other choice you have to make is whether minor presentational variations such as MATHEMATICAL BOLD CAPITAL A versus LATIN CAPITAL LETTER A are significant or not. In this case, Unicode has a bias - such distinctions are assumed meaningful by default. That is, decompositions of this kind (MATHEMATICAL BOLD CAPITAL A decomposes to LATIN CAPITAL LETTER A) are labelled Compatibility decompositions, whereas decompositions of the first kind (such as LATIN CAPITAL LETTER A WITH GRAVE decomposing to LATIN CAPITAL LETTER A followed by COMBINING GRAVE ACCENT) are known as Canonical decompositions. Note that all compositions are canonical - you cannot reverse a compatibility decomposition.

The four normalization forms

Accordingly, there are four Normalization forms defined in Unicode (there are additional forms defined by the W3C's Character Model - see Character Model for the World Wide Web 1.0: Normalization).

NFD
Normal Form Decomposition.

This is obtained by replacing all composed characters by their canonical decompositions.

NFKD
Normal Form Kompatibility Decomposition.

This is obtained by replacing all composed characters by their decompositions, whether they are canonical or compatibility decompositions.

NFC
Normal Form Composition.

This is obtained by replacing all composed characters by their canonical decompositions, and then in turn replacing all decomposed sequences by their canonical compositions.

NFKC
Normal Form Kompatibility Composition.

This is obtained by replacing all composed characters by their decompositions (whether canonical or compatibility), and then in turn replacing all decomposed sequences by their canonical compositions.

NFD and NFKD tend to be longer than NFC and NFKC. Note that a pure ASCII string is unchanged by any of these transformations. Not so for a Latin-1 string, however.

Since compatibility decompositions tend to equate to presentational differences, these are most naturally useful if you wish to do a case-insensitive comparison (since case is fundamentally a presentation difference in itself). Note however, that simply converting to NFKC or NFKD does not fold case differences. You have to further apply case folding to the resultant strings. (see TODO).


Copyright © 2005, Colin Adams
mailto:colin-adams@users.sourceforge.net
http://www.gobosoft.com
Last Updated: 4 November 2005
HomeTocPreviousNext