Sunday, April 19, 2009

"Special" Characters - part II

One of the prerequisites to do what I demonstrated in the previous entry is that you must use tools that allow the display of special characters.
As logical as it seems, you can only display diacritics or a Euro sign when the tool actually knows how to display those. If in the end, your tooling cannot, you will see some strange things.




This screenshot is not as strange as it looks. The terminal session actually allows for various character encodings; it is UTF-8 by default. That is why the characters are actually displayed in the first select statement. The second one uses Western Windows 1252, the third Western ISO-8859-1.
I will now add the € symbol, using ISO-8859-15:

The fact, that is does not display in the second select statement, is that the terminal session switched to UTF8.
Now, lets have a look how Oracle treats all this:
SQL> select dump(a) from blah;

DUMP(A)
--------------------------------------------------------------------------------
Typ=1 Len=1: 113
Typ=1 Len=2: 195,167
Typ=1 Len=1: 231
Typ=1 Len=11: 195,134,32,195,166,32,197,146,32,197,147
Typ=1 Len=14: 195,133,32,195,152,32,195,159,32,195,145,32,195,177
Typ=1 Len=1: 164

First line, standard lower case "q": value is 0x071, or decimal 113. Now how in the world does the value one-hundred-and-thirteen get translated into some flagpole with a circle attached, many people will recognize as the character "q"?

Code Points and Character Sets


That is all about "code points". There's a standing agreement that in computer environments, decimal 113 gets represented on screen as the letter q. There's even a standard, called ASCII, which is short for American Standard Code for Information Interchange.
Of course, not all the world uses this standard, so it is possible, that the value 113 gets represented by another sign in another environment. Which is true for new characters, like the euro sign. After all, that did not exist 30 years ago.
But even in the "good old days" it was possible to represent the symbol "q" with something else, when you used the Extended Binary Coded Decimal Interchange Code, or EBCDIC (pronounced as 'epsedick'), for example. In EBCDIC, "q" would be 0x098, or 152 decimal.
The second line looks odd: 195,167 (0x0C3, 0x0A7) for a c-cedilla. Let's not forget the terminal session uses UTF8 character encoding by default. And that is what I used to insert all these characters in the first place.
UTF8 is yet another standard by which signs on screens are connected to (computer) code. If you want to know more about ASCII (particular US7ASCII), ISO and UTF character encoding, internet search machines are your friend. Basically, the ASCII coding schema only used 7 bits, allowing only for 2^7 code points, or 128 characters. The 8th bit was originally used for parity checking. As the first 32 codes were used as communication control characters, that did not allow for much more than the alphabet in upper and lower case (2*26 = 52 code points), numbers (0-9 = 10 additional code points) and some other characters, found on every typewriter (yet another 32 code points). In total, 126 code points; add a space, and all your bits are used.
The PC industry skipped the parity check, and extended the character set with another 127 characters. This character set became actually known by that name: the Extended ASCII character set (or US8ASCII, as opposed to the original US7ASCII). As every manufacturer had it's own likings, about every PC manufacturer had it's own coding points; some used it for primitive graphics, others for locale specific adaptations.
ISO standards defines 2^8 character points, and sub-standards exist for about every locale: ISO 8859-1 is known as ISO Latin1, or Western European, whereas ISO 8859-5 is known as the Cyrillic character set, ISO 8859-8 the Celtic, etc, etc..

What actually is stored...


are code points. Not characters!
Now, how does the c-cedilla get translated to 0x0C3, 0x0A7? Simple: I entered a c-cedilla in a terminal session, that supported UTF-8. In UTF-8, the c-cedilla has a code point "c3 a7", or U+00E7. You can check the rest of the output with a conversion table, an example is here. Another fine example with the full 1-byte UTF8 code table is here.
Now, the SQL*Plus session uses US7ASCII (I specified that by setting NLS_LANG). The database uses US7ASCII. Conclusion: Oracle needs to convert nothing! It gets US7ASCII from the session, and it stores US7ASCII - done!
Now - the last entry, the euro sign (164, or 0x0A4), was done whilst using ISO 8859-15 as terminal setting. It comes back in UTF8 as a question mark. Even though U+00A4 is defined as currency symbol, UTF8 does not understand the single byte.

Where Oracle comes in.

OK - I now changed the environment variable NLS_LANG to American_America.UTF8. The terminal supports UTF8, too. Now take a look at this:

SQL> truncate table blah;
Table truncated.

SQL> insert into blah values('q');
1 row created.

SQL> insert into blah values('ç');
1 row created.

SQL> select * from blah;
A
------------------------------------------------------------
q
c

SQL> select dump(a) from blah;
DUMP(A)
--------------------------------------------------------------------------------
Typ=1 Len=1: 113
Typ=1 Len=1: 99

Watch that last entry! It reads the character 'c', not 'ç' ! The whole point here is that now the network layer actually performs a conversion - it gets UTF8, but has to map that into US7ASCII. US7ASCCI does not know the code point "ç", and maps it to "c". Similar for the euro symbol:
SQL> insert into blah values('€');
1 row created.

SQL> select a,dump(a) from blah;
A DUMP(A)
---------- ------------------------------
q Typ=1 Len=1: 113
c Typ=1 Len=1: 99
? Typ=1 Len=1: 63

Ah - there it is! The infamous question mark! And the euro symbol has become a decimal 63 (0x03F).

Mind you - now you have actually lost data! There is no way of knowing that 0x063 was originally the character 'c' or not. Same for the euro symbol!

Conclusion

  • Be careful with "special" characters - you may actually loose data!
  • You do not store characters in the database - you store code points. The actual character displayed depends on the character set in use.
  • Make sure all end user components are capable of displaying "special" characters. If one does show the correct character, and the other doesn't, the latter uses a wrong character set. Choose a different one, that matches the code point to the correct character (just as the other tool does).

4 comments:

Gareth said...

A couple of excellent posts - mandatory reading for anyone doing anything with multi-languages in Oracle!

Dik Pater said...

Frank,
very good explanation.
I also read
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:5783936214008
This is a great thing to read about the many oracle questions about charactersets, it so confusing...
One interesting link for you is machiatto.com .
Thanks for this article.
BTW what's the best characterset to use as a standard in my company and is it better to use the same characterset in all the databases?
Regards, Dik Pater.

Frank said...

@Gareth: thanks.
@Dik: your company would/should use WE8MSWIN1252 - see part III

Frank said...

Dik meant http://www.macchiato.com/