Wednesday, April 22, 2009

"Special" Characters - part III

Now that you know how Oracle databases store characters, as seen in part I and part II, it may be time to get realistic. After all, no one wants to loose information in a dramatic way as seen in part II.
One thing I forgot earlier: the euro symbol...
[frank@cs-frank03 ~]$ su - ora10gr2
Password:
[ora10gr2@cs-frank03 ~]$ . oraenv
ORACLE_SID = [ora10gr2] ? us7ascii
[ora10gr2@cs-frank03 ~]$ sqlplus /nolog
SQL*Plus: Release 10.2.0.3.0 - Production on Wed Apr 22 10:04:56 2009
Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

SQL> conn / as sysdba
Connected to an idle instance.
SQL> startup
ORACLE instance started.

Total System Global Area 289406976 bytes
Fixed Size 1261396 bytes
Variable Size 109052076 bytes
Database Buffers 176160768 bytes
Redo Buffers 2932736 bytes
Database mounted.
Database opened.
SQL> conn scott
Enter password:
Connected.
SQL> insert into blah values('€');
1 row created.

SQL> select dump(a) from blah;
DUMP(A)
--------------------------------------------------------------------------------
Typ=1 Len=1: 113
Typ=1 Len=1: 99
Typ=1 Len=1: 63
Typ=1 Len=1: 63
Typ=1 Len=1: 63
Typ=1 Len=3: 226,130,172

6 rows selected.

Now, ignore the first five entries - these are from the previous demonstrations. The last one is what it is all about, that is the Euro symbol! But how did it get to be three byte anyway?

The mystery of a single character stored as three bytes in US7ASCII

It is not. A mystery, that is. It is no rocket science at all. Here's the explanation:
  • The database and session are using the same NLS settings. Because of that, no conversion is done whatsoever. This is an important item to remember.
  • The terminal session supports UTF8
  • The decimal UTF8 representation of the Euro symbol is '226.130.172' - take a look here, search for 'euro'.
And that is exactly what is stored, and can be retrieved, and displayed as euro symbol. But only because the program I use (gnome-terminal) supports UTF8. Any other program, not understanding UTF8 will display something else!

Reality check

Enough playing around with US7ASCII. I do not think anyone in their right mind would build a US7ASCII database. Which makes you wonder why Oracle Corporation still uses it as the default? But let's move on.

Same way the US7ASCII database was built, now a WE8ISO8859P1 database is created. I have chosen WE8ISO8859P1 as it is the default for DBCA on West-European Unix and Linux.
For one of the clients, I will use the corresponding MS Windows code settings; WE8MSWIN1252. This will simulate the average workstation in this part of the world - predominantly Microsoft Windows.
The other client will simulate a DBA, logging in to the machine itself, and using the environment settings which come with the installation: NLS_LANG=American_America.WE8ISO8859P1; export NLS_LANG. Start seeing dark clouds yet? I do...
SQL> insert into blah values ('Æ');
1 row created.

SQL> insert into blah values('ß');
1 row created.

SQL> select a, dump(a,16) from blah;
A DUMP(A,16)
-------------------- --------------------
Æ Typ=1 Len=2: c3,86
ß Typ=1 Len=2: c3,9f

I know why that works - do you?
SQL> insert into blah values('Æ');
1 row created.

SQL> insert into blah values('ß');
1 row created.

SQL> select a, dump(a,16) from blah;
A DUMP(A,16)
-------------------- --------------------
Æ Typ=1 Len=2: c3,86
ß Typ=1 Len=2: c3,9f
Æ Typ=1 Len=1: c6
ß Typ=1 Len=1: df

HEY! What's that?
A dark cloud... Well, I cheated - something you cannot see from this output, is the fact I changed gnome-terminal to use ISO-8859-1, not UTF8. The last two inserts were done with ISO-8859-1. Now, that is a single byte character set, so the double bytes used by UTF8 (the first two entries) are no longer understood. Hence the "corrupt" characters when retrieving them. The ISO-8859-1 characters can be found here, and are correct from that standpoint. Remember, 0x0c6=198, 0x0df=223.

The case of the invalid Euro

I have had comments, that I was not storing the correct characters. My response to that was "define correct" (which is of course similar to one of the standard answers for any Oracle question: "it depends" ;) ). If I can store and retrieve characters, it must be correct, isn't it?
Let me insert a € symbol. This cannot be done while gnome-terminal is in ISO-8859-1 mode: ISO-8859-1 does not support the Euro symbol! So, switch to ISO-8859-15:
SQL> insert into blah values('€');
1 row created.

SQL> select a, dump(a) from blah;
A DUMP(A)
-------------------- --------------------
Æ Typ=1 Len=2: 195,134
ß Typ=1 Len=2: 195,159
Æ Typ=1 Len=1: 198
ß Typ=1 Len=1: 223
€ Typ=1 Len=1: 164

Switching back to ISO-8859-1, this is the result:
A DUMP(A)
-------------------- --------------------
Æ Typ=1 Len=2: 195,134
ß Typ=1 Len=2: 195,159
Æ Typ=1 Len=1: 198
ß Typ=1 Len=1: 223
¤ Typ=1 Len=1: 164

Nice, huh? When you lookup decimal 164 in the ISO-8859-1 reference, you will find it is defined as Currency sign. Pretty neat, I'd say to define the Euro symbol in the -15 extension on the code point for the currency sign.
But I also show things can go wrong. And it will get worse, when NLS settings come in!

Wrap up

So far, I have not even begun to fiddle around with NLS settings. All I did was change the capabilities of the front end. And I did not even touch web interfaces by far (which should of course use the correct HTML escaped code for a €: €)

Next and last part will be about NLS settings, as indicated above.

Not sure if and how your browser will display all these characters, here are two images:


2 comments:

Coskan Gundogar said...

Frank,

These serias helped me too much thank you for sharing. Will you be able to write the last part as you mentioned ?

Frank said...

Oops... in fact there's a draft of part 4, but I kinda' forgot what I was going to demonstrate.
Will look into it, time permitting.