Saturday, May 02, 2009

"Special" characters - part IV

By now, special characters should not be special. The whole point is that these characters have code points that have different shapes in different character sets. Once you realize that, it's simple. Make sure you use the same character set all the way.
If you can't do that, make sure you understand what to use, and how it's done. Reread previous parts.
Is there a silver bullet? A cure against all weird characters? No, there's not. No off-the-shelf menu to be found here. I do not know your system; you do. I can only depict a pretty standard situation I know about, and I can reproduce.

So, what about a pretty standard situation? How's this:
  • MS Windows client
  • Unix (well, Linux) based database, that has
  • a web interface, which is browsed from
  • a MS Windows client.

The Client


For the client install, I'll use a clean laptop; clean meaning: no previous Oracle installs. In this case, also meaning Vista wiped off, XP on again. So, all in all, pretty clean.
Of course, the (standard) install will use WE8MSWIN1252 as coding set.

The Database.


The database instance will be on Linux (CentOS to be precise). I'll use a 10G Release 2, as it is predominant in Europe. The (standard) install will use WE8ISO8859P1, which I changed to P15

The Web.


The web can be deleted here. The web interface on the database will encode the contents as UTF-8, as usual. Whether this is correct remains to be seen. There's an entry on displaying special characters on the web, too.

Preparations:


Install CentOS Server V4.7, and update.
Install missing Oracle stuff:

yum install gcc gcc-c++ openmotif21 sysstat gnome-libs libstdc++-devel compat-libstdc++-33 compat-db-4 libaio

If you don't, you will get this:

Exception in thread "main" java.lang.UnsatisfiedLinkError: /tmp/OraInstall2009-09-30_10-39-42AM/jre/1.4.2/lib/i386/libawt.so: libXp.so.6: cannot open shared object file: No such file or directory
as well as linking errors in ins_rdbms.mk.

Database


The ORCL instance was created using the following code:

connect "SYS"/"&&sysPassword" as SYSDBA
set echo on
spool /oracle/db10/admin/orcl/scripts/CreateDB.log
startup nomount pfile="/oracle/db10/admin/orcl/scripts/init.ora";
CREATE DATABASE "orcl"
MAXINSTANCES 8
MAXLOGHISTORY 1
MAXLOGFILES 16
MAXLOGMEMBERS 3
MAXDATAFILES 100
DATAFILE '/oracle/data/orcl/orcl/system01.dbf' SIZE 300M REUSE AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
EXTENT MANAGEMENT LOCAL
SYSAUX DATAFILE '/oracle/data/orcl/orcl/sysaux01.dbf' SIZE 120M REUSE AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
SMALLFILE DEFAULT TEMPORARY TABLESPACE TEMP TEMPFILE '/oracle/data/orcl/orcl/temp01.dbf' SIZE 20M REUSE AUTOEXTEND ON NEXT 640K MAXSIZE UNLIMITED
SMALLFILE UNDO TABLESPACE "UNDOTBS1" DATAFILE '/oracle/data/orcl/orcl/undotbs01.dbf' SIZE 200M REUSE AUTOEXTEND ON NEXT 5120K MAXSIZE UNLIMITED
CHARACTER SET WE8ISO8859P15
NATIONAL CHARACTER SET AL16UTF16
LOGFILE GROUP 1 ('/oracle/data/orcl/orcl/redo01.log') SIZE 51200K,
GROUP 2 ('/oracle/data/orcl/orcl/redo02.log') SIZE 51200K,
GROUP 3 ('/oracle/data/orcl/orcl/redo03.log') SIZE 51200K
USER SYS IDENTIFIED BY "&&sysPassword" USER SYSTEM IDENTIFIED BY "&&systemPassword";
spool off
Mind the difference with the default, the DBCA would choose: the characterset is WE8ISO8859P15 in order to support the €-symbol.
The environment is set whenever I login on the machine, by .bash_profile being executed. It has the following settings:

export ORACLE_BASE=/oracle/db10
export ORACLE_HOME=$ORACLE_BASE/10GRel2
export PATH=$ORACLE_HOME/bin:$PATH
export ORACLE_SID=orcl
export NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P15
export DISPLAY=192.168.1.100:0.0
echo "Oracle settings done"
Here, too, the €-symbol supporting characterset is used. Of course, you will only actually see it when your terminal program supports it. I hope that is clear now, else see part I.

Let's do the basic test, scott is created again:

SQL> create table blah (A varchar2(10) );
Table created.

SQL> insert into blah values ('€');
1 row created.

SQL> insert into blah values('Æ');
1 row created.

SQL> insert into blah values('ß');
1 row created.

SQL> col dump(a,16) format a20
SQL> select a, dump(a,16) from blah;
A DUMP(A,16)
---------- --------------------
€ Typ=1 Len=1: 80
Æ Typ=1 Len=1: c6
ß Typ=1 Len=1: df

That works. Kind of. First thing to notice, is that the code point for the € is now 0x80, where it used to be 0xa4 (or 164 decimal). Now for the MS Windows client... It should not present a problem, as MSWIN1252 is a superset of ISO8859P15. A superset is where all code points are mapped to the same characters, plus extra. In other words: all characters, seen with ISO8859P15 can still be seen with WIN1252, and then some (that ISO8859P15 does not have).

SQL> select * from blah;

A
----------


OK, that looks weird, as the client maps the characters, dictated by the code points to other characters then originally. Let's change the client, so that it can map the codepoints.

C:\>chcp 1252
Active code page: 1252

C:\>sqlplus scott/tiger@orcl
SQL*Plus: Release 10.2.0.1.0 - Production on Wed Sep 30 16:11:22 2009
Copyright (c) 1982, 2005, Oracle. All rights reserved.

Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - Production
With the Partitioning, Data Mining and Real Application Testing options

SQL> select * from blah;

A
----------
¿
Æ
ß
OK - a lot better, but still no €. And I can get this CLI session to display a € sign by using ALT-0128 on the numeric keypad. Heck, I can even insert the € correctly, why can't this stupid DBA?!?
SQL> insert into blah values ('€');
1 row created.

SQL> select A, dump(A,16) from blah;

A DUMP(A,16)
---------- ---------------
€ Typ=1 Len=1: a4
¿ Typ=1 Len=1: 80
Æ Typ=1 Len=1: c6
ß Typ=1 Len=1: df

SQL> commit;
See? First entry: the €!

Enter the War of the Worlds.


OK - what went wrong?!? The DBA inserted the € correctly. It could be retrieved, and -from a MicroSoft point of view- has the correct code point: 0x80, or 128 decimal (remember you can display the € on a MS Windows machine by holding the ALT button, and typing 0128 on a numeric keypad? Coincidence?)
The client cannot see the € correctly - instead it has the dreaded upside-down question mark. But it actually can insert a € sign, and retrieve it...

What is going on here? The client uses a default of AMERICAN_AMERICA.WE8MSWIN1252 for NLS_LANG (set in the registry during the client install). The DBA used the database character set for NLS_LANG: WE8ISO8859P15. Also, he copied the € symbol off a web page. His workstation? A MicroSoft Windows machine...
So: DBA has € in MS Win code page 437. That would be 0128, or 0x80. Copies that into a SQL*Plus session with NLS_LANG equal to DB Characterset. Because of that, not code point re-mapping takes place. 0x80 is stored as the €. Upon retrieval: same route other way around: € displays correctly.
Client however, retrieves 0x80 in WIN1252. The database delivers this in ISO8859P15, so the network substrate tries to remap 0x80. However, there is no mapping from 0x80 in ISO8859P15 to MSWIN1252 - hence the inverted question mark.
As the client inserts the €, there is a code point conversion from 0x80 to the corresponding currency sign in 8859P15: the € as 0xA4. Client can retrieve that, as the same logic applies as with the DBA session: same route other way around, and the character is displayed correctly.

Now, homework: explain what the DBA sees when the customer starts complaining, and why.
To be cont'd...

1 comment:

Frank said...

If you cannot see the € in IE, try Firefox; I use FF all the time, and it displays ok. IE (well, V6...) did not