Friday, April 24, 2009

Whoo!

Little Darling, it seems like years since it's been here.
Here comes the sun.

Just click the header for the press release.

Thursday, April 23, 2009

Database agnostic and Hibernate

You know, there once was a time I genuinely loved Java. I thought it would be a great language with even better security. Then came V0.8, and we had to rewrite everything we'd done so far during the class...
Since then, I changed. Java stepped out of the sandbox, was even tangled with by Microsoft, and I've become a bitter old man. I've seen too many Java "programmers" that just select from table a, and from table b, sort the a results, sort the b results, join a and b and discard 99% of the result set , and then complain about the performance of the database.

This entry made my life happier; maybe it is not "just me" - read the April, 22nd 2009 entries.

Grin.
By the way: java programmers: read up on ref_cursors...

Wednesday, April 22, 2009

"Special" Characters - part III

Now that you know how Oracle databases store characters, as seen in part I and part II, it may be time to get realistic. After all, no one wants to loose information in a dramatic way as seen in part II.
One thing I forgot earlier: the euro symbol...
[frank@cs-frank03 ~]$ su - ora10gr2
Password:
[ora10gr2@cs-frank03 ~]$ . oraenv
ORACLE_SID = [ora10gr2] ? us7ascii
[ora10gr2@cs-frank03 ~]$ sqlplus /nolog
SQL*Plus: Release 10.2.0.3.0 - Production on Wed Apr 22 10:04:56 2009
Copyright (c) 1982, 2006, Oracle. All Rights Reserved.

SQL> conn / as sysdba
Connected to an idle instance.
SQL> startup
ORACLE instance started.

Total System Global Area 289406976 bytes
Fixed Size 1261396 bytes
Variable Size 109052076 bytes
Database Buffers 176160768 bytes
Redo Buffers 2932736 bytes
Database mounted.
Database opened.
SQL> conn scott
Enter password:
Connected.
SQL> insert into blah values('€');
1 row created.

SQL> select dump(a) from blah;
DUMP(A)
--------------------------------------------------------------------------------
Typ=1 Len=1: 113
Typ=1 Len=1: 99
Typ=1 Len=1: 63
Typ=1 Len=1: 63
Typ=1 Len=1: 63
Typ=1 Len=3: 226,130,172

6 rows selected.

Now, ignore the first five entries - these are from the previous demonstrations. The last one is what it is all about, that is the Euro symbol! But how did it get to be three byte anyway?

The mystery of a single character stored as three bytes in US7ASCII

It is not. A mystery, that is. It is no rocket science at all. Here's the explanation:
  • The database and session are using the same NLS settings. Because of that, no conversion is done whatsoever. This is an important item to remember.
  • The terminal session supports UTF8
  • The decimal UTF8 representation of the Euro symbol is '226.130.172' - take a look here, search for 'euro'.
And that is exactly what is stored, and can be retrieved, and displayed as euro symbol. But only because the program I use (gnome-terminal) supports UTF8. Any other program, not understanding UTF8 will display something else!

Reality check

Enough playing around with US7ASCII. I do not think anyone in their right mind would build a US7ASCII database. Which makes you wonder why Oracle Corporation still uses it as the default? But let's move on.

Same way the US7ASCII database was built, now a WE8ISO8859P1 database is created. I have chosen WE8ISO8859P1 as it is the default for DBCA on West-European Unix and Linux.
For one of the clients, I will use the corresponding MS Windows code settings; WE8MSWIN1252. This will simulate the average workstation in this part of the world - predominantly Microsoft Windows.
The other client will simulate a DBA, logging in to the machine itself, and using the environment settings which come with the installation: NLS_LANG=American_America.WE8ISO8859P1; export NLS_LANG. Start seeing dark clouds yet? I do...
SQL> insert into blah values ('Æ');
1 row created.

SQL> insert into blah values('ß');
1 row created.

SQL> select a, dump(a,16) from blah;
A DUMP(A,16)
-------------------- --------------------
Æ Typ=1 Len=2: c3,86
ß Typ=1 Len=2: c3,9f

I know why that works - do you?
SQL> insert into blah values('Æ');
1 row created.

SQL> insert into blah values('ß');
1 row created.

SQL> select a, dump(a,16) from blah;
A DUMP(A,16)
-------------------- --------------------
Æ Typ=1 Len=2: c3,86
ß Typ=1 Len=2: c3,9f
Æ Typ=1 Len=1: c6
ß Typ=1 Len=1: df

HEY! What's that?
A dark cloud... Well, I cheated - something you cannot see from this output, is the fact I changed gnome-terminal to use ISO-8859-1, not UTF8. The last two inserts were done with ISO-8859-1. Now, that is a single byte character set, so the double bytes used by UTF8 (the first two entries) are no longer understood. Hence the "corrupt" characters when retrieving them. The ISO-8859-1 characters can be found here, and are correct from that standpoint. Remember, 0x0c6=198, 0x0df=223.

The case of the invalid Euro

I have had comments, that I was not storing the correct characters. My response to that was "define correct" (which is of course similar to one of the standard answers for any Oracle question: "it depends" ;) ). If I can store and retrieve characters, it must be correct, isn't it?
Let me insert a € symbol. This cannot be done while gnome-terminal is in ISO-8859-1 mode: ISO-8859-1 does not support the Euro symbol! So, switch to ISO-8859-15:
SQL> insert into blah values('€');
1 row created.

SQL> select a, dump(a) from blah;
A DUMP(A)
-------------------- --------------------
Æ Typ=1 Len=2: 195,134
ß Typ=1 Len=2: 195,159
Æ Typ=1 Len=1: 198
ß Typ=1 Len=1: 223
€ Typ=1 Len=1: 164

Switching back to ISO-8859-1, this is the result:
A DUMP(A)
-------------------- --------------------
Æ Typ=1 Len=2: 195,134
ß Typ=1 Len=2: 195,159
Æ Typ=1 Len=1: 198
ß Typ=1 Len=1: 223
¤ Typ=1 Len=1: 164

Nice, huh? When you lookup decimal 164 in the ISO-8859-1 reference, you will find it is defined as Currency sign. Pretty neat, I'd say to define the Euro symbol in the -15 extension on the code point for the currency sign.
But I also show things can go wrong. And it will get worse, when NLS settings come in!

Wrap up

So far, I have not even begun to fiddle around with NLS settings. All I did was change the capabilities of the front end. And I did not even touch web interfaces by far (which should of course use the correct HTML escaped code for a €: €)

Next and last part will be about NLS settings, as indicated above.

Not sure if and how your browser will display all these characters, here are two images:


Sunday, April 19, 2009

"Special" Characters - part II

One of the prerequisites to do what I demonstrated in the previous entry is that you must use tools that allow the display of special characters.
As logical as it seems, you can only display diacritics or a Euro sign when the tool actually knows how to display those. If in the end, your tooling cannot, you will see some strange things.




This screenshot is not as strange as it looks. The terminal session actually allows for various character encodings; it is UTF-8 by default. That is why the characters are actually displayed in the first select statement. The second one uses Western Windows 1252, the third Western ISO-8859-1.
I will now add the € symbol, using ISO-8859-15:

The fact, that is does not display in the second select statement, is that the terminal session switched to UTF8.
Now, lets have a look how Oracle treats all this:
SQL> select dump(a) from blah;

DUMP(A)
--------------------------------------------------------------------------------
Typ=1 Len=1: 113
Typ=1 Len=2: 195,167
Typ=1 Len=1: 231
Typ=1 Len=11: 195,134,32,195,166,32,197,146,32,197,147
Typ=1 Len=14: 195,133,32,195,152,32,195,159,32,195,145,32,195,177
Typ=1 Len=1: 164

First line, standard lower case "q": value is 0x071, or decimal 113. Now how in the world does the value one-hundred-and-thirteen get translated into some flagpole with a circle attached, many people will recognize as the character "q"?

Code Points and Character Sets


That is all about "code points". There's a standing agreement that in computer environments, decimal 113 gets represented on screen as the letter q. There's even a standard, called ASCII, which is short for American Standard Code for Information Interchange.
Of course, not all the world uses this standard, so it is possible, that the value 113 gets represented by another sign in another environment. Which is true for new characters, like the euro sign. After all, that did not exist 30 years ago.
But even in the "good old days" it was possible to represent the symbol "q" with something else, when you used the Extended Binary Coded Decimal Interchange Code, or EBCDIC (pronounced as 'epsedick'), for example. In EBCDIC, "q" would be 0x098, or 152 decimal.
The second line looks odd: 195,167 (0x0C3, 0x0A7) for a c-cedilla. Let's not forget the terminal session uses UTF8 character encoding by default. And that is what I used to insert all these characters in the first place.
UTF8 is yet another standard by which signs on screens are connected to (computer) code. If you want to know more about ASCII (particular US7ASCII), ISO and UTF character encoding, internet search machines are your friend. Basically, the ASCII coding schema only used 7 bits, allowing only for 2^7 code points, or 128 characters. The 8th bit was originally used for parity checking. As the first 32 codes were used as communication control characters, that did not allow for much more than the alphabet in upper and lower case (2*26 = 52 code points), numbers (0-9 = 10 additional code points) and some other characters, found on every typewriter (yet another 32 code points). In total, 126 code points; add a space, and all your bits are used.
The PC industry skipped the parity check, and extended the character set with another 127 characters. This character set became actually known by that name: the Extended ASCII character set (or US8ASCII, as opposed to the original US7ASCII). As every manufacturer had it's own likings, about every PC manufacturer had it's own coding points; some used it for primitive graphics, others for locale specific adaptations.
ISO standards defines 2^8 character points, and sub-standards exist for about every locale: ISO 8859-1 is known as ISO Latin1, or Western European, whereas ISO 8859-5 is known as the Cyrillic character set, ISO 8859-8 the Celtic, etc, etc..

What actually is stored...


are code points. Not characters!
Now, how does the c-cedilla get translated to 0x0C3, 0x0A7? Simple: I entered a c-cedilla in a terminal session, that supported UTF-8. In UTF-8, the c-cedilla has a code point "c3 a7", or U+00E7. You can check the rest of the output with a conversion table, an example is here. Another fine example with the full 1-byte UTF8 code table is here.
Now, the SQL*Plus session uses US7ASCII (I specified that by setting NLS_LANG). The database uses US7ASCII. Conclusion: Oracle needs to convert nothing! It gets US7ASCII from the session, and it stores US7ASCII - done!
Now - the last entry, the euro sign (164, or 0x0A4), was done whilst using ISO 8859-15 as terminal setting. It comes back in UTF8 as a question mark. Even though U+00A4 is defined as currency symbol, UTF8 does not understand the single byte.

Where Oracle comes in.

OK - I now changed the environment variable NLS_LANG to American_America.UTF8. The terminal supports UTF8, too. Now take a look at this:

SQL> truncate table blah;
Table truncated.

SQL> insert into blah values('q');
1 row created.

SQL> insert into blah values('ç');
1 row created.

SQL> select * from blah;
A
------------------------------------------------------------
q
c

SQL> select dump(a) from blah;
DUMP(A)
--------------------------------------------------------------------------------
Typ=1 Len=1: 113
Typ=1 Len=1: 99

Watch that last entry! It reads the character 'c', not 'ç' ! The whole point here is that now the network layer actually performs a conversion - it gets UTF8, but has to map that into US7ASCII. US7ASCCI does not know the code point "ç", and maps it to "c". Similar for the euro symbol:
SQL> insert into blah values('€');
1 row created.

SQL> select a,dump(a) from blah;
A DUMP(A)
---------- ------------------------------
q Typ=1 Len=1: 113
c Typ=1 Len=1: 99
? Typ=1 Len=1: 63

Ah - there it is! The infamous question mark! And the euro symbol has become a decimal 63 (0x03F).

Mind you - now you have actually lost data! There is no way of knowing that 0x063 was originally the character 'c' or not. Same for the euro symbol!

Conclusion

  • Be careful with "special" characters - you may actually loose data!
  • You do not store characters in the database - you store code points. The actual character displayed depends on the character set in use.
  • Make sure all end user components are capable of displaying "special" characters. If one does show the correct character, and the other doesn't, the latter uses a wrong character set. Choose a different one, that matches the code point to the correct character (just as the other tool does).

Wednesday, April 15, 2009

"Special" Characters - part I

Times and times again, people complain about "not being able to store special characters". Whatever special means - in this case diacriticals, like à é or ç, or currency symbols like the Yen or Euro sign: ¥ and €.
Educated guess... Sort of.
I even had an Oracle Master explaining that "the database had the wrong character set" because a web based application did not display the euro sign correctly. My 'polite' comment was "Humbug! A database stores whatever you throw at it". Sorry about that.
The database in question was 10.2.0.4, using WE8ISO8859P1 (not P15!)

So, what have we got:
- One workstation, running Linux (CentOS) 2.6.18-92.1.18.el5PAE #1 SMP
- One Oracle 10G Release 2 for Linux, pathed to 10.2.0.3 (p5337014).
- Some scripts, generated by dbca
- NLS_LANG set to American_America.US7ASCII, ORACLE_BASE and ORACLE_HOME set

This is the crux:

CREATE DATABASE "us7ascii"
MAXINSTANCES 8
MAXLOGHISTORY 1
MAXLOGFILES 16
MAXLOGMEMBERS 3
MAXDATAFILES 100
DATAFILE '/oracle/data/us7ascii/system01.dbf' SIZE 300M REUSE AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
EXTENT MANAGEMENT LOCAL
SYSAUX DATAFILE '/oracle/data/us7ascii/sysaux01.dbf' SIZE 120M REUSE AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
SMALLFILE DEFAULT TEMPORARY TABLESPACE TEMP TEMPFILE '/oracle/data/us7ascii/temp01.dbf' SIZE 20M REUSE AUTOEXTEND ON NEXT 640K MAXSIZE UNLIMITED
SMALLFILE UNDO TABLESPACE "UNDOTBS1" DATAFILE '/oracle/data/us7ascii/undotbs01.dbf' SIZE 200M REUSE AUTOEXTEND ON NEXT 5120K MAXSIZE UNLIMITED
CHARACTER SET US7ASCII
NATIONAL CHARACTER SET AL16UTF16
LOGFILE GROUP 1 ('/oracle/data/us7ascii/redo01.log') SIZE 51200K,
GROUP 2 ('/oracle/data/us7ascii/redo02.log') SIZE 51200K,
GROUP 3 ('/oracle/data/us7ascii/redo03.log') SIZE 51200K
USER SYS IDENTIFIED BY "&&sysPassword" USER SYSTEM IDENTIFIED BY "&&systemPassword";


Ok - database was made, and it cannot store any special characters by default, as the standard character set is 7-bits. Or can it?!?


SQL> create user scott identified by tiger default tablespace users;
User created.

SQL> grant connect, resource to scott;
Grant succeeded.

SQL> conn scott/tiger
Connected.
SQL> create table blah (a varchar2(20));
Table created.

SQL> insert into blah values ('q');
1 row created.

SQL> insert into blah values('ç');
1 row created.

SQL> select * from blah;

A
--------------------
q
ç

Oops... Can it get any worse? It can, actually:

SQL> select * from blah;

A
--------------------
q
ç

Æ æ Œ œ
Å Ø ß Ñ ñ


I guess I just demonstrated you actually can store "special" characters in a US7ASCII based Oracle database. So can we now stop the discussions?

A database (well, at least an Oracle database) stores whatever you throw at it. That is what databases are supposed to do.
Edit: of course, this is linked to the TOP-25...
more specifically, point #1 I made: "Improper encoding"