Wednesday, April 15, 2009

"Special" Characters - part I

Times and times again, people complain about "not being able to store special characters". Whatever special means - in this case diacriticals, like à é or ç, or currency symbols like the Yen or Euro sign: ¥ and €.
Educated guess... Sort of.
I even had an Oracle Master explaining that "the database had the wrong character set" because a web based application did not display the euro sign correctly. My 'polite' comment was "Humbug! A database stores whatever you throw at it". Sorry about that.
The database in question was 10.2.0.4, using WE8ISO8859P1 (not P15!)

So, what have we got:
- One workstation, running Linux (CentOS) 2.6.18-92.1.18.el5PAE #1 SMP
- One Oracle 10G Release 2 for Linux, pathed to 10.2.0.3 (p5337014).
- Some scripts, generated by dbca
- NLS_LANG set to American_America.US7ASCII, ORACLE_BASE and ORACLE_HOME set

This is the crux:

CREATE DATABASE "us7ascii"
MAXINSTANCES 8
MAXLOGHISTORY 1
MAXLOGFILES 16
MAXLOGMEMBERS 3
MAXDATAFILES 100
DATAFILE '/oracle/data/us7ascii/system01.dbf' SIZE 300M REUSE AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
EXTENT MANAGEMENT LOCAL
SYSAUX DATAFILE '/oracle/data/us7ascii/sysaux01.dbf' SIZE 120M REUSE AUTOEXTEND ON NEXT 10240K MAXSIZE UNLIMITED
SMALLFILE DEFAULT TEMPORARY TABLESPACE TEMP TEMPFILE '/oracle/data/us7ascii/temp01.dbf' SIZE 20M REUSE AUTOEXTEND ON NEXT 640K MAXSIZE UNLIMITED
SMALLFILE UNDO TABLESPACE "UNDOTBS1" DATAFILE '/oracle/data/us7ascii/undotbs01.dbf' SIZE 200M REUSE AUTOEXTEND ON NEXT 5120K MAXSIZE UNLIMITED
CHARACTER SET US7ASCII
NATIONAL CHARACTER SET AL16UTF16
LOGFILE GROUP 1 ('/oracle/data/us7ascii/redo01.log') SIZE 51200K,
GROUP 2 ('/oracle/data/us7ascii/redo02.log') SIZE 51200K,
GROUP 3 ('/oracle/data/us7ascii/redo03.log') SIZE 51200K
USER SYS IDENTIFIED BY "&&sysPassword" USER SYSTEM IDENTIFIED BY "&&systemPassword";


Ok - database was made, and it cannot store any special characters by default, as the standard character set is 7-bits. Or can it?!?


SQL> create user scott identified by tiger default tablespace users;
User created.

SQL> grant connect, resource to scott;
Grant succeeded.

SQL> conn scott/tiger
Connected.
SQL> create table blah (a varchar2(20));
Table created.

SQL> insert into blah values ('q');
1 row created.

SQL> insert into blah values('ç');
1 row created.

SQL> select * from blah;

A
--------------------
q
ç

Oops... Can it get any worse? It can, actually:

SQL> select * from blah;

A
--------------------
q
ç

Æ æ Œ œ
Å Ø ß Ñ ñ


I guess I just demonstrated you actually can store "special" characters in a US7ASCII based Oracle database. So can we now stop the discussions?

A database (well, at least an Oracle database) stores whatever you throw at it. That is what databases are supposed to do.
Edit: of course, this is linked to the TOP-25...
more specifically, point #1 I made: "Improper encoding"

6 comments:

DomBrooks said...

However, if you use the DUMP function to inspect the data, are you actually storing the correct characters?

You can get an application to work together with a database to store whatever it wants and interpret/transform the byte stream in whatever way it wants - sure I've seen that.

http://orastory.wordpress.com/2007/02/26/characterset-inspection/

But if something else comes along and wants to use that data then it has to interpret the byte stream in exactly the same way otherwise it's meaningless.

So, can you store any characters in any characterset properly? No.

Does what you're saying here mean you shouldn't do it properly? No.

Frank said...

Ah - the DUMP part is actually part II. But first, define "correct character" - I store it, and I can retrieve it - so it must be correct.

And what you say about interpreting the stream in the same way is exactly the whole point - anyone not doing that get the "wrong" data.

DomBrooks said...

I get your point.

There are plenty of applications out there that work this way and more often than not unknowingly (the example that I referred to on my blog used a double windows codepage translation to get what it wanted in and out) until they want to use the data elsewhere.

And if you make assumptions about what you've got in the database and try and plug something else straight into the database to get what you think is there, then you can be in trouble. The moral - store it right first time around.

I await part II ;)

DomBrooks said...

P.S. correct = the internal byte representation of the character is correct.

i.e. the client interpretation of the meaning of the bytes is not "by chance"

Other reasons why - nls semantics - if you are using two characters to store one then any data length constraints may be improperly implemented.

For example, say you want to restrict column ename to be a varchar2(30 char). If you are storing multiple characters (as opposed to bytes) to represent a single muti-byte character higher up in your application then you may be able to store far less than 30 characters.

DomBrooks said...

Sorry - I don't mean to bombard you with comments. I haven't thought through everything properly. The example I blogged about was here: http://orastory.wordpress.com/2007/01/24/doing-globalisation-the-hard-way/

And that example was with a UTF8 characterset so there was no excuse not to store things properly other than ignorance.

Frank said...

Dom,
- I made clear (I hope) what you store are code points. These are always subject for interpretation - even by the front end. I demonstrated that.
So much for internal byte representation.
- multi byte character sets: use nls_semantics=CHAR, rather than byte. Beware that the max size of a VARCHAR2 column still is 4000 byte.
- regarding your Cyrillic headline blog post: someone entered data using nls_lang set to WE8MSWIN1251, and stored in a UTF8 database?