Problems with Data Transfer (character sets)
From ICISWiki
I was having some problems loading the names from IWIS into an ICIS database created in PostgreSQL (I will call it Postgres from here on). After some struggling I came up with a number of steps that handles this satisfactory. It might not be the only solution, but it works.
In SQL server you can dump a table with the bcp command. A typical batch file would look like this
bcp pms_to_gms..names out c:\pms_to_gms\names.txt /c /t "|" /r \n /e c:\pms_to_gms\names.log /m 10 /S%1 /Usa /P%2
The problem I found out the hard way is, that only the lower part of the ANSI character set (the ASCII part you might say, values from chr(1) to chr(127)) is dumped as-is, while the upper part is getting converted. Thus the name 'NORTEÑO 67' in the output of the bcp looks like 'NORTE¥O 67'. The solution is to add the /C flag for codepage, and put ACP as codepage (OEM is the default if you don't specify the /C flag). Here is the new bcp command:
bcp pms_to_gms..names out c:\pms_to_gms\names.txt /c /CACP /t "|" /r \n /e c:\pms_to_gms\names.log /m 10 /S%1 /Usa /P%2
Now the output looks okay. To upload this into Postgres you need to handle codepages too. When creating the database you need to specify the codepage to be sure it is handled correctly. As UTF8 is the default codepage, which will assume Unicode characters if using the upper part of the character set (like 'Ñ'), you can get strange errors about invalid characters because it goes hunting for UNICODE double bytes in the input when uploading these characters.
When creating the database, you need to specify the encoding, so that it is not interpreting any characters as part of a UNICODE double byte. I presume you could select several different (Postgres Help has a topic 21.2 Character set support with these), thus LATIN1, LATIN6 and several others might be possible. If you use SJIS you will get japanese, and KOI8 will produce russian. I chose WIN1252 as being Western European codepage. If you will actually get a problem specifying KOI8 which is not double byte, I don't know, but if you use UTF8 (1-4 bytes pr. letter) or SJIS (1-2 bytes pr. letter) you will definitely get problems uploading, because some characters are then interpreted as double bytes (or triple, quadruple).
Here is what I did in the script to enforce the right encoding of the database:
CREATE database icislocal WITH ENCODING='WIN1252';
Now in the upload I'm using the COPY command. In this command I needed to specify the code page explicitly to avoid data transformation:
SET CLIENT_ENCODING='WIN1252';
COPY names
FROM 'C:\\Program Files\\pgAdmin III\\names.txt' DELIMITER '|'
NULL ''
I presume the client_encoding should match the one of the database, at least that seems logical.
This uploaded the names including 'NORTEÑO 67' correctly.
- Jesper Nørgaard Welen (CIMMYT)

