TIP: Click on subject to list as thread! ANSI
echo: net_dev
to: Leonard Erickson
from: Jonathan de Boyne Pollard
date: 2000-02-25 00:58:20
subject: Multi-byte character sets

LE>> Frankly, I think we should just go to Unicode. 

 JdBP>> By Unicode I presume you mean UCS-2.  That would mean a new PKT
 JdBP>> file format, of course.  It would also be highly inefficient for
 JdBP>> text that was mostly ISO 8859-1, since every other byte would be
 JdBP>> zero. (Although I do wonder how much of that would be eliminated 
 JdBP>> by ZIP, RAR, ARJ, ARC, and suchlike.)

 JdBP>> UTF-8 would be better.

 LE> I've been told there's a format where you give an "intro code" that
 LE> IDs the character subset, (essentially that first byte) and then only
 LE> have to use 16-byte chars for stuff that *isn't* in that set. Sort of
 LE> a "condensed mode"

I very much doubt that what you describe exists.  There would be no way to
distinguish 8-bit characters from 16-bit characters.  

As I said, UTF-8 would be better than UCS-2.  Aside from the storage
inefficiency and the problems with all of those zero bytes, there's the
problem of endianism to consider with UCS-2, as well.  UTF-8 doesn't suffer
from any of these.

 LE> Also, from what I've seen of Unicode, a message that was in full
 LE> 16-bit format and mostly *ASCII* is where the high byte would be
 LE> zero. The characters present in ISO 8859-1 that aren't present in
 LE> ASCII are spread over *several* unicode "sets".

I don't know where you read that, but it's wrong.  The ISO 8859-1 character
set occupies positions 0 to 255 of the Unicode character set.  It was
deliberately designed this way.

Because of this, messages written in Cyrillic will have non-zero high
bytes, as will (parts of) messages that use line drawing and box drawing
characters, but the majority of messages written in Western European
languages will have every second byte set to zero if using UCS-2.

 ¯ JdeBP ®

--- FleetStreet 1.22 NR
* Origin: JdeBP's point, using Squish (2:257/609.3)
SEEN-BY: 201/0 100 200 209 300 329 400 407 411 505 600 203/600 204/450 700
SEEN-BY: 205/0 206/0 396/1 490/21 633/267 270
@PATH: 257/609 255/3 1 396/1 201/505 633/267

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.