TIP: Click on subject to list as thread! ANSI
echo: qedit
to: ALL
from: DIETER KOESSL
date: 1997-04-20 22:19:00
subject: RE: accented characters

From: Dieter Koessl 
Otavio,
> Chris Antos (Exchange) wrote:
> >
> > to answer your 3 questions:
> > - no conversion is happening, it's what i explained above.
>
> no, there IS some conversion beeing made! if I cut a N-tilde (#209 in
> WinANSI) from NOTEPAD and paste it into TSE I end up with a N-tilde
> (#165 in cp_437); you see that the conversion is clever enough to change
> the code (from #209 to #165) preserving the "look" of the character: an
> N with a tilde on top of it;
Well, it seems that Chris has overlooked a detail. But let's start at
the beginning.
The ASCII character set (it's an ANSI standard, but I've forgotten the
number) is a 7-bit code, which includes ten numbers, the basic lower and
upper case letters, most of the punctuation marks and control characters
(e.g. CR, LF and TAB). It does not include any accented characters.
Control characters (#0..#31 and #127), by definition, are non-typeable
characters and as such have _no_ visual counterpart.
This all changed with the advent of the IBM-PC, which was targeted at an
international market. IBM extended the ASCII character set to 8 bits and
filled the empty space with accented characters, block graph and some
greek characters. It also defined visual counterparts for the control
characters. This became what now is known to be CP 437. Also text
oriented apps usually depend heavy on this extended character set. It
also quickly became clear that the accented characters included into the
new ASCII character set didn't suffice for many languages, thus new code
pages were invented, especially the "international" CP 850, which
sacrificed some block graph and greek characters for a more complete set
of accented characters. All these character sets or code pages later
collectively became to be known as _the_ OEM character set.
Things again changed with the advent of Windows, which exclusively used
the new ANSI characters set--an extension of the old ANSI 7-bit ASCII
character set. This new ANSI characters set includes an extensive set of
accented characters and additional punctuation marks, but... it doesn't
include any block graph and greek characters.
Now, if you use a GUI editor, e.g. notepad, it will store the characters
you have typed using the ANSI character set. Text oriented editors on
the other hand, e.g. TSE, will use the OEM character set. This means,
depending on which kind of editor you use, different numbers (bytes)
respresenting the same visual character will be stored within the file
on disk. This also means that the extended characters will be
interpreted to be something different entirely, if you open a file
written with the other kind of editor, e.g. open a file with TSE written
with notepad. To summarize what is stored on disk are only numbers in
the range of 0..255 and what is displayed on screen depends on how your
program interprets these numbers, e.g. which character set it uses.
Finally, the windows clipboard can be used to store a lot of things
including plain text. But this isn't so plain after all, because the
clipboard understands two kinds text (you guessed it!) ANSI text and OEM
text. Now if you stuff in ANSI text, say via notepad, and retrieve OEM
text, say via TSE, windows will _automatically_ transform one character
set into the other. It does this as best as it can and will fail on
certain characters, since either set includes characters which the other
doesn't. If windows encounters such a character, it will produce a block
sign in ANSI or an underscore in OEM.
hope this helps
Dieter
---
---------------
* Origin: apana>>>>>fidonet [sawasdi.apana.org.au] (3:800/846.13)

SOURCE: echomail via exec-pc

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.