| TIP: Click on subject to list as thread! | ANSI |
| echo: | |
|---|---|
| to: | |
| from: | |
| date: | |
| subject: | Re: Character encodings, transfer encodings, etc |
From: Ellen K.
So all the flavors of character encoding already have mappings? And then
the translation is created based on the rules for mapping the particular
flavor of character encoding to the desired translation? What happens to
the characters which inherently don't require more than 7 bits?
On Sun, 23 Oct 2005 11:55:14 -0700, "Rich" wrote in message
:
> It's a little more complicated than you think. An '¤" is Unicode U+00F1.
This character is not ASCII. It is encoded as 0xF1 in ISO-8859-1 and
Windows code page 1252. Your computer very likely runs with code page 1252
as the default multibyte (ANSI) encoding. If you encode this in UTF-8, a
common Unicode encoding, it is represented as 0xC3 0xB1
>
> The first thing you need to select is the character encoding. Let's
consider ISO-8859-1 and UTF-8 both. Windows 1252 is the same as ISO-8859-1 here.
>
> For quoted printable ISO-8859-1 this would be expressed as =F1. For quoted
printable UTF-8 this would be presented as =C3=B1.
>
> For base64 the answer is not easy as you need three bytes for a full base64
grouping. Here we have only one or two bytes depending on the character
encoding. There is a mechanism for this. If there is only one byte this
is padded with four zero bits to get two encoded characters which are
suffixed with "==". If there are only two bytes, this is padded
with two zero bits to get three encoded characters which are suffixed with
"=". I could do the work to find what the actual encoding is
here but I don't have any tools that just do base64 encoding and don't want
to take the time to fake it. The one byte ISO-8859-1 encoding could look
something like AQ== and the two byte UTF-8 encoding like T/r=.
>
>Rich
>
> "Ellen K." wrote in message
news:3ginl1drukln16uits5n5c9ll50mjovnt7{at}4ax.com...
> Thanks for the explanation. :)
>
> One picture (OK, one example in this case ) being worth the
> proverbial 1000 words, a lower-case n with a tilde (¤ if it gets
> reproduced correctly by the time people read this) is ascii 241, i.e. it
> uses the first bit of an 8-bit byte. How is it expressed in
> quoted-printable and how is it expressed in base64?
>
> On Sat, 22 Oct 2005 22:31:34 -0700, "Rich" wrote in message
> :
>
> > There are multiple ways. The two most common are called
quoted-printable and base64.
> >
> > In quoted printable characters can be represented by =XX where XX are
the hex digits for the byte. Because the '=' is an escape character it is
expressed as =3D.
> >
> > In base64 the byte sequence is divided in three byte groups which are
subdivided into four six bit units. The six bit units are mapped to
printable ASCII characters.
> >
> > Note that I refer to bytes not characters for the source. This is
because this transfer encoding is applied after any character encoding like
UTF-8. For example, with UTF-8 a single character is represented by from
one to four bytes. In quoted printable this becomes from one to 12 ASCII
characters in quoted printable. There are many character encodings in use
for many reasons. Quoted printable and base64 are usually selected based
on which results in a smaller size overall. At least that is the criterion
used by the clients I have seen.
> >
> > George's example is slightly different than what I describe above.
Headers like the subject use a different mechanism to identify encoding
than the message body and unlike the body allow mixing and matching in some
ways. His example is using a character encoding of "ascii" and
transfer encoding of base64. What bothered him is that the encoded form is
used when it wasn't necessary and presumably some tool he is using doesn't
understand this 12 year old standard.
> >
> >Rich
> >
> > "Ellen K." wrote in message
news:485ml1121hsq9se5hg0l4d2tsci5c5vc6b{at}4ax.com...
> > Just curious, how are characters that require more than 7 bits encoded
> > into 7-bit?
> >
> > On Thu, 20 Oct 2005 17:21:58 -0700, "Rich"
wrote in message
> > :
> >
> > > Email content is any encoding you want. The example you give is
valid even if silly. It's not a security issue in any case.
> > >
> > > BTW, email is not 7-bit though it is encouraged to be
encoded as such
because that provides better compatibility. There is a standard for
checking for 8-bit compatiblity. See http://www.ietf.org/rfc/rfc1652.txt.
It's not necessary since anything can be encoded as 7-bit. It can be more
efficient.
> > >
> > >Rich
> > >
> > > "Geo." wrote in message
news:4357ff5e$1{at}w3.nls.net...
> > > Ok I don't understand so maybe someone can give me a rational
explanation of
> > > this.
> > >
> > > Why would an email program accept
> > >
> > > Subject: =?ascii?B?W1NQQU1dICBPbmxpbmUgUGF5bWVu?=
> > > =?ascii?B?dHMgYW5kIG91ciBzZWN1cmUgc2l0?= =?ascii?B?ZSE=?=
> > >
> > > and decode it to
> > >
> > > [SPAM] Online Payments and our secure site!
> > >
> > > This just boggles the mind, I mean if you were trying to
create secure
> > > application wouldn't you restrict to a least common
instead of allow
> > > everything? Email is 7bit ascii not unicode correct? Is
this somehow
needed
> > > to allow unicode subject line where the RFC's don't allow it?
> > >
> > > Geo.
--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)SEEN-BY: 633/267 270 5030/786 @PATH: 379/45 1 106/2000 633/267 |
|
| SOURCE: echomail via fidonet.ozzmosis.com | |
Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.