| TIP: Click on subject to list as thread! | ANSI |
| echo: | |
|---|---|
| to: | |
| from: | |
| date: | |
| subject: | Re: Character encodings, transfer encodings, etc |
From: "Rich"
This is a multi-part message in MIME format.
------=_NextPart_000_04AF_01C5D7C8.A130AD00
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
It's a little more complicated than you think. An '=F1" is Unicode =
U+00F1. This character is not ASCII. It is encoded as 0xF1 in =
ISO-8859-1 and Windows code page 1252. Your computer very likely runs =
with code page 1252 as the default multibyte (ANSI) encoding. If you =
encode this in UTF-8, a common Unicode encoding, it is represented as =
0xC3 0xB1
The first thing you need to select is the character encoding. Let's =
consider ISO-8859-1 and UTF-8 both. Windows 1252 is the same as = ISO-8859-1 here.
For quoted printable ISO-8859-1 this would be expressed as =3DF1. =
For quoted printable UTF-8 this would be presented as =3DC3=3DB1.
For base64 the answer is not easy as you need three bytes for a full =
base64 grouping. Here we have only one or two bytes depending on the =
character encoding. There is a mechanism for this. If there is only = one
byte this is padded with four zero bits to get two encoded = characters
which are suffixed with "=3D=3D". If there are only two = bytes,
this is padded with two zero bits to get three encoded characters = which
are suffixed with "=3D". I could do the work to find what the =
actual encoding is here but I don't have any tools that just do base64 =
encoding and don't want to take the time to fake it. The one byte =
ISO-8859-1 encoding could look something like AQ=3D=3D and the two byte =
UTF-8 encoding like T/r=3D.
Rich
"Ellen K." wrote in message =
news:3ginl1drukln16uits5n5c9ll50mjovnt7{at}4ax.com...
Thanks for the explanation. :)
One picture (OK, one example in this case ) being worth the
proverbial 1000 words, a lower-case n with a tilde (=F1 if it gets
reproduced correctly by the time people read this) is ascii 241, i.e. =
it
uses the first bit of an 8-bit byte. How is it expressed in
quoted-printable and how is it expressed in base64?
On Sat, 22 Oct 2005 22:31:34 -0700, "Rich" wrote in message
:
> There are multiple ways. The two most common are called =
quoted-printable and base64.
>
> In quoted printable characters can be represented by =3DXX where =
XX are the hex digits for the byte. Because the '=3D' is an escape =
character it is expressed as =3D3D.
>
> In base64 the byte sequence is divided in three byte groups which =
are subdivided into four six bit units. The six bit units are mapped to =
printable ASCII characters.
>
> Note that I refer to bytes not characters for the source. This is =
because this transfer encoding is applied after any character encoding =
like UTF-8. For example, with UTF-8 a single character is represented = by
from one to four bytes. In quoted printable this becomes from one to = 12
ASCII characters in quoted printable. There are many character = encodings
in use for many reasons. Quoted printable and base64 are = usually
selected based on which results in a smaller size overall. At = least that
is the criterion used by the clients I have seen.
>
> George's example is slightly different than what I describe above. =
Headers like the subject use a different mechanism to identify encoding =
than the message body and unlike the body allow mixing and matching in =
some ways. His example is using a character encoding of "ascii"
and = transfer encoding of base64. What bothered him is that the encoded
form = is used when it wasn't necessary and presumably some tool he is
using = doesn't understand this 12 year old standard.
>
>Rich
>
> "Ellen K." wrote in message =
news:485ml1121hsq9se5hg0l4d2tsci5c5vc6b{at}4ax.com...
> Just curious, how are characters that require more than 7 bits =
encoded
> into 7-bit?
>
> On Thu, 20 Oct 2005 17:21:58 -0700, "Rich"
wrote in message
> :
>
> > Email content is any encoding you want. The example you give =
is valid even if silly. It's not a security issue in any case.
> >
> > BTW, email is not 7-bit though it is encouraged to be encoded =
as such because that provides better compatibility. There is a standard =
for checking for 8-bit compatiblity. See =
http://www.ietf.org/rfc/rfc1652.txt. It's not necessary since anything =
can be encoded as 7-bit. It can be more efficient.
> >
> >Rich
> >
> > "Geo." wrote in message =
news:4357ff5e$1{at}w3.nls.net...
> > Ok I don't understand so maybe someone can give me a rational =
explanation of
> > this.
> >
> > Why would an email program accept
> >
> > Subject: =3D?ascii?B?W1NQQU1dICBPbmxpbmUgUGF5bWVu?=3D
> > =3D?ascii?B?dHMgYW5kIG91ciBzZWN1cmUgc2l0?=3D =
=3D?ascii?B?ZSE=3D?=3D
> >
> > and decode it to
> >
> > [SPAM] Online Payments and our secure site!
> >
> > This just boggles the mind, I mean if you were trying to create =
secure
> > application wouldn't you restrict to a least common instead of =
allow
> > everything? Email is 7bit ascii not unicode correct? Is this =
somehow needed
> > to allow unicode subject line where the RFC's don't allow it?
> >
> > Geo.
------=_NextPart_000_04AF_01C5D7C8.A130AD00
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
It's a
little more =
complicated than=20
you think. An '=F1" is Unicode U+00F1. This
character is not =
ASCII. It is encoded as 0xF1 in ISO-8859-1 and Windows code
page=20 1252. Your computer very likely runs with code page 1252
as the = default=20
multibyte (ANSI) encoding. If you encode this in UTF-8,
a = common=20
Unicode encoding, it is represented as 0xC3 0xB1
The first
thing you need =
to select is=20
the character encoding. Let's consider ISO-8859-1 and UTF-8 =
both. =20
Windows 1252 is the same as ISO-8859-1 here.
For
quoted printable =
ISO-8859-1 this=20
would be expressed as =3DF1. For quoted printable UTF-8 this would = be=20
presented as =3DC3=3DB1.
For
base64 the answer is =
not easy as=20
you need three bytes for a full base64 grouping. Here we have
only = one or=20
two bytes depending on the character encoding. There is a =
mechanism for=20
this. If there is only one byte this is padded with four zero
bits = to get=20
two encoded characters which are suffixed with
"=3D=3D". If there = are only two=20
bytes, this is padded with two zero bits to get three encoded characters = which=20
are suffixed with "=3D". I could do the work to find
what the = actual=20
encoding is here but I don't have any tools that just do base64 encoding = and=20
don't want to take the time to fake it. The one byte ISO-8859-1 =
encoding=20
could look something like AQ=3D=3D and the two byte UTF-8 encoding like=20
T/r=3D.
Rich
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)SEEN-BY: 633/267 270 5030/786 @PATH: 379/45 1 106/2000 633/267 |
|
| SOURCE: echomail via fidonet.ozzmosis.com | |
Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.