| TIP: Click on subject to list as thread! | ANSI |
| echo: | |
|---|---|
| to: | |
| from: | |
| date: | |
| subject: | Re: Character encodings, transfer encodings, etc |
From: "Rich"
This is a multi-part message in MIME format.
------=_NextPart_000_04E8_01C5D7DD.EFBF67D0
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
I don't understand the first question. quoted printable and base64 =
are two well defined transfer encodings. They apply to data as a blob. =
The data can be text or binary. ASCII, ISO-8859-1, and UTF-8 are =
character encodings. There are multiple terms for these. The email =
standards call these charsets. You need a character encoding even if = you
don't have a transfer encoding.
There are two other common transfer encodings identified as 7bit and =
8bit. The former is used when the data is already limited to 7bit =
characters and doesn't need to be encoded. 8bit is used when the =
transport is 8bit safe and again no encoding is needed.
You can use quoted printable and base64 with all 7 bit data. There =
would be no point but nothing precludes it.
Again, george's example was with an email header where the rules are =
different. The encoding serves two purposes. One is to handle non-7bit =
bytes as with a message body. Another is to identify the character =
encoding. There are headers that are used to declare the character =
encoding of a body part. There was nothing to declare the character =
encoding of a header value. The RFC 1522 mechanism provides for this =
too. In george's example the character encoding was "ascii".
Rich
"Ellen K." wrote in message =
news:5csnl1lg05pjmmc40kue0n0aomtcslv17s{at}4ax.com...
So all the flavors of character encoding already have mappings? And
then the translation is created based on the rules for mapping the
particular flavor of character encoding to the desired translation?
What happens to the characters which inherently don't require more =
than
7 bits?
On Sun, 23 Oct 2005 11:55:14 -0700, "Rich" wrote in message
:
> It's a little more complicated than you think. An '=F1" is =
Unicode U+00F1. This character is not ASCII. It is encoded as 0xF1 in =
ISO-8859-1 and Windows code page 1252. Your computer very likely runs =
with code page 1252 as the default multibyte (ANSI) encoding. If you =
encode this in UTF-8, a common Unicode encoding, it is represented as =
0xC3 0xB1
>
> The first thing you need to select is the character encoding. =
Let's consider ISO-8859-1 and UTF-8 both. Windows 1252 is the same as =
ISO-8859-1 here.
>
> For quoted printable ISO-8859-1 this would be expressed as =3DF1. =
For quoted printable UTF-8 this would be presented as =3DC3=3DB1.
>
> For base64 the answer is not easy as you need three bytes for a =
full base64 grouping. Here we have only one or two bytes depending on =
the character encoding. There is a mechanism for this. If there is = only
one byte this is padded with four zero bits to get two encoded = characters
which are suffixed with "=3D=3D". If there are only two = bytes,
this is padded with two zero bits to get three encoded characters = which
are suffixed with "=3D". I could do the work to find what the =
actual encoding is here but I don't have any tools that just do base64 =
encoding and don't want to take the time to fake it. The one byte =
ISO-8859-1 encoding could look something like AQ=3D=3D and the two byte =
UTF-8 encoding like T/r=3D.
>
>Rich
>
> "Ellen K." wrote in message =
news:3ginl1drukln16uits5n5c9ll50mjovnt7{at}4ax.com...
> Thanks for the explanation. :)
>
> One picture (OK, one example in this case ) being worth the
> proverbial 1000 words, a lower-case n with a tilde (=F1 if it gets
> reproduced correctly by the time people read this) is ascii 241, =
i.e. it
> uses the first bit of an 8-bit byte. How is it expressed in
> quoted-printable and how is it expressed in base64?
>
> On Sat, 22 Oct 2005 22:31:34 -0700, "Rich"
wrote in message
> :
>
> > There are multiple ways. The two most common are called =
quoted-printable and base64.
> >
> > In quoted printable characters can be represented by =3DXX =
where XX are the hex digits for the byte. Because the '=3D' is an = escape
character it is expressed as =3D3D.
> >
> > In base64 the byte sequence is divided in three byte groups =
which are subdivided into four six bit units. The six bit units are =
mapped to printable ASCII characters.
> >
> > Note that I refer to bytes not characters for the source. This =
is because this transfer encoding is applied after any character = encoding
like UTF-8. For example, with UTF-8 a single character is = represented by
from one to four bytes. In quoted printable this becomes = from one to 12
ASCII characters in quoted printable. There are many = character encodings
in use for many reasons. Quoted printable and = base64 are usually
selected based on which results in a smaller size = overall. At least that
is the criterion used by the clients I have = seen.
> >
> > George's example is slightly different than what I describe =
above. Headers like the subject use a different mechanism to identify =
encoding than the message body and unlike the body allow mixing and =
matching in some ways. His example is using a character encoding of =
"ascii" and transfer encoding of base64. What bothered him is
that the = encoded form is used when it wasn't necessary and presumably
some tool = he is using doesn't understand this 12 year old standard.
> >
> >Rich
> >
> > "Ellen K." wrote
in message =
news:485ml1121hsq9se5hg0l4d2tsci5c5vc6b{at}4ax.com...
> > Just curious, how are characters that require more than 7 bits =
encoded
> > into 7-bit?
> >
> > On Thu, 20 Oct 2005 17:21:58 -0700, "Rich"
wrote in message
> > :
> >
> > > Email content is any encoding you want. The example you =
give is valid even if silly. It's not a security issue in any case.
> > >
> > > BTW, email is not 7-bit though it is encouraged to be =
encoded as such because that provides better compatibility. There is a =
standard for checking for 8-bit compatiblity. See =
http://www.ietf.org/rfc/rfc1652.txt. It's not necessary since anything =
can be encoded as 7-bit. It can be more efficient.
> > >
> > >Rich
> > >
> > > "Geo." wrote in message =
news:4357ff5e$1{at}w3.nls.net...
> > > Ok I don't understand so maybe someone can give me a rational =
explanation of
> > > this.
> > >
> > > Why would an email program accept
> > >
> > > Subject: =3D?ascii?B?W1NQQU1dICBPbmxpbmUgUGF5bWVu?=3D
> > > =3D?ascii?B?dHMgYW5kIG91ciBzZWN1cmUgc2l0?=3D =
=3D?ascii?B?ZSE=3D?=3D
> > >
> > > and decode it to
> > >
> > > [SPAM] Online Payments and our secure site!
> > >
> > > This just boggles the mind, I mean if you were trying to =
create secure
> > > application wouldn't you restrict to a least common instead =
of allow
> > > everything? Email is 7bit ascii not unicode correct? Is this =
somehow needed
> > > to allow unicode subject line where the RFC's don't allow it?
> > >
> > > Geo.
------=_NextPart_000_04E8_01C5D7DD.EFBF67D0
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
I don't
understand the =
first=20
question. quoted printable and base64 are two well defined = transfer=20
encodings. They apply to data as a blob. The data can
be = text or=20
binary. ASCII, ISO-8859-1, and UTF-8 are character =
encodings. There=20
are multiple terms for these. The email standards call these=20
charsets. You need a character encoding even if you don't have a
= transfer=20
encoding.
There are
two other common =
transfer=20
encodings identified as 7bit and 8bit. The former is used when
the = data is=20
already limited to 7bit characters and doesn't need to be
encoded. = 8bit is=20
used when the transport is 8bit safe and again no encoding is=20
needed.
You can
use quoted =
printable and=20
base64 with all 7 bit data. There would be no point but nothing =
precludes=20
it.
Again,
george's example =
was with an=20
email header where the rules are different. The encoding serves = two=20
purposes. One is to handle non-7bit bytes as with a message =
body. =20
Another is to identify the character encoding. There are headers
= that are=20
used to declare the character encoding of a body part. There was
= nothing=20
to declare the character encoding of a header value. The RFC
1522=20 mechanism provides for this too. In george's example the
character =
encoding was "ascii".
Rich
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)SEEN-BY: 633/267 270 5030/786 @PATH: 379/45 1 106/2000 633/267 |
|
| SOURCE: echomail via fidonet.ozzmosis.com | |
Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.