TIP: Click on subject to list as thread! ANSI
echo: nthelp
to: Ellen K.
from: Rich
date: 2005-10-23 14:27:44
subject: Re: Character encodings, transfer encodings, etc

From: "Rich" 

This is a multi-part message in MIME format.

------=_NextPart_000_04E8_01C5D7DD.EFBF67D0
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

   I don't understand the first question.  quoted printable and base64 =
are two well defined transfer encodings.  They apply to data as a blob.  =
The data can be text or binary.  ASCII, ISO-8859-1, and UTF-8 are =
character encodings.  There are multiple terms for these.  The email =
standards call these charsets.  You need a character encoding even if = you
don't have a transfer encoding.

   There are two other common transfer encodings identified as 7bit and =
8bit.  The former is used when the data is already limited to 7bit =
characters and doesn't need to be encoded.  8bit is used when the =
transport is 8bit safe and again no encoding is needed.

   You can use quoted printable and base64 with all 7 bit data.  There =
would be no point but nothing precludes it.

   Again, george's example was with an email header where the rules are =
different.  The encoding serves two purposes.  One is to handle non-7bit =
bytes as with a message body.  Another is to identify the character =
encoding.  There are headers that are used to declare the character =
encoding of a body part.  There was nothing to declare the character =
encoding of a header value.  The RFC 1522 mechanism provides for this =
too.  In george's example the character encoding was "ascii".

Rich

  "Ellen K."  wrote in message =
news:5csnl1lg05pjmmc40kue0n0aomtcslv17s{at}4ax.com...
  So all the flavors of character encoding already have mappings?  And
  then the translation is created based on the rules for mapping the
  particular flavor of character encoding to the desired translation?
  What happens to the characters which inherently don't require more =
than
  7 bits?

  On Sun, 23 Oct 2005 11:55:14 -0700, "Rich"  wrote in message
  :

  >   It's a little more complicated than you think.  An '=F1" is =
Unicode U+00F1.  This character is not ASCII.  It is encoded as 0xF1 in =
ISO-8859-1 and Windows code page 1252.  Your computer very likely runs =
with code page 1252 as the default multibyte (ANSI) encoding.  If you =
encode this in UTF-8, a common Unicode encoding, it is represented as =
0xC3 0xB1
  >
  >   The first thing you need to select is the character encoding.  =
Let's consider ISO-8859-1 and UTF-8 both.  Windows 1252 is the same as =
ISO-8859-1 here.
  >
  >   For quoted printable ISO-8859-1 this would be expressed as =3DF1.  =
For quoted printable UTF-8 this would be presented as =3DC3=3DB1.
  >
  >   For base64 the answer is not easy as you need three bytes for a =
full base64 grouping.  Here we have only one or two bytes depending on =
the character encoding.  There is a mechanism for this.  If there is = only
one byte this is padded with four zero bits to get two encoded = characters
which are suffixed with "=3D=3D".  If there are only two = bytes,
this is padded with two zero bits to get three encoded characters = which
are suffixed with "=3D".  I could do the work to find what the =
actual encoding is here but I don't have any tools that just do base64 =
encoding and don't want to take the time to fake it.  The one byte =
ISO-8859-1 encoding could look something like AQ=3D=3D and the two byte =
UTF-8 encoding like T/r=3D.
  >
  >Rich
  >
  >  "Ellen K."  wrote in message =
news:3ginl1drukln16uits5n5c9ll50mjovnt7{at}4ax.com...
  >  Thanks for the explanation.   :)
  >
  >  One picture (OK, one example in this case ) being worth the
  >  proverbial 1000 words, a lower-case n with a tilde (=F1 if it gets
  >  reproduced correctly by the time people read this) is ascii 241, =
i.e. it
  >  uses the first bit of an 8-bit byte.  How is it expressed in
  >  quoted-printable and how is it expressed  in base64?
  >
  >  On Sat, 22 Oct 2005 22:31:34 -0700, "Rich" 
wrote in message
  >  :
  >
  >  >   There are multiple ways.  The two most common are called =
quoted-printable and base64.
  >  >
  >  >   In quoted printable characters can be represented by =3DXX =
where XX are the hex digits for the byte.  Because the '=3D' is an = escape
character it is expressed as =3D3D.
  >  >
  >  >   In base64 the byte sequence is divided in three byte groups =
which are subdivided into four six bit units.  The six bit units are =
mapped to printable ASCII characters.
  >  >
  >  >   Note that I refer to bytes not characters for the source.  This =
is because this transfer encoding is applied after any character = encoding
like UTF-8.  For example, with UTF-8 a single character is = represented by
from one to four bytes.  In quoted printable this becomes = from one to 12
ASCII characters in quoted printable.  There are many = character encodings
in use for many reasons.  Quoted printable and = base64 are usually
selected based on which results in a smaller size = overall.  At least that
is the criterion used by the clients I have = seen.
  >  >
  >  >   George's example is slightly different than what I describe =
above.  Headers like the subject use a different mechanism to identify =
encoding than the message body and unlike the body allow mixing and =
matching in some ways.  His example is using a character encoding of =
"ascii" and transfer encoding of base64.  What bothered him is
that the = encoded form is used when it wasn't necessary and presumably
some tool = he is using doesn't understand this 12 year old standard.
  >  >
  >  >Rich
  >  >
  >  >  "Ellen K."  wrote
in message =
news:485ml1121hsq9se5hg0l4d2tsci5c5vc6b{at}4ax.com...
  >  >  Just curious, how are characters that require more than 7 bits =
encoded
  >  >  into 7-bit?
  >  >
  >  >  On Thu, 20 Oct 2005 17:21:58 -0700, "Rich"
 wrote in message
  >  >  :
  >  >
  >  >  >   Email content is any encoding you want.  The example you =
give is valid even if silly.  It's not a security issue in any case.
  >  >  >
  >  >  >   BTW, email is not 7-bit though it is encouraged to be =
encoded as such because that provides better compatibility.  There is a =
standard for checking for 8-bit compatiblity.  See =
http://www.ietf.org/rfc/rfc1652.txt.  It's not necessary since anything =
can be encoded as 7-bit.  It can be more efficient.
  >  >  >
  >  >  >Rich
  >  >  >
  >  >  >  "Geo."  wrote in message =
news:4357ff5e$1{at}w3.nls.net...
  >  >  >  Ok I don't understand so maybe someone can give me a rational =
explanation of
  >  >  >  this.
  >  >  >
  >  >  >  Why would an email program accept
  >  >  >
  >  >  >  Subject: =3D?ascii?B?W1NQQU1dICBPbmxpbmUgUGF5bWVu?=3D
  >  >  >  =3D?ascii?B?dHMgYW5kIG91ciBzZWN1cmUgc2l0?=3D =
=3D?ascii?B?ZSE=3D?=3D
  >  >  >
  >  >  >  and decode it to
  >  >  >
  >  >  >   [SPAM]  Online Payments and our secure site!
  >  >  >
  >  >  >  This just boggles the mind, I mean if you were trying to =
create secure
  >  >  >  application wouldn't you restrict to a least common instead =
of allow
  >  >  >  everything? Email is 7bit ascii not unicode correct? Is this =
somehow needed
  >  >  >  to allow unicode subject line where the RFC's don't allow it?
  >  >  >
  >  >  >  Geo. 

------=_NextPart_000_04E8_01C5D7DD.EFBF67D0
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable








   I don't
understand the =
first=20
question.  quoted printable and base64 are two well defined = transfer=20
encodings.  They apply to data as a blob.  The data can
be = text or=20
binary.  ASCII, ISO-8859-1, and UTF-8 are character =
encodings.  There=20
are multiple terms for these.  The email standards call these=20
charsets.  You need a character encoding even if you don't have a
= transfer=20
encoding.
 
   There are
two other common =
transfer=20
encodings identified as 7bit and 8bit.  The former is used when
the = data is=20
already limited to 7bit characters and doesn't need to be
encoded.  = 8bit is=20
used when the transport is 8bit safe and again no encoding is=20
needed.
 
   You can
use quoted =
printable and=20
base64 with all 7 bit data.  There would be no point but nothing =
precludes=20
it.
 
   Again,
george's example =
was with an=20
email header where the rules are different.  The encoding serves = two=20
purposes.  One is to handle non-7bit bytes as with a message =
body. =20
Another is to identify the character encoding.  There are headers
= that are=20
used to declare the character encoding of a body part.  There was
= nothing=20
to declare the character encoding of a header value.  The RFC
1522=20 mechanism provides for this too.  In george's example the
character =

encoding was "ascii".
 
Rich
 

  "Ellen K." <72322.1016{at}compuserve.com&g=">mailto:72322.1016{at}compuserve.com">72322.1016{at}compuserve.com&g=
t;=20
  wrote in message news:5csnl1lg05p=
jmmc40kue0n0aomtcslv17s{at}4ax.com...So=20
  all the flavors of character encoding already have mappings?  =
Andthen=20
  the translation is created based on the rules for mapping =
theparticular=20
  flavor of character encoding to the desired translation?What =
happens to=20
  the characters which inherently don't require more than7 =
bits?On=20
  Sun, 23 Oct 2005 11:55:14 -0700, "Rich" <{at}> wrote in =
message<435bdc4b{at}w3.nls.net>:&=">mailto:435bdc4b{at}w3.nls.net">435bdc4b{at}w3.nls.net>:&=
gt;  =20
  It's a little more complicated than you think.  An '=F1" is =
Unicode=20
  U+00F1.  This character is not ASCII.  It is encoded as 0xF1 =
in=20
  ISO-8859-1 and Windows code page 1252.  Your computer very likely =
runs=20
  with code page 1252 as the default multibyte (ANSI) encoding.  If =
you=20
  encode this in UTF-8, a common Unicode encoding, it is represented as =
0xC3=20
  0xB1>>   The first
thing you need to select is =
the=20
  character encoding.  Let's consider ISO-8859-1 and UTF-8 =
both. =20
  Windows 1252 is the same as ISO-8859-1 =
here.>>   For=20
  quoted printable ISO-8859-1 this would be expressed as =3DF1.  =
For quoted=20
  printable UTF-8 this would be presented as =
=3DC3=3DB1.>>  =20
  For base64 the answer is not easy as you need three bytes for a full =
base64=20
  grouping.  Here we have only one or two bytes depending on the =
character=20
  encoding.  There is a mechanism for this.  If there is only =
one byte=20
  this is padded with four zero bits to get two encoded characters which =
are=20
  suffixed with "=3D=3D".  If there are only two bytes, this is =
padded with two=20
  zero bits to get three encoded characters which are suffixed with =
"=3D".  I=20
  could do the work to find what the actual encoding is here but I don't =
have=20
  any tools that just do base64 encoding and don't want to take the time =
to fake=20
  it.  The one byte ISO-8859-1 encoding could look something like =
AQ=3D=3D and=20
  the two byte UTF-8 encoding like=20
 
T/r=3D.>>Rich>> 
"Ellen K." <72322.1016{at}compuserve.com&g=">mailto:72322.1016{at}compuserve.com">72322.1016{at}compuserve.com&g=
t;=20
  wrote in message news:3ginl1drukl=
n16uits5n5c9ll50mjovnt7{at}4ax.com...> =20
  Thanks for the explanation.  
:)>>  One =
picture=20
  (OK, one example in this case <g>) being worth
the>  =

  proverbial 1000 words, a lower-case n with a tilde (=F1 if it =
gets> =20
  reproduced correctly by the time people read this) is ascii 241, i.e.=20
  it>  uses the first bit of an 8-bit
byte.  How is it=20
  expressed in>  quoted-printable and how is it =
expressed  in=20
  base64?>>  On Sat, 22 Oct 2005
22:31:34 -0700, =
"Rich"=20
  <{at}> wrote in message>  <435b1fe7{at}w3.nls.net>:><=">mailto:435b1fe7{at}w3.nls.net">435b1fe7{at}w3.nls.net>:><=
BR>> =20
  >   There are multiple ways.  The two
most common =
are called=20
  quoted-printable and base64.> 
>>  =
>  =20
  In quoted printable characters can be represented by =3DXX where XX =
are the hex=20
  digits for the byte.  Because the '=3D' is an escape character it =
is=20
  expressed as =3D3D.> 
>>  >   =
In base64=20
  the byte sequence is divided in three byte groups which are subdivided =
into=20
  four six bit units.  The six bit units are mapped to printable =
ASCII=20
  characters.> 
>>  >   Note =
that I=20
  refer to bytes not characters for the source.  This is because =
this=20
  transfer encoding is applied after any character encoding like =
UTF-8. =20
  For example, with UTF-8 a single character is represented by from one =
to four=20
  bytes.  In quoted printable this becomes from one to 12 ASCII =
characters=20
  in quoted printable.  There are many character encodings in use =
for many=20
  reasons.  Quoted printable and base64 are usually selected based =
on which=20
  results in a smaller size overall.  At least that is the =
criterion used=20
  by the clients I have seen.> 
>>  =
>  =20
  George's example is slightly different than what I describe =
above. =20
  Headers like the subject use a different mechanism to identify =
encoding than=20
  the message body and unlike the body allow mixing and matching in some =

  ways.  His example is using a character encoding of
"ascii" and =
transfer=20
  encoding of base64.  What bothered him is that the encoded form =
is used=20
  when it wasn't necessary and presumably some tool he is using doesn't=20
  understand this 12 year old standard.> 
>>  =

  >Rich> 
>>  >  "Ellen
K." <72322.1016{at}compuserve.com&g=">mailto:72322.1016{at}compuserve.com">72322.1016{at}compuserve.com&g=
t;=20
  wrote in message news:485ml1121hs=
q9se5hg0l4d2tsci5c5vc6b{at}4ax.com...> =20
  >  Just curious, how are characters that require more than 7 =
bits=20
  encoded>  >  into
7-bit?>  =
>> =20
  >  On Thu, 20 Oct 2005 17:21:58 -0700, "Rich"
<{at}> wrote =
in=20
  message>  >  <43583435{at}w3.nls.net>:>&=">mailto:43583435{at}w3.nls.net">43583435{at}w3.nls.net>:>&=
nbsp;=20
  >>  > 
>   Email content is any =
encoding=20
  you want.  The example you give is valid even if silly.  =
It's not a=20
  security issue in any case.>  >  =
>> =20
  >  >   BTW, email is not 7-bit
though it is =
encouraged to=20
  be encoded as such because that provides better compatibility.  =
There is=20
  a standard for checking for 8-bit compatiblity.  See http://www.ietf.org/rfc/rfc1" target="new">http://www.ietf.org/rfc/rfc1=">http://www.ietf.org/rfc/rfc1652.txt">http://www.ietf.org/rfc/rfc1=
652.txt. =20
  It's not necessary since anything can be encoded as 7-bit.  It =
can be=20
  more efficient.>  > 
>>  >  =

  >Rich>  > 
>>  >  =
> =20
  "Geo." <fake{at}barkdom.com>">mailto:fake{at}barkdom.com">fake{at}barkdom.com>
wrote in=20
  message news:4357ff5e$1{at}w3.nls.net...=
> =20
  >  >  Ok I don't understand so maybe
someone can give =
me a=20
  rational explanation of>  > 
> =20
  this.>  > 
>>  >  =
>  Why=20
  would an email program accept>  >  =
>> =20
  >  >  Subject:=20
  =3D?ascii?B?W1NQQU1dICBPbmxpbmUgUGF5bWVu?=3D> 
>  =
> =20
  =3D?ascii?B?dHMgYW5kIG91ciBzZWN1cmUgc2l0?=3D =
=3D?ascii?B?ZSE=3D?=3D> =20
  >  >> 
>  >  and decode it=20
  to>  > 
>>  >  =
>  =20
  [SPAM]  Online Payments and our secure site!>  =
> =20
  >>  > 
>  This just boggles the mind, I =
mean if=20
  you were trying to create secure> 
>  > =20
  application wouldn't you restrict to a least common instead of=20
  allow>  >  > 
everything? Email is 7bit =
ascii not=20
  unicode correct? Is this somehow needed> 
>  =
>  to=20
  allow unicode subject line where the RFC's don't allow =
it?> =20
  >  >> 
>  >  Geo. <confused =
and=20
  trying not to read conspiricy into =
it>

------=_NextPart_000_04E8_01C5D7DD.EFBF67D0--

--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)
SEEN-BY: 633/267 270 5030/786
@PATH: 379/45 1 106/2000 633/267

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.