TIP: Click on subject to list as thread! ANSI
echo: nthelp
to: Ellen K.
from: Rich
date: 2005-10-23 11:55:14
subject: Re: Character encodings, transfer encodings, etc

From: "Rich" 

This is a multi-part message in MIME format.

------=_NextPart_000_04AF_01C5D7C8.A130AD00
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

   It's a little more complicated than you think.  An '=F1" is Unicode =
U+00F1.  This character is not ASCII.  It is encoded as 0xF1 in =
ISO-8859-1 and Windows code page 1252.  Your computer very likely runs =
with code page 1252 as the default multibyte (ANSI) encoding.  If you =
encode this in UTF-8, a common Unicode encoding, it is represented as =
0xC3 0xB1

   The first thing you need to select is the character encoding.  Let's =
consider ISO-8859-1 and UTF-8 both.  Windows 1252 is the same as = ISO-8859-1 here.

   For quoted printable ISO-8859-1 this would be expressed as =3DF1.  =
For quoted printable UTF-8 this would be presented as =3DC3=3DB1.

   For base64 the answer is not easy as you need three bytes for a full =
base64 grouping.  Here we have only one or two bytes depending on the =
character encoding.  There is a mechanism for this.  If there is only = one
byte this is padded with four zero bits to get two encoded = characters
which are suffixed with "=3D=3D".  If there are only two = bytes,
this is padded with two zero bits to get three encoded characters = which
are suffixed with "=3D".  I could do the work to find what the =
actual encoding is here but I don't have any tools that just do base64 =
encoding and don't want to take the time to fake it.  The one byte =
ISO-8859-1 encoding could look something like AQ=3D=3D and the two byte =
UTF-8 encoding like T/r=3D.

Rich

  "Ellen K."  wrote in message =
news:3ginl1drukln16uits5n5c9ll50mjovnt7{at}4ax.com...
  Thanks for the explanation.   :)

  One picture (OK, one example in this case ) being worth the
  proverbial 1000 words, a lower-case n with a tilde (=F1 if it gets
  reproduced correctly by the time people read this) is ascii 241, i.e. =
it
  uses the first bit of an 8-bit byte.  How is it expressed in
  quoted-printable and how is it expressed  in base64?

  On Sat, 22 Oct 2005 22:31:34 -0700, "Rich"  wrote in message
  :

  >   There are multiple ways.  The two most common are called =
quoted-printable and base64.
  >
  >   In quoted printable characters can be represented by =3DXX where =
XX are the hex digits for the byte.  Because the '=3D' is an escape =
character it is expressed as =3D3D.
  >
  >   In base64 the byte sequence is divided in three byte groups which =
are subdivided into four six bit units.  The six bit units are mapped to =
printable ASCII characters.
  >
  >   Note that I refer to bytes not characters for the source.  This is =
because this transfer encoding is applied after any character encoding =
like UTF-8.  For example, with UTF-8 a single character is represented = by
from one to four bytes.  In quoted printable this becomes from one to = 12
ASCII characters in quoted printable.  There are many character = encodings
in use for many reasons.  Quoted printable and base64 are = usually
selected based on which results in a smaller size overall.  At = least that
is the criterion used by the clients I have seen.
  >
  >   George's example is slightly different than what I describe above. =
 Headers like the subject use a different mechanism to identify encoding =
than the message body and unlike the body allow mixing and matching in =
some ways.  His example is using a character encoding of "ascii"
and = transfer encoding of base64.  What bothered him is that the encoded
form = is used when it wasn't necessary and presumably some tool he is
using = doesn't understand this 12 year old standard.
  >
  >Rich
  >
  >  "Ellen K."  wrote in message =
news:485ml1121hsq9se5hg0l4d2tsci5c5vc6b{at}4ax.com...
  >  Just curious, how are characters that require more than 7 bits =
encoded
  >  into 7-bit?
  >
  >  On Thu, 20 Oct 2005 17:21:58 -0700, "Rich" 
wrote in message
  >  :
  >
  >  >   Email content is any encoding you want.  The example you give =
is valid even if silly.  It's not a security issue in any case.
  >  >
  >  >   BTW, email is not 7-bit though it is encouraged to be encoded =
as such because that provides better compatibility.  There is a standard =
for checking for 8-bit compatiblity.  See =
http://www.ietf.org/rfc/rfc1652.txt.  It's not necessary since anything =
can be encoded as 7-bit.  It can be more efficient.
  >  >
  >  >Rich
  >  >
  >  >  "Geo."  wrote in message =
news:4357ff5e$1{at}w3.nls.net...
  >  >  Ok I don't understand so maybe someone can give me a rational =
explanation of
  >  >  this.
  >  >
  >  >  Why would an email program accept
  >  >
  >  >  Subject: =3D?ascii?B?W1NQQU1dICBPbmxpbmUgUGF5bWVu?=3D
  >  >  =3D?ascii?B?dHMgYW5kIG91ciBzZWN1cmUgc2l0?=3D =
=3D?ascii?B?ZSE=3D?=3D
  >  >
  >  >  and decode it to
  >  >
  >  >   [SPAM]  Online Payments and our secure site!
  >  >
  >  >  This just boggles the mind, I mean if you were trying to create =
secure
  >  >  application wouldn't you restrict to a least common instead of =
allow
  >  >  everything? Email is 7bit ascii not unicode correct? Is this =
somehow needed
  >  >  to allow unicode subject line where the RFC's don't allow it?
  >  >
  >  >  Geo. 

------=_NextPart_000_04AF_01C5D7C8.A130AD00
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable








   It's a
little more =
complicated than=20
you think.  An '=F1" is Unicode U+00F1.  This
character is not =

ASCII.  It is encoded as 0xF1 in ISO-8859-1 and Windows code
page=20 1252.  Your computer very likely runs with code page 1252
as the = default=20
multibyte (ANSI) encoding.  If you encode this in UTF-8,
a = common=20
Unicode encoding, it is represented as 0xC3 0xB1
 
   The first
thing you need =
to select is=20
the character encoding.  Let's consider ISO-8859-1 and UTF-8 =
both. =20
Windows 1252 is the same as ISO-8859-1 here.
 
   For
quoted printable =
ISO-8859-1 this=20
would be expressed as =3DF1.  For quoted printable UTF-8 this would = be=20
presented as =3DC3=3DB1.
 
   For
base64 the answer is =
not easy as=20
you need three bytes for a full base64 grouping.  Here we have
only = one or=20
two bytes depending on the character encoding.  There is a =
mechanism for=20
this.  If there is only one byte this is padded with four zero
bits = to get=20
two encoded characters which are suffixed with
"=3D=3D".  If there = are only two=20
bytes, this is padded with two zero bits to get three encoded characters = which=20
are suffixed with "=3D".  I could do the work to find
what the = actual=20
encoding is here but I don't have any tools that just do base64 encoding = and=20
don't want to take the time to fake it.  The one byte ISO-8859-1 =
encoding=20
could look something like AQ=3D=3D and the two byte UTF-8 encoding like=20
T/r=3D.
 
Rich
 

  "Ellen K." <72322.1016{at}compuserve.com&g=">mailto:72322.1016{at}compuserve.com">72322.1016{at}compuserve.com&g=
t;=20
  wrote in message news:3ginl1drukl=
n16uits5n5c9ll50mjovnt7{at}4ax.com...Thanks=20
  for the explanation.   :)One
picture (OK, one =
example in=20
  this case <g>) being worth theproverbial 1000 words, a =
lower-case n=20
  with a tilde (=F1 if it getsreproduced correctly by the time =
people read=20
  this) is ascii 241, i.e. ituses the first bit of an 8-bit =
byte.  How=20
  is it expressed inquoted-printable and how is it expressed  =
in=20
  base64?On Sat, 22 Oct 2005 22:31:34 -0700,
"Rich" <{at}> =
wrote in=20
  message<435b1fe7{at}w3.nls.net>:&=">mailto:435b1fe7{at}w3.nls.net">435b1fe7{at}w3.nls.net>:&=
gt;  =20
  There are multiple ways.  The two most common are called =
quoted-printable=20
  and base64.>>   In
quoted printable characters =
can be=20
  represented by =3DXX where XX are the hex digits for the byte.  =
Because the=20
  '=3D' is an escape character it is expressed as =
=3D3D.>>  =20
  In base64 the byte sequence is divided in three byte groups which are=20
  subdivided into four six bit units.  The six bit units are mapped =
to=20
  printable ASCII
characters.>>   Note
that I =
refer to=20
  bytes not characters for the source.  This is because this =
transfer=20
  encoding is applied after any character encoding like UTF-8.  For =

  example, with UTF-8 a single character is represented by from one to =
four=20
  bytes.  In quoted printable this becomes from one to 12 ASCII =
characters=20
  in quoted printable.  There are many character encodings in use =
for many=20
  reasons.  Quoted printable and base64 are usually selected based =
on which=20
  results in a smaller size overall.  At least that is the =
criterion used=20
  by the clients I have
seen.>>   George's =
example is=20
  slightly different than what I describe above.  Headers like the =
subject=20
  use a different mechanism to identify encoding than the message body =
and=20
  unlike the body allow mixing and matching in some ways.  His =
example is=20
  using a character encoding of "ascii" and transfer encoding of =
base64. =20
  What bothered him is that the encoded form is used when it wasn't =
necessary=20
  and presumably some tool he is using doesn't understand this 12 year =
old=20
 
standard.>>Rich>> 
"Ellen K." <72322.1016{at}compuserve.com&g=">mailto:72322.1016{at}compuserve.com">72322.1016{at}compuserve.com&g=
t;=20
  wrote in message news:485ml1121hs=
q9se5hg0l4d2tsci5c5vc6b{at}4ax.com...> =20
  Just curious, how are characters that require more than 7 bits=20
  encoded>  into
7-bit?>>  On Thu, 20 Oct =
2005=20
  17:21:58 -0700, "Rich" <{at}> wrote in
message>  <43583435{at}w3.nls.net>:><=">mailto:43583435{at}w3.nls.net">43583435{at}w3.nls.net>:><=
BR>> =20
  >   Email content is any encoding you
want.  The =
example you=20
  give is valid even if silly.  It's not a security issue in any=20
  case.>  >> 
>   BTW, email is =
not=20
  7-bit though it is encouraged to be encoded as such because that =
provides=20
  better compatibility.  There is a standard for checking for 8-bit =

  compatiblity.  See http://www.ietf.org/rfc/rfc1" target="new">http://www.ietf.org/rfc/rfc1=">http://www.ietf.org/rfc/rfc1652.txt">http://www.ietf.org/rfc/rfc1=
652.txt. =20
  It's not necessary since anything can be encoded as 7-bit.  It =
can be=20
  more efficient.>  >>  =
>Rich> =20
  >>  >  "Geo."
<fake{at}barkdom.com>">mailto:fake{at}barkdom.com">fake{at}barkdom.com>
wrote in =
message news:4357ff5e$1{at}w3.nls.net...=
> =20
  >  Ok I don't understand so maybe someone can give me a =
rational=20
  explanation of>  > 
this.>  =
>> =20
  >  Why would an email program accept>  =
>> =20
  >  Subject: =
=3D?ascii?B?W1NQQU1dICBPbmxpbmUgUGF5bWVu?=3D> =20
  >  =3D?ascii?B?dHMgYW5kIG91ciBzZWN1cmUgc2l0?=3D=20
  =3D?ascii?B?ZSE=3D?=3D> 
>>  >  and =
decode it=20
  to>  >> 
>   [SPAM]  =
Online=20
  Payments and our secure site!> 
>>  =
>  This=20
  just boggles the mind, I mean if you were trying to create=20
  secure>  >  application
wouldn't you restrict to a =
least=20
  common instead of allow>  > 
everything? Email is =
7bit=20
  ascii not unicode correct? Is this somehow needed>  =
>  to=20
  allow unicode subject line where the RFC's don't allow =
it?> =20
  >>  >  Geo.
<confused and trying not to read =

  conspiricy into it>

------=_NextPart_000_04AF_01C5D7C8.A130AD00--

--- BBBS/NT v4.01 Flag-5
* Origin: Barktopia BBS Site http://HarborWebs.com:8081 (1:379/45)
SEEN-BY: 633/267 270 5030/786
@PATH: 379/45 1 106/2000 633/267

SOURCE: echomail via fidonet.ozzmosis.com

Email questions or comments to sysop@ipingthereforeiam.com
All parts of this website painstakingly hand-crafted in the U.S.A.!
IPTIA BBS/MUD/Terminal/Game Server List, © 2025 IPTIA Consulting™.