Issue 1096:  Issue about CDR encoding for wide characters (interop)
Source:  (, )
Nature:  Uncategorized Issue
Severity: 
Summary: Summary: 1.  When encoding a wchar, always put 4 bytes into the CDR stream,
 &gt; adding any zero pad bytes as necessary to make the value 4 bytes
 &gt; long.
 2.  When encoding a wstring, always encode the length as the total
 &gt; number of bytes used by the encoded value, regardless of whether the
 &gt; encoding is byte-oriented or not.
 

Resolution: 
Revised Text: 
Actions taken:
March 24, 1998: received issue
June 25, 1998: closed issue
Discussion: 

End of Annotations:=====
Return-Path: <loewis@hp832.informatik.hu-berlin.de>
Date: Wed, 25 Mar 1998 13:38:23 +0100
From: "Martin v. Loewis" <loewis@informatik.hu-berlin.de>
To: jon@floorboard.com
CC: jfh@hal.com, issues@omg.org, interop@omg.org
Subject: Re: omg_issues: Issue about CDR encoding for wide characters
References: <199803231932.LAA24237@kodiak.hal.com>
<3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora>
<3517EDBC.C0AA416C@floorboard.com>

> Are you saying that there are current byte-oriented codesets that
use
> more than 4 bytes for some characters?  If so, then I don't know how
we
> could solve this.

UTF-8, for one, can go up to 6 bytes
(http://www.stonehand.com/unicode/standard/wg2n1036.html). The
characters represented by encodings larger than four bytes are not yet
assigned, though.

I think there is no way to process multi-byte characters embedded in
CDR if you don't know what the encoding is. An ORB should not accept a
codeset where it can't easily determine the length of an encoding.  It
seems that multibyte non-Unicode encodings will be used only if they
are native to both client and server.

> 1.  When encoding a wchar, always put 4 bytes into the CDR stream,
> adding any zero pad bytes as necessary to make the value 4 bytes
> long.

I don't think that this could work.

> 2.  When encoding a wstring, always encode the length as the total
> number of bytes used by the encoded value, regardless of whether the
> encoding is byte-oriented or not.

Even though this sounds reasonable, it contradicts the adopted
specifications.

Regards,
Martin

Return-Path: <jon@floorboard.com>
Sender: jon@floorboard.com
Date: Wed, 25 Mar 1998 09:18:38 -0800
From: Jonathan Biggar <jon@floorboard.com>
To: "Martin v. Loewis" <loewis@informatik.hu-berlin.de>
CC: jfh@hal.com, issues@omg.org, interop@omg.org
Subject: Re: omg_issues: Issue about CDR encoding for wide characters
References: <199803231932.LAA24237@kodiak.hal.com>
<3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora>
<3517EDBC.C0AA416C@floorboard.com> <199803251238.NAA22851@pandora>

Martin v. Loewis wrote:
> 
> > Are you saying that there are current byte-oriented codesets that
> use
> > more than 4 bytes for some characters?  If so, then I don't know
> how we
> > could solve this.
> 
> UTF-8, for one, can go up to 6 bytes
> (http://www.stonehand.com/unicode/standard/wg2n1036.html). The
> characters represented by encodings larger than four bytes are not
> yet
> assigned, though.

Ick, thats awful.

> I think there is no way to process multi-byte characters embedded in
> CDR if you don't know what the encoding is. An ORB should not accept
> a
> codeset where it can't easily determine the length of an encoding.
> It
> seems that multibyte non-Unicode encodings will be used only if they
> are native to both client and server.

The idea, and I'm sure Bill Janssen would agree with me, is to make
the
CDR encoding parsable without having to know what encoding was used
for
wchar and wstring.  If you don't have this, then bridges and gateways
become much more complicated, because they must know and track every
possible transmission codeset.

> > 1.  When encoding a wchar, always put 4 bytes into the CDR stream,
> > adding any zero pad bytes as necessary to make the value 4 bytes
> > long.
> 
> I don't think that this could work.
> 
> > 2.  When encoding a wstring, always encode the length as the total
> > number of bytes used by the encoded value, regardless of whether
> the
> > encoding is byte-oriented or not.
> 
> Even though this sounds reasonable, it contradicts the adopted
> specifications.

Right, I am proposing this as a change to the standards to resolve
this
problem.
Perhaps if I changed my proposal to require wchar to be always encoded
as 8 bytes?

-- 
Jon Biggar
Floorboard Software
jon@floorboard.com
jon@biggar.org

Return-Path: <jmason@gog.dublin.iona.ie>
To: Jonathan Biggar <jon@floorboard.com>
cc: "Martin v. Loewis" <loewis@informatik.hu-berlin.de>, jfh@hal.com,
        issues@omg.org, interop@omg.org
X-uri: http://www.iona.com/~jmason/
Subject: Re: omg_issues: Issue about CDR encoding for wide characters 
Content-ID: <7427.890847778.1@gog.dublin.iona.ie>
Date: Wed, 25 Mar 1998 17:42:58 +0000
From: Justin Mason <jmason@iona.com>

Jonathan Biggar said:

>The idea, and I'm sure Bill Janssen would agree with me, is to make
the
>CDR encoding parsable without having to know what encoding was used
for
>wchar and wstring.  If you don't have this, then bridges and gateways
>become much more complicated, because they must know and track every
>possible transmission codeset.

>> > 2.  When encoding a wstring, always encode the length as the
total
>> > number of bytes used by the encoded value, regardless of whether
the
>> > encoding is byte-oriented or not.
>> 
>> Even though this sounds reasonable, it contradicts the adopted
>> specifications.
>
>Right, I am proposing this as a change to the standards to resolve
this
>problem.  Perhaps if I changed my proposal to require wchar to be
always
>encoded as 8 bytes?

I agree with you in the wstring case, otherwise bridges are in big
trouble
(spoken as a bridge spokesperson ;). However I don't know if the wchar
should be limited to 8 bytes...

It strikes me that the best way to handle both wstring and wchar is to
marshal them with the length as the length of the wstring/wchar in
octets,
as suggested earlier for the wstring case.

--j.

Return-Path: <jon@floorboard.com>
Sender: jon@floorboard.com
Date: Wed, 25 Mar 1998 09:53:25 -0800
From: Jonathan Biggar <jon@floorboard.com>
To: Justin Mason <jmason@iona.com>
CC: "Martin v. Loewis" <loewis@informatik.hu-berlin.de>, jfh@hal.com,
        issues@omg.org, interop@omg.org
Subject: Re: omg_issues: Issue about CDR encoding for wide characters
References: <199803251751.RAA01386@dublin.iona.ie>

Justin Mason wrote:
> 
> Jonathan Biggar said:
> 
> >The idea, and I'm sure Bill Janssen would agree with me, is to make
> the
> >CDR encoding parsable without having to know what encoding was used
> for
> >wchar and wstring.  If you don't have this, then bridges and
> gateways
> >become much more complicated, because they must know and track
> every
> >possible transmission codeset.
> 
> >> > 2.  When encoding a wstring, always encode the length as the
> total
> >> > number of bytes used by the encoded value, regardless of
> whether the
> >> > encoding is byte-oriented or not.
> >>
> >> Even though this sounds reasonable, it contradicts the adopted
> >> specifications.
> >
> >Right, I am proposing this as a change to the standards to resolve
> this
> >problem.  Perhaps if I changed my proposal to require wchar to be
> always
> >encoded as 8 bytes?
> 
> I agree with you in the wstring case, otherwise bridges are in big
> trouble
> (spoken as a bridge spokesperson ;). However I don't know if the
> wchar
> should be limited to 8 bytes...
> 
> It strikes me that the best way to handle both wstring and wchar is
> to
> marshal them with the length as the length of the wstring/wchar in
> octets,
> as suggested earlier for the wstring case.

I could live with that.
-- 
Jon Biggar
Floorboard Software
jon@floorboard.com
jon@biggar.org

Return-Path: <janssen@parc.xerox.com>
Date: Wed, 25 Mar 1998 14:41:54 PST
Sender: Bill Janssen <janssen@parc.xerox.com>
From: Bill Janssen <janssen@parc.xerox.com>
To: jon@floorboard.com, "Martin v. Loewis"
<loewis@informatik.hu-berlin.de>
Subject: Re: omg_issues: Issue about CDR encoding for wide characters
CC: jfh@hal.com, interop@omg.org
References: <199803231932.LAA24237@kodiak.hal.com>
<3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora>
<3517EDBC.C0AA416C@floorboard.com>
	<199803251238.NAA22851@pandora>

Although I wonder if it's reasonable to call UTF-8 a codeset.  I think
of it rather as a roughly Huffman-coded string form of the 32-bit
Unicode codeset.

Bill

Return-Path: <janssen@parc.xerox.com>
Date: Wed, 25 Mar 1998 14:45:21 PST
Sender: Bill Janssen <janssen@parc.xerox.com>
From: Bill Janssen <janssen@parc.xerox.com>
To: "Martin v. Loewis" <loewis@informatik.hu-berlin.de>,
        Jonathan Biggar <jon@floorboard.com>
Subject: Re: omg_issues: Issue about CDR encoding for wide characters
CC: jfh@hal.com, issues@omg.org, interop@omg.org
References: <199803231932.LAA24237@kodiak.hal.com>
<3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora>
<3517EDBC.C0AA416C@floorboard.com> <199803251238.NAA22851@pandora>
	<35193C6E.DA245FBD@floorboard.com>

Excerpts from local.omg: 25-Mar-98 Re: omg_issues: Issue about..
Jonathan Biggar@floorboa (1761*)

> > > 2.  When encoding a wstring, always encode the length as the
total
> > > number of bytes used by the encoded value, regardless of whether
the
> > > encoding is byte-oriented or not.

Probably to be consistent, we should do this for both wstring and
string.  Not a real change for string, due to the existing
restrictions
on character sets for string types, but a change in semantics.

I wonder if we should think about removing the distinction between
string and wstring?

Bill

Return-Path: <loewis@hp832.informatik.hu-berlin.de>
Date: Thu, 26 Mar 1998 13:44:27 +0100
From: "Martin v. Loewis" <loewis@informatik.hu-berlin.de>
To: janssen@parc.xerox.com
CC: interop@omg.org
Subject: Re: omg_issues: Issue about CDR encoding for wide characters
References: <199803231932.LAA24237@kodiak.hal.com>
<3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora>
<3517EDBC.C0AA416C@floorboard.com>
	<199803251238.NAA22851@pandora>
<0p6MUmUB0KGW42whsn@holmes.parc.xerox.com>

> Although I wonder if it's reasonable to call UTF-8 a codeset.  I
think
> of it rather as a roughly Huffman-coded string form of the 32-bit
> Unicode codeset.

When they talk about code sets in CORBA, they really mean encodings,
or coded character sets, as explained in 11.7.

Martin