Issue 1111:  Wchar and Char can both be multibyte (interop)
Source:  (, )
Nature:  Uncategorized Issue
Severity: 
Summary: Summary: With regard to the character encoding issue.
 
 This is not just a problem with Wchar, and Wstring, it can happen
 with char and string as well.  Multibyte encodings are used for
 normal strings in the enhanced IDL as well as for 
 Wchars.
 I think the problem is in the definition of char. For interop, the char type could be either: 
 1) restricted to a syntactic thing which can only be used with single byte encodings, or
 2) we need to bite the bullet and make the encoding for char a sequence 
 of bytes if a multibyte encoding is chosen.  
 

Resolution: 
Revised Text: 
Actions taken:
March 25, 1998: received issue
June 25, 1998: closed issue
Discussion: 
 
End of Annotations:=====
Return-Path: <ter@holta.ho.lucent.com>
Date: Wed, 25 Mar 1998 17:48:05 -0500
From: ter@holta.ho.lucent.com (T E Rutt)
To: issues@omg.org, interop@omg.org
Subject: Wchar and Char can both be multibyte

With regard to the character encoding issue.

This is not just a problem with Wchar, and Wstring, it can happen
with char and string as well.  Multibyte encodings are used for
normal strings in the enhanced IDL as well as for 
Wchars.

I think the problem is in the definition of char.  The CDR`
can
be interpreted to be dealing with sequences of bytes encoded
using a character encoding.  A character would be sometimes more
than one byte.  We should make it so the orb just passes the bytes
sent from the application.  The character encoding should not be done
by the ORB.

In a Wstring or string which has negotiated a multibyte encoding, as
opposed
to a fixed size encoding (i.e. Unicode as 16 bit things) we need to
clarify the encoding of CDR to be "byte" counts not character counts.

In fact a single char for multibye character sets probably will never
be sent outside of a string.

Thus I have always thought of a single Char (like Wchar) as being
a symbolic semantic thing which needs more than one byte to represent
sometimes in countries with more than 256 symbols in their alphabet.  

For interop, the char type could be either: 
1) restricted to a syntactic thing which can only be used with single
byte encodings, or
2) we need to bite the bullet and make the encoding for char a
sequence 
of bytes if a multibyte encoding is chosen.  The orb knows the result
of
the negotiation, so this choice is easy to detect.  HOwever the ORB
would
use single byte encoding of Char only if Latin1 or other single byte
encodings are negotiated (ie. most likely by default case of no
service
context data for encoding info).


With strings, I see no problem in counting bytes for the encoding.
The
negotiated string encoding would be only used by the ultimate user of
the international string, the orb does not have to know what the
encoding
is.

Tom Rutt


Return-Path: <janssen@parc.xerox.com>
Date: Wed, 25 Mar 1998 14:56:25 PST
Sender: Bill Janssen <janssen@parc.xerox.com>
From: Bill Janssen <janssen@parc.xerox.com>
To: issues@omg.org, interop@omg.org, ter@holta.ho.lucent.com (T E
Rutt)
Subject: Re: Wchar and Char can both be multibyte
References: <199803252248.RAA28816@holta.ho.lucent.com>

Excerpts from local.omg: 25-Mar-98 Wchar and Char can both be .. T E
Rutt@holta.ho.lucent (1864*)

> For interop, the char type could be either: 
> 1) restricted to a syntactic thing which can only be used with
> single
> byte encodings, or
> 2) we need to bite the bullet and make the encoding for char a
> sequence 
> of bytes if a multibyte encoding is chosen.  The orb knows the
> result of
> the negotiation, so this choice is easy to detect.  HOwever the ORB
> would
> use single byte encoding of Char only if Latin1 or other single byte
> encodings are negotiated (ie. most likely by default case of no
> service
> context data for encoding info).

Actually, I've come to believe that the presence of `char' and `wchar'
are mistakes in the IDL type system.  I believe that both should be
removed, leaving only `string' and `wstring', which are what's really
needed in this space.  The roles played by `char' and `wchar' are
sufficiently supported by the presence of enumerations and integers. 
This would also help to remove silly issues like, ``what's the
marshalled form of the typecode for a char-discriminated union when a
different charset is indicated?''.

Bill

Return-Path: <loewis@hp832.informatik.hu-berlin.de>
Date: Thu, 26 Mar 1998 15:20:14 +0100
From: "Martin v. Loewis" <loewis@informatik.hu-berlin.de>
To: ter@holta.ho.lucent.com
CC: issues@omg.org, interop@omg.org
Subject: Re: Wchar and Char can both be multibyte
References:  <199803252248.RAA28816@holta.ho.lucent.com>

> With regard to the character encoding issue.
> 
> This is not just a problem with Wchar, and Wstring, it can happen
> with char and string as well.  Multibyte encodings are used for
> normal strings in the enhanced IDL as well as for 
> Wchars.

This sounds like it needs even further clarification. The spec
suggests that selecting a non-byte-oriented character set for TCS-C is
possible. What is then the encoding of characters in the various GIOP
headers? For GIOP::MessageHeader_1_0::magic, it clearly says
ISO-8859-1 (regardless of TCS-C). For GIOP::RequestHeader::operation,
no such statement is made.

Martin


Return-Path: <jon@floorboard.com>
Sender: jon@floorboard.com
Date: Thu, 26 Mar 1998 08:32:16 -0800
From: Jonathan Biggar <jon@floorboard.com>
To: "Martin v. Loewis" <loewis@informatik.hu-berlin.de>
CC: ter@holta.ho.lucent.com, issues@omg.org, interop@omg.org
Subject: Re: Wchar and Char can both be multibyte
References: <199803252248.RAA28816@holta.ho.lucent.com>
<199803261420.PAA11513@pandora>

Martin v. Loewis wrote:
> 
> > With regard to the character encoding issue.
> >
> > This is not just a problem with Wchar, and Wstring, it can happen
> > with char and string as well.  Multibyte encodings are used for
> > normal strings in the enhanced IDL as well as for
> > Wchars.
> 
> This sounds like it needs even further clarification. The spec
> suggests that selecting a non-byte-oriented character set for TCS-C
> is
> possible. What is then the encoding of characters in the various
> GIOP
> headers? For GIOP::MessageHeader_1_0::magic, it clearly says
> ISO-8859-1 (regardless of TCS-C). For
> GIOP::RequestHeader::operation,
> no such statement is made.

Ick.  No one ordered this!  The spec should be clear that operation is
ISO-8859-1.

-- 
Jon Biggar
Floorboard Software
jon@floorboard.com
jon@biggar.org