Issue 1111: Wchar and Char can both be multibyte (interop) Source: (, ) Nature: Uncategorized Issue Severity: Summary: Summary: With regard to the character encoding issue. This is not just a problem with Wchar, and Wstring, it can happen with char and string as well. Multibyte encodings are used for normal strings in the enhanced IDL as well as for Wchars. I think the problem is in the definition of char. For interop, the char type could be either: 1) restricted to a syntactic thing which can only be used with single byte encodings, or 2) we need to bite the bullet and make the encoding for char a sequence of bytes if a multibyte encoding is chosen. Resolution: Revised Text: Actions taken: March 25, 1998: received issue June 25, 1998: closed issue Discussion: End of Annotations:===== Return-Path: Date: Wed, 25 Mar 1998 17:48:05 -0500 From: ter@holta.ho.lucent.com (T E Rutt) To: issues@omg.org, interop@omg.org Subject: Wchar and Char can both be multibyte With regard to the character encoding issue. This is not just a problem with Wchar, and Wstring, it can happen with char and string as well. Multibyte encodings are used for normal strings in the enhanced IDL as well as for Wchars. I think the problem is in the definition of char. The CDR` can be interpreted to be dealing with sequences of bytes encoded using a character encoding. A character would be sometimes more than one byte. We should make it so the orb just passes the bytes sent from the application. The character encoding should not be done by the ORB. In a Wstring or string which has negotiated a multibyte encoding, as opposed to a fixed size encoding (i.e. Unicode as 16 bit things) we need to clarify the encoding of CDR to be "byte" counts not character counts. In fact a single char for multibye character sets probably will never be sent outside of a string. Thus I have always thought of a single Char (like Wchar) as being a symbolic semantic thing which needs more than one byte to represent sometimes in countries with more than 256 symbols in their alphabet. For interop, the char type could be either: 1) restricted to a syntactic thing which can only be used with single byte encodings, or 2) we need to bite the bullet and make the encoding for char a sequence of bytes if a multibyte encoding is chosen. The orb knows the result of the negotiation, so this choice is easy to detect. HOwever the ORB would use single byte encoding of Char only if Latin1 or other single byte encodings are negotiated (ie. most likely by default case of no service context data for encoding info). With strings, I see no problem in counting bytes for the encoding. The negotiated string encoding would be only used by the ultimate user of the international string, the orb does not have to know what the encoding is. Tom Rutt Return-Path: Date: Wed, 25 Mar 1998 14:56:25 PST Sender: Bill Janssen From: Bill Janssen To: issues@omg.org, interop@omg.org, ter@holta.ho.lucent.com (T E Rutt) Subject: Re: Wchar and Char can both be multibyte References: <199803252248.RAA28816@holta.ho.lucent.com> Excerpts from local.omg: 25-Mar-98 Wchar and Char can both be .. T E Rutt@holta.ho.lucent (1864*) > For interop, the char type could be either: > 1) restricted to a syntactic thing which can only be used with > single > byte encodings, or > 2) we need to bite the bullet and make the encoding for char a > sequence > of bytes if a multibyte encoding is chosen. The orb knows the > result of > the negotiation, so this choice is easy to detect. HOwever the ORB > would > use single byte encoding of Char only if Latin1 or other single byte > encodings are negotiated (ie. most likely by default case of no > service > context data for encoding info). Actually, I've come to believe that the presence of `char' and `wchar' are mistakes in the IDL type system. I believe that both should be removed, leaving only `string' and `wstring', which are what's really needed in this space. The roles played by `char' and `wchar' are sufficiently supported by the presence of enumerations and integers. This would also help to remove silly issues like, ``what's the marshalled form of the typecode for a char-discriminated union when a different charset is indicated?''. Bill Return-Path: Date: Thu, 26 Mar 1998 15:20:14 +0100 From: "Martin v. Loewis" To: ter@holta.ho.lucent.com CC: issues@omg.org, interop@omg.org Subject: Re: Wchar and Char can both be multibyte References: <199803252248.RAA28816@holta.ho.lucent.com> > With regard to the character encoding issue. > > This is not just a problem with Wchar, and Wstring, it can happen > with char and string as well. Multibyte encodings are used for > normal strings in the enhanced IDL as well as for > Wchars. This sounds like it needs even further clarification. The spec suggests that selecting a non-byte-oriented character set for TCS-C is possible. What is then the encoding of characters in the various GIOP headers? For GIOP::MessageHeader_1_0::magic, it clearly says ISO-8859-1 (regardless of TCS-C). For GIOP::RequestHeader::operation, no such statement is made. Martin Return-Path: Sender: jon@floorboard.com Date: Thu, 26 Mar 1998 08:32:16 -0800 From: Jonathan Biggar To: "Martin v. Loewis" CC: ter@holta.ho.lucent.com, issues@omg.org, interop@omg.org Subject: Re: Wchar and Char can both be multibyte References: <199803252248.RAA28816@holta.ho.lucent.com> <199803261420.PAA11513@pandora> Martin v. Loewis wrote: > > > With regard to the character encoding issue. > > > > This is not just a problem with Wchar, and Wstring, it can happen > > with char and string as well. Multibyte encodings are used for > > normal strings in the enhanced IDL as well as for > > Wchars. > > This sounds like it needs even further clarification. The spec > suggests that selecting a non-byte-oriented character set for TCS-C > is > possible. What is then the encoding of characters in the various > GIOP > headers? For GIOP::MessageHeader_1_0::magic, it clearly says > ISO-8859-1 (regardless of TCS-C). For > GIOP::RequestHeader::operation, > no such statement is made. Ick. No one ordered this! The spec should be clear that operation is ISO-8859-1. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org