Issue 4008: wchar endianness (corba-rtf) Source: AT&T (Dr. Duncan Grisby, ) Nature: Uncategorized Issue Severity: Summary: In a similar vein to Vishy's question about alignment, what should the endianness of a word-oriented wchar be? This applies both to single wchars, and the separate code points in a wstring. With the 2.3 spec, it seemed quite obvious to me that word-oriented wide characters should have the same endianness as the rest of the stream. After all, they are no different from any other word-oriented type. However, with the new 2.4 spec, there is now a bizarre section saying that if, and only if, the TCS-W is UTF-16, all wchar values are marshalled big-endian unless there is a byte-order-mark telling you otherwise. I don't understand the point of this. Section 2.7 of the Unicode Standard, version 3.0 says [emphasis mine]: "Data streams that begin with U+FEFF byte order mark are likely to contain Unicode values. It is recommended that applications sending or receiving _untyped_ data streams of coded characters use this signature. _If other signaling methods are used, signatures should not be employed._" It seems quite clear to me that a GIOP stream is a _typed_ data stream which uses its own signalling methods. The Unicode standard therefore says that a BOM should _not_ be used. I guess it's too late to clean up the UTF-16 encoding, but what about other word-oriented code sets? What if the end-points have negotiated the use of UCS-4? Should that be big-endian unless there's a BOM? The spec doesn't say. Even worse, what if the negotiated encoding is something like Big5? That doesn't _have_ byte order marks. Big5 doesn't have a one-to-one Unicode mapping, so it's not sensible to always translate to UTF-16. GIOP already has a perfectly good mechanism for sorting out this kind of issue. Please can wchar be considered on equal footing with all other types, and use the stream's endianness? Resolution: see above Revised Text: In orbrev/02-02-01 1. Delete the following text from the second bullet item in 15.3.1.6 "For example, if the TCS-W is ISO 10646 UCS-2 (Universal Character Set containing 2 bytes), then wide characters are represented as unsigned shorts. For ISO 10646 UCS-4, they are represented as unsigned longs." 2. Add the following paragraphs at end of 15.3.1.6: "For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be encoded using the endianess of the GIOP message, for backward compatibility. For GIOP 1.4, the byte order rules for UCS-2 and UCS-4 are the same as for UTF-16. UTF-16LE and UTF-16BE, from IANA codeset registry, have their own endianess definition. Thus these should be encoded using the endianess specified by their endianness definition." Actions taken: October 31, 2000: received issue March 7, 2002: moved to Core RTF December 11, 2002: closed issue Discussion: resolved in previous RTF (issue 3405b). Resolution: UCS-2 and UCS-4 are native codesets, and in newer Unicode Forum versions of the standard, they are not intended for transfer syntaxes. For backward Compatibility with Java ORB, UCS-2 and UCS-2 will be treated like Integer, i.e., marshaling shall use the endianness of the message. For giop 1.4 onward, we should have the same interpretation for UCS-2 as with UTF. End of Annotations:===== To: interop@omg.org Subject: wchar endianness From: Duncan Grisby Date: Tue, 31 Oct 2000 10:09:14 +0000 Sender: dpg1@uk.research.att.com Content-Type: text X-UIDL: 8)\!!`=4!!e09!!=F!e9 In a similar vein to Vishy's question about alignment, what should the endianness of a word-oriented wchar be? This applies both to single wchars, and the separate code points in a wstring. With the 2.3 spec, it seemed quite obvious to me that word-oriented wide characters should have the same endianness as the rest of the stream. After all, they are no different from any other word-oriented type. However, with the new 2.4 spec, there is now a bizarre section saying that if, and only if, the TCS-W is UTF-16, all wchar values are marshalled big-endian unless there is a byte-order-mark telling you otherwise. I don't understand the point of this. Section 2.7 of the Unicode Standard, version 3.0 says [emphasis mine]: "Data streams that begin with U+FEFF byte order mark are likely to contain Unicode values. It is recommended that applications sending or receiving _untyped_ data streams of coded characters use this signature. _If other signaling methods are used, signatures should not be employed._" It seems quite clear to me that a GIOP stream is a _typed_ data stream which uses its own signalling methods. The Unicode standard therefore says that a BOM should _not_ be used. I guess it's too late to clean up the UTF-16 encoding, but what about other word-oriented code sets? What if the end-points have negotiated the use of UCS-4? Should that be big-endian unless there's a BOM? The spec doesn't say. Even worse, what if the negotiated encoding is something like Big5? That doesn't _have_ byte order marks. Big5 doesn't have a one-to-one Unicode mapping, so it's not sensible to always translate to UTF-16. GIOP already has a perfectly good mechanism for sorting out this kind of issue. Please can wchar be considered on equal footing with all other types, and use the stream's endianness? Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- From: "Rutt, T E (Tom)" To: interop@omg.org, "'Duncan Grisby'" Subject: RE: wchar endianness Date: Tue, 31 Oct 2000 09:58:54 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain X-UIDL: BPn!!PJ~e9IJ@e9h~=e9 The problem is that we point to the OSF code registry, which does not allow Us to interpret UNICODE differently than the spec. ] the Byte order mark is optional in utf-16. This interpretation was given us by senior members of the Unicode forum. Do you disagree with their interpretation? TomRutt Chair interop rtf ter ---------- From: Duncan Grisby [SMTP:dgrisby@uk.research.att.com] Sent: Tuesday, October 31, 2000 5:09 AM To: interop@omg.org Subject: wchar endianness In a similar vein to Vishy's question about alignment, what should the endianness of a word-oriented wchar be? This applies both to single wchars, and the separate code points in a wstring. With the 2.3 spec, it seemed quite obvious to me that word-oriented wide characters should have the same endianness as the rest of the stream. After all, they are no different from any other word-oriented type. However, with the new 2.4 spec, there is now a bizarre section saying that if, and only if, the TCS-W is UTF-16, all wchar values are marshalled big-endian unless there is a byte-order-mark telling you otherwise. I don't understand the point of this. Section 2.7 of the Unicode Standard, version 3.0 says [emphasis mine]: "Data streams that begin with U+FEFF byte order mark are likely to contain Unicode values. It is recommended that applications sending or receiving _untyped_ data streams of coded characters use this signature. _If other signaling methods are used, signatures should not be employed._" It seems quite clear to me that a GIOP stream is a _typed_ data stream which uses its own signalling methods. The Unicode standard therefore says that a BOM should _not_ be used. I guess it's too late to clean up the UTF-16 encoding, but what about other word-oriented code sets? What if the end-points have negotiated the use of UCS-4? Should that be big-endian unless there's a BOM? The spec doesn't say. Even worse, what if the negotiated encoding is something like Big5? That doesn't _have_ byte order marks. Big5 doesn't have a one-to-one Unicode mapping, so it's not sensible to always translate to UTF-16. GIOP already has a perfectly good mechanism for sorting out this kind of issue. Please can wchar be considered on equal footing with all other types, and use the stream's endianness? Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- To: interop@omg.org Subject: Re: wchar endianness In-Reply-To: Message from "Rutt, T E (Tom)" of "Tue, 31 Oct 2000 09:58:54 EST." From: Duncan Grisby Date: Tue, 31 Oct 2000 15:40:05 +0000 Sender: dpg1@uk.research.att.com Content-Type: text X-UIDL: ;Kg!!^2Be94T@!!A4i!! On Tuesday 31 October, "Rutt, T E (Tom)" wrote: > The problem is that we point to the OSF code registry, which does not allow > Us to interpret UNICODE differently than the spec. I'm not suggesting that anyone interprets Unicode differently from the spec. As far as I can see, the code set registry is just a standard way of indicating which character set we're talking about. > the Byte order mark is optional in utf-16. This interpretation was given us > by senior members of the Unicode forum. Do you disagree with their > interpretation? The passage from the Unicode specification I quoted says that the BOM is optional, but recommended, unless "other signaling methods are used", in which case a BOM "should not be employed". I would consider GIOP to be another "signaling method". Then there is section 3.1, "Conformance Requirements": "C3 A process shall interpret a Unicode value that has been serialized into a sequence of bytes by most significant byte first, _in the absence of higher-level protocols_." Surely GIOP is a "higher-level protocol" which tells us the byte order. Obviously the senior members of the Unicode forum disagree, or they were asked the question out of context. Using a BOM adds nothing at all to GIOP, and just makes the marshalling code more complex than necessary. Anyway, the CORBA spec says UTF-16 uses a BOM, so that's that. What about all the other word-based code sets? Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- From: debasish@us.ibm.com Received: from rchasa25.rchland.ibm.com (d27ml502.rchland.ibm.com [9.5.39.48]) by northrelay02.pok.ibm.com (8.8.8m3/NCO v4.95) with ESMTP id LAA74116; Tue, 31 Oct 2000 11:57:40 -0500 Importance: Normal Subject: Re: wchar endianness To: interop@omg.org, Duncan Grisby X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: Date: Tue, 31 Oct 2000 10:55:57 -0600 X-MIMETrack: Serialize by Router on d27ml502/27/M/IBM(Release 5.0.5 |September 22, 2000) at 10/31/2000 10:56:30 AM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-UIDL: `>2e9:YU!!Ka&e9-1?!! I fully agree with Duncan -- the OMG byte order indicator should be sufficient to determine the endian-ness of the marshalled data. But apparently Unicode experts are telling us to ignore the clause "in the absence of higher-level protocols" while reading the Unicode spec. To me, the OMG byte order indicator indeed defines a higher level protocol. After some lengthy discussions I came to the following understanding. If the encoding is non-byte (integer) oriented (UCS-2, UCS-4) the OMG byte order indicator will be used to specify the endian-ness of the marshalled data. If the encoding is byte oriented and endian-sensitive (UTF-16, UTF-32) the OMG byte order indicator will be ignored in determining the endian-ness No BOM will imply BIG-ENDIAN If the encoding is byte oriented but endian-insensitive (UTF-8) OMG byte order indicator will determine the endian-ness. Any BOM (EF BB BF in UTF-8) should be interpreted as a 'zero-width no break space' character. UTF-16 should not get any special status. OMG spec. should generalize statements dealing with UTF-16. It appears to me, that the entire CORBA spec. dealing with codeset negotiation is somewhat informal and imprecise; it contains lots of inconsistencies and inaccuracies. Time permitting, I will prepare a list of correction items to be incorporated in the OMG codeset negotiation spec. Thanks, Deb Duncan Grisby @uk.research.att.com on 10/31/2000 09:40:05 AM Sent by: dpg1@uk.research.att.com To: interop@omg.org cc: Subject: Re: wchar endianness On Tuesday 31 October, "Rutt, T E (Tom)" wrote: > The problem is that we point to the OSF code registry, which does not allow > Us to interpret UNICODE differently than the spec. I'm not suggesting that anyone interprets Unicode differently from the spec. As far as I can see, the code set registry is just a standard way of indicating which character set we're talking about. > the Byte order mark is optional in utf-16. This interpretation was given us > by senior members of the Unicode forum. Do you disagree with their > interpretation? The passage from the Unicode specification I quoted says that the BOM is optional, but recommended, unless "other signaling methods are used", in which case a BOM "should not be employed". I would consider GIOP to be another "signaling method". Then there is section 3.1, "Conformance Requirements": "C3 A process shall interpret a Unicode value that has been serialized into a sequence of bytes by most significant byte first, _in the absence of higher-level protocols_." Surely GIOP is a "higher-level protocol" which tells us the byte order. Obviously the senior members of the Unicode forum disagree, or they were asked the question out of context. Using a BOM adds nothing at all to GIOP, and just makes the marshalling code more complex than necessary. Anyway, the CORBA spec says UTF-16 uses a BOM, so that's that. What about all the other word-based code sets? Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- Sender: jon@corvette.floorboard.com Message-ID: <39FF03BD.277A6EEB@floorboard.com> Date: Tue, 31 Oct 2000 09:39:09 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.76 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Duncan Grisby CC: interop@omg.org Subject: Re: wchar endianness References: <200010311540.PAA03021@pineapple.uk.research.att.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: U1i!!_Icd9 Duncan Grisby wrote: > I'm not suggesting that anyone interprets Unicode differently from the > spec. As far as I can see, the code set registry is just a standard > way of indicating which character set we're talking about. Naming the code set also specifies the semantics of how it is transmitted. The opinion we got from the Unicode folks is that UTF-16 is always big-endian, unless preceeded by a BOM, in which case the BOM takes precedence. > The passage from the Unicode specification I quoted says that the BOM > is optional, but recommended, unless "other signaling methods are > used", in which case a BOM "should not be employed". I would consider > GIOP to be another "signaling method". Then there is section 3.1, > "Conformance Requirements": > > "C3 A process shall interpret a Unicode value that has been > serialized into a sequence of bytes by most significant byte > first, _in the absence of higher-level protocols_." > > Surely GIOP is a "higher-level protocol" which tells us the byte > order. Obviously the senior members of the Unicode forum disagree, or > they were asked the question out of context. Using a BOM adds nothing > at all to GIOP, and just makes the marshalling code more complex than > necessary. I disagree. Using a BOM allows the implementation to pass through a string that is not encoded in native byte-order without having to swap the data. This can be a big win for bridges or the event or notification services. > Anyway, the CORBA spec says UTF-16 uses a BOM, so that's that. What > about all the other word-based code sets? This is a good point, and we out to make all word-based Unicode sets work the same way. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org To: Duncan Grisby cc: interop@omg.org Subject: Re: wchar endianness In-Reply-To: Your message of "Tue, 31 Oct 2000 07:40:05 PST." <200010311540.PAA03021@pineapple.uk.research.att.com> From: Bill Janssen Message-Id: <00Oct31.094349pst."3448"@watson.parc.xerox.com> Date: Tue, 31 Oct 2000 09:43:44 PST Content-Type: text X-UIDL: hY4!!2p4e99mpd9lFY!! Duncan, we sorted through all this about six months ago. I don't believe it should be re-opened. Bill To: Bill Janssen cc: Duncan Grisby , interop@omg.org Subject: Re: wchar endianness In-Reply-To: Message from Bill Janssen of "Tue, 31 Oct 2000 09:43:44 PST." <00Oct31.094349pst."3448"@watson.parc.xerox.com> From: Duncan Grisby Date: Tue, 31 Oct 2000 17:53:12 +0000 Sender: dpg1@uk.research.att.com Content-Type: text X-UIDL: TVQ!!ppR!!]`W!!$kF!! On Tuesday 31 October, Bill Janssen wrote: > Duncan, we sorted through all this about six months ago. I don't believe > it should be re-opened. So what was the answer with regard to code sets other than UTF-16? And why isn't it in the 2.4 spec? Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- From: "Rutt, T E (Tom)" To: issues@emerald.omg.org, interop@omg.org, "'Juergen Boldt'" Subject: RE: issue 4008 -- Interop RTF issue Date: Tue, 31 Oct 2000 17:58:10 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain; charset="iso-8859-1" X-UIDL: >'""!Vf9e93KUd9# wchar endianness In a similar vein to Vishy's question about alignment, what should the endianness of a word-oriented wchar be? This applies both to single wchars, and the separate code points in a wstring. With the 2.3 spec, it seemed quite obvious to me that word-oriented wide characters should have the same endianness as the rest of the stream. After all, they are no different from any other word-oriented type. However, with the new 2.4 spec, there is now a bizarre section saying that if, and only if, the TCS-W is UTF-16, all wchar values are marshalled big-endian unless there is a byte-order-mark telling you otherwise. I don't understand the point of this. Section 2.7 of the Unicode Standard, version 3.0 says [emphasis mine]: "Data streams that begin with U+FEFF byte order mark are likely to contain Unicode values. It is recommended that applications sending or receiving _untyped_ data streams of coded characters use this signature. _If other signaling methods are used, signatures should not be employed._" It seems quite clear to me that a GIOP stream is a _typed_ data stream which uses its own signalling methods. The Unicode standard therefore says that a BOM should _not_ be used. I guess it's too late to clean up the UTF-16 encoding, but what about other word-oriented code sets? What if the end-points have negotiated the use of UCS-4? Should that be big-endian unless there's a BOM? The spec doesn't say. Even worse, what if the negotiated encoding is something like Big5? That doesn't _have_ byte order marks. Big5 doesn't have a one-to-one Unicode mapping, so it's not sensible to always translate to UTF-16. GIOP already has a perfectly good mechanism for sorting out this kind of issue. Please can wchar be considered on equal footing with all other types, and use the stream's endianness? ================================================================ Juergen Boldt Senior Member of Technical Staff Object Management Group Tel. +1-781 444 0404 ext. 132 250 First Avenue, Suite 201 Fax: +1-781 444 0320 Needham, MA 02494, USA Email: juergen@omg.org ================================================================ From: "Rutt, T E (Tom)" To: interop@omg.org, "'Duncan Grisby'" Subject: RE: wchar endianness Date: Tue, 31 Oct 2000 17:50:07 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain X-UIDL: iEZd9&0Q!!_##e97ZMe9 What makes you think that all unicode strings sent by an orb will adhere to the Same endieness. They may have originated from elsewhere. Since the OSF codeset registry is straight UTF-16, a unicode string without a BOM must Be treated as most significant byte first. ter ---------- From: Duncan Grisby [SMTP:dgrisby@uk.research.att.com] Sent: Tuesday, October 31, 2000 10:40 AM To: interop@omg.org Subject: Re: wchar endianness On Tuesday 31 October, "Rutt, T E (Tom)" wrote: > The problem is that we point to the OSF code registry, which does not allow > Us to interpret UNICODE differently than the spec. I'm not suggesting that anyone interprets Unicode differently from the spec. As far as I can see, the code set registry is just a standard way of indicating which character set we're talking about. > the Byte order mark is optional in utf-16. This interpretation was given us > by senior members of the Unicode forum. Do you disagree with their > interpretation? The passage from the Unicode specification I quoted says that the BOM is optional, but recommended, unless "other signaling methods are used", in which case a BOM "should not be employed". I would consider GIOP to be another "signaling method". Then there is section 3.1, "Conformance Requirements": "C3 A process shall interpret a Unicode value that has been serialized into a sequence of bytes by most significant byte first, _in the absence of higher-level protocols_." Surely GIOP is a "higher-level protocol" which tells us the byte order. Obviously the senior members of the Unicode forum disagree, or they were asked the question out of context. Using a BOM adds nothing at all to GIOP, and just makes the marshalling code more complex than necessary. Anyway, the CORBA spec says UTF-16 uses a BOM, so that's that. What about all the other word-based code sets? Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- To: interop@omg.org Subject: Re: wchar endianness In-Reply-To: Message from Jonathan Biggar of "Tue, 31 Oct 2000 09:39:09 PST." <39FF03BD.277A6EEB@floorboard.com> From: Duncan Grisby Date: Wed, 01 Nov 2000 11:07:43 +0000 Sender: dpg1@uk.research.att.com Content-Type: text X-UIDL: h"M!!-i%e9b0+e9gaVd9 On Tuesday 31 October, Jonathan Biggar wrote: > I disagree. Using a BOM allows the implementation to pass through a > string that is not encoded in native byte-order without having to > swap > the data. This can be a big win for bridges or the event or > notification services. Of course, the same argument can be applied to all the other CORBA data types. A bridge or whatever still has to byte-swap all the rest of the data, including sequences of complex structs and unions which are far more effort. Why should wstrings be singled out for special treatment? If an application has a UTF-16 wstring in its non-native byte order and wishes to transmit it, how does it tell the ORB about the byte order? It can't use a BOM. Section 15.3.1.6 of CORBA 2.4 says: "If the ORB decides to use BOM to indicate endianness, it shall add the BOM to the beginning of wchar or wstring values when encoding the value, _since it is not present in wchar or wstring values passed by the user_." > > Anyway, the CORBA spec says UTF-16 uses a BOM, so that's that. What > > about all the other word-based code sets? > > This is a good point, and we out to make all word-based Unicode sets > work the same way. So in my implementation, should I assume that _all_ Unicode based code sets (UCS-2, UCS-4, etc.) use a BOM? And that non-Unicode code sets use the stream's endianness? Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- To: interop@omg.org, issues@omg.org Subject: Re: issue 4008 -- Interop RTF issue In-Reply-To: Message from Juergen Boldt of "Wed, 01 Nov 2000 10:31:09 EST." <4.2.0.58.20001101103052.00a9ebf0@emerald.omg.org> From: Duncan Grisby Date: Wed, 01 Nov 2000 16:44:11 +0000 Sender: dpg1@uk.research.att.com Content-Type: text X-UIDL: OE5!!bO`!!1-8!!GB_d9 On Wednesday 1 November, Juergen Boldt wrote: > At 05:58 PM 10/31/00 -0500, Rutt, T E (Tom) wrote: > >This issue was raised and closed by the last RTF. > > > >Unless there is new information, I do not see why it should be raised again. > > > >Please review the prior issue mailing dialogue: Issue 3405 b > issue 4008 has been closed... Oh for goodness sake! How can it possibly be closed? Nobody has come up with an answer about code sets other than UTF-16. Forget UTF-16. The 2.4 spec has a perfectly usable definition as to how UTF-16 should be transmitted. My issue is that the spec says NOTHING AT ALL about the endianness of code sets other than UTF-16. It would seem quite obvious that word-based code sets should use the stream's endianness, if it wasn't for the UTF-16 special-case. Does anyone actually bother to support code sets other than UTF-16? If so, what do they do? Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- To: interop@omg.org, issues@omg.org Subject: Re: issue 4008 -- Interop RTF issue In-Reply-To: Message from Juergen Boldt of "Wed, 01 Nov 2000 10:31:09 EST." <4.2.0.58.20001101103052.00a9ebf0@emerald.omg.org> From: Duncan Grisby Date: Wed, 01 Nov 2000 16:44:11 +0000 Sender: dpg1@uk.research.att.com Content-Type: text X-UIDL: OE5!!bO`!!1-8!!GB_d9 On Wednesday 1 November, Juergen Boldt wrote: > At 05:58 PM 10/31/00 -0500, Rutt, T E (Tom) wrote: > >This issue was raised and closed by the last RTF. > > > >Unless there is new information, I do not see why it should be raised again. > > > >Please review the prior issue mailing dialogue: Issue 3405 b > issue 4008 has been closed... Oh for goodness sake! How can it possibly be closed? Nobody has come up with an answer about code sets other than UTF-16. Forget UTF-16. The 2.4 spec has a perfectly usable definition as to how UTF-16 should be transmitted. My issue is that the spec says NOTHING AT ALL about the endianness of code sets other than UTF-16. It would seem quite obvious that word-based code sets should use the stream's endianness, if it wasn't for the UTF-16 special-case. Does anyone actually bother to support code sets other than UTF-16? If so, what do they do? Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- From: "Rutt, T E (Tom)" To: interop@omg.org, "'Duncan Grisby'" Subject: RE: issue 4008 -- Interop RTF issue Date: Wed, 1 Nov 2000 14:01:25 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain; charset="iso-8859-1" X-UIDL: [H#!!,Mbd9%*9!!*/L!! Thank you for this clarification. What other "word oriented" codings are in the OSF codeset registry? These are the ones we need be concerned about. Tom Rutt ter ---------- From: Duncan Grisby [SMTP:dgrisby@uk.research.att.com] Sent: Wednesday, November 01, 2000 11:44 AM To: interop@omg.org; issues@omg.org Subject: Re: issue 4008 -- Interop RTF issue On Wednesday 1 November, Juergen Boldt wrote: > At 05:58 PM 10/31/00 -0500, Rutt, T E (Tom) wrote: > >This issue was raised and closed by the last RTF. > > > >Unless there is new information, I do not see why it should be raised again. > > > >Please review the prior issue mailing dialogue: Issue 3405 b > issue 4008 has been closed... Oh for goodness sake! How can it possibly be closed? Nobody has come up with an answer about code sets other than UTF-16. Forget UTF-16. The 2.4 spec has a perfectly usable definition as to how UTF-16 should be transmitted. My issue is that the spec says NOTHING AT ALL about the endianness of code sets other than UTF-16. It would seem quite obvious that word-based code sets should use the stream's endianness, if it wasn't for the UTF-16 special-case. Does anyone actually bother to support code sets other than UTF-16? If so, what do they do? Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- Sender: jbiggar@corvette.floorboard.com Message-ID: <3A0098F5.E7FF6631@floorboard.com> Date: Wed, 01 Nov 2000 14:28:05 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.76 [en] (X11; U; SunOS 5.6 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Duncan Grisby CC: interop@omg.org Subject: Re: wchar endianness References: <200011011107.LAA08770@pineapple.uk.research.att.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: "_~!!&Qb!!b%gd9;HO!! Duncan Grisby wrote: > > On Tuesday 31 October, Jonathan Biggar wrote: > > > I disagree. Using a BOM allows the implementation to pass through a > > string that is not encoded in native byte-order without having to swap > > the data. This can be a big win for bridges or the event or > > notification services. > > Of course, the same argument can be applied to all the other CORBA > data types. A bridge or whatever still has to byte-swap all the rest > of the data, including sequences of complex structs and unions which > are far more effort. Why should wstrings be singled out for special > treatment? Because wstrings are often long and are very common. If a bridge wants to really avoid byte-swapping other data, it can always issue the GIOP message using a non-native byte-order anyway. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org To: Duncan Grisby cc: interop@omg.org, issues@omg.org Subject: Re: issue 4008 -- Interop RTF issue In-Reply-To: Your message of "Wed, 01 Nov 2000 08:44:11 PST." <200011011644.QAA19449@pineapple.uk.research.att.com> From: Bill Janssen Message-Id: <00Nov1.170222pst."3448"@watson.parc.xerox.com> Date: Wed, 1 Nov 2000 17:02:16 PST Content-Type: text X-UIDL: 1"5e9-(9e9%H^d90~Yd9 A "code set" is supposed to specify the endianness in its definition. Bill Date: Thu, 02 Nov 2000 00:45:59 +0000 From: Simon Nash Organization: IBM X-Mailer: Mozilla 4.72 [en] (Windows NT 5.0; I) X-Accept-Language: en MIME-Version: 1.0 To: Duncan Grisby CC: interop@omg.org Subject: Re: wchar endianness References: <200010311009.KAA27938@pineapple.uk.research.att.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: KI#e9lop!!N48e9/c'!! Duncan, Duncan Grisby wrote: > > In a similar vein to Vishy's question about alignment, what should the > endianness of a word-oriented wchar be? This applies both to single > wchars, and the separate code points in a wstring. With the 2.3 spec, > it seemed quite obvious to me that word-oriented wide characters > should have the same endianness as the rest of the stream. After all, > they are no different from any other word-oriented type. > > However, with the new 2.4 spec, there is now a bizarre section saying > that if, and only if, the TCS-W is UTF-16, all wchar values are > marshalled big-endian unless there is a byte-order-mark telling you > otherwise. I don't understand the point of this. Section 2.7 of the > Unicode Standard, version 3.0 says [emphasis mine]: > > "Data streams that begin with U+FEFF byte order mark are likely to > contain Unicode values. It is recommended that applications sending > or receiving _untyped_ data streams of coded characters use this > signature. _If other signaling methods are used, signatures should > not be employed._" > > It seems quite clear to me that a GIOP stream is a _typed_ data stream > which uses its own signalling methods. The Unicode standard therefore > says that a BOM should _not_ be used. > GIOP is not using the BOM as a signature. A signature is a "hint" as to what type of data is present in the stream. This is one possible use of a BOM, but not the only use. The other use (which does apply to GIOP) is to indicate the endianness of the encoded data. Simon -- Simon C Nash, Technology Architect, IBM Java Technology Centre Tel. +44-1962-815156 Fax +44-1962-818999 Hursley, England Internet: nash@hursley.ibm.com Lotus Notes: Simon Nash@ibmgb To: interop@omg.org Subject: Re: issue 4008 -- Interop RTF issue In-Reply-To: Message from "Rutt, T E (Tom)" of "Wed, 01 Nov 2000 14:01:25 EST." From: Duncan Grisby Date: Thu, 02 Nov 2000 10:00:49 +0000 Sender: dpg1@uk.research.att.com Content-Type: text X-UIDL: QGd!!OATd9&Z#"!KA@!! On Wednesday 1 November, "Rutt, T E (Tom)" wrote: > What other "word oriented" codings are in the OSF codeset registry? Certainly UCS-2 and UCS-4. There may be others -- I haven't checked yet. If it is the case that the only two word oriented code sets are UCS-2 and UCS-4, then I guess the byte order mark can be used for them too, although I think that maybe it is strictly-speaking not permitted in a UCS, only a UTF. Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- From: "Rutt, T E (Tom)" To: interop@omg.org, "'Duncan Grisby'" Cc: "Rutt, T E (Tom)" Subject: RE: issue 4008 -- Interop RTF issue Date: Thu, 2 Nov 2000 10:54:54 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain; charset="iso-8859-1" X-UIDL: / What other "word oriented" codings are in the OSF codeset registry? Certainly UCS-2 and UCS-4. There may be others -- I haven't checked yet. If it is the case that the only two word oriented code sets are UCS-2 and UCS-4, then I guess the byte order mark can be used for them too, although I think that maybe it is strictly-speaking not permitted in a UCS, only a UTF. Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- Date: Thu, 02 Nov 2000 18:53:10 +0000 From: Simon Nash Organization: IBM X-Mailer: Mozilla 4.72 [en] (Windows NT 5.0; I) X-Accept-Language: en MIME-Version: 1.0 To: Duncan Grisby CC: interop@omg.org Subject: Re: issue 4008 -- Interop RTF issue References: <200011021000.KAA20852@pineapple.uk.research.att.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: pb5e9#3hd9`C@!!NDj!! Duncan, Duncan Grisby wrote: > > On Wednesday 1 November, "Rutt, T E (Tom)" wrote: > > > What other "word oriented" codings are in the OSF codeset registry? > > Certainly UCS-2 and UCS-4. There may be others -- I haven't checked > yet. > > If it is the case that the only two word oriented code sets are UCS-2 > and UCS-4, then I guess the byte order mark can be used for them too, > although I think that maybe it is strictly-speaking not permitted in a > UCS, only a UTF. > We should not use the BOM for UCS-2 and UCS-4. The only reason we used it for UTF-16 was that the Unicode rules mandated it. If it is not mandated (or "strictly speaking" not even permitted) for UCS then we should not use it for the GIOP/CDR encoding of UCS. Simon -- Simon C Nash, Technology Architect, IBM Java Technology Centre Tel. +44-1962-815156 Fax +44-1962-818999 Hursley, England Internet: nash@hursley.ibm.com Lotus Notes: Simon Nash@ibmgb From: "Rutt, T E (Tom)" To: "'Duncan Grisby'" Cc: "'interop@omg.org'" Subject: RE: issue 4008 -- Interop RTF issue Date: Thu, 2 Nov 2000 14:12:10 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain; charset="iso-8859-1" X-UIDL: ]Ol!!#>!"!o5U!!/) Would you like to propose that rule for the two UCS-2 and UCS-4? I would propose that UCS-2 and UCS-4 should be transmitted with the stream's endianness, but I don't really mind. The main thing is that it is specified. Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- From: "Rutt, T E (Tom)" To: "'interop@Omg.org'" Cc: "Rutt, T E (Tom)" Subject: Annex F of ISO/IEC 10646 on UCS-2 and 4 use of BOM Date: Thu, 2 Nov 2000 15:26:00 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain; charset="iso-8859-1" X-UIDL: 8Ced9lHmd9c=(e9"U`!! Here is the annex F, extracted from the on-line standard: ---- Annex F (informative) The use of "signatures" to identify UCS This annex describes a convention for the identification of features of the UCS, by the use of "signatures" within data streams of coded characters. The convention makes use of the character ZERO WIDTH NO-BREAK SPACE, and is applied by a certain class of applications. When this convention is used, a signature at the beginning of a stream of coded characters indicates that the characters following are encoded in the UCS-2 or UCS-4 coded representation, and indicates the ordering of the octets within the coded representation of each character (see 6.3). It is typical of the class of applications mentioned above, that some make use of the signatures when receiving data, while others do not. The signatures are therefore designed in a way that makes it easy to ignore them. In this convention, the ZERO WIDTH NO-BREAK SPACE character has the following significance when it is present at the beginning of a stream of coded characters: UCS-2 signature: FEFF UCS-4 signature: 0000 FEFF An application receiving data may either use these signatures to identify the coded representation form, or may ignore them and treat FEFF as the ZERO WIDTH NO-BREAK SPACE character. If an application which uses one of these signatures recognises its coded representation in reverse sequence (e.g. hexadecimal FFFE), the application can identify that the coded representations of the following characters use the opposite octet sequence to the sequence expected, and may take the necessary action to recognise the characters correctly. NOTE - The hexadecimal value FFFE does not correspond to any coded character within ISOAEC 10646. ----- It seems to imply the same ordering as UTF-16 terutt@lucent.com Date: Thu, 02 Nov 2000 20:38:07 +0000 From: Simon Nash Organization: IBM X-Mailer: Mozilla 4.72 [en] (Windows NT 5.0; I) X-Accept-Language: en MIME-Version: 1.0 To: Duncan Grisby CC: interop@omg.org Subject: Re: issue 4008 -- Interop RTF issue References: <200011011644.QAA19449@pineapple.uk.research.att.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: R`Ee9OU > On Wednesday 1 November, Juergen Boldt wrote: > > > At 05:58 PM 10/31/00 -0500, Rutt, T E (Tom) wrote: > > >This issue was raised and closed by the last RTF. > > > > > >Unless there is new information, I do not see why it should be raised again. > > > > > >Please review the prior issue mailing dialogue: Issue 3405 b > > > issue 4008 has been closed... > > Oh for goodness sake! How can it possibly be closed? Nobody has come > up with an answer about code sets other than UTF-16. > > Forget UTF-16. The 2.4 spec has a perfectly usable definition as to > how UTF-16 should be transmitted. My issue is that the spec says > NOTHING AT ALL about the endianness of code sets other than UTF-16. > > It would seem quite obvious that word-based code sets should use the > stream's endianness, if it wasn't for the UTF-16 special-case. Does > anyone actually bother to support code sets other than UTF-16? If so, > what do they do? > All the ORBs I am familiar with support UCS-2 using the stream's endianness and never add a BOM. Simon -- Simon C Nash, Technology Architect, IBM Java Technology Centre Tel. +44-1962-815156 Fax +44-1962-818999 Hursley, England Internet: nash@hursley.ibm.com Lotus Notes: Simon Nash@ibmgb Date: Fri, 03 Nov 2000 16:43:17 +0000 From: Simon Nash Organization: IBM X-Mailer: Mozilla 4.72 [en] (Windows NT 5.0; I) X-Accept-Language: en MIME-Version: 1.0 To: "Rutt, T E (Tom)" CC: "'Duncan Grisby'" , "'interop@omg.org'" Subject: Re: issue 4008 -- Interop RTF issue References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: jLU!!'=A!!COo!!~+8!! Tom, As I read this, the use of signatures on UCS is optional. Applications can choose to recognise an initial BOM on data received and use this to reverse the expected endianness, but they are not required to do this. Also, there is no statement that the expected endianness defaults to big-endian in the absence of a BOM. Both of these make the UCS case different from the UTF-16 case. Also, from a pragmatic standpoint, mandating big-endianness in the absence of a BOM would break interoperability with every existing ORB that supports UCS-2. Simon "Rutt, T E (Tom)" wrote: > > Although I am not a character set expert, I notice that annex f of 10646-1 > talks about signatures > To identify UCS. This annex states " if an application which uses on eof > these signatures recognises > Its coded representation in reverse seqence (e.g., FFEE) the application can > Identify that the coded representation ... uses the opposed octed sequence > to the sequence expected ... > " > > Thus I interpret that to mean that the BOM applies to UCS-2 and UCS-4 as > much as it does to > UTF-16 and UTF-32. > > We do not have the LE and BE variants (in IANA registry) at our disposal in > the OSF codeset. > > Thus I cannot see how you think that UCS-2 and UCS-4 without a BOM do not > default to Big endian just as UTF-16 does. > > Tom Rutt > does > ter > > ---------- > From: Duncan Grisby [SMTP:dgrisby@uk.research.att.com] > Sent: Thursday, November 02, 2000 12:38 PM > To: Rutt, T E (Tom) > Subject: Re: issue 4008 -- Interop RTF issue > > On Thursday 2 November, "Rutt, T E (Tom)" wrote: > > > Would you like to propose that rule for the two UCS-2 and UCS-4? > > I would propose that UCS-2 and UCS-4 should be transmitted with the > stream's endianness, but I don't really mind. The main thing is that > it is specified. > > Cheers, > > Duncan. > > -- > -- Duncan Grisby \ Research Engineer -- > -- AT&T Laboratories Cambridge -- > -- http://www.uk.research.att.com/~dpg1 -- -- Simon C Nash, Technology Architect, IBM Java Technology Centre Tel. +44-1962-815156 Fax +44-1962-818999 Hursley, England Internet: nash@hursley.ibm.com Lotus Notes: Simon Nash@ibmgb From: "Rutt, T E (Tom)" To: "Rutt, T E (Tom)" , "'Simon Nash'" Cc: "'Duncan Grisby'" , "'interop@omg.org'" Subject: RE: issue 4008 -- Interop RTF issue Date: Fri, 3 Nov 2000 13:16:05 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain; charset="iso-8859-1" X-UIDL: p@C!!)Imd9&U_!!X,\d9 OK I will try this out in a wordsmithing round, with some other "controversial" resolutions. Tom Rutt ter ---------- From: Simon Nash [SMTP:nash@hursley.ibm.com] Sent: Friday, November 03, 2000 11:43 AM To: Rutt, T E (Tom) Cc: 'Duncan Grisby'; 'interop@omg.org' Subject: Re: issue 4008 -- Interop RTF issue Tom, As I read this, the use of signatures on UCS is optional. Applications can choose to recognise an initial BOM on data received and use this to reverse the expected endianness, but they are not required to do this. Also, there is no statement that the expected endianness defaults to big-endian in the absence of a BOM. Both of these make the UCS case different from the UTF-16 case. Also, from a pragmatic standpoint, mandating big-endianness in the absence of a BOM would break interoperability with every existing ORB that supports UCS-2. Simon "Rutt, T E (Tom)" wrote: > > Although I am not a character set expert, I notice that annex f of 10646-1 > talks about signatures > To identify UCS. This annex states " if an application which uses on eof > these signatures recognises > Its coded representation in reverse seqence (e.g., FFEE) the application can > Identify that the coded representation ... uses the opposed octed sequence > to the sequence expected ... > " > > Thus I interpret that to mean that the BOM applies to UCS-2 and UCS-4 as > much as it does to > UTF-16 and UTF-32. > > We do not have the LE and BE variants (in IANA registry) at our disposal in > the OSF codeset. > > Thus I cannot see how you think that UCS-2 and UCS-4 without a BOM do not > default to Big endian just as UTF-16 does. > > Tom Rutt > does > ter > > ---------- > From: Duncan Grisby [SMTP:dgrisby@uk.research.att.com] > Sent: Thursday, November 02, 2000 12:38 PM > To: Rutt, T E (Tom) > Subject: Re: issue 4008 -- Interop RTF issue > > On Thursday 2 November, "Rutt, T E (Tom)" wrote: > > > Would you like to propose that rule for the two UCS-2 and UCS-4? > > I would propose that UCS-2 and UCS-4 should be transmitted with the > stream's endianness, but I don't really mind. The main thing is that > it is specified. > > Cheers, > > Duncan. > > -- > -- Duncan Grisby \ Research Engineer -- > -- AT&T Laboratories Cambridge -- > -- http://www.uk.research.att.com/~dpg1 -- -- Simon C Nash, Technology Architect, IBM Java Technology Centre Tel. +44-1962-815156 Fax +44-1962-818999 Hursley, England Internet: nash@hursley.ibm.com Lotus Notes: Simon Nash@ibmgb To: Simon Nash cc: "Rutt, T E (Tom)" , "'Duncan Grisby'" , "'interop@omg.org'" Subject: Re: issue 4008 -- Interop RTF issue In-Reply-To: Your message of "Fri, 03 Nov 2000 08:43:17 PST." <3A02EB25.E1F6B486@hursley.ibm.com> From: Bill Janssen Message-Id: <00Nov3.130750pst."3448"@watson.parc.xerox.com> Date: Fri, 3 Nov 2000 13:07:40 PST Content-Type: text X-UIDL: AH'!!\+ad9Z3&"!\'o!! Yes, there is no precise definition of what the wire form of UCS-2 is. I'd suggest that means it shouldn't be used by any ORBs that hope for interoperability. Defining our own version of UCS-2 (along the lines suggested by Simon), and registering it with the registry, and then using it, is a possibility, though. Bill From: "Rutt, T E (Tom)" To: Simon Nash , "'Bill Janssen'" Cc: "Rutt, T E (Tom)" , "'Duncan Grisby'" , "'interop@omg.org'" Subject: RE: issue 4008 -- Interop RTF issue Date: Fri, 3 Nov 2000 16:56:30 -0500 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: text/plain X-UIDL: i%cd9&;X!!K>4e9$Wad9 I agree, unfortunately we are using a dead registry. ter ---------- From: Bill Janssen [SMTP:janssen@parc.xerox.com] Sent: Friday, November 03, 2000 4:08 PM To: Simon Nash Cc: Rutt, T E (Tom); 'Duncan Grisby'; 'interop@omg.org' Subject: Re: issue 4008 -- Interop RTF issue Yes, there is no precise definition of what the wire form of UCS-2 is. I'd suggest that means it shouldn't be used by any ORBs that hope for interoperability. Defining our own version of UCS-2 (along the lines suggested by Simon), and registering it with the registry, and then using it, is a possibility, though. Bill Sender: jbiggar@corvette.floorboard.com Message-ID: <3A034779.80C7DA5@floorboard.com> Date: Fri, 03 Nov 2000 15:17:13 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.76 [en] (X11; U; SunOS 5.6 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: "Rutt, T E (Tom)" CC: Simon Nash , "'Bill Janssen'" , "'Duncan Grisby'" , "'interop@omg.org'" Subject: Re: issue 4008 -- Interop RTF issue References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: I~)!!&n^d9S2bd9E;!!! "Rutt, T E (Tom)" wrote: > > I agree, unfortunately we are using a dead registry. So let's fix it. I think this is rather urgent, since pointing to a dead registry means we can't fix the problems we have with UCS-2 et al. I looked at both registries, and it appears we can manage a transition. The OSF registry numbers are all specified in hexadecimal, and are all larger than 65536. The IANA registry are specified in decimal, and no current integer identifier (the MIBenum value) is larger than 2259. This means we can transition from the OSF registry to the IANA one, since there is no overlap of assigned values. We can specify that using OSF registry values is deprecated, and it appears we have several years (at the very minimum) before IANA assigns 65k worth of codeset values. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Date: Mon, 03 Dec 2001 17:58:28 -0800 From: Everett Anderson X-Mailer: Mozilla 4.73 [en] (Windows NT 5.0; U) X-Accept-Language: en,pdf,ja MIME-Version: 1.0 To: interop@omg.org, issues@omg.org Subject: Encoding of UCS-2 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: M~0!!`W/!!NV7e9n0,e9 Hi, I'd like to raise the following as a new issue on the interop RTF: There is a difference of interpretation of how to encode UCS-2 in at least two shipping implementations. By UCS-2, I'm referring to the encoding for OSF code set registry entry 0x00010100. The first interpretation is based on an example in the CORBA spec: See formal 01-02-01 15.3.1.6: "For example, if the TCS-W is ISO 10646 UCS-2 (Universal Character Set containing 2 bytes), then wide characters are represented as unsigned shorts." that leads to an implementation which puts no byte order marker and takes the byte order of the stream. This is further backed up by the IANA character set registry: http://www.iana.org/assignments/character-sets which states for UCS-2: "Source: the 2-octet Basic Multilingual Plane, aka Unicode this needs to specify network byte order: the standard does not specify (it is a 16-bit integer space)" The second interpretation is based on domain knowledge of i18n developers who insist that UCS-2 is encoded exactly the same way as UTF-16 -- optional byte order marker and defaults to big endian if no byte order marker available. I'd really be interested if someone could find some Unicode or ISO spec references that really detail this. Members from several companies have looked and found nothing conclusive to back up either interpretation of UCS-2. Thanks, Everett Date: Tue, 4 Dec 2001 10:53:34 +0100 (MET) Message-Id: <200112040953.fB49rYE17834@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: everett.anderson@sun.com CC: interop@omg.org, issues@omg.org In-reply-to: <3C0C2DC4.707EE8C9@sun.com> (message from Everett Anderson on Mon, 03 Dec 2001 17:58:28 -0800) Subject: Re: Encoding of UCS-2 References: <3C0C2DC4.707EE8C9@sun.com> User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: &]R!!cgmd9,a I'd really be interested if someone could find some Unicode or ISO > spec references that really detail this. Members from several > companies have looked and found nothing conclusive to back up either > interpretation of UCS-2. The statement in the IANA character set registry is correct. See Unicode Technical Report #17 for an elaboration, at http://www.unicode.org/unicode/reports/tr17/ Basically, UCS-2 is a Character Encoding Form (section 4), meaning that it isB capable of representing character strings as sequences of "code units". In UCS-2, a code unit is 16 bits wide (thus representing 65536 values). Unfortunately, the code unit in UCS-2 is *not* a byte, so UCS-2 is not capable of representing a string on a byte stream. What you need is a Character Encoding Scheme, which is a mapping between code units and bytes. If code units are 7-bit or 8-bit, this mapping is typically straight-forward; for larger code units, you need to specify byte order. Unicode 3.0 has two CESs for UTF-16, namely UTF-16BE and UTF-16LE. Because of these problems, I would recommend to deprecate the use of the OSF registry entry 0x00010100 in GIOP. Regards, Martin Date: Tue, 04 Dec 2001 10:39:19 -0500 From: Tom Rutt Reply-To: tom@coastin.com Organization: Coast X-Mailer: Mozilla 4.78 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Everett Anderson CC: interop@omg.org, issues@omg.org Subject: Re: Encoding of UCS-2 References: <3C0C2DC4.707EE8C9@sun.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: ;IXd92L$e90`X!!&TGe9 This is already an open issue: Issue 4008 has not yet been resolved. This should be included as part of Issue 4008 Tom Rutt Preceeding Chair Interop RTF tom@coastin.com Everett Anderson wrote: > Hi, > > I'd like to raise the following as a new issue on the interop RTF: > > There is a difference of interpretation of how to encode UCS-2 in at > least two shipping implementations. By UCS-2, I'm referring to the > encoding for OSF code set registry entry 0x00010100. > > The first interpretation is based on an example in the CORBA spec: > > See formal 01-02-01 15.3.1.6: > > "For example, if the TCS-W is ISO 10646 UCS-2 (Universal Character >Set > containing 2 bytes), then wide characters are represented as >unsigned > shorts." > > that leads to an implementation which puts no byte order marker and > takes the byte order of the stream. > > This is further backed up by the IANA character set registry: > > http://www.iana.org/assignments/character-sets > > which states for UCS-2: > > "Source: the 2-octet Basic Multilingual Plane, aka Unicode > this needs to specify network byte order: the standard > does not specify (it is a 16-bit integer space)" > > The second interpretation is based on domain knowledge of i18n > developers who insist that UCS-2 is encoded exactly the same way as > UTF-16 -- optional byte order marker and defaults to big endian if >no > byte order marker available. > > I'd really be interested if someone could find some Unicode or ISO > spec references that really detail this. Members from several > companies have looked and found nothing conclusive to back up either > interpretation of UCS-2. > > Thanks, > Everett From: "Everett Anderson" To: Cc: , Subject: RE: Encoding of UCS-2 Date: Tue, 4 Dec 2001 10:50:09 -0800 Message-ID: MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) In-reply-to: <3C0CEE27.C4D23652@coastin.com> Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4807.1700 Content-Type: text/plain; charset="us-ascii" X-UIDL: )-gd9M%md9]cMe9gj1e9 Hi, > This is already an open issue: > > Issue 4008 has not yet been resolved. > > This should be included as part of Issue 4008 I'd really prefer the details of the UCS-2 encoding stay out of the 4008 brawl since it is obviously generating a lot of UTF-16 encoding specific debate, and I'm afraid that they will be lumped together. - Everett Importance: Normal Sensitivity: Subject: RE: Encoding of UCS-2 To: "Everett Anderson" Cc: , , X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: "Debasish Banerjee" Date: Tue, 4 Dec 2001 16:51:10 -0600 X-MIMETrack: Serialize by Router on d27ml601/27/M/IBM(Build M10_08082001 Beta 3|August 08, 2001) at 12/04/2001 04:51:12 PM MIME-Version: 1.0 Content-Disposition: inline Content-Type: multipart/alternative; Boundary="0__=09BBE18BDFE9EE688f9e8a93df938690918c09BBE18BDFE9EE68" X-UIDL: Hf:!!~e>e9>f7e9=RH!! Status: RO Everett, UCS-2 and UTF-16 encoding must be identical--UCS-2 treats the surrogates simply as unknown code points. The second quotation in your previous note: "Source: the 2-octet ... specify (it is a 16-bit integer space)" is from the IANA charset registry which deals the MIME world. In the MIME world, there are no general means to indicate the endian-ness. CORBA should never use U+FEFF (or U+FFFE) to indicate the endian-ness of UCS-2/UCS-4/UTF-16/UTF-32 encoded data. In fact, OMG should not use separate OSF code set ids for UCS-2 and UTF-16; UCS-2 and UTF-16 code set ids should be the same or they should be considered as aliases. For the purpose of data interchange, and code set conversions, UTF-16 should be identified with UCS-2, and UTF-32 should be identified with UCS-4. Thanks, Deb To: cc:, Subject:RE: Encoding of UCS-2 Hi, > This is already an open issue: > > Issue 4008 has not yet been resolved. > > This should be included as part of Issue 4008 I'd really prefer the details of the UCS-2 encoding stay out of the 4008 brawl since it is obviously generating a lot of UTF-16 encoding specific debate, and I'm afraid that they will be lumped together. - Everett Importance: Normal Sensitivity: Subject: RE: To: "Everett Anderson" Cc: "Andy Piper" , "Duncan Grisby" , "Richard Sitze" , X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: "Debasish Banerjee" Date: Tue, 4 Dec 2001 17:54:12 -0600 X-MIMETrack: Serialize by Router on d27ml601/27/M/IBM(Build M10_08082001 Beta 3|August 08, 2001) at 12/04/2001 05:54:14 PM MIME-Version: 1.0 Content-Disposition: inline Content-Type: multipart/alternative; Boundary="0__=09BBE18BDFEE1B858f9e8a93df938690918c09BBE18BDFEE1B85" X-UIDL: O-"!!#[Pe9L_?e9A%~e9 Status: RO Everett, My comments are embedded as ... . Thanks, Deb To:"Andy Piper" , "Duncan Grisby" , Richard Sitze/Charlotte/IBM@IBMUS cc:, Debasish Banerjee/Rochester/IBM@IBMUS Subject:RE: Hi, The CORBA spec is now very specific about how CDR encodes UTF-16 -- it defaults to big endian if no BOM is present, and thus you don't have to waste extra space on the wee.r What about little-endian platforms? To serialize a wchar/wstring in little-endian platforms in UTF-16 one has to append the quite unnecesary U+FFFE. From what I understand reading the Unicode spec, a BOM is optional, but what tends to matter more to us is how the CORBA spec describes it. While I agree we should try to do what the Unicode spec intends, I don't see how the CORBA interpretation hurts Unicode adopti..n While it is true, that the Unicode conformance clauses (Chapter 2.7, and rule C3 of Chapter 3.1 of http://www.unicode.org/unicode/uni2book/u2.html) are flexible enough to allow the current CORBA interpretation of BOM, the current OMG spec can be considered to 'ad-hoc' at best, and 'misleading' at worst. The present CORBA spec. is using Unicode in an inefficient, and confusing way--(a) it is not following the Unicode recommendations regarding BOM, (b) BOM is mandated for little-endian UTF-16 but not for little-endian UCS-2. For the universal acceptance of OMG and Unicode, we should try to follow the Unicode recommendations, and not use any of our improvisations. We are now shipping implementations using the current spec description of UTF-16, defaulting to always using BOMs since it seemed safer given that some implementations were taking the byte order of the stream and some were defaulting to big endian. - Everett > -----Original Message----- > From: Andy Piper [mailto:andyp@bea.com] > Sent: Tuesday, December 04, 2001 9:35 AM > To: Duncan Grisby; Richard Sitze > Cc: interop@omg.org; Debasish Banerjee > Subject: Re: > > > At 02:13 PM 12/4/01 +0000, Duncan Grisby wrote: > >I think you will find deployed ORBs using BOMs for UTF-16. > > WebLogic Server is one of them. > > >Your "urgent" issue is a repeat of issue 4008, which I raised in > >October 2000. I argued the sensible case of using the stream's > >endianness for a while, then gave up because the Powers > That Be were > >clearly never going to change their minds. > > I also argued on the J2EE interop list that the spec is > wrong in this > respect, but no-one seemed to agree so we went with the > status quo. In > particular the J2EE specs reference CORBA 2.3.1 and since > this version does > not have the CORBA 2.4 "clarification" it seemed legal to > me to interpret > the spec as you describe. > > >I never did get a straight answer about exactly _who_ is > was from the > >Unicode standards body that asserted GIOP must make use of BOMs. > >Whoever it was was clearly so important that their word should be > >believed in preference to the words in the Unicode specification. > > Surely this is the problem. To me the spec appears > unambiguous, but to > others not so. An explicit description of the encoding > would therefore > appear necessary to resolve these sorts of disagreement. > > Although I believe the CORBA 2.4 "clarification" is wrong > in spirit - you > only need BOMs in the absence of any indication of > endianess - it doesn't > seem wrong in execution since any conformant UTF-16 > implementation is > required to be able to interpret the data with or without BOMS. > > andy > > Sender: jon@floorboard.com Message-ID: <3C0D633B.AB910909@biggar.org> Date: Tue, 04 Dec 2001 15:58:51 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.77 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Debasish Banerjee CC: Duncan Grisby , interop@omg.org, Mark Davis , Markus Scherer Subject: Re: UTF-16 and BOM References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: fkZ!!jkNe9K$""!6`De9 Status: RO Debasish Banerjee wrote: > Then we have to come up with a sound migration strategy. We > simply can not use Unicode in a wrong way. The OMG > recommended use of BOM in UTF-16 (and UTF-32) is detrimental to > the Unicode acceptance. As a member of the Interop RTF that was involved in hashing out the current handling of UTF-16, let me share what I understand was the thinking of the RTF. I think this issue is a confusion between Unicode encoding forms and encoding schemes. The former are for internal computer processing, and have no concept of byte order. The latter are used for interoperable transfer of Unicode data. First, remember that Unicode as represented in the ORB and IIOP is purely a transfer syntax, used only by ORBs. It is the responsibility of the ORB vendor to make sure that the ORB operates correctly with the native Unicode process encoding used by the OS that hosts the ORB. Note in particular these two paragraphs from the CDR specification (15.3.1.6): "If an ORB decides to use BOM to indicate endianness, it shall add the BOM to the beginning of wchar or wstring values when encoding the value, since it is not present in wchar or wstring values passed by the user. If a BOM is present at the beginning of a wchar or wstring received in a GIOP message, the ORB shall remove the BOM before passing the value to the user." So the ORB will add and strip off the BOM, if it chooses and/or needs to for the purposes of unambiguous transfer. There are three reasons I can think up off the bat why the ORB might do this: 1. The native Unicode environment is little-endian. The ORB can either prepend a BOM indicating that the data is little-endian, or it can byte-swap the data when encoding the CDR stream. The choice is left up to the ORB vendor, who presumably will choose the most efficient one for the environment at hand, or perhaps provide a way to set the behavior at runtime. 2. The ORB receives the Unicode data from a source in one byte order, and will embed it into a message that is going to be encoded in the other byte order. Again, the ORB must either use a BOM, or else byte-swap the data. Note that you can't finnese this by having the ORB encode the entire CDR stream in the same order as the Unicode data, since we might be aggregating Unicode data from more than one source with different byte orders. 3. The client program hands a Unicode string to the ORB that contains an initial zero-width non-breaking space. This is not a BOM, by the rules of Unicode, since we are not dealing with an encoding scheme, but the native Unicode encoding form, which has no concept of byte-order. In this case, the ORB must prepend a BOM to prevent the receiving ORB from misinterpreting the leading zero-width space as a BOM. Note that the Unicode standard strongly discourages placing a ZWNBS at the beginning of a Unicode string, so this could also be technically treated as an error case. To sum up, in my opinion the use of UTF-16 BOM in CDR is correct as specified. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org ----------------------------------------------------------------- In repsponse to Duncan Grisby's question, attached below is the original response the RTF got from the Technical Director of Unicode, Inc. which we used to form the currect mechanism for handling UTF-16 in CDR: > > From: Kenneth Whistler[SMTP:kenw@sybase.com] > > Sent: Wednesday, May 17, 2000 9:00 PM > > To: terutt@lucent.com > > Cc: Arnold.Winkler@unisys.com; kenw@sybase.com > > Subject: RE: FW: Is our interpretation of UTF-16 encoding > > correct? > > > > Mr. Rutt, > > > > Arnold passed your questions on to me, and I will endeavor to > > provide you with some answers. > > > > First of all, it is important to distinguish what we are now > > calling "encoding forms" from what we are calling "encoding > > schemes". UTF-16 is implicated in both. The full details of > > the distinctions are made in the Unicode Technical Report #17, > > Character Encoding Model. See: > > > > http://www.unicode.org/unicode/reports/tr17/ > > > > But the short take on this is as follows: > > > > **Encoding forms** essentially describe how the integral > > numerical values specified in a character encoding standard > > get realized as actual integral datatypes in computers. > > In the character encoding, the numerical values are just abstract > > integers, unassociated with particular datatypes. The character > > encoding form makes that association, so you can actually put > > characters into computer registers with known integral data sizes. > > > > The Unicode Standard now has 3 encoding forms: > > > > UTF-8 variable width encoding form, using an 8-bit byte as > > the datatype (= "code unit"), and using 1 to 4 > > of those bytes per character > > > > UTF-16 variable width encoding form, using a 16-bit word (= > > "wyde") > > as the datatype, and using 1 to 2 of those wydes per > > character > > > > UTF-32 fixed width encoding form, using a 32-bit word as the > > datatype, and always using exactly 1 of those words > > per character > > > > The encoding forms are relevant to in-memory processing of Unicode > > data. It is also critical for API's to specify exactly which > > encoding form they are using, since the character and string > > parameters have to be bound to particular size datatypes by the > > compilers and linkers. > > > > The BOM is not relevant when you are talking about encoding forms, > > since you do not need it for in-memory processing, any more than > > you need such a mechanism to tell your compiler what order to > > put the bytes for handling a 16-, 32-, or 64-bit integer as a > > number in registers or in API's. That stuff is all handled > > automatically > > on each platform by the compilers. > > > > **Encoding schemes** describe how character data expressed in a > > particular encoding form is serialized into byte streams. > > > > UTF-8 doesn't need any particular encoding scheme, since it is > > byte-oriented already. You just pump out the bytes. > > > > UTF-16 has the byte-order problem when serialized as a byte > > stream. > > As a result, there are *three* encoding schemes associated with > > it: > > > > UTF-16 if BOM present, 0xFE 0xFF ==> msb first > > 0xFF 0xFE ==> lsb first > > else msb first > > > > UTF-16BE msb first, no BOM > > > > UTF-16LE lsb first, no BOM > > > > In other words, the UTF-16BE and UTF-16LE encoding schemes require > > no BOM signature (in fact they disallow it); these assume that the > > information about byte order is identified correctly outside the > > stream itself, for example, in MIME fields, other markup, > > system context, or whatever. UTF-16 as an encoding *scheme* allows > > the use of BOM as a signature, so that the stream can be > > self-identifying > > as to byte order. > > > > UTF-32 is handled exactly by analogy to UTF-16. There are three > > encoding schemes, UTF-32, UTF-32BE, UTF-32LE -- and their > > interpretation > > is comparable, except that 32-bit entities are being serialized as > > 4 bytes, and the signature BOM is 0x00 0x00 0xFE 0xFF, and so on. > > > > O.k., now with all that background in place, I can try to get to > > your particular questions. > > > > First, I think you need to sharply distinguish *string* API's and > > *streams*. All the string-handling API's should be dealing with > > the encoding forms. When you are talking about wide strings (i.e., > > Unicode strings in UTF-16 encoding form, bound to unsigned short* > > in C or C++), then you are not dealing with BOM's at all -- or > > should > > not be. The BOM's are not for string handling, and were never > > intended > > to be. String concatenation operations, for example, should not be > > examining and trimming off initial BOM's on Unicode strings, and > > so > > on. If they are, something is terribly wrong in the > > implementation. > > Here is an example of a complete, conformant implementation of > > a Unicode string concatenation: > > > > unistring unistrcat( unistring s1, const unistring s2 ) > > { > > unistring t1 = s1; > > unistring t2 = s2; > > > > /* advance to the end of s1 */ > > while ( *t1 != UNINULL ) > > { > > t1++; > > } > > while ( *t2 != UNINULL ) > > { > > *t1++ = *t2++; > > } > > *t1 = UNINULL; > > return ( s1 ); > > } > > > > If "unistring" is type-bound to unsigned short*, then this works > > for UTF-16. If "unistring" is type-bound to unsigned {32-bit} > > long*, > > then this works for UTF-32. Note, no provisions for BOM's. There > > should *not* be such provisions. > > > > Where you can run into BOM's, on the other hand, is when you are > > serializing UTF-16 Unicode data into or out of byte streams > > intended > > for interchange -- e.g. as plain text datafiles loose on the > > Internet somewhere, or "loose" inside any mixed byte polarity > > network sharing files. It is in those instances that you need to > > make a decision as to which of the encoding schemes you will > > declare and use. And if your encoding scheme is UTF-16, then it is > > advisable to prepend a serialized UTF-16 byte stream with the BOM > > on > > export, and the interpret any prepended BOM on import. > > > > But think of this as a minimal weight file formatting issue. You > > get a file stream with a two-byte "header" that you interpret to > > figure out the byte polarity of the data, and then you treat the > > rest of the file stream as the "data", byte-swapping or not, and > > feeding the results to string-based API's that can either treat > > the "data" as one big long string, or, more usually, parse up the > > contents according to whatever they are supposed to be doing. > > > > Further comments below. > > > > > > > -----Original Message----- > > > From: Rutt, T E (Tom) [mailto:terutt@lucent.com] > > > Sent: Tuesday, May 16, 2000 12:05 AM > > > To: 'arnold.winkler@unisys.com' > > > Subject: Is our interpretation of UTF-16 encoding correct? > > > > > > > > > Arnold, > > > > > > I hope you can take some time to help the OMG Interop RTF > > resolve an > > urgent > > > issue. > > > > > > With regard to encoding of UTF-16, the following proposal was > > forwarded > > by > > > our Xerox > > > member. We are referring to the OSF codeset designation UTF-16 > > as our > > > basis, but the > > > ISO standard is what people think is "sanctimonious". > > > > > > We cannot use the IANA BE, LE variants, because we are (probably > > should > > > change) > > > using OSF as our codeset registration authority. > > > > I cannot comment on the constraints you may be under regarding > > what you can and cannot normatively reference in your > > specifications. > > > > However, my general advice is to make sure that you "get it right" > > and don't get stuck in some specification conundrum because of > > what you think you have to refer to. > > > > > > > > Here is what we are currently voting on: > > > > > > ------------ > > > UTF-16 can be big endian or little endian but defaults to big > > endian.. > > The > > > wording > > > of the ISO spec indicates (though it does not clearly say -- > > well, we > > know > > > how > > > that goes) that by placing a BOM (byte order marker) at the > > front of the > > > string, > > > you can send it either way. So, our official interpretation of > > a UTF-16 > > > string should be: > > > > Replace the "string" here with "serialized byte stream", and you > > get > > closer > > to the intent. > > > > > > > > if the first two bytes are FE FF, it's big-endian; > > > if FF FE, it's little-endian; > > > if neither, it's big-endian. > > > > Otherwise, this is correct. > > > > Once you are dealing with *string*'s per se, however, you should > > not be > > dealing with BOM's. > > > > > > > > ---------------- > > > > > > The problem is, one member has asked a good question, > > > > > > "Is the BOM (if present) required to be removed from the > > transmitted > > > encoding before passing > > > the wstring up to the User through their normal Programming > > Language API > > for > > > WSTRING?" > > > > It should be removed from the *serialized byte stream* before the > > byte > > stream > > is reinterpreted as a properly byte-polarized UTF-16 wide *string* > > that > > a string-oriented API handles. > > > > Once again, try to distinguish the file stream side of the issue > > from > > the string side of the issue. The BOM is a *signature*, not to be > > interpreted > > as part of the content. > > > > And when you are using UTF-16 as an encoding *form* for your > > strings, > > the BOM should *not* occur at the front of *any* string. That is > > for when > > you are packing up Unicode UTF-16 textual data into a byte stream > > for > > interchange. > > > > > > > > The concern is that for application comparison of strings and > > such, > > > Is the BOM a "white space" or is it always ignored by User > > application > > > software? > > > > U+FEFF, when it appears inside strings (anyplace, not just at the > > beginning of strings), is interpreted as a ZERO WIDTH NO-BREAK > > SPACE. That > > is a legitimate character, and is not functioning as a > > byte-order-mark > > (BOM) > > in those instances. Ordinarily, a ZWNBSP would be treated as white > > space, > > since it is a particular kind of "space", but it is not always > > ignored > > by user application software. In particular, the whole point of > > using > > a ZWNBSP in text is to tell a line-breaker not to place a line > > break at > > a particular place in text (without having the visual gap that the > > regular > > NBSP has). > > > > If you are doing a binary comparison of strings, then the presence > > or > > absence of a U+FEFF character in a string obviously is going to > > make > > a difference. If you are allocating memory for storage, the > > presence > > or absence of a U+FEFF character obviously cannot be ignored. If > > you > > are doing culturally significant sorting with weight tables, then > > a U+FEFF character would typically be ignored, just as a ZERO > > WIDTH SPACE > > (U+200B) would typically be ignored for such sorting. > > > > But the important thing is that ZWNBSP character in text should > > not be > > there as a result of misinterpretation of BOM signature bytes on a > > data stream that were not removed when a serialized UTF-16 byte > > stream > > was turned into UTF-16 string(s) for in memory processing. > > > > > > > > If it it significant for comparisons, then we should clarify > > that it is > > > removed before passing > > > the string to the application from the ORB decoder. > > > > It should *not* be there on strings. > > > > It *should* be removed from the beginning of serialized UTF-16 > > byte > > streams that are being turned into strings. > > > > > > > > I await your helpful response? > > > > I hope this was helpful. > > > > --Ken Whistler, Technical Director, Unicode, Inc. To: Jonathan Biggar Cc: Debasish Banerjee , Duncan Grisby , interop@omg.org, jon@floorboard.com, Mark Davis MIME-Version: 1.0 Subject: Re: UTF-16 and BOM X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: "Markus Scherer" Date: Tue, 4 Dec 2001 17:04:20 -0800 X-MIMETrack: Serialize by Router on D04NM203/04/M/IBM(Release 5.0.8 |June 18, 2001) at 12/04/2001 08:04:40 PM, Serialize complete at 12/04/2001 08:04:40 PM Content-Type: text/plain; charset="us-ascii" X-UIDL: mh%!!8GP!!?Qfd9!91e9 Status: RO Hello, Let me explain how I see the application of the terms "encoding form" and "encoding scheme" in this context. (Dr. Mark Davis read this email over before I sent it.) First of all, I agree with Jonathan that the endianness of the encoding inside a message/encapsulation is independent from the endianness of the current machine. I believe that this is not in question in this discussion. Encoding forms (where different from encoding schemes) are relevant in contexts and environments where there are 16-bit units available. In such contexts, UTF-16 is stored and processed directly in terms of such 16-bit units. This is the case for in-process UTF-16 text handling. Encoding schemes are relevant in contexts and environments that deal strictly with byte streams (or byte vectors), without predefined 16-bit units. In these contexts, the byte order of non-byte-oriented data like UTF-16 text needs to be specified either by the protocol - in form of a flag in a header [which could in turn take the form of byte-order-specific charset tags like "UTF-16LE"] or in form of an implicit agreement - or, if all else fails, then one can use the Byte Order Mark (BOM) for UTF-16 text. ----- As far as I understand the Inter-ORB specification, there are two levels of data representation. The first level flattens data into primitives like unsigned short, octet, double, struct, array, etc. The second level then byte-serializes each of the primitives according to the byte order of the message/encapsulation. In this model, UTF-16 strings should be treated like vectors of unsigned short on the first level, and then byte-serialized according to the message/encapsulation byte order on the second level. This is how all other data types are treated. This is how GIOP 1.1 treated wchar and wstring. This first level is consistent with character encoding forms because you have real 16-bit types on this level. The second level is consistent with character encoding schemes because it takes care of the byte-serialization. You do not need to specify the byte order of the serialization of the UTF-16 text because that is specified in the message/encapsulation. ----- For GIOP 1.2, the encoding of wchar/wstring was changed. In my understanding, this introduces an inconsistency between the two Inter-ORB data representation levels. wchar/wstring are now already byte-serialized on the level of encoding forms, and treated as if there were no available 16-bit types. The byte order of the message/encapsulation is ignored for wchar/wstring, but used for all other data. This is unnecessary and inefficient, and will lead to confusion. - It goes against common principles of Inter-ORB data representation as well as common use of Unicode. - It allows the possibility of UTF-16 text with a different byte order than the rest of the message. - It allows the possibility of two different UTF-16 wstrings in the same message/encapsulation with different byte orders. However, note that this does not violate Unicode _conformance_. It is permissible to use a BOM in UTF-16 even when it is not necessary because a higher-level protocol could specify the byte order, unless the higher-level protocol specifically requires or prohibits the interpretation of U+FEFF as a BOM (like with the MIME charset name "UTF-16"). Best regards, markus Markus Scherer IBM GCoC-Unicode/ICU Cupertino, CA markus.scherer@us.ibm.com (also for SameTime) Jonathan Biggar Sent by: jon@floorboard.com 2001-12-04 15:58 To: Debasish Banerjee/Rochester/IBM@IBMUS cc: Duncan Grisby , interop@omg.org, Mark Davis/Cupertino/IBM@IBMUS, Markus Scherer/Cupertino/IBM@IBMUS Subject: Re: UTF-16 and BOM Debasish Banerjee wrote: > Then we have to come up with a sound migration strategy. We > simply can not use Unicode in a wrong way. The OMG > recommended use of BOM in UTF-16 (and UTF-32) is detrimental to > the Unicode acceptance. As a member of the Interop RTF that was involved in hashing out the current handling of UTF-16, let me share what I understand was the thinking of the RTF. I think this issue is a confusion between Unicode encoding forms and encoding schemes. The former are for internal computer processing, and have no concept of byte order. The latter are used for interoperable transfer of Unicode data. ... Sender: jon@floorboard.com Message-ID: <3C0D7B33.A872F987@biggar.org> Date: Tue, 04 Dec 2001 17:41:07 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.77 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Markus Scherer CC: Jonathan Biggar , Debasish Banerjee , Duncan Grisby , interop@omg.org, Mark Davis Subject: Re: UTF-16 and BOM References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: WHc!!A^0e9Weed92,Nd9 Markus Scherer wrote: > > Hello, > > Let me explain how I see the application of the terms "encoding form" and > "encoding scheme" in this context. > (Dr. Mark Davis read this email over before I sent it.) > > First of all, I agree with Jonathan that the endianness of the encoding > inside a message/encapsulation is independent from the endianness of the > current machine. I believe that this is not in question in this > discussion. > > Encoding forms (where different from encoding schemes) are relevant in > contexts and environments where there are 16-bit units available. In such > contexts, UTF-16 is stored and processed directly in terms of such 16-bit > units. This is the case for in-process UTF-16 text handling. > > Encoding schemes are relevant in contexts and environments that deal > strictly with byte streams (or byte vectors), without predefined 16-bit > units. In these contexts, the byte order of non-byte-oriented data like > UTF-16 text needs to be specified either by the protocol - in form of a > flag in a header [which could in turn take the form of byte-order-specific > charset tags like "UTF-16LE"] or in form of an implicit agreement - or, if > all else fails, then one can use the Byte Order Mark (BOM) for UTF-16 > text. I completely agree with the above. > ----- > > As far as I understand the Inter-ORB specification, there are two levels > of data representation. > The first level flattens data into primitives like unsigned short, octet, > double, struct, array, etc. > The second level then byte-serializes each of the primitives according to > the byte order of the message/encapsulation. > > In this model, UTF-16 strings should be treated like vectors of unsigned > short on the first level, and then byte-serialized according to the > message/encapsulation byte order on the second level. > This is how all other data types are treated. > This is how GIOP 1.1 treated wchar and wstring. > > This first level is consistent with character encoding forms because you > have real 16-bit types on this level. > The second level is consistent with character encoding schemes because it > takes care of the byte-serialization. > > You do not need to specify the byte order of the serialization of the > UTF-16 text because that is specified in the message/encapsulation. Arguing about what CDR should do as a clean model is nice, but doesn't help much when considering what the specification says CDR does do. CDR for GIOP 1.2 treates UTF-16 encodings specially, not as an abstract vector of shorts. The specification of CDR was changed from GIOP 1.1 to GIOP 1.2 because the CDR rules were not adequate to handle variable-sized encodings for wchar or wstring. > ----- > > For GIOP 1.2, the encoding of wchar/wstring was changed. In my > understanding, this introduces an inconsistency between the two Inter-ORB > data representation levels. > wchar/wstring are now already byte-serialized on the level of encoding > forms, and treated as if there were no available 16-bit types. > The byte order of the message/encapsulation is ignored for wchar/wstring, > but used for all other data. Again, this was a purposeful decision. CDR defers to the OSF registry to define the rules of how to encode a wstring for a given codeset identifier. The OSF registry defines the UTF-16 strictly, which means big-endian unless a BOM is used. The RTF felt we had no choice but to specify it this way, even though most of the RTF would have preferred a scheme that used the GIOP header (or CDR encapsulation) byte order flag. > This is unnecessary and inefficient, and will lead to confusion. I disagree. > - It goes against common principles of Inter-ORB data representation as > well as common use of Unicode. This case is clearly spelled out as a special case in the CDR specification. Yes, it is different than what you might expect, but it is not ambiguous. I also see little practical advantage to conforming to a "common use of Unicode", since the ORB is going to have to be prepared to byte-swap anyway. The rules for when it does it might be a hair more complicated, but hardly rocket science. > - It allows the possibility of UTF-16 text with a different byte order > than the rest of the message. I consider this an advantage. > - It allows the possibility of two different UTF-16 wstrings in the same > message/encapsulation with different byte orders. I also consider this an advantage. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Sender: jon@floorboard.com Message-ID: <3C0DAAB2.583B5106@biggar.org> Date: Tue, 04 Dec 2001 21:03:46 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.77 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Markus Scherer CC: Jonathan Biggar , Debasish Banerjee , Duncan Grisby , interop@omg.org, Mark Davis Subject: Re: UTF-16 and BOM References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: ,%"!!Ca\d9kD?!!H1/e9 Markus Scherer wrote: > > Jonathan, > > > The specification of CDR was changed from GIOP 1.1 to GIOP 1.2 because > > the CDR rules were not adequate to handle variable-sized encodings for > > wchar or wstring. > > In what way were the 1.1 rules not adequate for UTF-16? > For which wide-character encoding other than UTF-16/32 were the 1.1 rules > not adequate? Note I didn't say UTF-16/32 specifically. The problem was with other codesets that used a variable number of bytes to encode characters. GIOP/CDR 1.1 assumed every character encoding was the same size. The other difficulty was that it required explicit knowledge of the codeset semantics in order to remarshal the data. This meant that a GIOP bridge must know every possible codeset that it retransmits. The change in GIOP 1.2 for encoding wchar & wstring means that the data can be remarshalled verbatum without knowing the semantics of the codeset, since the length specifiers now declare the number of octets, not codepoints. > For wchar, the possibility of a Unicode code point (with "Unicode code > point" in the meaning of Unicode and SGML/XML) requiring two unsigned > shorts in UTF-16 could have been handled in ways other than encoding the > one or two unsigned shorts directly as (two or four) bytes. > > Note that depending on what notion of "character" is meant to be > represented in a wchar, one might need in UTF-16 either > - a single unsigned short (for a "Unicode code unit") or > - one or two unsigned shorts (for a "Unicode code point") or > - an arbitrary number of unsigned shorts (for a "user character" or > "grapheme cluster" like a sequence of "c" followed by "circumflex" > followed by "cedilla") > > It is not clear to me from the Inter-ORB specification which notion of > "character" is meant to be represented in a wchar. > I suggest to clarify this point in the text of the specification. I'm not entirely sure either, although I'm pretty sure your first alternative (code unit) isn't the correct interpretation. As it stands now, the "grapheme cluster" interpretation is possible, with the limitation that the final encoded syntax in CDR can't be more than 255 bytes. > For wstring, I do not see any problem with applying the 1.1 rules to > UTF-16. As I mentioned above, the changes were to handle other codesets other than ones for Unicode. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Date: Wed, 5 Dec 2001 10:32:47 +0100 (MET) Message-Id: <200112050932.fB59Wlk19826@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: debasish@us.ibm.com CC: everett.anderson@sun.com, tom@coastin.com, interop@omg.org, issues@omg.org In-reply-to: (debasish@us.ibm.com) Subject: Re: Encoding of UCS-2 References: User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: ? UCS-2 and UTF-16 encoding must be identical--UCS-2 treats the surrogates > simply as unknown code points. UCS-2 is a character encoding form using 16-bit code units, so it cannot be used on a byte stream. CORBA treats UTF-16 as a character encoding scheme, so it can be used on a byte stream (such as GIOP). Therefore, they *cannot* be identical. Furthermore, even if you look at the coded character sets that can be represented in these encodings, they are different. The code points you are referring to are illegal in UCS-2, not just "unknown". UTF-16 encodes many characters that UCS-2 does not encode. So they are by no means identical. Regards, Martin To: Jonathan Biggar cc: Debasish Banerjee , interop@omg.org, Mark Davis , Markus Scherer Subject: Re: UTF-16 and BOM In-Reply-To: Message from Jonathan Biggar of "Tue, 04 Dec 2001 15:58:51 PST." <3C0D633B.AB910909@biggar.org> From: Duncan Grisby Date: Wed, 05 Dec 2001 10:17:13 +0000 Sender: dpg1@uk.research.att.com Content-Type: text X-UIDL: 63H!!8''"!a6bd9aM:!! On Tuesday 4 December, Jonathan Biggar wrote: > 2. The ORB receives the Unicode data from a source in one byte order, > and will embed it into a message that is going to be encoded in the > other byte order. That might be useful, except that there is, of course, no way for an application to tell the ORB that it wants to do that. Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- To: Jonathan Biggar cc: Debasish Banerjee , interop@omg.org, Mark Davis , Markus Scherer Subject: Re: UTF-16 and BOM In-Reply-To: Message from Jonathan Biggar of "Tue, 04 Dec 2001 15:58:51 PST." <3C0D633B.AB910909@biggar.org> From: Duncan Grisby Date: Wed, 05 Dec 2001 10:20:56 +0000 Sender: dpg1@uk.research.att.com Content-Type: text X-UIDL: pi`d9GIBe9JH'!!b"Me9 On Tuesday 4 December, Jonathan Biggar wrote: [...] > In repsponse to Duncan Grisby's question, attached below is the original > response the RTF got from the Technical Director of Unicode, Inc. which > we used to form the currect mechanism for handling UTF-16 in CDR: It's clear from reading that that he was replying from a state of ignorance about the question that was being asked. Note how he keeps talking about "files" containing UTF-16... > > > Where you can run into BOM's, on the other hand, is when you are > > > serializing UTF-16 Unicode data into or out of byte streams > > > intended > > > for interchange -- e.g. as plain text datafiles loose on the > > > Internet somewhere, or "loose" inside any mixed byte polarity > > > network sharing files. It is in those instances that you need to That simply isn't the situation that GIOP is in, since at no time are the wstrings "loose". They always exist inside a GIOP message that has a byte order header. I accept that it is way too late to change the marshalling format for UTF-16. Apart from aesthetics (and a bit of code size) it really doesn't matter what the format for other word-oriented formats is, but it really needs to be specified. Cheers, Duncan. -- -- Duncan Grisby \ Research Engineer -- -- AT&T Laboratories Cambridge -- -- http://www.uk.research.att.com/~dpg1 -- Sender: jon@floorboard.com Message-ID: <3C0E4CF2.40231B81@biggar.org> Date: Wed, 05 Dec 2001 08:36:02 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.77 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Duncan Grisby CC: Jonathan Biggar , Debasish Banerjee , interop@omg.org, Mark Davis , Markus Scherer Subject: Re: UTF-16 and BOM References: <200112051020.KAA23475@pineapple.uk.research.att.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: o@l!!GKK!!=(]!!hL5e9 Duncan Grisby wrote: > > On Tuesday 4 December, Jonathan Biggar wrote: > > [...] > > In repsponse to Duncan Grisby's question, attached below is the original > > response the RTF got from the Technical Director of Unicode, Inc. which > > we used to form the currect mechanism for handling UTF-16 in CDR: > > It's clear from reading that that he was replying from a state of > ignorance about the question that was being asked. Note how he keeps > talking about "files" containing UTF-16... It's not clear at all. Ken uses the term "files" as a generic term to indicate data that will be interchanged between systems. For GIOP it happens to be messages, but the use case is pretty much the same. > > > > Where you can run into BOM's, on the other hand, is when you are > > > > serializing UTF-16 Unicode data into or out of byte streams intended > > > > for interchange -- e.g. as plain text datafiles loose on the > > > > Internet somewhere, or "loose" inside any mixed byte polarity > > > > network sharing files. It is in those instances that you need to > > That simply isn't the situation that GIOP is in, since at no time are > the wstrings "loose". They always exist inside a GIOP message that has > a byte order header. > > I accept that it is way too late to change the marshalling format for > UTF-16. Apart from aesthetics (and a bit of code size) it really > doesn't matter what the format for other word-oriented formats is, but > it really needs to be specified. UTF-16 isn't a word oriented format. It is a byte-level encoding scheme that happens to group the bytes in twos. As Martin pointed out on this thread, you shouldn't use UCS-2 with GIOP/CDR, because GIOP is a 8-bit stream and UCS-2 is a 16 bit encoding. The ambiguity in byte order for UCS-2 when trying to stream it via CDR comes because it is being used inappropriately. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Sender: jon@floorboard.com Message-ID: <3C0E4E53.C986E522@biggar.org> Date: Wed, 05 Dec 2001 08:41:55 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.77 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Duncan Grisby CC: Jonathan Biggar , Debasish Banerjee , interop@omg.org, Mark Davis , Markus Scherer Subject: Re: UTF-16 and BOM References: <200112051017.KAA23461@pineapple.uk.research.att.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: !=h!!55^d9NBL!!Sfd!! Duncan Grisby wrote: > > On Tuesday 4 December, Jonathan Biggar wrote: > > > 2. The ORB receives the Unicode data from a source in one byte order, > > and will embed it into a message that is going to be encoded in the > > other byte order. > > That might be useful, except that there is, of course, no way for an > application to tell the ORB that it wants to do that. And there shouldn't be, since an ORB should always receive and present Unicode data to the application in the native form. The case that I am referring to could be wstring data embedded in an Any that the application never unwrapped, or information internal to the ORB that is never presented to the user such as Unicode data embedded in GIOP service contexts, or a GIOP/GIOP bridge, or an interworking bridge that translates to a different distributed communication protocol. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org To: Jonathan Biggar Cc: "Debasish Banerjee" , Duncan Grisby , interop@omg.org, Jonathan Biggar , Markus Scherer MIME-Version: 1.0 Subject: Re: UTF-16 and BOM X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: "Markus Scherer" Date: Wed, 5 Dec 2001 09:44:41 -0800 X-MIMETrack: Serialize by Router on D04NM203/04/M/IBM(Release 5.0.8 |June 18, 2001) at 12/05/2001 12:44:53 PM, Serialize complete at 12/05/2001 12:44:53 PM Content-Type: text/plain; charset="us-ascii" X-UIDL: 32[d9+!Le98j>e9o?0!! Comments below with *** markus Markus Scherer IBM GCoC-Unicode/ICU Cupertino, CA markus.scherer@us.ibm.com (also for SameTime) Jonathan Biggar Sent by: jon@floorboard.com 2001-12-05 08:36 To: Duncan Grisby cc: Jonathan Biggar , Debasish Banerjee/Rochester/IBM@IBMUS, interop@omg.org, Mark Davis/Cupertino/IBM@IBMUS, Markus Scherer/Cupertino/IBM@IBMUS Subject: Re: UTF-16 and BOM Duncan Grisby wrote: > > On Tuesday 4 December, Jonathan Biggar wrote: > > [...] > > In repsponse to Duncan Grisby's question, attached below is the original > > response the RTF got from the Technical Director of Unicode, Inc. which > > we used to form the currect mechanism for handling UTF-16 in CDR: > > It's clear from reading that that he was replying from a state of > ignorance about the question that was being asked. Note how he keeps > talking about "files" containing UTF-16... It's not clear at all. Ken uses the term "files" as a generic term to indicate data that will be interchanged between systems. For GIOP it happens to be messages, but the use case is pretty much the same. *** It is very clear. Ken is very careful with his use of words. *** The use case is in fact very different when it comes to *** serialization *** and byte order. *** For "files on the loose" one can not assume any meta information *** about the charset, the byte order, etc. *** For GIOP, all this and more is well defined. GIOP is not "loose". > > > > Where you can run into BOM's, on the other hand, is when you are > > > > serializing UTF-16 Unicode data into or out of byte streams intended > > > > for interchange -- e.g. as plain text datafiles loose on the > > > > Internet somewhere, or "loose" inside any mixed byte polarity > > > > network sharing files. It is in those instances that you need to > > That simply isn't the situation that GIOP is in, since at no time are > the wstrings "loose". They always exist inside a GIOP message that has > a byte order header. > > I accept that it is way too late to change the marshalling format for > UTF-16. Apart from aesthetics (and a bit of code size) it really > doesn't matter what the format for other word-oriented formats is, but > it really needs to be specified. UTF-16 isn't a word oriented format. It is a byte-level encoding scheme that happens to group the bytes in twos. As Martin pointed out on this *** UTF-16 is two things: *** 1. A word-oriented encoding form *** 2. A byte-serialization (encoding scheme) of that encoding form *** The Unicode Standard defines from version 1.0 that in its encoding *** form it must be treated exactly by using words, not pairs of *** bytes. *** This does not _require_ you to pick either the "form" or "scheme" *** at a *** particular level of GIOP, but it suggests what to do. *** Note that I keep talking about what's desirable. *** I realize that this was changed from what looks *** more "natural" to something else. *** Anything will _work_ for as long as it is fully specified and *** technically feasible. thread, you shouldn't use UCS-2 with GIOP/CDR, because GIOP is a 8-bit stream and UCS-2 is a 16 bit encoding. The ambiguity in byte order for UCS-2 when trying to stream it via CDR comes because it is being used inappropriately. *** As far as I understand, GIOP has two steps of representing data, *** as I pointed out earlier. GIOP 1.1 did the "natural" thing for wstring. To: Jonathan Biggar Cc: Debasish Banerjee , Duncan Grisby , interop@omg.org, Jonathan Biggar , Mark Davis MIME-Version: 1.0 Subject: Re: UTF-16 and BOM X-Mailer: Lotus Notes Release 5.0.7 March 21, 2001 Message-ID: From: "Markus Scherer" Date: Wed, 5 Dec 2001 09:52:58 -0800 X-MIMETrack: Serialize by Router on D04NM203/04/M/IBM(Release 5.0.8 |June 18, 2001) at 12/05/2001 12:53:06 PM, Serialize complete at 12/05/2001 12:53:06 PM Content-Type: text/plain; charset="us-ascii" X-UIDL: (Y)e9NAH!!,BV!!5E!e9 More comments with *** markus Markus Scherer IBM GCoC-Unicode/ICU Cupertino, CA markus.scherer@us.ibm.com (also for SameTime) Jonathan Biggar Sent by: jon@floorboard.com 2001-12-04 21:03 To: Markus Scherer/Cupertino/IBM@IBMUS cc: Jonathan Biggar , Debasish Banerjee/Rochester/IBM@IBMUS, Duncan Grisby , interop@omg.org, Mark Davis/Cupertino/IBM@IBMUS Subject: Re: UTF-16 and BOM Markus Scherer wrote: > > Jonathan, > > > The specification of CDR was changed from GIOP 1.1 to GIOP 1.2 because > > the CDR rules were not adequate to handle variable-sized encodings for > > wchar or wstring. > > In what way were the 1.1 rules not adequate for UTF-16? > For which wide-character encoding other than UTF-16/32 were the 1.1 rules > not adequate? Note I didn't say UTF-16/32 specifically. The problem was with other codesets that used a variable number of bytes to encode characters. GIOP/CDR 1.1 assumed every character encoding was the same size. *** I think this is important. *** I have some experience dealing with all kinds of encodings, and I *** am *** not aware of any text encoding that fulfills all three of the following: *** 1. not a form of Unicode *** 2. not byte-oriented *** 3. suitable for text data interchange *** Unless I completely misunderstand what wchar/wstring are meant for, *** any encoding must fulfill 2. and 3., right? *** Note that wchar_t encodings as used with libc libraries on Unixes do *** not satisfy 3. They are strictly defined for process-internal use only *** and generally only defined "opaque". The other difficulty was that it required explicit knowledge of the codeset semantics in order to remarshal the data. This meant that a GIOP bridge must know every possible codeset that it retransmits. The change in GIOP 1.2 for encoding wchar & wstring means that the data can be remarshalled verbatum without knowing the semantics of the codeset, since the length specifiers now declare the number of octets, not codepoints. *** If "every possible codeset" means the two UTF-16 and UTF-32, then *** the problem is at least limited. Importance: Normal Sensitivity: Subject: Re: Encoding of UCS-2 To: Martin von Loewis , everett.anderson@Sun.COM, tom@coastin.com, interop@omg.org, issues@omg.org Cc: "Mark Davis" X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: "Debasish Banerjee" Date: Wed, 5 Dec 2001 17:40:50 -0600 X-MIMETrack: Serialize by Router on d27ml601/27/M/IBM(Build M10_08082001 Beta 3|August 08, 2001) at 12/05/2001 05:40:52 PM MIME-Version: 1.0 Content-Disposition: inline Content-Type: multipart/alternative; Boundary="0__=09BBE18ADF1259708f9e8a93df938690918c09BBE18ADF125970" X-UIDL: dPm!!af UCS-2 and UTF-16 encoding must be identical--UCS-2 treats the surrogates > simply as unknown code points. UCS-2 is a character encoding form using 16-bit code units, so it cannot be used on a byte stream. CORBA treats UTF-16 as a character encoding scheme, so it can be used on a byte stream (such as GIOP). Therefore, they *cannot* be identical. Furthermore, even if you look at the coded character sets that can be represented in these encodings, they are different. The code points you are referring to are illegal in UCS-2, not just "unknown". UTF-16 encodes many characters that UCS-2 does not encode. So they are by no means identical. Regards, Martin Sender: jon@floorboard.com Message-ID: <3C0ED697.58AA4E26@biggar.org> Date: Wed, 05 Dec 2001 18:23:19 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.77 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Markus Scherer CC: Jonathan Biggar , Debasish Banerjee , Duncan Grisby , interop@omg.org, Mark Davis Subject: Re: UTF-16 and BOM References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: -f`d9\"5e9WI%!!@HZd9 Markus Scherer wrote: > > > The specification of CDR was changed from GIOP 1.1 to GIOP 1.2 because > > > the CDR rules were not adequate to handle variable-sized encodings for > > > wchar or wstring. > > > > In what way were the 1.1 rules not adequate for UTF-16? > > For which wide-character encoding other than UTF-16/32 were the 1.1 > rules > > not adequate? > > Note I didn't say UTF-16/32 specifically. The problem was with other > codesets that used a variable number of bytes to encode characters. > GIOP/CDR 1.1 assumed every character encoding was the same size. > > *** I think this is important. > *** I have some experience dealing with all kinds of encodings, and I am > *** not aware of any text encoding that fulfills all three of the > following: > > *** 1. not a form of Unicode > *** 2. not byte-oriented > *** 3. suitable for text data interchange > > *** Unless I completely misunderstand what wchar/wstring are meant for, > *** any encoding must fulfill 2. and 3., right? Nope. It is legal to use a byte-oriented TCS for wchar/wstring (like UTF-8), as well as a "word-oriented" TCS for char/string. Here are two relevant snippets from Chapter 13: "13.10.1.9 Char and Wchar Transmission Code Set (TCS-C and TCS-W) These two terms refer to code sets that are used for transmission between ORBs after negotiation is completed. As the names imply, the first one is used for char data and the second one for wchar data. Each TCS can be byte-oriented or non-byte oriented." and from 13.10.1.12: The intent is for TCS-C to be byte-oriented and TCS-W to be non-byte-oriented. However, this specification does allow both types of characters to be transmitted using the same transmission code set. That is, the selection of a transmission code set is orthogonal to the wideness or narrowness of the characters, although a given code set may be better suited for either narrow or wide characters." The one thing the spec says you can't do effectively is use a multi-byte encoding when trying to transmit a single char value, but you can use byte or word-oriented encodings for string data. > *** Note that wchar_t encodings as used with libc libraries on Unixes do > *** not satisfy 3. They are strictly defined for process-internal use only > *** and generally only defined "opaque". Yup. > The other difficulty was that it required explicit knowledge of the > codeset semantics in order to remarshal the data. This meant that a > GIOP bridge must know every possible codeset that it retransmits. > The > change in GIOP 1.2 for encoding wchar & wstring means that the data > can > be remarshalled verbatum without knowing the semantics of the > codeset, > since the length specifiers now declare the number of octets, not > codepoints. > > *** If "every possible codeset" means the two UTF-16 and UTF-32, > then > *** the problem is at least limited. I meant exactly what I said. We wanted it to handle any possible codeset that can be expressed as a sequence of octets, including ones that haven't been invented yet. :) -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Date: Thu, 6 Dec 2001 10:52:05 +0100 (MET) Message-Id: <200112060952.fB69q5o22042@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: debasish@us.ibm.com CC: everett.anderson@sun.com, tom@coastin.com, interop@omg.org, issues@omg.org, mark.davis@us.ibm.com In-reply-to: (debasish@us.ibm.com) Subject: Re: Encoding of UCS-2 References: User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: m-6!!l'_d9C`Le9?(=!! > UCS-2 is used to designate a subset of Unicode characters that only > includes BMP characters. As above, there is no purpose to > distinguishing in practice between UCS-2 and UTF-16 as encoding > tags. They can be distinguished, but the recommendation is not to do > so when receiving text. When emitting text, the tag "UCS-2" should > be avoided. How is this relevant to GIOP, in particular in respect to this issue? CORBA 2.5 explains how UTF-16 is transmitted in GIOP messages. It is not obvious that the same statements also apply to transmission of UCS-2. Are you saying they should, or are you saying they should not? Regards, Martin Importance: Normal Sensitivity: Subject: Re: Encoding of UCS-2 To: Martin von Loewis Cc: everett.anderson@Sun.COM, tom@coastin.com, interop@omg.org, issues@omg.org, "Mark Davis" X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: "Debasish Banerjee" Date: Thu, 6 Dec 2001 10:54:15 -0600 X-MIMETrack: Serialize by Router on d27ml601/27/M/IBM(Build M10_08082001 Beta 3|August 08, 2001) at 12/06/2001 10:54:30 AM MIME-Version: 1.0 Content-Disposition: inline Content-Type: multipart/alternative; Boundary="0__=09BBE189DFC9B5AF8f9e8a93df938690918c09BBE189DFC9B5AF" X-UIDL: ?c%"!40Wd9;lP!!n@Yd9 Martin, The relevant points are: -- UCS-2 and UTF-16 SHOULD not have separate OSF code set ids. Even if they do, the OSF ids SHOULD be considered equivalent (aliases). -- Any BOM-encoding rule meant for UTF-16 SHOULD be applicable to UCS-2 also. -- The above statements SHOULD be true for the pair too. UTF-32 SHOULD not be distinguished from UCS-4 for data transmission and code set conversions. Finally, the present CORBA spec. is not following Unicode recommendations in a few instances. Can we fix it? Also, for any future revisions of the CORBA specifications we should be very careful so as not to introduce any of our own Unicode interpretation, but follow Unicode guidelines, and recommendations. Thanks, Deb To:Debasish Banerjee/Rochester/IBM@IBMUS cc:everett.anderson@sun.com, tom@coastin.com, interop@omg.org, issues@omg.org, Mark Davis/Cupertino/IBM@IBMUS Subject:Re: Encoding of UCS-2 > UCS-2 is used to designate a subset of Unicode characters that only > includes BMP characters. As above, there is no purpose to > distinguishing in practice between UCS-2 and UTF-16 as encoding > tags. They can be distinguished, but the recommendation is not to do > so when receiving text. When emitting text, the tag "UCS-2" should > be avoided. How is this relevant to GIOP, in particular in respect to this issue? CORBA 2.5 explains how UTF-16 is transmitted in GIOP messages. It is not obvious that the same statements also apply to transmission of UCS-2. Are you saying they should, or are you saying they should not? Regards, Martin Date: Thu, 6 Dec 2001 19:49:34 +0100 (MET) Message-Id: <200112061849.fB6InYE23270@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: debasish@us.ibm.com CC: everett.anderson@sun.com, tom@coastin.com, interop@omg.org, issues@omg.org, mark.davis@us.ibm.com In-reply-to: (debasish@us.ibm.com) Subject: Re: Encoding of UCS-2 References: User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: @WJ!!@+*!!lNOe9TW8!! > -- UCS-2 and UTF-16 SHOULD not have separate OSF code set ids. Even if they > do, the OSF ids SHOULD be considered equivalent (aliases). Well, they do. I don't agree they should be aliases. > -- Any BOM-encoding rule meant for UTF-16 SHOULD be applicable to UCS-2 > also. UCS-2, clearly, uses 16-bit code points and SHOULD not have been used for byte-oriented streams in the first place. So another option is to outlaw the usage of the UCS-2 code set id for GIOP, an favour of the UTF-16 code set id - in this case, they won't be equivalent. > Finally, the present CORBA spec. is not following Unicode > recommendations in a few instances. Can we fix it? Depends on what specifically you are referring to. Things that have been clarified by resolutions of earlier issues so that they are now unambiguous SHOULD not be changed. > Also, for any future revisions of the CORBA specifications we should > be very careful so as not to introduce any of our own Unicode > interpretation, but follow Unicode guidelines, and recommendations. The issue is not that Unicode has been deliberately ignored. The issue is rather that the Unicode spec was ambiguous in the past, and that, for on-the-wire purposes, CORBA has clarified some of these ambiguities. There might be cases where the Unicode consortium has resolved these ambiguities with different wording, so that it appears that there is a contradiction. In no case should compliance with the Unicode spec be achieved by giving up interoperability. Regards, Martin Importance: Normal Subject: Re: Encoding of UCS-2 To: Martin von Loewis Cc: everett.anderson@Sun.COM, tom@coastin.com, interop@omg.org, issues@omg.org, "Mark Davis" X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: "Debasish Banerjee" Date: Thu, 6 Dec 2001 14:51:38 -0600 X-MIMETrack: Serialize by Router on d27mc601/27/M/IBM(Release 5.0.8 |June 18, 2001) at 12/06/2001 02:49:48 PM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-UIDL: l]od9%jT!!%a/e9*b;e9 > -- UCS-2 and UTF-16 SHOULD not have separate OSF code set ids. Even if they > do, the OSF ids SHOULD be considered equivalent (aliases). Well, they do. I don't agree they should be aliases. I know they do,and I do not agree with your viewpoint. Maybe, we should raise this point as another interop issue. > -- Any BOM-encoding rule meant for UTF-16 SHOULD be applicable to UCS-2 > also. UCS-2, clearly, uses 16-bit code points and SHOULD not have been used for byte-oriented streams in the first place. So another option is to outlaw the usage of the UCS-2 code set id for GIOP, an favour of the UTF-16 code set id - in this case, they won't be equivalent. An interesting proposal. It appears that your proposal provies a pragamatic solution. > Finally, the present CORBA spec. is not following Unicode > recommendations in a few instances. Can we fix it? Depends on what specifically you are referring to. Things that have been clarified by resolutions of earlier issues so that they are now unambiguous SHOULD not be changed. I am specifically referring to the OMG BOM rule. Initial U+FFFE is a BOM for UTF-16 encoded OMG data, but a ZWNBS for UCS-2 encoded OMG data! Also, CORBA spec. is presently silent about the interpretation of an initial U+FFFE for UCS-4 or UTF-32 encoded data. What am I supposed to assume--BOM or ZWNBS? > Also, for any future revisions of the CORBA specifications we should > be very careful so as not to introduce any of our own Unicode > interpretation, but follow Unicode guidelines, and recommendations. The issue is not that Unicode has been deliberately ignored. The issue is rather that the Unicode spec was ambiguous in the past, and that, for on-the-wire purposes, CORBA has clarified some of these ambiguities. There might be cases where the Unicode consortium has resolved these ambiguities with different wording, so that it appears that there is a contradiction. In no case should compliance with the Unicode spec be achieved by giving up interoperability. The conformance clause C3 in the old Unicode spec. (Unicode V2.0, Addison-Wesley, 1996; Chapter 3.1) has not changed much in Version 3.0. Can you please point out the specific s or s in the Unicode spec., which got clarified by the CORBA specifications. Incidentally, CORBA 2.5 spec. was published long after Unicode 3.0 spec. was available in book form. UTF-16 as the fallback code set for TCS-W was introduced in CORBA spec. long before version 2.5. With this new 2.5 interpretation of BOM in UTF-16, we faced serious interoperability problem. I am just curious how other vendors addressed the interoperability issue--what was ZWNBSP in CORBA 2.3/2.4.x(?) became BOM in 2.4.5! Regards, Martin Sender: jon@floorboard.com Message-ID: <3C100CD2.F7DE8583@biggar.org> Date: Thu, 06 Dec 2001 16:26:58 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.77 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Debasish Banerjee CC: Martin von Loewis , everett.anderson@sun.com, tom@coastin.com, interop@omg.org, issues@omg.org, Mark Davis Subject: Re: Encoding of UCS-2 References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: oWd!!;:Pe9=C0e9)H3e9 Debasish Banerjee wrote: > UCS-2, clearly, uses 16-bit code points and SHOULD not have been used > > for byte-oriented streams in the first place. So another option is to > > outlaw the usage of the UCS-2 code set id for GIOP, an favour of the > > UTF-16 code set id - in this case, they won't be equivalent. > > An interesting proposal. It appears that your proposal provies > a pragamatic solution. I agree. We should have a statement that only byte-oriented encoding scheme should be usable with GIOP/CDR. The examples in the current specification that use UCS/2 for a transfer code set (except for the specific statements about UCS-2 and UCS-4 for GIOP 1.1) should be removed. > > Finally, the present CORBA spec. is not following Unicode > > > recommendations in a few instances. Can we fix it? > > Depends on what specifically you are referring to. Things that have > > been clarified by resolutions of earlier issues so that they are now > > unambiguous SHOULD not be changed. > > I am specifically referring to the OMG BOM rule. Initial U+FFFE is a > BOM for UTF-16 encoded OMG data, but a ZWNBS for UCS-2 encoded OMG data! > Also, > CORBA spec. is presently silent about the interpretation of an > initial U+FFFE for UCS-4 or UTF-32 encoded data. What am I supposed to > assume--BOM or ZWNBS? ZWNBS. The OMG BOM rule specifically states that it applies only to UTF-16, and that only for CDR, so it doesn't apply, for example, if UTF-16 happens to be the native process code set. > > Also, for any future revisions of the CORBA specifications we should > > > be very careful so as not to introduce any of our own Unicode > > > interpretation, but follow Unicode guidelines, and recommendations. > > The issue is not that Unicode has been deliberately ignored. The issue > > is rather that the Unicode spec was ambiguous in the past, and that, > > for on-the-wire purposes, CORBA has clarified some of these > > ambiguities. There might be cases where the Unicode consortium has > > resolved these ambiguities with different wording, so that it appears > > that there is a contradiction. In no case should compliance with the > > Unicode spec be achieved by giving up interoperability. > > The conformance clause C3 in the old Unicode spec. (Unicode V2.0, > Addison-Wesley, 1996; Chapter 3.1) has not changed much in Version 3.0. > Can you please point out the specific s or s > in the Unicode spec., which got clarified by the CORBA specifications. > Incidentally, CORBA 2.5 spec. was published long after Unicode 3.0 spec. > was available in book form. Your definition of "higher level protocol" from clause C3 is too limited. For GIOP/CDR 1.2, the higher level protocol not only includes the byte-order bit in the GIOP header or CDR encapsulation, it also includes the specific BOM rules from 15.3.1.6. I do not believe that current GIOP/CDR violates either the rules or spirit of the Unicode specification. > UTF-16 as the fallback code set for TCS-W was introduced in CORBA spec. > long before version 2.5. With this new 2.5 interpretation of BOM in > UTF-16, we faced serious interoperability problem. I am just curious > how other vendors addressed the interoperability issue--what was ZWNBSP > in CORBA 2.3/2.4.x(?) became BOM in 2.4.5! By the way, the BOM rule was added in CORBA 2.4 as a clarification for GIOP 1.2, not in CORBA 2.5. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Date: Fri, 7 Dec 2001 11:16:52 +0100 (MET) Message-Id: <200112071016.fB7AGql24441@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: debasish@us.ibm.com CC: everett.anderson@sun.com, tom@coastin.com, interop@omg.org, issues@omg.org, mark.davis@us.ibm.com In-reply-to: (debasish@us.ibm.com) Subject: Re: Encoding of UCS-2 References: User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: =F%!!$E!e9~\Oe93V'e9 > > Finally, the present CORBA spec. is not following Unicode > > recommendations in a few instances. Can we fix it? > Depends on what specifically you are referring to. Things that have > been clarified by resolutions of earlier issues so that they are now > unambiguous SHOULD not be changed. > > I am specifically referring to the OMG BOM rule. Initial U+FFFE > is a > BOM for UTF-16 encoded OMG data, but a ZWNBS for UCS-2 encoded OMG > data! Are you claiming that the CORBA is not following the Unicode recommendation here? In what respect? Either you are claiming that U+FFFE should not be use as a BOM in UTF-16, or not be used as a ZWNBS in UCS-2; your statement is unclear as to where you see the violation of the Unicode recommendations. First let's look at UTF-16 encoded data, where UTF-16 is understood as a CES, in the spirit of RFC 2781. That RFC, in section 3.2, specifies the usage of the BOM for encoding UTF-16 data in a byte stream. It refers to Unicode section 2.4 and Annex F of ISO 10646. So that probably is not in violation of the Unicode spec. Now, let's look at ZWNBS in UCS-2. I cannot find anything in the CORBA specification that suggests any interpretation of U+FFFE for UCS-2. The BOM is mentioned only in 15.3.1.6, which only talks about UTF-16. > Also, CORBA spec. is presently silent about the interpretation of an > initial U+FFFE for UCS-4 or UTF-32 encoded data. What am I supposed > to assume--BOM or ZWNBS? The OSF registry does not list UTF-32 as a character set, so you cannot use that in GIOP. As for UCS-32, I'd suggest the same resolution as for UCS-2: Don't use it, since it operates on 32-bit code units, not on bytes. > The conformance clause C3 in the old Unicode spec. (Unicode V2.0, > Addison-Wesley, 1996; Chapter 3.1) has not changed much in Version 3.0. > Can you please point out the specific s or s > in the Unicode spec., which got clarified by the CORBA specifications. > Incidentally, CORBA 2.5 spec. was published long after Unicode 3.0 spec. > was available in book form. There was a lot of confusion on the issue of code sequences and byte sequences in earlier formulations of Unicode. For example, please look at http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html which corrects usage of the term "byte sequence" into "code unit sequence". People could get the impression that both UCS-2 and UTF-16 allow transmission of Unicode data in a byte stream. That has only been clarified in recent Unicode versions. Regards, Martin Date: Fri, 07 Dec 2001 19:17:13 -0800 From: Everett Anderson X-Mailer: Mozilla 4.73 [en] (Windows NT 5.0; U) X-Accept-Language: en,pdf,ja MIME-Version: 1.0 To: Debasish Banerjee CC: Martin von Loewis , tom@coastin.com, interop@omg.org, issues@omg.org, Mark Davis Subject: Re: Encoding of UCS-2 References: Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: kDl!!ZP0e93!U!!)=Qe9 Status: RO Hi, Again, I think this comes down to the following two things: 1. The CORBA spec was changed to be very clear about what UTF-16 means in CORBA. Many people are using this, now, and at least where I am, we've finally started to get real interoperability for complex types -- I don't want to break this, again. We should not change the CORBA spec for UTF-16 -- it is unambigious and explicit. I know many are still arguing that the way CORBA encodes UTF-16 isn't exactly true to the Unicode spec, but that's how it is in CORBA, now, and it in no way negatively effects Unicode outside of CORBA. 2. It sounds like UCS-2 should not be used in CORBA since it doesn't provide a mapping to bytes. Deprecating its use and any other conversion means which don't map to bytes seems reasonable. Can we agree to these to points and move on, or at least move this discussion to the interop list rather than issues@omg.org? :) > of the Unicode recommendations I think this was mentioned in > earlier notes. But please take a look at the Conformance clause C3, > and > Chapter 2.7 of the Unicode 3.0 spec. The line "If other signalling > methods > are used, signatures SHOULD NOT BE EMPLOYED" is in Section 2.7 in > the > Byte Order Mark (BOM) section. In the CORBA domain, where there is a I should really know better than to step into this one, but I believe that that part in Chapter 2.7 is talking about signatures, not byte order markers, and the "signaling methods" discussed there refer to ways of signaling the reader that the document contains Unicode text. - Everett X-Authentication-Warning: emerald.omg.org: hobbit.omg.org [192.67.184.3] didn't use HELO protocol Received: from unknown (63.96.163.33) by hobbit.omg.org asmtp(1.1.1) id 17962; Sat, 08 Dec 2001 01:05:04 -0500 (EST) Received: from san-francisco.beasys.com (san-francisco.beasys.com [192.168.9.10]) by usmailrelay.beasys.com (Switch-2.2.0/Switch-2.2.0) with ESMTP id fB85k4a27602; Fri, 7 Dec 2001 21:46:04 -0800 (PST) Received: from TSUNAMI.bea.com ([192.168.11.83]) by san-francisco.beasys.com (8.9.3+Sun/8.9.1) with ESMTP id VAA18111; Fri, 7 Dec 2001 21:46:03 -0800 (PST) Message-Id: <4.3.2.7.2.20011207214311.00b14240@san-francisco.beasys.com> X-Sender: andyp@san-francisco.beasys.com X-Mailer: QUALCOMM Windows Eudora Version 4.3.2 Date: Fri, 07 Dec 2001 21:44:25 -0800 To: Everett Anderson , Debasish Banerjee From: Andy Piper Subject: Re: Encoding of UCS-2 Cc: Martin von Loewis , tom@coastin.com, interop@omg.org, issues@omg.org, Mark Davis In-Reply-To: <3C118639.8BC3A9A0@sun.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-UIDL: DeCe9eV:e9%2hd9VRh!! Status: RO At 07:17 PM 12/7/2001 -0800, Everett Anderson wrote: 2. It sounds like UCS-2 should not be used in CORBA since it doesn't provide a mapping to bytes. Deprecating its use and any other conversion means which don't map to bytes seems reasonable. The problem with this is that UCS-2 is the only thing JDK 1.3.x understands. So deprecation may take a while, but you should consider making the native codeset of JDK 1.4 UTF-16 (if its not already). andy Date: Sat, 8 Dec 2001 16:04:53 +0100 (MET) Message-Id: <200112081504.fB8F4r926740@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: debasish@us.ibm.com CC: everett.anderson@sun.com, tom@coastin.com, interop@omg.org, issues@omg.org, mark.davis@us.ibm.com In-reply-to: (debasish@us.ibm.com) Subject: Re: Encoding of UCS-2 References: User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: GX+e9Y*dd9":4!!^5j!! Status: RO > The present CORBA spec. is not violating the Unicode spec.; it is > not following the Unicode recommendations. I understand. Earlier, you've asked "Can we fix it?" to which, I believe, the only sensible answer is "no". Changing it (treatment of the BOM in UTF-16 in GIOP) will break interoperability between existing implementation, and will cause confusion to implementors, so it must not be done. > -- Though, both UCS-2 and UTF-16 are of same species, according > to CORBA 2.5 spec., an initial U+FFFE is treated as a ZWNBS in UCS-2 > and as a BOM in in UTF-16! Where, in the text of CORBA 2.5, did you see the statement that U+FFFE is treated as ZWNBS in UCS-2? I firmly believe that no such text is in the text of CORBA 2.5. Assuming so is your interpretation, which may or may not be everybody else's interpretation. If there are different interpretations, we get an issue (as we currently do). That issue must be resolved, but until it is, please don't claim anything about what CORBA 2.5 says which it doesn't. > The absence of an initial U+FFFE in the serialized version of S will > be interpreted as in big-endian format. Outermost indicator is > indicating (CORBA 2.5 spec. Chapter 15.2.1) little-endian, but ONLY > for the UTF-16 encoded string we will treat the stuff as > big-endian. Does not it sound inconsistent and funny? I do not care about things sounding funny. There are many things in GIOP that are not the way I like them (including the notion of bi-endianness, which I think is very funny). The only thing that should matter to the revision task force is whether the text is underspecified, ambiguous, or contradicting. Since we both agree that the text says that UTF-16 strings are transmitted as big endian in the absence of a BOM, I don't think there is any issue to be solved for UTF-16 encoding. There was one, but that has been resolved. > -- UTF-32 is another variant of Unicode. For UTF-32, again the CORBA 2.5 > spec. tells us to use byte-order flag and not consider an initial > U+FFFE to be a BOM! The CORBA spec (by implication) tells us that UTF-32 cannot be used with GIOP. To use it, you'd need to have an OSF charset identifier for UTF-32, but there is none assigned. Since the OSF registry is no longer maintained, there won't be one assigned in the future. So CORBA is very clear how to use UTF-32 with GIOP: You cannot use it. > A sound spec. or theory should be uniformaly applicable to the entire > domain. A sound spec is something that is specifies unambiguously the correct behaviour for every possible case. It may do so by using a uniform theory, or it may do so by enumerating all alternatives. > OSF registry can have UTF-32 (or UTF-xx) tomorrow, and I do not > know who maintains the OSF registry. Just try to register it :-) The OSF registry is not maintained anymore. Regards, Martin Date: Sat, 8 Dec 2001 16:11:59 +0100 (MET) Message-Id: <200112081511.fB8FBxi26752@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: andy@xemacs.org CC: everett.anderson@Sun.COM, debasish@us.ibm.com, tom@coastin.com, interop@omg.org, mark.davis@us.ibm.com In-reply-to: <4.3.2.7.2.20011207214311.00b14240@san-francisco.beasys.com> (message from Andy Piper on Fri, 07 Dec 2001 21:44:25 -0800) Subject: Re: Encoding of UCS-2 References: <4.3.2.7.2.20011207214311.00b14240@san-francisco.beasys.com> User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: con!!=Ogd9pb3e9`W(e9 Status: RO > The problem with this is that UCS-2 is the only thing JDK 1.3.x > understands. So deprecation may take a while, but you should > consider > making the native codeset of JDK 1.4 UTF-16 (if its not already). Is that really a problem? Apparently, use of UCS-2 in GIOP is underspecified at best. To make it work, you have to make additional assumptions. There is nothing wrong with vendors making those assumptions, to provide interworking with products they need to interoperate with. If that means that everybody follows the JDK 1.3 interpretation of UCS-2 in GIOP, that might be a bad thing. Essentially, deprecating it puts it into the domain of proprietary extensions. OTOH, if OMG would rule differently on this matter, JDK 1.3 might become non-conforming; that would perhaps do more harm, since vendors would have the choice of either a compliant or an interoperating implementation. Regards, Martin Sender: jon@floorboard.com Message-ID: <3C124899.86B25851@biggar.org> Date: Sat, 08 Dec 2001 09:06:33 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.77 [en] (X11; U; SunOS 5.7 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Andy Piper CC: Everett Anderson , Debasish Banerjee , Martin von Loewis , tom@coastin.com, interop@omg.org, issues@omg.org, Mark Davis Subject: Re: Encoding of UCS-2 References: <4.3.2.7.2.20011207214311.00b14240@san-francisco.beasys.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: ~WMe9RH%"!YnP!!mGc!! Status: RO Andy Piper wrote: > > At 07:17 PM 12/7/2001 -0800, Everett Anderson wrote: > >2. It sounds like UCS-2 should not be used in CORBA since it doesn't > >provide a mapping to bytes. Deprecating its use and any other > >conversion means which don't map to bytes seems reasonable. > > The problem with this is that UCS-2 is the only thing JDK 1.3.x > understands. So deprecation may take a while, but you should consider > making the native codeset of JDK 1.4 UTF-16 (if its not already). I think all the options have confused the issue. Everett only means that UCS-2 cannot be used in GIOP. The JDK uses UCS-2 as its native codeset, and that is fine. A Java ORB should not offer UCS-2 as a transfer syntax, and must translate between UCS-2 and UTF-16 when marshalling wstring data. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org From: "Andy Piper" To: "Martin von Loewis" Cc: , , , , Subject: RE: Encoding of UCS-2 Date: Sat, 8 Dec 2001 09:38:24 -0800 Message-ID: <002201c1800f$235fa4e0$947ba8c0@TSUNAMI> MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 In-Reply-To: <200112081511.fB8FBxi26752@paros.informatik.hu-berlin.de> X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200 Importance: Normal Content-Type: text/plain; charset="iso-8859-1" X-UIDL: Hi)!!Yo@e91)bd9'K5e9 Status: RO > > The problem with this is that UCS-2 is the only thing JDK 1.3.x > > understands. So deprecation may take a while, but you should > > consider > > making the native codeset of JDK 1.4 UTF-16 (if its not already). > > Is that really a problem? Apparently, use of UCS-2 in GIOP is > underspecified at best. To make it work, you have to make additional > assumptions. There is nothing wrong with vendors making those > assumptions, to provide interworking with products they need to > interoperate with. If that means that everybody follows the JDK 1.3 > interpretation of UCS-2 in GIOP, that might be a bad > thing. Essentially, deprecating it puts it into the domain of > proprietary extensions. I would rather everyone *did* follow the JDK rules. > OTOH, if OMG would rule differently on this matter, JDK 1.3 might > become non-conforming; that would perhaps do more harm, since > vendors > would have the choice of either a compliant or an interoperating > implementation. Yes this would be bad, it would make the JDK worthless as an RMI-IIOP client and I think that is one of its fundamentally useful features. andy Date: Sat, 8 Dec 2001 23:44:24 +0100 (MET) Message-Id: <200112082244.fB8MiOg27381@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: jon@floorboard.com CC: andy@xemacs.org, everett.anderson@sun.com, debasish@us.ibm.com, tom@coastin.com, interop@omg.org, issues@omg.org, mark.davis@us.ibm.com In-reply-to: <3C124899.86B25851@biggar.org> (message from Jonathan Biggar on Sat, 08 Dec 2001 09:06:33 -0800) Subject: Re: Encoding of UCS-2 References: <4.3.2.7.2.20011207214311.00b14240@san-francisco.beasys.com> <3C124899.86B25851@biggar.org> User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: +e!!!%'!!!PX#!!R\-e9 Status: RO > I think all the options have confused the issue. Everett only means > that UCS-2 cannot be used in GIOP. The JDK uses UCS-2 as its native > codeset, and that is fine. A Java ORB should not offer UCS-2 as a > transfer syntax, and must translate between UCS-2 and UTF-16 when > marshalling wstring data. I think Andy claims that the JDK ORB offers UCS-2 as a transfer syntax. I don't know whether this is true. If it is not, I was confusing the issue, too. Regards, Martin Date: Sat, 8 Dec 2001 23:48:34 +0100 (MET) Message-Id: <200112082248.fB8MmYI27389@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: andy@xemacs.org CC: everett.anderson@Sun.COM, debasish@us.ibm.com, tom@coastin.com, interop@omg.org, mark.davis@us.ibm.com In-reply-to: <002201c1800f$235fa4e0$947ba8c0@TSUNAMI> (andy@xemacs.org) Subject: Re: Encoding of UCS-2 References: <002201c1800f$235fa4e0$947ba8c0@TSUNAMI> User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: Z"?!!GGmd9m- I would rather everyone *did* follow the JDK rules. Are you certain the JDK offers UCS-2 as a wchar codeset? If so, which one (0x00010100, 0x00010101, or 0x00010102)? Where is that documented? How does the JDK deal with endianness of UCS-2 inside the GIOP message? Where is that documented? The Task Force may guess, but if it guesses incorrectly as to what the JDK does. So we'd absolutely need documentation first. Regards, Martin Subject: Re: Encoding of UCS-2 To: Jonathan Biggar Cc: Andy Piper , "Debasish Banerjee" , Everett Anderson , interop@omg.org, issues@omg.org, jon@floorboard.com, Martin von Loewis , tom@coastin.com, "Markus Scherer" X-Mailer: Lotus Notes Release 5.0.3 March 21, 2000 Message-ID: From: "Mark Davis" Date: Sat, 8 Dec 2001 17:47:57 -0800 X-MIMETrack: Serialize by Router on d27mc601/27/M/IBM(Release 5.0.8 |June 18, 2001) at 12/08/2001 07:48:08 PM MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-UIDL: *-'!!2-;e9`m\d9EHgd9 Status: RO > and must translate between UCS-2 and UTF-16 when In normal circumstances, the "translation" between UCS-2 and UTF-16 is a NOP: - Both can be either big or little endian, using a BOM to indicate that status unless supplanted by the appropriate higher-level protocol. - All BMP code points are expressed identically in UTF-16 and UCS-2. - When translating any Unicode formats, unassigned code points are maintained, thus supplementary characters in UTF-16, expressed as surrogate code units, will be maintained in UCS-2 (even though a UCS-2 program wouldn't know that those surrogate code units represented something different. This should not be surprising, since UTF-16 was designed expressly for compatibility with UTF-2, so that data channeled from a UTF-16 implementation through a UCS-2 pipe, and back out to a different UTF-16 implemenation would be maintained. It is not productive to distinguish between "UCS-2" and "UTF-16" when discussing *data*. The only time it is may be useful is when talking about a program -- whether it supports a particular types of characters. One can be compliant with the Unicode Standard and distinguish between these two tags, however the strong recommendation is not to do so. The discussion of Java and UCS-2 also confuses matters. A String in Java contains UTF-16 code units -- it is just that much of the Java software doesn't interpret the surrogates as representing code points. When that string is serialized, it is UTF-16; one can see that by the way that Java serializes Strings into UTF-8. Mark ___ Mark Davis, IBM GCoC, Cupertino (408) 777-5850 [fax: 5892], mark.davis@us.ibm.com, president@unicode.org http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014 Jonathan Biggar To: Andy Piper , Debasish d.com> Banerjee/Rochester/IBM@IBMUS, Martin von Loewis Sent by: , tom@coastin.com, jon@floorboard interop@omg.org, issues@omg.org, Mark .com Davis/Cupertino/IBM@IBMUS Subject: Re: Encoding of UCS-2 2001.12.08 09:06 From: "Andy Piper" To: "Martin von Loewis" , Cc: , , , , , Subject: RE: Encoding of UCS-2 Date: Sat, 8 Dec 2001 18:39:15 -0800 Message-ID: <002b01c1805a$b14ba060$947ba8c0@TSUNAMI> MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 In-Reply-To: <200112082244.fB8MiOg27381@paros.informatik.hu-berlin.de> X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4522.1200 Importance: Normal Content-Type: text/plain; charset="iso-8859-1" X-UIDL: KOPe9F7%e9E^cd9(FEe9 Status: RO > I think Andy claims that the JDK ORB offers UCS-2 as a transfer > syntax. I don't know whether this is true. If it is not, I was > confusing the issue, too. Well Everett can confirm or deny but I believe that is the case for JDK 1.3.x. Not only that it cannot do any codeset conversion so if you don't use UCS-2 you are hosed. JDK 1.4 is much better (it does conversion and negotiation) although I don't know what its default native codeset is. Just to be clear it uses UCS-2 taking the endianess of the stream and does not use or understand BOMs. It uses OSF registry entry 0x00010100. Its not documented as far as I am aware. andy From: "Everett Anderson" To: "Andy Piper" , "Martin von Loewis" , Cc: , , , , Subject: RE: Encoding of UCS-2 Date: Sun, 9 Dec 2001 02:22:04 -0800 Message-ID: MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4807.1700 In-Reply-To: <002b01c1805a$b14ba060$947ba8c0@TSUNAMI> Content-Type: text/plain; charset="iso-8859-1" X-UIDL: m>*!!JV=!!W,]d9494e9 Status: RO Hi Andy, While I definitely appreciate your comment on the JDK ORB being useful as a RMI-IIOP client, I need to explicitly say that the Sun JDK ORBs before the upcoming J2SE 1.4 (or current J2EE 1.3 RI) release were not interoperable for a variety of reasons, not just that they required UCS-2 as a wchar/wstring code set. I don't want their behavior to influence a decision here. For interoperability, the newer Sun ORBs behave correctly. I agree that for my group to maintain backwards compatibility with the older releases, it's nice to have UCS-2 in our code set conversion list. We currently (J2SE 1.4/J2EE 1.3) are advertising the following, hoping that other vendors' ORBs will end up taking Latin-1/UTF-16 or UTF-8/UTF-16 rather than jumping on UCS-2: char/string: native: ISO8859-1 conversion[]: UTF-8, ISO646 wchar/wstring: native: UTF-16 conversion[]: UCS-2 We also provide a proprietary undocumented system property for specifying exactly what code sets to use. (We usually don't document these types of implementation specific things to avoid users thinking that they're standard OMG properties, but they're available if people ask.) I can't expect anyone to interoperate with non-compliant older JDK ORBs except for new Sun ORBs, and I definitely don't want to decide on issue resolutions based on the non-compliant behavior. - Everett > -----Original Message----- > From: Andy Piper [mailto:andy@xemacs.org] > Sent: Saturday, December 08, 2001 6:39 PM > To: Martin von Loewis; jon@floorboard.com > Cc: everett.anderson@sun.com; debasish@us.ibm.com; tom@coastin.com; > interop@omg.org; issues@omg.org; mark.davis@us.ibm.com > Subject: RE: Encoding of UCS-2 > > > > I think Andy claims that the JDK ORB offers UCS-2 as a transfer > > syntax. I don't know whether this is true. If it is not, I was > > confusing the issue, too. > > Well Everett can confirm or deny but I believe that is the > case for JDK > 1.3.x. Not only that it cannot do any codeset conversion so > if you don't use > UCS-2 you are hosed. JDK 1.4 is much better (it does conversion and > negotiation) although I don't know what its default native > codeset is. > > Just to be clear it uses UCS-2 taking the endianess of the > stream and does > not use or understand BOMs. It uses OSF registry entry > 0x00010100. Its not > documented as far as I am aware. > > andy > > From: "Everett Anderson" To: "Martin von Loewis" , Cc: , , , Subject: RE: Encoding of UCS-2 Date: Sun, 9 Dec 2001 02:41:08 -0800 Message-ID: MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4807.1700 In-Reply-To: <200112082248.fB8MmYI27389@paros.informatik.hu-berlin.de> Content-Type: text/plain; charset="us-ascii" X-UIDL: b;Td9:Q_d9n<#!!CINd9 Status: RO Hi Martin, > > I would rather everyone *did* follow the JDK rules. > > Are you certain the JDK offers UCS-2 as a wchar codeset? If > so, which > one (0x00010100, 0x00010101, or 0x00010102)? Where is that > documented? > > How does the JDK deal with endianness of UCS-2 inside the GIOP > message? Where is that documented? > > The Task Force may guess, but if it guesses incorrectly as > to what the > JDK does. So we'd absolutely need documentation first. In JDK 1.3.x, the default ORB encodes "UCS-2" (OSF registry 0x00010100) as 2 bytes, no byte order marker, and with the byte order of the stream/encapsulation to which it is writing. I believe at least one other vendor's ORB does this, and this interpretation is backed up by the CORBA spec's example ('encode UCS-2 as unsigned shorts'), as well as the note in the IANA registry about UCS-2, as I originally posted. I asked for an issue to be raised since a customer had interoperability issues since another ORB encodes UCS-2 exactly as CORBA defines UTF-16. As we've heard, it sounds like it is invalid to have UCS-2 in CORBA. Thus, I'd love to see a resolution like the following: 1. Remove the UCS-2 example. 2. Clarify that only encodings which result in bytes can be used with CORBA. (Someone would need to put this statement together since I don't think I have the terminology correct.) 3. Deprecate UCS-2 and code sets which don't match the above criteria. 4. As part of the deprecation, perhaps have newer client implementations not consider such code sets when performing the code set negotiation algorithm. This would still leave the problem Andy mentioned about interoperability with JDK 1.3.x and earlier Sun ORBs, but I don't think we can do anything, there. If a vendor really wants to go out of his way to make his ORB interoperable with the non-compliant legacy Sun ORB behavior, I can tell him everything our ORB liked and disliked, but that is not part of our spec work. :) - Everett Date: Sun, 9 Dec 2001 18:39:23 +0100 (MET) Message-Id: <200112091739.fB9HdNu28823@paros.informatik.hu-berlin.de> X-Authentication-Warning: paros.informatik.hu-berlin.de: loewis set sender to loewis@informatik.hu-berlin.de using -f From: Martin von Loewis To: mark.davis@us.ibm.com CC: jon@floorboard.com, andy@xemacs.org, debasish@us.ibm.com, everett.anderson@sun.com, interop@omg.org, issues@omg.org, jon@floorboard.com, tom@coastin.com, markus.scherer@us.ibm.com In-reply-to: (mark.davis@us.ibm.com) Subject: Re: Encoding of UCS-2 References: User-Agent: SEMI/1.13.7 (Awazu) FLIM/1.13.2 (Kasanui) Emacs/20.7 (sparc-sun-solaris2.8) MULE/4.0 (HANANOEN) MIME-Version: 1.0 (generated by SEMI 1.13.7 - "Awazu") Content-Type: text/plain; charset=US-ASCII X-UIDL: Ca8!!0~S!!,*od9Sd@!! > In normal circumstances, the "translation" between UCS-2 and UTF-16 is a > NOP: > > - Both can be either big or little endian, using a BOM to indicate that > status unless supplanted by the appropriate higher-level protocol. I'm confused. From a strict Unicode Consortium point-of-view (TR#17), I understand that neither can use little or big endian, since they use 16-bit code units, not bytes. It is possible to define character encoding schemes for both of them. The Unicode consortium has defined UCS-2BE and UCS-2LE for Unicode 1.1 and UTF-16BE and UTF-16LE for Unicode 3.0. None of those schemes allows to use a BOM. Of course, other schemes are possible: For example, the IANA scheme with MIBEnum 1015 (which, strictly speaking, is a character map) allows to use the BOM with UTF-16. It appears that neither IANA nor the Unicode consortium has an encoding scheme that allows write UCS-2 into a byte stream, using the BOM to indicate byte order. Is TR#17 incorrect? > One can be compliant with the Unicode Standard and distinguish between > these two tags, however the strong recommendation is not to do so. I would hope that the strict recommendation is to always talk about character encoding schemes in byte-oriented transports, and never about character encoding forms (atleast not when those have more than 256 code units). > The discussion of Java and UCS-2 also confuses matters. A String in Java > contains UTF-16 code units -- it is just that much of the Java software > doesn't interpret the surrogates as representing code points. When that > string is serialized, it is UTF-16; one can see that by the way that Java > serializes Strings into UTF-8. That is part of the confusion. The other part is that people, when talking about Java, UCS-2, and CORBA, may either talk about the internal storage of java.lang.string, or they may talk about transmitting CORBA wstring on the wire (which don't need to be related at all). Regards, Martin Importance: Normal Sensitivity: Subject: Re: Encoding of UCS-2 To: Martin von Loewis Cc: everett.anderson@Sun.COM, tom@coastin.com, interop@omg.org, issues@omg.org, "Mark Davis" X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: "Debasish Banerjee" Date: Fri, 7 Dec 2001 16:46:10 -0600 X-MIMETrack: Serialize by Router on d27ml601/27/M/IBM(Build M10_08082001 Beta 3|August 08, 2001) at 12/07/2001 04:46:13 PM MIME-Version: 1.0 Content-Disposition: inline Content-Type: multipart/alternative; Boundary="0__=09BBE188DFE18A3F8f9e8a93df938690918c09BBE188DFE18A3F" X-UIDL: ;73!!,C[!!`9!e96?>e9 Status: RO > > Finally, the present CORBA spec. is not following Unicode > > recommendations in a few instances. Can we fix it? > Depends on what specifically you are referring to. Things that have > been clarified by resolutions of earlier issues so that they are now > unambiguous SHOULD not be changed. > > I am specifically referring to the OMG BOM rule. Initial U+FFFE is a > BOM for UTF-16 encoded OMG data, but a ZWNBS for UCS-2 encoded OMG > data! Are you claiming that the CORBA is not following the Unicode recommendation here? Yes In what respect? Either you are claiming that U+FFFE should not be use as a BOM in UTF-16, or not be used as a ZWNBS in UCS-2; your statement is unclear as to where you see the violation of the Unicode recommendations I think this was mentioned in earlier notes. But please take a look at the Conformance clause C3, and Chapter 2.7 of the Unicode 3.0 spec. The line "If other signalling methods are used, signatures SHOULD NOT BE EMPLOYED" is in Section 2.7 in the Byte Order Mark (BOM) section. In the CORBA domain, where there is a byte-order indicator flag, the BOM is not NECESSARY and it SHOULD not be used for byte-order indication. The present CORBA spec. is not violating the Unicode spec.; it is not following the Unicode recommendations. In UCS-2(UTF-16)/UCS-4(UTF-32) encoded OMG data U+FEFF and U+FFFE should always get interpreted as ZWNBSP regardless of their positions in the serialized streams. First let's look at UTF-16 encoded data, where UTF-16 is understood as a CES, in the spirit of RFC 2781. That RFC, in section 3.2, specifies the usage of the BOM for encoding UTF-16 data in a byte stream. It refers to Unicode section 2.4 and Annex F of ISO 10646. So that probably is not in violation of the Unicode spec The RFC 2781 is about the MIME world where there are no general byte-order indicators. MIME relies on BOM for plain UTF-16 data transmission. CORBA is different from MIME. We already have a byte-order indicator flag, which is marking the serialized stream implicitly as UTF-16BE or UTF-16LE. We should not carry over the MIME byte order indicator rule literally to the CORBA domain. The byte-order rules of RFC 2781 are context-sensitive in nature, they are applicable to the contexts of MIME and any other domain where there exists no separate byte-order indicators. Again, the present CORBA spec. is not violating the Unicode spec.; it is just not following the Unicode recommendations (refer to Chapter 2.7 of the Unicode 3.0 spec.). Now, let's look at ZWNBS in UCS-2. I cannot find anything in the CORBA specification that suggests any interpretation of U+FFFE for UCS-2. The BOM is mentioned only in 15.3.1.6, which only talks about UTF-16. UCS-2 is a restricted form of UTF-16. In fact, in a sense UTF-16 can be considered to be a superset of UCS-2. -- Though, both UCS-2 and UTF-16 are of same species, according to CORBA 2.5 spec., an initial U+FFFE is treated as a ZWNBS in UCS-2 and as a BOM in in UTF-16! -- Suppose in a CDR encap. I have a long (L) and an UTF-16 encoded string (S). The CDR's byte-order indicator flag marks the entire encapsulation as little-endian. L is serialized as little-endian. The absence of an initial U+FFFE in the serialized version of S will be interpreted as in big-endian format. Outermost indicator is indicating (CORBA 2.5 spec. Chapter 15.2.1) little-endian, but ONLY for the UTF-16 encoded string we will treat the stuff as big-endian. Does not it sound inconsistent and funny? -- UTF-32 is another variant of Unicode. For UTF-32, again the CORBA 2.5 spec. tells us to use byte-order flag and not consider an initial U+FFFE to be a BOM! These are the precise reasons why this BOM rule in CORBA 2.5 spec. looks *ad-hoc*. A sound spec. or theory should be uniformaly applicable to the entire domain. Special case approaches produce inconsistencies and in general have very little appeal. > Also, CORBA spec. is presently silent about the interpretation of an > initial U+FFFE for UCS-4 or UTF-32 encoded data. What am I supposed > to assume--BOM or ZWNBS? The OSF registry does not list UTF-32 as a character set, so you cannot use that in GIOP. As for UCS-32, I'd suggest the same resolution as for UCS-2: Don't use it, since it operates on 32-bit code units, not on byte.s OSF registry can have UTF-32 (or UTF-xx) tomorrow, and I do not know who maintains the OSF registry. But that is not an issue of paramount importance. A safe and sound theory or specification should have enough flexibility to be uniformly applicable for all the non-byte oriented variants of Unicode. > The conformance clause C3 in the old Unicode spec. (Unicode V2.0, > Addison-Wesley, 1996; Chapter 3.1) has not changed much in Version 3.0. > Can you please point out the specific s or s > in the Unicode spec., which got clarified by the CORBA specifications. > Incidentally, CORBA 2.5 spec. was published long after Unicode 3.0 spec. > was available in book form. There was a lot of confusion on the issue of code sequences and byte sequences in earlier formulations of Unicode. For example, please look at http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html which corrects usage of the term "byte sequence" into "code unit sequence". People could get the impression that both UCS-2 and UTF-16 allow transmission of Unicode data in a byte stream. That has only been clarified in recent Unicode versions Thanks for the reference. I will look into it. Regards, Martin Importance: Normal Sensitivity: Subject: Re: Encoding of UCS-2 To: Martin von Loewis Cc: tom@coastin.com, interop@omg.org, "Mark Davis" , "Markus Scherer" X-Mailer: Lotus Notes Release 5.0.3 (Intl) 21 March 2000 Message-ID: From: "Debasish Banerjee" Date: Thu, 13 Dec 2001 13:51:09 -0600 X-MIMETrack: Serialize by Router on d27ml601/27/M/IBM(Build M10_08082001 Beta 3|August 08, 2001) at 12/13/2001 01:51:14 PM MIME-Version: 1.0 Content-Disposition: inline Content-Type: multipart/alternative; Boundary="0__=09BBE1B2DFCBDEAF8f9e8a93df938690918c09BBE1B2DFCBDEAF" X-UIDL: El1!!B[kd9,Y0!!:M3e9 >> -- Though, both UCS-2 and UTF-16 are of same species, according >> to CORBA 2.5 spec., an initial U+FFFE is treated as a ZWNBS in UCS-2 >> and as a BOM in in UTF-16! > Where, in the text of CORBA 2.5, did you see the statement that U+FFFE > is treated as ZWNBS in UCS-2? I firmly believe that no such text is in > the text of CORBA 2.5. Assuming so is your interpretation, which may > or may not be everybody else's interpretation. If there are different > interpretations, we get an issue (as we currently do). That issue must > be resolved, but until it is, please don't claim anything about what > CORBA 2.5 says which it doesn't A simple implication (in conjunction with the Unicode specifications). For sure, CORBA 2.5 spec. does not state anywhere in print that UTF-32 can not be used as TCS. But in the same note you derived the implication: " The CORBA spec (by implication) tells us that UTF-32 cannot be used with GIOP". I guess informal implications can be used in both instances. Otherwise, we may not also infer that CORBA 2.5 (in conjunction with the STATIC state of OSF registry) prohibits the use of UTF-32 in GIOP; can we? It seems from your note that there can be more than one interpretations of an initial "FF FE" in UCS-2 encoded GIOP data. I am curious to know about your interpretation of the same. > A sound spec is something that is specifies unambiguously the correct > behaviour for every possible case. It may do so by using a uniform > theory, or it may do so by enumerating all alternatives Probably there are not many well known theories in the field of artificial sciences which try to directly enumerate all the alternatives without relying on the principle of induction. In any case, it seems, we differ strongly in our viewpoints regarding Chapter 15.3.1.6 of the CORBA 2.5 spec. An initial "FF FE" must be interpreted as a BOM in an UTF-32 encoded GIOP data, but as (I guess, unspecified yet by the CORBA 2.5 spec. where UCS-2 encoding is allowed) when the same GIOP data is considered to be encoded in UCS-2--UCS-2 is a proper subset of UTF-32. Given a choice, I think we should try to adopt an uniformly applicable theory rather than prescribing *ad-hoc* solutions, which may continuously need clumsy enhancements to remove 'introduced inconsistencies' and also for encompassing domain elements which may get considered in future. CORBA may adopt in future a XYZ registry, which may have UTF-32, UTF-xx, etc. in its repertory. To:Debasish Banerjee/Rochester/IBM@IBMUS cc:everett.anderson@sun.com, tom@coastin.com, interop@omg.org, issues@omg.org, Mark Davis/Cupertino/IBM@IBMUS Subject:Re: Encoding of UCS-2 > The present CORBA spec. is not violating the Unicode spec.; it is > not following the Unicode recommendations. I understand. Earlier, you've asked "Can we fix it?" to which, I believe, the only sensible answer is "no". Changing it (treatment of the BOM in UTF-16 in GIOP) will break interoperability between existing implementation, and will cause confusion to implementors, so it must not be done. > -- Though, both UCS-2 and UTF-16 are of same species, according > to CORBA 2.5 spec., an initial U+FFFE is treated as a ZWNBS in UCS-2 > and as a BOM in in UTF-16! Where, in the text of CORBA 2.5, did you see the statement that U+FFFE is treated as ZWNBS in UCS-2? I firmly believe that no such text is in the text of CORBA 2.5. Assuming so is your interpretation, which may or may not be everybody else's interpretation. If there are different interpretations, we get an issue (as we currently do). That issue must be resolved, but until it is, please don't claim anything about what CORBA 2.5 says which it doesn't. > The absence of an initial U+FFFE in the serialized version of S will > be interpreted as in big-endian format. Outermost indicator is > indicating (CORBA 2.5 spec. Chapter 15.2.1) little-endian, but ONLY > for the UTF-16 encoded string we will treat the stuff as > big-endian. Does not it sound inconsistent and funny? I do not care about things sounding funny. There are many things in GIOP that are not the way I like them (including the notion of bi-endianness, which I think is very funny). The only thing that should matter to the revision task force is whether the text is underspecified, ambiguous, or contradicting. Since we both agree that the text says that UTF-16 strings are transmitted as big endian in the absence of a BOM, I don't think there is any issue to be solved for UTF-16 encoding. There was one, but that has been resolved. > -- UTF-32 is another variant of Unicode. For UTF-32, again the CORBA 2.5 > spec. tells us to use byte-order flag and not consider an initial > U+FFFE to be a BOM! The CORBA spec (by implication) tells us that UTF-32 cannot be used with GIOP. To use it, you'd need to have an OSF charset identifier for UTF-32, but there is none assigned. Since the OSF registry is no longer maintained, there won't be one assigned in the future. So CORBA is very clear how to use UTF-32 with GIOP: You cannot use it. > A sound spec. or theory should be uniformaly applicable to the entire > domain. A sound spec is something that is specifies unambiguously the correct behaviour for every possible case. It may do so by using a uniform theory, or it may do so by enumerating all alternatives. > OSF registry can have UTF-32 (or UTF-xx) tomorrow, and I do not > know who maintains the OSF registry. Just try to register it :-) The OSF registry is not maintained anymore. Regards, Martin Date: Wed, 27 Mar 2002 10:53:59 -0500 From: Tom Rutt Reply-To: tom@coastin.com Organization: Coast Enterprises LLC X-Mailer: Mozilla 4.78 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: orb-revision@omg.org CC: tom@coastin.com Subject: Issue 4008 Proposed Resolution Content-Type: multipart/mixed; boundary="------------86ADDF1BFBB6A1CF4B9BD96C" X-UIDL: 0jdd9[a~e9L1`d9`QF!! attached is a draft proposed resolution for Issue 4008 This is not for vote but for discussion at this time. -- ---------------------------------------------------- Tom Rutt email: tom@coastin.com Tel: +1 732 801 5744 Fax: +1 732 774 5133 Fujitsu Software Corp Issue 4008: wchar endianness (corba-rtf) Click here for this issue's archive. Source: AT&T (Dr. Duncan Grisby, dgrisby@uk.research.att.com) Nature: Uncategorized Issue Severity: Summary: In a similar vein to Vishy's question about alignment, what should the endianness of a word-oriented wchar be? This applies both to single wchars, and the separate code points in a wstring. With the 2.3 spec, it seemed quite obvious to me that word-oriented wide characters should have the same endianness as the rest of the stream. After all, they are no different from any other word-oriented type. However, with the new 2.4 spec, there is now a bizarre section saying that if, and only if, the TCS-W is UTF-16, all wchar values are marshalled big-endian unless there is a byte-order-mark telling you otherwise. I don't understand the point of this. Section 2.7 of the Unicode Standard, version 3.0 says [emphasis mine]: "Data streams that begin with U+FEFF byte order mark are likely to contain Unicode values. It is recommended that applications sending or receiving _untyped_ data streams of coded characters use this signature. _If other signaling methods are used, signatures should not be employed._" It seems quite clear to me that a GIOP stream is a _typed_ data stream which uses its own signalling methods. The Unicode standard therefore says that a BOM should _not_ be used. I guess it's too late to clean up the UTF-16 encoding, but what about other word-oriented code sets? What if the end-points have negotiated the use of UCS-4? Should that be big-endian unless there's a BOM? The spec doesn't say. Even worse, what if the negotiated encoding is something like Big5? That doesn't _have_ byte order marks. Big5 doesn't have a one-to-one Unicode mapping, so it's not sensible to always translate to UTF-16. GIOP already has a perfectly good mechanism for sorting out this kind of issue. Please can wchar be considered on equal footing with all other types, and use the stream's endianness? Resolution: UCS-2 and UCS-4 are native codesets, and are not intended for transfer syntaxes. If little endian or big endian is desired, the new IANA code points for UTF16BE and UTF 16LE should be used. Otherwise we should have the same interpretation for UCS-2 as with UTF> Revised Text: Add clarification: " The byte order rules for UCS-2 and UCS-4 in GIOP are the same as for UTF-16. GIOP 1.x allows the use of UTF16BE and UTF16LE, from the IANA codeset registry, for systems which desire to indicate a preference for a particular byte order. " "Actions taken: October 31, 2000: received issue March 7, 2002: moved to Core RTF Discussion: related to previous RTF (issue 3405b) X-Sender: andyp@san-francisco.beasys.com X-Mailer: QUALCOMM Windows Eudora Version 4.3.2 Date: Wed, 27 Mar 2002 09:15:50 -0800 To: tom@coastin.com, orb-revision@omg.org From: Andy Piper Subject: Re: Issue 4008 Proposed Resolution Cc: tom@coastin.com In-Reply-To: <3CA1EB17.48A60303@coastin.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-UIDL: bk'!!Alpd9QL6!!p:Ce9 So I agree totally with the discussion here, but not with the resolution. I know of at least three implementations that already treat UCS-2 as the discussion appears to want, i.e. it is encoded using the endianess of the stream. Given that we don't want people using UCS-2 anyway, and the mandate for UTF-16 is generally agreed to be peverse, it seems nonsensical to actually break existing implementations. So I would like to see wording that basically says the endianess of the stream shall be used apart from UTF-16 without a BOM. Thanks andy At 10:53 AM 3/27/02 -0500, Tom Rutt wrote: attached is a draft proposed resolution for Issue 4008 This is not for vote but for discussion at this time. -- ---------------------------------------------------- Tom Rutt email: tom@coastin.com Tel: +1 732 801 5744 Fax: +1 732 774 5133 Fujitsu Software Corp Issue 4008: wchar endianness (corba-rtf) Click here for this issue's archive. Source: AT&T (Dr. Duncan Grisby, dgrisby@uk.research.att.com) Nature: Uncategorized Issue Severity: Summary: In a similar vein to Vishy's question about alignment, what should the endianness of a word-oriented wchar be? This applies both to single wchars, and the separate code points in a wstring. With the 2.3 spec, it seemed quite obvious to me that word-oriented wide characters should have the same endianness as the rest of the stream. After all, they are no different from any other word-oriented type. However, with the new 2.4 spec, there is now a bizarre section saying that if, and only if, the TCS-W is UTF-16, all wchar values are marshalled big-endian unless there is a byte-order-mark telling you otherwise. I don't understand the point of this. Section 2.7 of the Unicode Standard, version 3.0 says [emphasis mine]: "Data streams that begin with U+FEFF byte order mark are likely to contain Unicode values. It is recommended that applications sending or receiving _untyped_ data streams of coded characters use this signature. _If other signaling methods are used, signatures should not be employed._" It seems quite clear to me that a GIOP stream is a _typed_ data stream which uses its own signalling methods. The Unicode standard therefore says that a BOM should _not_ be used. I guess it's too late to clean up the UTF-16 encoding, but what about other word-oriented code sets? What if the end-points have negotiated the use of UCS-4? Should that be big-endian unless there's a BOM? The spec doesn't say. Even worse, what if the negotiated encoding is something like Big5? That doesn't _have_ byte order marks. Big5 doesn't have a one-to-one Unicode mapping, so it's not sensible to always translate to UTF-16. GIOP already has a perfectly good mechanism for sorting out this kind of issue. Please can wchar be considered on equal footing with all other types, and use the stream's endianness? Resolution: UCS-2 and UCS-4 are native codesets, and are not intended for transfer syntaxes. If little endian or big endian is desired, the new IANA code points for UTF16BE and UTF 16LE should be used. Otherwise we should have the same interpretation for UCS-2 as with UTF> Revised Text: Add clarification: " The byte order rules for UCS-2 and UCS-4 in GIOP are the same as for UTF-16. GIOP 1.x allows the use of UTF16BE and UTF16LE, from the IANA codeset registry, for systems which desire to indicate a preference for a particular byte order. " "Actions taken: October 31, 2000: received issue March 7, 2002: moved to Core RTF Discussion: related to previous RTF (issue 3405b) Date: Wed, 27 Mar 2002 12:26:09 -0500 From: Tom Rutt Reply-To: tom@coastin.com Organization: Coast Enterprises LLC X-Mailer: Mozilla 4.78 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Andy Piper CC: orb-revision@omg.org Subject: Re: Issue 4008 Proposed Resolution References: <4.3.2.7.2.20020327090925.00b07f00@san-francisco.beasys.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: R$k!!Z7!"!?D:e9B+Ud9 I cannot find any text in any specification supporting your view. UCS-2 and UCS-4 in the ISO standard have text indicating that only Big Endian is allowed. I think that the confustion lies in using UCS-2 as a transfer encoding. UTF-16 has the same values as UCS-2 for a large range of characters (16 bit unicode) but is different for the extended 4 octet characters. The trouble, from my perspective, with your proposal is that it assumes everyone implemented the spec the same way the subset of implementations you refer to did. Someone may always send Big Endian because they read the ISO spec. The BOM is a simple, safe, check which is consistent with the interpretation we got from people in the unicode forum on the rules for UTF-16. The is also an IETF RFC 2781 regarding UTF-16, which has the same rules that we use for UTF-16. I quote directly from the RFC: " 3.3 Choosing a label for UTF-16 text Any labelling application that uses UTF-16 character encoding, and explicitly labels the text, and knows the serialization order of the characters in text, SHOULD label the text as either "UTF-16BE" or "UTF-16LE", whichever is appropriate based on the endianness of the text. This allows applications processing the text, but unable to look inside the text, to know the serialization definitively. Text in the "UTF-16BE" charset MUST be serialized with the octets which make up a single 16-bit UTF-16 value in big-endian order. Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text. Text in the "UTF-16LE" charset MUST be serialized with the octets which make up a single 16-bit UTF-16 value in little-endian order. Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text. Any labelling application that uses UTF-16 character encoding, and puts an explicit charset label on the text, and does not know the serialization order of the characters in text, MUST label the text as "UTF-16", and SHOULD make sure the text starts with 0xFEFF. An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE" would occur with document formats that mandate a BOM in UTF-16 text, thereby requiring the use of the "UTF-16" tag only. " Please show me a specification which supports your view. Why not use the new IANA UTF16 BE and LE codepoints to fix your problem in future release? Andy Piper wrote: > So I agree totally with the discussion here, but not with the resolution. I > know of at least three implementations that already treat UCS-2 as the > discussion appears to want, i.e. it is encoded using the endianess of the > stream. > > Given that we don't want people using UCS-2 anyway, and the mandate for > UTF-16 is generally agreed to be peverse, it seems nonsensical to actually > break existing implementations. So I would like to see wording that > basically says the endianess of the stream shall be used apart from UTF-16 > without a BOM. > > Thanks > > andy > > At 10:53 AM 3/27/02 -0500, Tom Rutt wrote: > >attached is a draft proposed resolution for Issue 4008 > > > > > >This is not for vote but for discussion at this time. > > > >-- > >---------------------------------------------------- > >Tom Rutt email: tom@coastin.com > >Tel: +1 732 801 5744 Fax: +1 732 774 5133 > >Fujitsu Software Corp > > > >Issue 4008: wchar endianness (corba-rtf) > > > >Click here for this issue's archive. > >Source: AT&T (Dr. Duncan Grisby, dgrisby@uk.research.att.com) > >Nature: Uncategorized Issue > >Severity: > >Summary: > > > >In a similar vein to Vishy's question about alignment, what should the > >endianness of a word-oriented wchar be? This applies both to single > >wchars, and the separate code points in a wstring. With the 2.3 spec, > >it seemed quite obvious to me that word-oriented wide characters > >should have the same endianness as the rest of the stream. After all, > >they are no different from any other word-oriented type. > > > >However, with the new 2.4 spec, there is now a bizarre section saying > >that if, and only if, the TCS-W is UTF-16, all wchar values are > >marshalled big-endian unless there is a byte-order-mark telling you > >otherwise. I don't understand the point of this. Section 2.7 of the > >Unicode Standard, version 3.0 says [emphasis mine]: > > > > "Data streams that begin with U+FEFF byte order mark are likely to > > contain Unicode values. It is recommended that applications sending > > or receiving _untyped_ data streams of coded characters use this > > signature. _If other signaling methods are used, signatures should > > not be employed._" > > > >It seems quite clear to me that a GIOP stream is a _typed_ data stream > >which uses its own signalling methods. The Unicode standard therefore > >says that a BOM should _not_ be used. > > > >I guess it's too late to clean up the UTF-16 encoding, but what about > >other word-oriented code sets? What if the end-points have negotiated > >the use of UCS-4? Should that be big-endian unless there's a BOM? > >The spec doesn't say. Even worse, what if the negotiated encoding is > >something like Big5? That doesn't _have_ byte order marks. Big5 > >doesn't have a one-to-one Unicode mapping, so it's not sensible to > >always translate to UTF-16. > > > >GIOP already has a perfectly good mechanism for sorting out this kind > >of issue. Please can wchar be considered on equal footing with all > >other types, and use the stream's endianness? > > > >Resolution: > >UCS-2 and UCS-4 are native codesets, and are not intended for transfer > >syntaxes. > > > >If little endian or big endian is desired, the new IANA code points for > >UTF16BE and UTF 16LE should be used. > > > >Otherwise we should have the same interpretation for UCS-2 as with UTF> > > > >Revised Text: Add clarification: > >" > >The byte order rules for UCS-2 and UCS-4 in GIOP are the same as for UTF-16. > > > >GIOP 1.x allows the use of UTF16BE and UTF16LE, from the IANA codeset > >registry, > >for systems which desire to indicate a preference for a particular byte > >order. > >" > >"Actions taken: > >October 31, 2000: received issue > >March 7, 2002: moved to Core RTF > > > >Discussion: > > > >related to previous RTF (issue 3405b) -- ---------------------------------------------------- Tom Rutt email: tom@coastin.com Tel: +1 732 801 5744 Fax: +1 732 774 5133 Fujitsu Date: Wed, 27 Mar 2002 12:29:21 -0500 From: Tom Rutt Reply-To: tom@coastin.com Organization: Coast Enterprises LLC X-Mailer: Mozilla 4.78 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: Andy Piper CC: orb-revision@omg.org Subject: Re: Issue 4008 Proposed Resolution References: <4.3.2.7.2.20020327090925.00b07f00@san-francisco.beasys.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: QIe!!L+Ld9!?R!!i~5!! This is a second response, I found an even stronger quote from RFC 2781 " 4.3 Interpreting text labelled as UTF-16 Text labelled with the "UTF-16" charset might be serialized in either big-endian or little-endian order. If the first two octets of the text is 0xFE followed by 0xFF, then the text can be interpreted as being big-endian. If the first two octets of the text is 0xFF followed by 0xFE, then the text can be interpreted as being little- endian. If the first two octets of the text is not 0xFE followed by 0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be interpreted as being big-endian. All applications that process text with the "UTF-16" charset label MUST be able to read at least the first two octets of the text and be able to process those octets in order to determine the serialization order of the text. Applications that process text with the "UTF-16" charset label MUST NOT assume the serialization without first checking the first two octets to see if they are a big-endian BOM, a little-endian BOM, or not a BOM. All applications that process text with the "UTF-16" charset label MUST be able to interpret both big- endian and little-endian text. " ---------------------------------------------------- Tom Rutt email: tom@coastin.com Tel: +1 732 801 5744 Fax: +1 732 774 5133 Fujitsu X-Sender: andyp@san-francisco.beasys.com X-Mailer: QUALCOMM Windows Eudora Version 4.3.2 Date: Wed, 27 Mar 2002 09:58:57 -0800 To: tom@coastin.com From: Andy Piper Subject: Re: Issue 4008 Proposed Resolution Cc: orb-revision@omg.org In-Reply-To: <3CA20171.EBFB1C30@coastin.com> References: <4.3.2.7.2.20020327090925.00b07f00@san-francisco.beasys.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-UIDL: U>Ie9QQG!!!1(e9M, To: , Subject: RE: Issue 4724 Proposed Resolution Date: Wed, 27 Mar 2002 14:27:42 -0800 Message-ID: MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) In-Reply-To: <3CA1EE4C.E5FB0120@coastin.com> Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4910.0300 Content-Type: text/plain; charset="us-ascii" X-UIDL: -^3e9IGLd9fKL!!$ob!! Hi Tom, Thanks for pushing for resolution of these issues, though at the moment, it does seem a bit overwhelming. I like most of them. Here are some comments: 4236 Switching to IANA registry I'd really like to make the switch so we can have UTF-16BE/LE. I agree with Jonathan Biggar that more needs to be said about the numbers to maintain compatibility with older ORBs. Could ORBs with GIOP > 1.2 support use the IANA numbers in a separate IOR profile, and include a GIOP 1.2 profile with the OSF numbers? 4008 Making UCS-2 the same as CORBA defined UTF-16 I know several people from at least three companies that have tried to find spec justification for their UCS-2 encodings, but no one can find something that isn't contradicted somewhere else. At least one major ORB as well as the Sun Java ORBs encode UCS-2 with no byte order marker and take the byte order of the stream. However, the Java ORB has a bug against it since this prevents interop with the other main interpretation for UCS-2. As we now seem to know, UCS-2 is only meant as a native code set, so we shouldn't be sending it at all. For our ORBs, I'd definitely prefer the interpretation we went with rather than "It's the same as UTF-16". :) 4824 Require code set context from client before dealing with wchar/wstring data Is there any interpretation of the INS spec that might lead someone to develop an implementation that given a corbaloc URL, wouldn't send a full IOR back to the client (the only way it could perform negotiation)? - Everett Date: Thu, 28 Mar 2002 08:38:13 +0000 From: Simon Nash Organization: IBM X-Mailer: Mozilla 4.72 [en] (Windows NT 5.0; I) X-Accept-Language: en MIME-Version: 1.0 To: tom@coastin.com CC: corba-rtf@omg.org Subject: Re: [Fwd: Issue 4008 Proposed Resolution] References: <3CA20450.C6D03B34@coastin.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: mQ'!!L[@e9!L=!!oKGe9 Tom, I don't agree with making the UCS2 rules the same as the UTF16 rules. If use of UCS2 is to be discouraged, there is little value in making an incompatible change to how it works. The pain is definitely not worth any supposed gain. Simon Tom Rutt wrote: > > -- > ---------------------------------------------------- > Tom Rutt email: tom@coastin.com > Tel: +1 732 801 5744 Fax: +1 732 774 5133 > Coast Enterprises > > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Subject: Issue 4008 Proposed Resolution > Date: Wed, 27 Mar 2002 10:53:59 -0500 > From: Tom Rutt > Organization: Coast Enterprises LLC > To: orb-revision@omg.org > CC: tom@coastin.com > > attached is a draft proposed resolution for Issue 4008 > > This is not for vote but for discussion at this time. > > -- > ---------------------------------------------------- > Tom Rutt email: tom@coastin.com > Tel: +1 732 801 5744 Fax: +1 732 774 5133 > Fujitsu Software Corp > > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Issue 4008: wchar endianness (corba-rtf) > > Click here for this issue's archive. > Source: AT&T (Dr. Duncan Grisby, dgrisby@uk.research.att.com) > Nature: Uncategorized Issue > Severity: > Summary: > > In a similar vein to Vishy's question about alignment, what should the > endianness of a word-oriented wchar be? This applies both to single > wchars, and the separate code points in a wstring. With the 2.3 spec, > it seemed quite obvious to me that word-oriented wide characters > should have the same endianness as the rest of the stream. After all, > they are no different from any other word-oriented type. > > However, with the new 2.4 spec, there is now a bizarre section saying > that if, and only if, the TCS-W is UTF-16, all wchar values are > marshalled big-endian unless there is a byte-order-mark telling you > otherwise. I don't understand the point of this. Section 2.7 of the > Unicode Standard, version 3.0 says [emphasis mine]: > > "Data streams that begin with U+FEFF byte order mark are likely to > contain Unicode values. It is recommended that applications sending > or receiving _untyped_ data streams of coded characters use this > signature. _If other signaling methods are used, signatures should > not be employed._" > > It seems quite clear to me that a GIOP stream is a _typed_ data stream > which uses its own signalling methods. The Unicode standard therefore > says that a BOM should _not_ be used. > > I guess it's too late to clean up the UTF-16 encoding, but what about > other word-oriented code sets? What if the end-points have negotiated > the use of UCS-4? Should that be big-endian unless there's a BOM? > The spec doesn't say. Even worse, what if the negotiated encoding is > something like Big5? That doesn't _have_ byte order marks. Big5 > doesn't have a one-to-one Unicode mapping, so it's not sensible to > always translate to UTF-16. > > GIOP already has a perfectly good mechanism for sorting out this kind > of issue. Please can wchar be considered on equal footing with all > other types, and use the stream's endianness? > > Resolution: > UCS-2 and UCS-4 are native codesets, and are not intended for transfer > syntaxes. > > If little endian or big endian is desired, the new IANA code points for > UTF16BE and UTF 16LE should be used. > > Otherwise we should have the same interpretation for UCS-2 as with UTF> > > Revised Text: Add clarification: > " > The byte order rules for UCS-2 and UCS-4 in GIOP are the same as for UTF-16. > > GIOP 1.x allows the use of UTF16BE and UTF16LE, from the IANA codeset > registry, > for systems which desire to indicate a preference for a particular byte > order. > " > "Actions taken: > October 31, 2000: received issue > March 7, 2002: moved to Core RTF > > Discussion: > > related to previous RTF (issue 3405b) -- Simon C Nash, Chief Technical Officer, IBM Java Technology Tel. +44-1962-815156 Fax +44-1962-818999 Hursley, England Internet: nash@hursley.ibm.com Lotus Notes: Simon Nash@ibmgb Sender: jbiggar@Resonate.com Message-ID: <3CA4E123.F333795D@floorboard.com> Date: Fri, 29 Mar 2002 13:48:19 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.79 [en] (X11; U; SunOS 5.6 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: tom@coastin.com CC: "corba-rtf@omg.org" Subject: Re: [Fwd: [Fwd: Issue 4008 Proposed Resolution]] References: <3CA4BDDA.1EBCAEE9@coastin.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: FkP!!m^0e95&'!!61Ge9 Tom Rutt wrote: > If you can come up with text for UCS2 that everyone agrees to, then we can put it in. > > Simon Nash wrote: > > > Tom, > > I don't agree with making the UCS2 rules the same as the UTF16 rules. > > If use of UCS2 is to be discouraged, there is little value in making an > > incompatible change to how it works. The pain is definitely not worth > > any supposed gain. Let me reiterate my position, which is supported by Unicode Technical Report #17 (http://www.unicode.org/unicode/reports/tr17/): UCS-2 is an inappropriate choice for use in GIOP since it is a Character Encoding Form (CEF), which maps code points to integer values (code units), and not a Character Encoding Scheme (CES) which maps code units to serialized bytes. Since GIOP is a serialized byte protocol, only a CES makes sense. Unfortunately, this has all been confused since them meaning of the acronym UCS-2 has changed over time. In Unicode 1.1, it was used to identify a CES, but since Unicode 2.0, it is used to identify a CEF instead. The older UCS-2 (used as a CES) is identical to a subset of UTF-16 that doesn't use surrogates. My position is that we should declare use of UCS-2 in GIOP to be deprecated, and UTF-16 should be used in its place in the future. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org X-Sender: andyp@san-francisco.beasys.com X-Mailer: QUALCOMM Windows Eudora Version 4.3.2 Date: Fri, 29 Mar 2002 14:07:49 -0800 To: Jonathan Biggar , tom@coastin.com From: Andy Piper Subject: Re: [Fwd: [Fwd: Issue 4008 Proposed Resolution]] Cc: "corba-rtf@omg.org" In-Reply-To: <3CA4E123.F333795D@floorboard.com> References: <3CA4BDDA.1EBCAEE9@coastin.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-UIDL: 33m!!"^G!!dhM!!e5gd9 That's all fine, but doesn't address the confusion over implementation in existing Orbs. I think we should doc the status quo, as well pointing out its deprecated. andy At 01:48 PM 3/29/02 -0800, Jonathan Biggar wrote: Tom Rutt wrote: > If you can come up with text for UCS2 that everyone agrees to, then we can put it in. > > Simon Nash wrote: > > > Tom, > > I don't agree with making the UCS2 rules the same as the UTF16 rules. > > If use of UCS2 is to be discouraged, there is little value in making an > > incompatible change to how it works. The pain is definitely not worth > > any supposed gain. Let me reiterate my position, which is supported by Unicode Technical Report #17 (http://www.unicode.org/unicode/reports/tr17/): UCS-2 is an inappropriate choice for use in GIOP since it is a Character Encoding Form (CEF), which maps code points to integer values (code units), and not a Character Encoding Scheme (CES) which maps code units to serialized bytes. Since GIOP is a serialized byte protocol, only a CES makes sense. Unfortunately, this has all been confused since them meaning of the acronym UCS-2 has changed over time. In Unicode 1.1, it was used to identify a CES, but since Unicode 2.0, it is used to identify a CEF instead. The older UCS-2 (used as a CES) is identical to a subset of UTF-16 that doesn't use surrogates. My position is that we should declare use of UCS-2 in GIOP to be deprecated, and UTF-16 should be used in its place in the future. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Sender: jbiggar@Resonate.com Message-ID: <3CA4E988.DEA17553@floorboard.com> Date: Fri, 29 Mar 2002 14:24:08 -0800 From: Jonathan Biggar X-Mailer: Mozilla 4.79 [en] (X11; U; SunOS 5.6 sun4u) X-Accept-Language: en MIME-Version: 1.0 To: Andy Piper CC: tom@coastin.com, "corba-rtf@omg.org" Subject: Re: [Fwd: [Fwd: Issue 4008 Proposed Resolution]] References: <3CA4BDDA.1EBCAEE9@coastin.com> <4.3.2.7.2.20020329140621.00bf0430@san-francisco.beasys.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: -]g!!9mm!!o;Vd9@:pd9 Andy Piper wrote: > > That's all fine, but doesn't address the confusion over implementation in > existing Orbs. I think we should doc the status quo, as well pointing out > its deprecated. Unfortunately there isn't a status quo, since some ORBs have implemented UCS-2 using the byte-order of the CDR stream, and others have implemented UCS-2 to always use the default big-endian byte order. So somebody's ox gets to be gored. :( -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org X-Sender: andyp@san-francisco.beasys.com X-Mailer: QUALCOMM Windows Eudora Version 4.3.2 Date: Fri, 29 Mar 2002 14:54:34 -0800 To: Jonathan Biggar From: Andy Piper Subject: Re: [Fwd: [Fwd: Issue 4008 Proposed Resolution]] Cc: tom@coastin.com, "corba-rtf@omg.org" In-Reply-To: <3CA4E988.DEA17553@floorboard.com> References: <3CA4BDDA.1EBCAEE9@coastin.com> <4.3.2.7.2.20020329140621.00bf0430@san-francisco.beasys.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-UIDL: 4mGe9 > That's all fine, but doesn't address the confusion over implementation in > existing Orbs. I think we should doc the status quo, as well pointing out > its deprecated. Unfortunately there isn't a status quo, since some ORBs have implemented UCS-2 using the byte-order of the CDR stream, and others have implemented UCS-2 to always use the default big-endian byte order. So somebody's ox gets to be gored. :( That's why we vote :) Majority rules... andy X-Sender: andyp@san-francisco.beasys.com X-Mailer: QUALCOMM Windows Eudora Version 4.3.2 Date: Wed, 10 Apr 2002 11:48:22 -0700 To: tom@coastin.com From: Andy Piper Subject: Re: Issue 4008 Proposed Resolution Cc: orb-revision@omg.org So here is my alternative proposed resolution: Resolution: UCS-2 and UCS-4 are native codesets, and are not intended for transfer syntaxes. If little endian or big endian is desired, the new IANA code points for UTF16BE and UTF 16LE should be used. However, many implementations exist which encode UCS-2 using the endianess of the GIOP message. To preserve backwards compatibility, implementations that must use UCS-2 should use the endianess of the GIOP message. Revised Text: Add clarification: "The use of UCS-2 and UCS-4 in GIOP is deprecated. However, to retain backwards compatibility, implementations that use these codesets should encode using the endianess of the GIOP message. GIOP 1.x allows the use of UTF16BE and UTF16LE, from the IANA codeset registry, for systems which desire to indicate a preference for a particular byte order." Thanks andy Date: Thu, 11 Apr 2002 16:34:28 +0900 From: SHIMAMURA Masayoshi To: Andy Piper Subject: Re: Issue 4008 Proposed Resolution Cc: orb-revision@omg.org Organization: FUJITSU LIMITED X-Mailer: Becky! ver. 2.00.07 X-MIME-Autoconverted: from quoted-printable to 8bit by emerald.omg.org id g3B7c6e12832 Andy, On Wed, 10 Apr 2002 11:48:22 -0700 Andy Piper wrote: ... > UCS-2 and UCS-4 are native codesets, and are not intended for transfer > syntaxes. > > If little endian or big endian is desired, the new IANA code points for > UTF16BE and UTF 16LE should be used. However ISO permits using UCS-2 and UCS-4 as transfer syntaxes. Thus we should not enforce UTF-16BE and UTF-16LE instead of UCS-2 and UCS-4. > However, many implementations exist which encode UCS-2 using the > endianess of the GIOP message. To preserve backwards compatibility, > implementations that must use UCS-2 should use the endianess of the > GIOP message. It seems good in existing GIOP versions (GIOP 1.1-1.3) to preserve compatibility with existing ORB implementations. However the rule above is different from the byte order rules for UTF-16. I think we should not introduce additional rules into next GIOP version. It increases complexity of CORBA specification. > Revised Text: Add clarification: > > "The use of UCS-2 and UCS-4 in GIOP is deprecated. However, to > retain > backwards compatibility, implementations that use these codesets > should > encode using the endianess of the GIOP message. > > GIOP 1.x allows the use of UTF16BE and UTF16LE, from the IANA > codeset > registry, for systems which desire to indicate a preference for a > particular byte order." > I think following clarification is better: For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be encoded using the endianess of the GIOP message for compatibility with existing implementations. For GIOP 1.4, the byte order rules for UCS-2 and UCS-4 are the same as for UTF-16. UTF-16LE and UTF-16BE, from IANA codeset registry, have its own endianess definition. Thus these should be encoded using its specific endianess. I'd like to add the clarification and IANA codeset registry to the specification at same time when we create GIOP 1.4. Regards, -- SHIMAMURA Masayoshi TEL:+81-45-476-4590(ext.7128-4241) FAX:+81-45-476-4726(ext.7128-6783) Planning Dep., Strategic Planning Div., Software Group, FUJITSU LIMITED X-Sender: andyp@san-francisco.beasys.com X-Mailer: QUALCOMM Windows Eudora Version 4.3.2 Date: Thu, 11 Apr 2002 10:54:38 -0700 To: SHIMAMURA Masayoshi From: Andy Piper Subject: Re: Issue 4008 Proposed Resolution Cc: orb-revision@omg.org At 04:34 PM 4/11/02 +0900, SHIMAMURA Masayoshi wrote: Andy, On Wed, 10 Apr 2002 11:48:22 -0700 Andy Piper wrote: ... > UCS-2 and UCS-4 are native codesets, and are not intended for transfer > syntaxes. > > If little endian or big endian is desired, the new IANA code points for > UTF16BE and UTF 16LE should be used. However ISO permits using UCS-2 and UCS-4 as transfer syntaxes. Thus we should not enforce UTF-16BE and UTF-16LE instead of UCS-2 and UCS-4. I actually have no problem using these for transfer as long as the encoding is unambiguous (which is currently not the case). > However, many implementations exist which encode UCS-2 using the > endianess of the GIOP message. To preserve backwards compatibility, > implementations that must use UCS-2 should use the endianess of the > GIOP message. It seems good in existing GIOP versions (GIOP 1.1-1.3) to preserve compatibility with existing ORB implementations. However the rule above is different from the byte order rules for UTF-16. I think we should not introduce additional rules into next GIOP version. It increases complexity of CORBA specification. > Revised Text: Add clarification: > > "The use of UCS-2 and UCS-4 in GIOP is deprecated. However, to >retain > backwards compatibility, implementations that use these codesets >should > encode using the endianess of the GIOP message. > > GIOP 1.x allows the use of UTF16BE and UTF16LE, from the IANA >codeset > registry, for systems which desire to indicate a preference for a > particular byte order." > I think following clarification is better: For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be encoded using the endianess of the GIOP message for compatibility with existing implementations. For GIOP 1.4, the byte order rules for UCS-2 and UCS-4 are the same as for UTF-16. UTF-16LE and UTF-16BE, from IANA codeset registry, have its own endianess definition. Thus these should be encoded using its specific endianess. I'd like to add the clarification and IANA codeset registry to the specification at same time when we create GIOP 1.4. This is ok by me. Actually to truly preserve backwards compatibility we might have to go further and preclude BOMs from UCS-2 encodings in GIOP 1.1, 1.2 and 1.3, but maybe this is going too far... andy Date: Fri, 12 Apr 2002 11:51:15 -0400 From: Tom Rutt Reply-To: tom@coastin.com Organization: Coast Enterprises LLC X-Mailer: Mozilla 4.78 [en] (Win98; U) X-Accept-Language: en To: Andy Piper CC: SHIMAMURA Masayoshi , orb-revision@omg.org Subject: Re: Issue 4008 Proposed Resolution Andy Piper wrote: > > > > This is ok by me. Actually to truly preserve backwards compatibility >we > might have to go further and preclude BOMs from UCS-2 encodings in >GIOP > 1.1, 1.2 and 1.3, but maybe this is going too far... > What if the sending orb receives a wstring encoded with a BOM as the first symbol, and that BOM is not the same as the endianness of the giop message? > > andy -- ---------------------------------------------------- Tom Rutt email: tom@coastin.com Tel: +1 732 801 5744 Fax: +1 732 774 5133 Fujitsu Software X-Sender: andyp@san-francisco.beasys.com X-Mailer: QUALCOMM Windows Eudora Version 4.3.2 Date: Fri, 12 Apr 2002 10:32:31 -0700 To: tom@coastin.com From: Andy Piper Subject: Re: Issue 4008 Proposed Resolution Cc: SHIMAMURA Masayoshi , orb-revision@omg.org At 11:51 AM 4/12/02 -0400, Tom Rutt wrote: > This is ok by me. Actually to truly preserve backwards compatibility we > might have to go further and preclude BOMs from UCS-2 encodings in GIOP > 1.1, 1.2 and 1.3, but maybe this is going too far... > What if the sending orb receives a wstring encoded with a BOM as the first symbol, and that BOM is not the same as the endianness of the giop message? BOMs should take precedence IMO, that's how it is for UTF-16. My comment was more to do with the fact that certain implementations that rely on UCS-2 having the endianess of the message do not understand BOMs. But I don't think my comment is necessary since we are only trying to address the ambiguous (no-BOM) case. andy Date: Fri, 12 Apr 2002 15:31:56 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard Company X-Mailer: Mozilla 4.78 [en] (Windows NT 5.0; U) X-Accept-Language: en To: corba-rtf@omg.org Subject: Issue 4008 Draft Resolution The following draft resolution needs some significant further work in order to fit it in well with section 13.10.2.6. There is also need to provide details of fallback codesets from the IANA reistry, which is est included in this resolution. Take a look at the paragraph that begins "In summary, the fallback codeset...." on page 13-48 of orbrev/02-02-01. I need specific language to fix this section to make it work with the ANA registry. All the specific codeset references from OSF egistry that exist for pre-1.4 GIOP, need to be repeated with equivalent codeset details from the IANA registry. Until we do that we should hold off on votin on this issue. So, Shimamura-san, Tom and Andy your help is needed to complete this draft. Think about how to fix sextion 13.10.2.6 in its totality. Thanks, Jishnu. -- Issue 4008: wchar endianness (interop) http://cgi.omg.org/issues/issue4008.txt Source: AT&T (Dr. Duncan Grisby, dgrisby@uk.research.att.com) Nature: Uncategorized Issue Severity: Summary: . . . . . . . . Resolution: The use of various encoding schemes needs to be clarified. See proposed text change below. Revised Text: In document orbrev/02-02-01 in section 13.10.2.6 "Code Set Negotation" insert th followng paragraphs immediately following the paragraph that begins "In summary, the fallback codeset......" on page 13-48: "For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be encoded using the endianess of the GIOP message for compatibility with existing implementations. For GIOP 1.4, the byte order rules for UCS-2 and UCS-4 are the same as for UTF-16. UTF-16LE and UTF-16BE, from IANA codeset registry, have its own endianess definition. Thus these should be encoded using its specific endianess." Actions taken: Incorporate changes and close issue. Date: Tue, 16 Apr 2002 14:08:30 +0900 From: SHIMAMURA Masayoshi To: Andy Piper Subject: Re: Issue 4008 Proposed Resolution Cc: orb-revision@omg.org, Tom Rutt Organization: FUJITSU LIMITED X-Mailer: Becky! ver. 2.00.07 Andy, I'd like to revise my proposal as following. I changed only first paragraph. I think it is much better for backward compatibility. For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be encoded using endianess indicated by the BOM. If BOM is not presented, these should be encoded using the endianess of the GIOP message for backward compatibility. For GIOP 1.4, the byte order rules for UCS-2 and UCS-4 are the same as for UTF-16. UTF-16LE and UTF-16BE, from IANA codeset registry, have its own endianess definition. Thus these should be encoded using its specific endianess. On Thu, 11 Apr 2002 10:54:38 -0700 Andy Piper wrote: > At 04:34 PM 4/11/02 +0900, SHIMAMURA Masayoshi wrote: > >Andy, > > > >On Wed, 10 Apr 2002 11:48:22 -0700 > >Andy Piper wrote: > >... > > > UCS-2 and UCS-4 are native codesets, and are not intended for transfer > > > syntaxes. > > > > > > If little endian or big endian is desired, the new IANA code points for > > > UTF16BE and UTF 16LE should be used. > > > >However ISO permits using UCS-2 and UCS-4 as transfer syntaxes. Thus we > >should not enforce UTF-16BE and UTF-16LE instead of UCS-2 and UCS-4. > > I actually have no problem using these for transfer as long as the encoding > is unambiguous (which is currently not the case). > > > > However, many implementations exist which encode UCS-2 using the > > > endianess of the GIOP message. To preserve backwards compatibility, > > > implementations that must use UCS-2 should use the endianess of the > > > GIOP message. > > > >It seems good in existing GIOP versions (GIOP 1.1-1.3) to preserve > >compatibility with existing ORB implementations. > > > >However the rule above is different from the byte order rules for UTF-16. > >I think we should not introduce additional rules into next GIOP version. > >It increases complexity of CORBA specification. > > > > > Revised Text: Add clarification: > > > > > > "The use of UCS-2 and UCS-4 in GIOP is deprecated. However, to retain > > > backwards compatibility, implementations that use these codesets should > > > encode using the endianess of the GIOP message. > > > > > > GIOP 1.x allows the use of UTF16BE and UTF16LE, from the IANA codeset > > > registry, for systems which desire to indicate a preference for a > > > particular byte order." > > > > > > >I think following clarification is better: > > > > For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be encoded using > > the endianess of the GIOP message for compatibility with existing > > implementations. > > > > For GIOP 1.4, the byte order rules for UCS-2 and UCS-4 are the same > > as for UTF-16. > > > > UTF-16LE and UTF-16BE, from IANA codeset registry, have its own > > endianess definition. Thus these should be encoded using its > > specific endianess. > > > >I'd like to add the clarification and IANA codeset registry to the > >specification at same time when we create GIOP 1.4. > > This is ok by me. Actually to truly preserve backwards compatibility we > might have to go further and preclude BOMs from UCS-2 encodings in GIOP > 1.1, 1.2 and 1.3, but maybe this is going too far... > > andy Regards, -- SHIMAMURA Masayoshi TEL:+81-45-476-4590(ext.7128-4241) FAX:+81-45-476-4726(ext.7128-6783) Planning Dep., Strategic Planning Div., Software Group, FUJITSU LIMITED From: "Everett Anderson" To: "SHIMAMURA Masayoshi" , "Andy Piper" Cc: , "Tom Rutt" Subject: RE: Issue 4008 Proposed Resolution Date: Tue, 16 Apr 2002 11:32:29 -0700 X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) Importance: Normal Hi, > For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be > encoded using > endianess indicated by the BOM. If BOM is not presented, these > should be encoded using the endianess of the GIOP message for > backward compatibility. Sorry, I've been away the past week or so. I don't think it's a satisfactory backwards compatibility story to say only to the receiver that if the BOM isn't present, it must take the byte order of the stream. What about existing ORBs who only read UCS-2 as unsigned shorts and aren't expecting BOMs at all? It seems like we must make it one way or the other until GIOP 1.3 (?) or 1.4. Just to give some weight to the argument about why some ORBs don't support or send BOMs with "UCS-2", I think this is the reasoning why our ORBs haven't: CORBA 01-09-01 15.3.1.6: "For example, if the TCS-W is ISO 10646 UCS-2 (Universal Character Set containing 2 bytes), then wide characters are represented as unsigned shorts. For ISO 10646 UCS-4, they are represented as unsigned longs." Just as with UTF-16, though there many be disagreement outside of CORBA about how to encode it, we've made the rules in CORBA CDR explicit. If we're going to change the rules such that it's okay to send a BOM with UCS-2, we should remove this text from the spec. - Everett X-Sender: andyp@san-francisco.beasys.com X-Mailer: QUALCOMM Windows Eudora Version 4.3.2 Date: Tue, 16 Apr 2002 16:37:13 -0700 To: "Everett Anderson" , "SHIMAMURA Masayoshi" From: Andy Piper Subject: RE: Issue 4008 Proposed Resolution Cc: , "Tom Rutt" At 11:32 AM 4/16/02 -0700, Everett Anderson wrote: Sorry, I've been away the past week or so. I don't think it's a satisfactory backwards compatibility story to say only to the receiver that if the BOM isn't present, it must take the byte order of the stream. What about existing ORBs who only read UCS-2 as unsigned shorts and aren't expecting BOMs at all? It seems like we must make it one way or the other until GIOP 1.3 (?) or 1.4. Just to give some weight to the argument about why some ORBs don't support or send BOMs with "UCS-2", I think this is the reasoning why our ORBs haven't: CORBA 01-09-01 15.3.1.6: "For example, if the TCS-W is ISO 10646 UCS-2 (Universal Character Set containing 2 bytes), then wide characters are represented as unsigned shorts. For ISO 10646 UCS-4, they are represented as unsigned longs." Just as with UTF-16, though there many be disagreement outside of CORBA about how to encode it, we've made the rules in CORBA CDR explicit. If we're going to change the rules such that it's okay to send a BOM with UCS-2, we should remove this text from the spec. You're right. I was only thinking about current implementations being backwardly compatible with old ones, not the other way around. So I guess I go back to my original comment - the above text should be left in and BOM's should not be used for these encodings. andy Date: Fri, 19 Apr 2002 21:39:59 +0900 From: SHIMAMURA Masayoshi To: orb-revision@omg.org Subject: Re: Issue 4008 Draft Resolution Organization: FUJITSU LIMITED X-Mailer: Becky! ver. 2.00.07 Regarding Issue 4008 Resolution: If we take out the words regarding the BOM, we are left with the following from my April 16 proposal: " For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be encoded using the endianess of the GIOP message, for backward compatibility. For GIOP 1.4, the byte order rules for UCS-2 and UCS-4 are the same as for UTF-16. UTF-16LE and UTF-16BE, from IANA codeset registry, have their own endianess definition. Thus these should be encoded using the specific endianess. " Does everyone agree with this resulting wording? Should we also add an additional paragraph to deprecate or even disallow UCS-2 in GIOP 1.4 in favor of UTF-16? Regards, -- SHIMAMURA Masayoshi TEL:+81-45-476-4590(ext.7128-4241) FAX:+81-45-476-4726(ext.7128-6783) Planning Dep., Strategic Planning Div., Software Group, FUJITSU LIMITED Date: Fri, 17 May 2002 14:43:57 -0400 From: Tom Rutt Reply-To: tom@coastin.com Organization: Fujitsu Software Corp X-Mailer: Mozilla 4.78 [en] (Win98; U) X-Accept-Language: en To: "corba-rtf@omg.org" , Jishnu Mukerji CC: TOM@coastin.com Subject: Proposed Resolution to Issue 4008 Jishnu The following is the proposed resolution for Issue 4008 Issue 4008: wchar endianness (corba-rtf) Click here for this issue's archive. Source: AT&T (Dr. Duncan Grisby, dgrisby@uk.research.att.com) Nature: Uncategorized Issue Severity: Summary: In a similar vein to Vishy's question about alignment, what should the endianness of a word-oriented wchar be? This applies both to single wchars, and the separate code points in a wstring. With the 2.3 spec, it seemed quite obvious to me that word-oriented wide characters should have the same endianness as the rest of the stream. After all, they are no different from any other word-oriented type. However, with the new 2.4 spec, there is now a bizarre section saying that if, and only if, the TCS-W is UTF-16, all wchar values are marshalled big-endian unless there is a byte-order-mark telling you otherwise. I don't understand the point of this. Section 2.7 of the Unicode Standard, version 3.0 says [emphasis mine]: "Data streams that begin with U+FEFF byte order mark are likely to contain Unicode values. It is recommended that applications sending or receiving _untyped_ data streams of coded characters use this signature. _If other signaling methods are used, signatures should not be employed._" It seems quite clear to me that a GIOP stream is a _typed_ data stream which uses its own signalling methods. The Unicode standard therefore says that a BOM should _not_ be used. I guess it's too late to clean up the UTF-16 encoding, but what about other word-oriented code sets? What if the end-points have negotiated the use of UCS-4? Should that be big-endian unless there's a BOM? The spec doesn't say. Even worse, what if the negotiated encoding is something like Big5? That doesn't _have_ byte order marks. Big5 doesn't have a one-to-one Unicode mapping, so it's not sensible to always translate to UTF-16. GIOP already has a perfectly good mechanism for sorting out this kind of issue. Please can wchar be considered on equal footing with all other types, and use the stream's endianness? Resolution: UCS-2 and UCS-4 are native codesets, and in newer Unicode Forum versions of the standard, they are not intended for transfer syntaxes. For backward Compatibility with Java ORB, USC-2 and UCS-2 will be treated like Integer, i.e., marshalling shall use the endianness of the message. For giop 1.4 onward, we should have the same interpretation for UCS-2 as with UTF> Revised Text: Delete the following text from the second bullet item in 15.3.1.6 " For example, if the TCS-W is ISO 10646 UCS-2 (Universal Character Set containing 2 bytes), then wide characters are represented as unsigned shorts. For ISO 10646 UCS-4, they arE represented as unsigned longs. " Add clarification at end of 15.3.1.6: " For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be encoded using the endianess of the GIOP message, for backward compatibility. For GIOP 1.4, the byte order rules for UCS-2 and UCS-4 are the same as for UTF-16. UTF-16LE and UTF-16BE, from IANA codeset registry, have their own endianess definition. Thus these should be encoded using the specific endianess. " "Actions taken: October 31, 2000: received issue March 7, 2002: moved to Core RTF Discussion: -- ---------------------------------------------------- Tom Rutt email: tom@coastin.com Tel: +1 732 801 5744 Fax: +1 732 774 5133 Coast Enterprises Date: Fri, 17 May 2002 15:09:50 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en To: corba-rtf@omg.org Subject: Proposed resolution for 4008 The following resolution will appear in the next vote unless there is significant technical objection to it. Jishnu. ____________________________________________________________________ Issue 4008: wchar endianness (interop) http://cgi.omg.org/issues/issue4008.txt Source: AT&T (Dr. Duncan Grisby, dgrisby@uk.research.att.com) Nature: Uncategorized Issue Severity: Summary: In a similar vein to Vishy's question about alignment, what should the endianness of a word-oriented wchar be? This applies both to single wchars, and the separate code points in a wstring. With the 2.3 spec, it seemed quite obvious to me that word-oriented wide characters should have the same endianness as the rest of the stream. After all, they are no different from any other word-oriented type. However, with the new 2.4 spec, there is now a bizarre section saying that if, and only if, the TCS-W is UTF-16, all wchar values are marshalled big-endian unless there is a byte-order-mark telling you otherwise. I don't understand the point of this. Section 2.7 of the Unicode Standard, version 3.0 says [emphasis mine]: "Data streams that begin with U+FEFF byte order mark are likely to contain Unicode values. It is recommended that applications sending or receiving _untyped_ data streams of coded characters use this signature. _If other signaling methods are used, signatures should not be employed._" It seems quite clear to me that a GIOP stream is a _typed_ data stream which uses its own signalling methods. The Unicode standard therefore says that a BOM should _not_ be used. . . . . . . . . . . . . Resolution: UCS-2 and UCS-4 are native codesets, and in newer Unicode Forum versions of the standard, they are not intended for transfer syntaxes. For backward Compatibility with Java ORB, USC-2 and UCS-2 will be treated like Integer, i.e., marshaling shall use the endianness of the message. For giop 1.4 onward, we should have the same interpretation for UCS-2 as with UTF. Revised Text: In orbrev/02-02-01 1. Delete the following text from the second bullet item in 15.3.1.6 "For example, if the TCS-W is ISO 10646 UCS-2 (Universal Character Set containing 2 bytes), then wide characters are represented as unsigned shorts. For ISO 10646 UCS-4, they are represented as unsigned longs." 2. Add the following paragraphs at end of 15.3.1.6: "For GIOP 1.1, 1.2 and 1.3, UCS-2 and UCS-4 should be encoded using the endianess of the GIOP message, for backward compatibility. For GIOP 1.4, the byte order rules for UCS-2 and UCS-4 are the same as for UTF-16. UTF-16LE and UTF-16BE, from IANA codeset registry, have their own endianess definition. Thus these should be encoded using the endianess specified by their endianness definition." Actions taken: Incorporate changes and close issue From: "Everett Anderson" To: "Bob Kukura" , "Jishnu Mukerji" Cc: Subject: RE: CORBA Core RTF Vote 3 Announcement Date: Fri, 7 Jun 2002 15:14:56 -0700 X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) Importance: Normal Hi, -----Original Message----- 4008 - NO - This resolution does not appear to be backwardly compatible with IONA's current interpretation of the specification. Our current implementations of GIOP 1.2 treat UCS wchars similarly to UTF (big endian unless specified otherwise via BOM). We have not heard any reports of interop issues, so we suspect other vendors have made similar interpretations (or no one really uses the IDL wchar type). --------------------------- There is an interop issue with the Sun J2EE 1.3 RI and all Sun and IBM JDK ORBs that I know of. These ORBs do not use BOMs and take the byte order of the stream for UCS2 (due to the wording in the spec that UCS2 is encoded as unsigned shorts). We have a bug open on this interop problem between our ORBs filed by an external customer. I talked about it with IONA developers, and we tried to find Unicode spec justification for our encodings, but found nothing conclusive (probably since apparently no one is really supposed to use UCS2 as a transmission code set according to the issue discussion). The CORBA wchar/wstring types are crucial for RMI-IIOP. - Everett Date: Fri, 07 Jun 2002 13:59:53 -0400 From: Bob Kukura User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 X-Accept-Language: en-us To: Jishnu Mukerji CC: corba-rtf@omg.org Subject: Re: CORBA Core RTF Vote 3 Announcement X-OriginalArrivalTime: 07 Jun 2002 18:02:38.0893 (UTC) FILETIME=[82C3B5D0:01C20E4D] IONA votes as follows: 3318 - YES - But I'm not happy with an aspect of the wording of the revised text. The entire reason that its necessary to send the codeset context multiple times is that there is no guarantee what order request messages will be processed in the server. Therefore, any client that wants to reliably interoperate with all compliant servers must keep sending the codeset context in each outgoing request until it recieves an a reply to one of its requests sent on that connection. Since there is no requirement that the server process requests in the same order in which they were sent by the client, the resolution should not describe the server's behavior in terms of the order the requests were sent by the client. I'd recommend editorially rewording the resolution as follows: If the client (or the server if Bi-Directional GIOP is in use) sends multiple codeset service contexts with different paramter values, then the behavior is undefined. The reciever of a codeset service context with different values from those recieved on the same connection and processed previously may return a MARSHAL system exception with the standard minor code j. 2. Add the following minor code to the MARSHAL system exception. minor code: j description: codeset service contexts with different values recieved on the same connection. 3429 - YES 4008 - NO - This resolution does not appear to be backwardly compatible with IONA's current interpretation of the specification. Our current implementations of GIOP 1.2 treat UCS wchars similarly to UTF (big endian unless specified otherwise via BOM). We have not heard any reports of interop issues, so we suspect other vendors have made similar interpretations (or no one really uses the IDL wchar type). 4156 - YES 4167 - YES 4167 - ABSTAIN - I'm not sure all these TSC contraints were ever intended or are necessary. 4236 - YES 4806 - YES 4820 - YES 4824 - YES - I have similar concerns about the text implying that requests are processed by the server in the order in which they are recieved. 4835 - YES 4899 - YES -Bob Jishnu Mukerji wrote: Folks, Vote 3 of the CORBA Core RTF is now available at: http://cgi.omg.org/pub/orbrev/votes/core_rtf_vote_3.html It is due on June 7, so you have some time to vote on it. The following will lose voting rights if they do not cast a vote on this one, since they did not vote on Vote 2: Mitchel Sanders 2AB Ke Jin Borland Tom Rutt Fujitsu Jon Currey Highlander Roger Hart IBM So please do vote early and vote often. As usual if you do vote often, only the last vote cast by you will count. As I have mentioned earlier, I will be away on vacation until the 13th of June, and I will tabulate and publish results upon my return. Meanwhile, I am hoping that Tom Rutt will help out by prodding you to cast your vote in time, and deal with any immediate contingencies that might arise, specially of the editorial kind. If necessary we will do a quick turnaround vote in the week of June 17 to fix any residual problems. Other than that, this is the last vote for this round of the RTF, leading to an interim report to be filed by the Orlando meeting. As usual the list of issues to be is still to be found at: http://cgi.omg.org/pub/orbrev/drafts/corba_rtf-ytbr.html Please keep working on coming up with proposed resolutions. As usual we will put resolutions up for vote on a regular basis as soon as they become stable. So far we have made very good progress on Interop and Core issues. Interceptors currently needs the most attention since it has the largest backlog. The next on the list of those needing attention are the Messaging issue, and the clutch of remaining POA and ORB::shutdown vs. ORB::destroy issues. Upon my return I will shepherd 1139 to a conclusion. Another thing that should be brought to everyone's attention is that it appears that the Firewall RFP is stuck dead in the water since no one seems to be filing a BC Questionnaire for it. If this RFP does not get finalized then we will have a few issues to deal with: 1. How do we get the BiDir IIOP changes from that submission bought back into Core? 2. How/When do we go for IIOP 1.4. There are two issues that require IIOP 1.4 so far: a. IPV6 and IIOP, very important IMHO, sicne there is demand apeparing for this in the marketplace. b. Object::repository_id() Anyw thoughts/comments/discussions of this issue would be very useful. However, I will not be participating in any discussions until June 13th. Upon my return I will ask the same questions again and see where it goes. Thanks, Jishnu. -- Bob Kukura Technical Leader ASP CORBA Team END 2 ANYWHERE 200 West Street Waltham, MA 02451 USA Tel (781) 902-8141 Fax (781) 902-8001 bob.kukura@iona.com IONA votes as follows: 3318 - YES - But I'm not happy with an aspect of the wording of the revised text. The entire reason that its necessary to send the codeset context multiple times is that there is no guarantee what order request messages will be processed in the server. Therefore, any client that wants to reliably interoperate with all compliant servers must keep sending the codeset context in each outgoing request until it recieves an a reply to one of its requests sent on that connection. Since there is no requirement that the server process requests in the same order in which they were sent by the client, the resolution should not describe the server's behavior in terms of the order the requests were sent by the client. I'd recommend editorially rewording the resolution as follows: If the client (or the server if Bi-Directional GIOP is in use) sends multiple codeset service contexts with different paramter values, then the behavior is undefined. The reciever of a codeset service context with different values from those recieved on the same connection and processed previously may return a MARSHAL system exception with the standard minor code j. 2. Add the following minor code to the MARSHAL system exception. minor code: j description: codeset service contexts with different values recieved on the same connection. 3429 - YES 4008 - NO - This resolution does not appear to be backwardly compatible with IONA's current interpretation of the specification. Our current implementations of GIOP 1.2 treat UCS wchars similarly to UTF (big endian unless specified otherwise via BOM). We have not heard any reports of interop issues, so we suspect other vendors have made similar interpretations (or no one really uses the IDL wchar type). 4156 - YES 4167 - YES 4167 - ABSTAIN - I'm not sure all these TSC contraints were ever intended or are necessary. 4236 - YES 4806 - YES 4820 - YES 4824 - YES - I have similar concerns about the text implying that requests are processed by the server in the order in which they are recieved. 4835 - YES 4899 - YES -Bob Jishnu Mukerji wrote: Folks,Vote 3 of the CORBA Core RTF is now available at:http://cgi.omg.org/pub/orbrev/votes/core_rtf_vote_3.htmlIt is due on June 7, so you have some time to vote on it.The following will lose voting rights if they do not cast a vote on thisone, since they did not vote on Vote 2: Mitchel Sanders 2AB Ke Jin Borland Tom Rutt Fujitsu Jon Currey Highlander Roger Hart IBMSo please do vote early and vote often. As usual if you do vote often,only the last vote cast by you will count.As I have mentioned earlier, I will be away on vacation until the 13thof June, and I will tabulate and publish results upon my return.Meanwhile, I am hoping that Tom Rutt will help out by prodding you tocast your vote in time, and deal with any immediate contingencies thatmight arise, specially of the editorial kind.If necessary we will do a quick turnaround vote in the week of June 17to fix any residual problems. Other than that, this is the last vote forthis round of the RTF, leading to an interim report to be filed by theOrlando meeting.As usual the list of issues to be is still to be found at:http://cgi.omg.org/pub/orbrev/drafts/corba_rtf-ytbr.htmlPlease keep working on coming up with proposed resolutions. As usual wewill put resolutions up for vote on a regular basis as soon as theybecome stable.So far we have made very good progress on Interop and Core issues.Interceptors currently needs the most attention since it has the largestbacklog. The next on the list of those needing attention are theMessaging issue, and the clutch of remaining POA and ORB::shutdown vs.ORB::destroy issues.Upon my return I will shepherd 1139 to a conclusion.Another thing that should be brought to everyone's attention is that itappears that the Firewall RFP is stuck dead in the water since no oneseems to be filing a BC Questionnaire for it. If this RFP does not getfinalized then we will have a few issues to deal with:1. How do we get the BiDir IIOP changes from that submission bought backinto Core?2. How/When do we go for IIOP 1.4. There are two issues that requireIIOP 1.4 so far: a. IPV6 and IIOP, very important IMHO, sicne there is demand apeparing for this in the marketplace. b. Object::repository_id()Anyw thoughts/comments/discussions of this issue would be very useful.However, I will not be participating in any discussions until June 13th.Upon my return I will ask the same questions again and see where itgoes.Thanks,Jishnu. -- Bob Kukura Technical Leader ASP CORBA Team END 2 ANYWHERE 200 West Street Waltham, MA 02451 USA Tel (781) 902-8141 Fax (781) 902-8001 bob.kukura@iona.com Date: Fri, 12 Apr 2002 11:51:15 -0400 From: Tom Rutt Reply-To: tom@coastin.com Organization: Coast Enterprises LLC X-Mailer: Mozilla 4.78 [en] (Win98; U) X-Accept-Language: en To: Andy Piper CC: SHIMAMURA Masayoshi , orb-revision@omg.org Subject: Re: Issue 4008 Proposed Resolution Andy Piper wrote: > > > > This is ok by me. Actually to truly preserve backwards compatibility we > might have to go further and preclude BOMs from UCS-2 encodings in GIOP > 1.1, 1.2 and 1.3, but maybe this is going too far... > What if the sending orb receives a wstring encoded with a BOM as the first symbol, and that BOM is not the same as the endianness of the giop message? > > andy -- ---------------------------------------------------- Tom Rutt email: tom@coastin.com Tel: +1 732 801 5744 Fax: +1 732 774 5133 Fujitsu Software