Issue 585: IDL Type Extensions: wchar and wstring CDR encoding (interop) Source: (, ) Nature: Uncategorized Severity: Summary: Summary: Section 4.1 GIOP CDR Transfer Syntax: The spec should cover cases where TCS-W is byte-oriented or non-byte oriented Resolution: duplicate to closed issue 1096 Revised Text: see core rtf 2.2 changes , changed to fix marshalling for giop 1.2 Actions taken: May 29, 1997: received issue June 22, 1998: moved from orb_revision to port-rtf June 23, 1998: moved from port-rtf to interop February 17, 1999: closed issue February 19, 1999: closed issue; Resolved Discussion: received issue End of Annotations:===== Return-Path: To: issues@omg.org, orb_revision@omg.org Reply-To: leou@austin.ibm.com Subject: IDL Type Extensions: wchar and wstring CDR encoding issues Date: Thu, 29 May 97 15:41:44 -0500 From: Leo Uzcategui Reference: IDL Type Extensions V1.0 ( ptc/97-01-01 ) In section 4.1 GIOP CDR Transfer Syntax on p. 19, it states "a wide character is represented as the fixed number of bits used to encode any single codepoint in the TCS." What does this mean when the negotiated TCS-W is byte-oriented? The spec should cover cases where TCS-W is byte-oriented or non-byte oriented. Return-Path: Reply-To: leou@austin.ibm.com To: Masayoshi Shimamura Cc: , , , , , , , , , , idltx@omg.org Subject: Re: IDL Type Extensions Spec issues Date: Wed, 04 Jun 97 08:16:36 -0500 From: Leo Uzcategui Shimamura-san, Thanks for your explanation. However, I believe the spec allows TCS-W to be either byte-oriented or non-byte oriented. On p. 7, under heading "Transmission Code Set" it states "The intent is for TCS-C to be byte-oriented and TCS-W to be non-byte-oriented. However, this specification does allow both types of characters to be transmitted using the same transmission code set. That is, the selection of a trasmission code set is orthogonal to the wideness or narrowness of the characters, although a given code set may be better suited for either narrow or wide characters." Unfortunately, I was not involved during the spec definition and am going by what the spec says and discussion with some at IBM who were involved, like Hari Madduri and Gary Miller. Was there not agreement on this point? Can someone else please comment? Thanks, Leo I also cc'd idltx@omg.org in case I missed someone on the original list, sorry if you get 2 copies. > Dear Mr. Leo Uzcategui, > > Sorry for the late response. > > On Thu, 29 May 97 15:26:58 -0500 > Leo Uzcategui wrote: > > > > I need your help in trying to interpret the IDL Type Extensions V1.0 > > Spec ( ftp://ftp.omg.org/pub/docs/ptc/97-01-01.pdf [.ps] ) > > The spec is vague and open to misinterpretation with respect to the > > CDR encoding of wchar/wstring data. > > > > I am somewhat unsure which OMG TF or RTF is responsible for definitive > > answers (ORBOS?). But since you were all involved in the definition, you > > can be of immediate help. I intend to mail these to issues@omg.org and > > orb_revision@omg.org so they get recorded. > > > > In section 4.1 GIOP CDR Transfer Syntax on p. 19, it states "a wide > > character is represented as the fixed number of bits used to encode any > > single codepoint in the TCS." > > What does this mean when the negotiated TCS-W is byte-oriented? > > The spec should cover cases where TCS-W is byte-oriented or non-byte > > oriented. > > > > Negotiated TCS-W must be only non-byte oriented codeset. See section > "2.1 Character Processing Terminology" on page 6. > > Byte-Oriented Code Set > > An encoding of characters where the numeric code corresponding > to a character code element can occupy one or more bytes. A byte > as used in this document is synonymous with octet, which > occupies 8 bits. As noted above, byte-oriented code sets use the > char data type. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ~~~~~~~~~~~~~~~ > > Non-Byte Oriented Code Set > > An encoding of characters where the numeric code corresponding > to a character code element can occupy fixed 16 or 32 bits. As > noted above, non-byte oriented code sets use the wchar data type. > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > In section 4.1.2 on p. 20, spec states "A wide string is encoded as > > an unsigned long indicating the length of the string in octects or > > unsigned integers (determined by the transfer syntax for wchar) followed > > by the individual wide characters. Both the string length and contents > > include a terminating null. The terminating null character for a wstring > > is also a wide character." > > Again, what does this mean if TCS-W is byte-oriented? > > My answer is same above. Negotiated TCS-W must be only non-byte oriented > codeset. In other words, native codeset and conversion codesets of > *char type* must be *byte-oriented*. And also, native codeset and > conversion codesets of *wchar type* must be *non-byte-oriented*. > > > Our interpretation: > > If the negotiated TCS-W is non-byte oriented, then length is the number > > of characters, each character encoded using a fixed number of bits (as > > determined by the code set) plus the terminating null (also of the same > > width as the other characters.) > > Strictly speaking, the length means the number of code element including > terminating null. It is not the number of characters including null. > > Because, for example, Unicode has specification of combining character. > Unicode can represent character which has diacritical mark such as grave > accent by specification of combining character. > > \ > If the character is "A" (combination of character "A" and grave > accent "`"), the character may represented by combination: > code of "A" and code of "`" in Unicode. In the case, number of > characters is one, but the number of code elements is two. > > \ > If the string "A" with Unicode is represented by CDR, the length should > be three: code of "A", code of "`" and code of null. Therefore, the > lenght of string in CDR should mean number of code elements instead of > number of characters. > > > If the negotiated TCS-W is byte-oriented, then length is the number of > > bytes (not of characters), each character encoded using only the bytes > > required by the chosen code set (not a fixed-width representation), plus > > the terminating single byte null. > > > > There is not such case. > > > Regards, > > Masayoshi SHIMAMURA > ------------------------------------------------------------------------ > Masayoshi SHIMAMURA FUJITSU LIMITED > Planning Department II Tel: +81-45-476-4591 (Ext. 7128-4221) > IT Strategy and Planning Div. FAX: +81-45-476-4726 (Ext. 7128-6783) > Software Group E-mail: shima@rp.open.cs.fujitsu.co.jp > ------------------------------------------------------------------------ ---------------------------------------------------------------------- Leo Uzcategui, IBM Corp. 11400 Burnet Rd, Austin, TX 78758 (512) 823-9573 fax:(512) 838-1032 ---------------------------------------------------------------------- Return-Path: From: hari@austin.ibm.com Date: Wed, 4 Jun 1997 12:28:00 -0500 To: leou@austin.ibm.com, jfh@hal.com Subject: Re: Re[2]: IDL Type Extensions Spec issues Cc: shima@rp.open.cs.fujitsu.co.jp, davg@mfltd.co.uk, janssen@parc.xerox.com, kline_s@apollo.hp.com, hari@austin.ibm.com, rabin@osf.org, alex_thomas@globalnet.co.uk, nicktindall@vnet.ibm.com, yamada@hal.com, idltx@omg.org Content-Md5: xXpmKBl75LX0vjAzzSIJ8g== Jim, Leo: The change I wanted in the referenced note below is consistent with the paragraph referred to by Leo (on page 7). I see that that the spec is consistent with the view that a "WChar" type data could be transmitted using either byte-oriented or non-byte oriented transmission codesets. To the best of my understanding, that is what we (the revision committee members) all wanted. What Shimamura San is referring to in the definitions part of the spec is not excluding the possibility of transmitting wchar type of data using either codesets. It is describing the more likely case. If I recall correctly, Fujitsu was one of the strong advocates of that flexibility where one could use either type of codesets. I would have to disagree with Dr. Shimamura's interpretation as noted below his remarks. regards, -Hari > From root Wed Jun 4 11:03:27 1997 > From: jfh@hal.com (Jim Hughes) > Message-Id: <199706041602.JAA02694@kodiak.hal.com> > Date: Wed, 4 Jun 1997 09:02:31 -0700 (PDT) > To: leou@austin.ibm.com > Cc: shima@rp.open.cs.fujitsu.co.jp, leou@austin.ibm.com, davg@mfltd.co.uk, janssen@parc.xerox.com, kline_s@apollo.hp.com, hari@austin.ibm.com, rabin@osf.org, alex_thomas@globalnet.co.uk, nicktindall@vnet.ibm.com, yamada@hal.com, idltx@omg.org > Subject: Re[2]: IDL Type Extensions Spec issues > In-Reply-To: <9706041316.AA15270@leou.austin.ibm.com> > X-Mailer: Ishmail 1.3-960829-sol24 > Mime-Version: 1.0 > Content-Type: text/plain > Content-Length: 7652 > Status: RO > > Leo, > > I'll let Shimamura-san give a technical answer for Fujitsu to your response, > but let me add some more information. > > The text you reference on page 7 was proposed for removal by me when I > created Draft 6 of the final document because I thought it was confusing. IBM > requested that it be re-instated per the email below. Probably Hari can give > you the best reasons why this was done and how it should be implemented. > > Jim > > >>>>> Forwarded from hari@austin.ibm.com > > Date: Tue, 7 Jan 1997 14:23:25 -0600 > From: hari@austin.ibm.com > To: jfh@hal.com > Subject: My changes to draft 5 > Cc: nicktindall@VNET.IBM.COM, leou@austin.ibm.com, hari@austin.ibm.com > > Jim, > Here are the changes that I would like: > > 1. Once we revert to narrow and wide characters referring to IDL > "char' and "wchar" types of data, it becomes unambiguous and > even necessary to point out that they could be transimtted using > either byte-oriented or non-byte oriented code sets. > Therefore the removed text in your notes on pages 5, 6, and > 25 needs to be reintroduced. (Incidentally Mr. Shimamura didn't > like the phrase "don't have to be different" (in your note on page > 25). Rephrase it as you like, but I want that point to be made.) > > Also at the bottom of page 18 your note says " previous material > suggesting that a byte-oriented code set could be used for 'wide > characters' was removed". This removed text needs to be restored. > > (rest of email deleted) > > <<<<< End forwarded message > > > ================ > On Wed, 04 Jun 97 08:16:36 -0500, Leo Uzcategui wrote: > > > > > Shimamura-san, > > > > Thanks for your explanation. However, I believe the spec allows TCS-W to be > > either byte-oriented or non-byte oriented. > > > > On p. 7, under heading "Transmission Code Set" it states > > > > "The intent is for TCS-C to be byte-oriented and TCS-W to be > > non-byte-oriented. However, this specification does allow both types > > of characters to be transmitted using the same transmission code set. > > That is, the selection of a trasmission code set is orthogonal to > > the wideness or narrowness of the characters, although a given code > > set may be better suited for either narrow or wide characters." > > > > Unfortunately, I was not involved during the spec definition and am going > > by what the spec says and discussion with some at IBM who were involved, > > like Hari Madduri and Gary Miller. > > > > Was there not agreement on this point? Can someone else please comment? > > > > Thanks, Leo > > > > I also cc'd idltx@omg.org in case I missed someone on the original list, > > sorry if you get 2 copies. > > > > > Dear Mr. Leo Uzcategui, > > > > > > Sorry for the late response. > > > > > > On Thu, 29 May 97 15:26:58 -0500 > > > Leo Uzcategui wrote: > > > > > > > > I need your help in trying to interpret the IDL Type Extensions V1.0 > > > > Spec ( ftp://ftp.omg.org/pub/docs/ptc/97-01-01.pdf [.ps] ) > > > > The spec is vague and open to misinterpretation with respect to the > > > > CDR encoding of wchar/wstring data. > > > > > > > > I am somewhat unsure which OMG TF or RTF is responsible for definitive > > > > answers (ORBOS?). But since you were all involved in the definition, > > you > > > > can be of immediate help. I intend to mail these to issues@omg.org and > > > > orb_revision@omg.org so they get recorded. > > > > > > > > In section 4.1 GIOP CDR Transfer Syntax on p. 19, it states "a wide > > > > character is represented as the fixed number of bits used to encode any > > > > single codepoint in the TCS." > > > > What does this mean when the negotiated TCS-W is byte-oriented? > > > > The spec should cover cases where TCS-W is byte-oriented or non-byte > > > > oriented. > > > > > > > > > > Negotiated TCS-W must be only non-byte oriented codeset. See section > > > "2.1 Character Processing Terminology" on page 6. No. TCS-W can be either. > > > > > > Byte-Oriented Code Set > > > > > > An encoding of characters where the numeric code corresponding > > > to a character code element can occupy one or more bytes. A byte > > > as used in this document is synonymous with octet, which > > > occupies 8 bits. As noted above, byte-oriented code sets use the > > > char data type. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > ~~~~~~~~~~~~~~~ > > > > > > Non-Byte Oriented Code Set > > > > > > An encoding of characters where the numeric code corresponding > > > to a character code element can occupy fixed 16 or 32 bits. As > > > noted above, non-byte oriented code sets use the wchar data type. > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > In section 4.1.2 on p. 20, spec states "A wide string is encoded as > > > > an unsigned long indicating the length of the string in octects or > > > > unsigned integers (determined by the transfer syntax for wchar) > > followed > > > > by the individual wide characters. Both the string length and contents > > > > include a terminating null. The terminating null character for a > > wstring > > > > is also a wide character." > > > > Again, what does this mean if TCS-W is byte-oriented? > > > > > > My answer is same above. Negotiated TCS-W must be only non-byte oriented > > > codeset. In other words, native codeset and conversion codesets of > > > *char type* must be *byte-oriented*. And also, native codeset and > > > conversion codesets of *wchar type* must be *non-byte-oriented*. No again. > > > > > > > Our interpretation: > > > > If the negotiated TCS-W is non-byte oriented, then length is the number > > > > of characters, each character encoded using a fixed number of bits (as > > > > determined by the code set) plus the terminating null (also of the same > > > > width as the other characters.) > > > > > > Strictly speaking, the length means the number of code element including > > > terminating null. It is not the number of characters including null. Yes. I agree. > > > > > > Because, for example, Unicode has specification of combining character. > > > Unicode can represent character which has diacritical mark such as grave > > > accent by specification of combining character. > > > > > > \ > > > If the character is "A" (combination of character "A" and grave > > > accent "`"), the character may represented by combination: > > > code of "A" and code of "`" in Unicode. In the case, number of > > > characters is one, but the number of code elements is two. > > > > > > \ > > > If the string "A" with Unicode is represented by CDR, the length should > > > be three: code of "A", code of "`" and code of null. Therefore, the > > > lenght of string in CDR should mean number of code elements instead of > > > number of characters. > > > > > > > If the negotiated TCS-W is byte-oriented, then length is the number of > > > > bytes (not of characters), each character encoded using only the bytes > > > > required by the chosen code set (not a fixed-width representation), > > plus > > > > the terminating single byte null. > > > > > > > > > > There is not such case. I disagree. There sure is such a case. > > > > > > > > > Regards, > > > > > > Masayoshi SHIMAMURA > > > ------------------------------------------------------------------------ > > > Masayoshi SHIMAMURA FUJITSU LIMITED > > > Planning Department II Tel: +81-45-476-4591 (Ext. 7128-4221) > > > IT Strategy and Planning Div. FAX: +81-45-476-4726 (Ext. 7128-6783) > > > Software Group E-mail: shima@rp.open.cs.fujitsu.co.jp > > > ------------------------------------------------------------------------ > > ---------------------------------------------------------------------- > > Leo Uzcategui, IBM Corp. > > 11400 Burnet Rd, Austin, TX 78758 > > (512) 823-9573 fax:(512) 838-1032 > > ---------------------------------------------------------------------- > > I have stated what I believe to be the consensus we arrived at. What do the other RTF members say? regards, -Hari Return-Path: From: jfh@hal.com (Jim Hughes) Date: Wed, 4 Jun 1997 09:02:31 -0700 (PDT) To: leou@austin.ibm.com Cc: shima@rp.open.cs.fujitsu.co.jp, leou@austin.ibm.com, davg@mfltd.co.uk, janssen@parc.xerox.com, kline_s@apollo.hp.com, hari@austin.ibm.com, rabin@osf.org, alex_thomas@globalnet.co.uk, nicktindall@vnet.ibm.com, yamada@hal.com, idltx@omg.org Subject: Re[2]: IDL Type Extensions Spec issues Leo, I'll let Shimamura-san give a technical answer for Fujitsu to your response, but let me add some more information. The text you reference on page 7 was proposed for removal by me when I created Draft 6 of the final document because I thought it was confusing. IBM requested that it be re-instated per the email below. Probably Hari can give you the best reasons why this was done and how it should be implemented. Jim >>>>> Forwarded from hari@austin.ibm.com Date: Tue, 7 Jan 1997 14:23:25 -0600 From: hari@austin.ibm.com To: jfh@hal.com Subject: My changes to draft 5 Cc: nicktindall@VNET.IBM.COM, leou@austin.ibm.com, hari@austin.ibm.com Jim, Here are the changes that I would like: 1. Once we revert to narrow and wide characters referring to IDL "char' and "wchar" types of data, it becomes unambiguous and even necessary to point out that they could be transimtted using either byte-oriented or non-byte oriented code sets. Therefore the removed text in your notes on pages 5, 6, and 25 needs to be reintroduced. (Incidentally Mr. Shimamura didn't like the phrase "don't have to be different" (in your note on page 25). Rephrase it as you like, but I want that point to be made.) Also at the bottom of page 18 your note says " previous material suggesting that a byte-oriented code set could be used for 'wide characters' was removed". This removed text needs to be restored. (rest of email deleted) <<<<< End forwarded message ================ On Wed, 04 Jun 97 08:16:36 -0500, Leo Uzcategui wrote: > > Shimamura-san, > > Thanks for your explanation. However, I believe the spec allows TCS-W to be > either byte-oriented or non-byte oriented. > > On p. 7, under heading "Transmission Code Set" it states > > "The intent is for TCS-C to be byte-oriented and TCS-W to be > non-byte-oriented. However, this specification does allow both types > of characters to be transmitted using the same transmission code set. > That is, the selection of a trasmission code set is orthogonal to > the wideness or narrowness of the characters, although a given code > set may be better suited for either narrow or wide characters." > > Unfortunately, I was not involved during the spec definition and am going > by what the spec says and discussion with some at IBM who were involved, > like Hari Madduri and Gary Miller. > > Was there not agreement on this point? Can someone else please comment? > > Thanks, Leo > > I also cc'd idltx@omg.org in case I missed someone on the original list, > sorry if you get 2 copies. > > > Dear Mr. Leo Uzcategui, > > > > Sorry for the late response. > > > > On Thu, 29 May 97 15:26:58 -0500 > > Leo Uzcategui wrote: > > > > > > I need your help in trying to interpret the IDL Type Extensions V1.0 > > > Spec ( ftp://ftp.omg.org/pub/docs/ptc/97-01-01.pdf [.ps] ) > > > The spec is vague and open to misinterpretation with respect to the > > > CDR encoding of wchar/wstring data. > > > > > > I am somewhat unsure which OMG TF or RTF is responsible for definitive > > > answers (ORBOS?). But since you were all involved in the definition, > you > > > can be of immediate help. I intend to mail these to issues@omg.org and > > > orb_revision@omg.org so they get recorded. > > > > > > In section 4.1 GIOP CDR Transfer Syntax on p. 19, it states "a wide > > > character is represented as the fixed number of bits used to encode any > > > single codepoint in the TCS." > > > What does this mean when the negotiated TCS-W is byte-oriented? > > > The spec should cover cases where TCS-W is byte-oriented or non-byte > > > oriented. > > > > > > > Negotiated TCS-W must be only non-byte oriented codeset. See section > > "2.1 Character Processing Terminology" on page 6. > > > > Byte-Oriented Code Set > > > > An encoding of characters where the numeric code corresponding > > to a character code element can occupy one or more bytes. A byte > > as used in this document is synonymous with octet, which > > occupies 8 bits. As noted above, byte-oriented code sets use the > > char data type. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ~~~~~~~~~~~~~~~ > > > > Non-Byte Oriented Code Set > > > > An encoding of characters where the numeric code corresponding > > to a character code element can occupy fixed 16 or 32 bits. As > > noted above, non-byte oriented code sets use the wchar data type. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > In section 4.1.2 on p. 20, spec states "A wide string is encoded as > > > an unsigned long indicating the length of the string in octects or > > > unsigned integers (determined by the transfer syntax for wchar) > followed > > > by the individual wide characters. Both the string length and contents > > > include a terminating null. The terminating null character for a > wstring > > > is also a wide character." > > > Again, what does this mean if TCS-W is byte-oriented? > > > > My answer is same above. Negotiated TCS-W must be only non-byte oriented > > codeset. In other words, native codeset and conversion codesets of > > *char type* must be *byte-oriented*. And also, native codeset and > > conversion codesets of *wchar type* must be *non-byte-oriented*. > > > > > Our interpretation: > > > If the negotiated TCS-W is non-byte oriented, then length is the number > > > of characters, each character encoded using a fixed number of bits (as > > > determined by the code set) plus the terminating null (also of the same > > > width as the other characters.) > > > > Strictly speaking, the length means the number of code element including > > terminating null. It is not the number of characters including null. > > > > Because, for example, Unicode has specification of combining character. > > Unicode can represent character which has diacritical mark such as grave > > accent by specification of combining character. > > > > \ > > If the character is "A" (combination of character "A" and grave > > accent "`"), the character may represented by combination: > > code of "A" and code of "`" in Unicode. In the case, number of > > characters is one, but the number of code elements is two. > > > > \ > > If the string "A" with Unicode is represented by CDR, the length should > > be three: code of "A", code of "`" and code of null. Therefore, the > > lenght of string in CDR should mean number of code elements instead of > > number of characters. > > > > > If the negotiated TCS-W is byte-oriented, then length is the number of > > > bytes (not of characters), each character encoded using only the bytes > > > required by the chosen code set (not a fixed-width representation), > plus > > > the terminating single byte null. > > > > > > > There is not such case. > > > > > > Regards, > > > > Masayoshi SHIMAMURA > > ------------------------------------------------------------------------ > > Masayoshi SHIMAMURA FUJITSU LIMITED > > Planning Department II Tel: +81-45-476-4591 (Ext. 7128-4221) > > IT Strategy and Planning Div. FAX: +81-45-476-4726 (Ext. 7128-6783) > > Software Group E-mail: shima@rp.open.cs.fujitsu.co.jp > > ------------------------------------------------------------------------ > ---------------------------------------------------------------------- > Leo Uzcategui, IBM Corp. > 11400 Burnet Rd, Austin, TX 78758 > (512) 823-9573 fax:(512) 838-1032 > ----------------------------------------------------------------------