Issue 1096: Issue about CDR encoding for wide characters (interop) Source: (, ) Nature: Uncategorized Issue Severity: Summary: Summary: 1. When encoding a wchar, always put 4 bytes into the CDR stream, > adding any zero pad bytes as necessary to make the value 4 bytes > long. 2. When encoding a wstring, always encode the length as the total > number of bytes used by the encoded value, regardless of whether the > encoding is byte-oriented or not. Resolution: Revised Text: Actions taken: March 24, 1998: received issue June 25, 1998: closed issue Discussion: End of Annotations:===== Return-Path: Date: Wed, 25 Mar 1998 13:38:23 +0100 From: "Martin v. Loewis" To: jon@floorboard.com CC: jfh@hal.com, issues@omg.org, interop@omg.org Subject: Re: omg_issues: Issue about CDR encoding for wide characters References: <199803231932.LAA24237@kodiak.hal.com> <3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora> <3517EDBC.C0AA416C@floorboard.com> > Are you saying that there are current byte-oriented codesets that use > more than 4 bytes for some characters? If so, then I don't know how we > could solve this. UTF-8, for one, can go up to 6 bytes (http://www.stonehand.com/unicode/standard/wg2n1036.html). The characters represented by encodings larger than four bytes are not yet assigned, though. I think there is no way to process multi-byte characters embedded in CDR if you don't know what the encoding is. An ORB should not accept a codeset where it can't easily determine the length of an encoding. It seems that multibyte non-Unicode encodings will be used only if they are native to both client and server. > 1. When encoding a wchar, always put 4 bytes into the CDR stream, > adding any zero pad bytes as necessary to make the value 4 bytes > long. I don't think that this could work. > 2. When encoding a wstring, always encode the length as the total > number of bytes used by the encoded value, regardless of whether the > encoding is byte-oriented or not. Even though this sounds reasonable, it contradicts the adopted specifications. Regards, Martin Return-Path: Sender: jon@floorboard.com Date: Wed, 25 Mar 1998 09:18:38 -0800 From: Jonathan Biggar To: "Martin v. Loewis" CC: jfh@hal.com, issues@omg.org, interop@omg.org Subject: Re: omg_issues: Issue about CDR encoding for wide characters References: <199803231932.LAA24237@kodiak.hal.com> <3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora> <3517EDBC.C0AA416C@floorboard.com> <199803251238.NAA22851@pandora> Martin v. Loewis wrote: > > > Are you saying that there are current byte-oriented codesets that > use > > more than 4 bytes for some characters? If so, then I don't know > how we > > could solve this. > > UTF-8, for one, can go up to 6 bytes > (http://www.stonehand.com/unicode/standard/wg2n1036.html). The > characters represented by encodings larger than four bytes are not > yet > assigned, though. Ick, thats awful. > I think there is no way to process multi-byte characters embedded in > CDR if you don't know what the encoding is. An ORB should not accept > a > codeset where it can't easily determine the length of an encoding. > It > seems that multibyte non-Unicode encodings will be used only if they > are native to both client and server. The idea, and I'm sure Bill Janssen would agree with me, is to make the CDR encoding parsable without having to know what encoding was used for wchar and wstring. If you don't have this, then bridges and gateways become much more complicated, because they must know and track every possible transmission codeset. > > 1. When encoding a wchar, always put 4 bytes into the CDR stream, > > adding any zero pad bytes as necessary to make the value 4 bytes > > long. > > I don't think that this could work. > > > 2. When encoding a wstring, always encode the length as the total > > number of bytes used by the encoded value, regardless of whether > the > > encoding is byte-oriented or not. > > Even though this sounds reasonable, it contradicts the adopted > specifications. Right, I am proposing this as a change to the standards to resolve this problem. Perhaps if I changed my proposal to require wchar to be always encoded as 8 bytes? -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Return-Path: To: Jonathan Biggar cc: "Martin v. Loewis" , jfh@hal.com, issues@omg.org, interop@omg.org X-uri: http://www.iona.com/~jmason/ Subject: Re: omg_issues: Issue about CDR encoding for wide characters Content-ID: <7427.890847778.1@gog.dublin.iona.ie> Date: Wed, 25 Mar 1998 17:42:58 +0000 From: Justin Mason Jonathan Biggar said: >The idea, and I'm sure Bill Janssen would agree with me, is to make the >CDR encoding parsable without having to know what encoding was used for >wchar and wstring. If you don't have this, then bridges and gateways >become much more complicated, because they must know and track every >possible transmission codeset. >> > 2. When encoding a wstring, always encode the length as the total >> > number of bytes used by the encoded value, regardless of whether the >> > encoding is byte-oriented or not. >> >> Even though this sounds reasonable, it contradicts the adopted >> specifications. > >Right, I am proposing this as a change to the standards to resolve this >problem. Perhaps if I changed my proposal to require wchar to be always >encoded as 8 bytes? I agree with you in the wstring case, otherwise bridges are in big trouble (spoken as a bridge spokesperson ;). However I don't know if the wchar should be limited to 8 bytes... It strikes me that the best way to handle both wstring and wchar is to marshal them with the length as the length of the wstring/wchar in octets, as suggested earlier for the wstring case. --j. Return-Path: Sender: jon@floorboard.com Date: Wed, 25 Mar 1998 09:53:25 -0800 From: Jonathan Biggar To: Justin Mason CC: "Martin v. Loewis" , jfh@hal.com, issues@omg.org, interop@omg.org Subject: Re: omg_issues: Issue about CDR encoding for wide characters References: <199803251751.RAA01386@dublin.iona.ie> Justin Mason wrote: > > Jonathan Biggar said: > > >The idea, and I'm sure Bill Janssen would agree with me, is to make > the > >CDR encoding parsable without having to know what encoding was used > for > >wchar and wstring. If you don't have this, then bridges and > gateways > >become much more complicated, because they must know and track > every > >possible transmission codeset. > > >> > 2. When encoding a wstring, always encode the length as the > total > >> > number of bytes used by the encoded value, regardless of > whether the > >> > encoding is byte-oriented or not. > >> > >> Even though this sounds reasonable, it contradicts the adopted > >> specifications. > > > >Right, I am proposing this as a change to the standards to resolve > this > >problem. Perhaps if I changed my proposal to require wchar to be > always > >encoded as 8 bytes? > > I agree with you in the wstring case, otherwise bridges are in big > trouble > (spoken as a bridge spokesperson ;). However I don't know if the > wchar > should be limited to 8 bytes... > > It strikes me that the best way to handle both wstring and wchar is > to > marshal them with the length as the length of the wstring/wchar in > octets, > as suggested earlier for the wstring case. I could live with that. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Return-Path: Date: Wed, 25 Mar 1998 14:41:54 PST Sender: Bill Janssen From: Bill Janssen To: jon@floorboard.com, "Martin v. Loewis" Subject: Re: omg_issues: Issue about CDR encoding for wide characters CC: jfh@hal.com, interop@omg.org References: <199803231932.LAA24237@kodiak.hal.com> <3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora> <3517EDBC.C0AA416C@floorboard.com> <199803251238.NAA22851@pandora> Although I wonder if it's reasonable to call UTF-8 a codeset. I think of it rather as a roughly Huffman-coded string form of the 32-bit Unicode codeset. Bill Return-Path: Date: Wed, 25 Mar 1998 14:45:21 PST Sender: Bill Janssen From: Bill Janssen To: "Martin v. Loewis" , Jonathan Biggar Subject: Re: omg_issues: Issue about CDR encoding for wide characters CC: jfh@hal.com, issues@omg.org, interop@omg.org References: <199803231932.LAA24237@kodiak.hal.com> <3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora> <3517EDBC.C0AA416C@floorboard.com> <199803251238.NAA22851@pandora> <35193C6E.DA245FBD@floorboard.com> Excerpts from local.omg: 25-Mar-98 Re: omg_issues: Issue about.. Jonathan Biggar@floorboa (1761*) > > > 2. When encoding a wstring, always encode the length as the total > > > number of bytes used by the encoded value, regardless of whether the > > > encoding is byte-oriented or not. Probably to be consistent, we should do this for both wstring and string. Not a real change for string, due to the existing restrictions on character sets for string types, but a change in semantics. I wonder if we should think about removing the distinction between string and wstring? Bill Return-Path: Date: Thu, 26 Mar 1998 13:44:27 +0100 From: "Martin v. Loewis" To: janssen@parc.xerox.com CC: interop@omg.org Subject: Re: omg_issues: Issue about CDR encoding for wide characters References: <199803231932.LAA24237@kodiak.hal.com> <3516BAF8.3E82FFD7@floorboard.com> <199803241056.LAA16737@pandora> <3517EDBC.C0AA416C@floorboard.com> <199803251238.NAA22851@pandora> <0p6MUmUB0KGW42whsn@holmes.parc.xerox.com> > Although I wonder if it's reasonable to call UTF-8 a codeset. I think > of it rather as a roughly Huffman-coded string form of the 32-bit > Unicode codeset. When they talk about code sets in CORBA, they really mean encodings, or coded character sets, as explained in 11.7. Martin