Issue 724: Non ASCII alphabetics in IDL identifiers (orb_revision) Source: (, ) Nature: Uncategorized Severity: Summary: Summary: IDL identifiers can contain non-ASCII alphabetic characters. None of the language maappings deals with this. To fix this restrict IDL identifiers to ASCII characters, digits and underscore Resolution: Since most of the implementation languages to which IDL is mapped do not accept non-ASCII character Revised Text: All revisions refer to base document Rev 2.2+ i.e. Rev 2.2 as modified by ptc/98-03-03 Actions taken: September 17, 1997: received issue February 23, 1999: closed issue Discussion: End of Annotations:===== Return-Path: X-Authentication-Warning: foxtail.dstc.edu.au: michi owned process doing -bs Date: Wed, 17 Sep 1997 08:45:55 +1000 (EST) From: Michi Henning To: issues@omg.org Subject: Non-ASCII alphabetics in IDL identifiers Hi, Martin v. Loewis pointed out on the corba-dev list that IDL identifiers can contain non-ASCII alphabetic characters (such as various European language accented characters). However, none of the language mappings deal with this, and none of the target languages allow non-ASCII alphabetic characters in identifiers. It appears that the easiest way to fix this is to restrict IDL identifiers to ASCII characters, digits, and underscore. For reference, Martin's post and my reply are included below. Cheers, Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html ------------ From: "Martin v. Loewis" To: corba-dev@qds.com Subject: CORBA-DEV: Non-ASCII in IDL identifiers Mime-Version: 1.0 (generated by tm-edit 7.105) Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by sparky.qds.com id BAA16511 Sender: owner-corba-dev@qds.com Precedence: bulk Reply-To: "Martin v. Loewis" The IDL lexical rules specify that IDL input is ISO-8859-1. They separate Latin-1 in letters, digits, graphical symbols, etc. Identifiers are specified as consisting of letters, digits and the underscore. Now the question: Since accented characters are specified as letters, the text module M{ enum Farbe{rot, gr|n, blau}; interface Fvn{ M::Farbe farbe(); }; enum languages{english, deutsch, frangais, espagqol}; //sp? }; constitutes correct IDL. How is this mapped into target languages that only allow [A-Za-z0-9_]+ as identifiers (i.e. all target languages)? TIA, Martin P.S. For those of you with MIME-challenged mail readers: the above text contains various accented characters. ------------ Date: Tue, 16 Sep 1997 12:32:15 +1000 (EST) From: Michi Henning To: "Martin v. Loewis" cc: corba-dev@qds.com Subject: Re: CORBA-DEV: Non-ASCII in IDL identifiers In-Reply-To: <199709150855.KAA07414@punica> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from QUOTED-PRINTABLE to 8bit by sparky.qds.com id TAA18184 Sender: owner-corba-dev@qds.com Precedence: bulk Reply-To: Michi Henning On Mon, 15 Sep 1997, Martin v. Loewis wrote: > The IDL lexical rules specify that IDL input is ISO-8859-1. They > separate Latin-1 in letters, digits, graphical symbols, > etc. Identifiers are specified as consisting of letters, digits and > the underscore. > > Now the question: Since accented characters are specified as > letters, > the text > > module M{ > enum Farbe{rot, gr|n, blau}; > interface Fvn{ > M::Farbe farbe(); > }; > enum languages{english, deutsch, frangais, espagqol}; //sp? > }; > > constitutes correct IDL. How is this mapped into target languages > that > only allow [A-Za-z0-9_]+ as identifiers (i.e. all target languages)? > > P.S. For those of you with MIME-challenged mail readers: the above > text contains various accented characters. That's a really interesting question. I completely agree with your analysis. IDL definitely permits accented characters in identifiers. As you point out, these are illegal in the target languages as identifiers. As far as I know, none of the language bindings have addressed this. I suspect the reason is the anglo-centric mind set of "a letter is what ASCII defines as a letter". I've tried this with a number of IDL compilers, and they all produce error messages with varying degrees of incomprehensibility :-( I think this is simply a defect in the IDL specification. When you think about it, permitting characters in IDL identifiers that are not part of common implementation character sets creates mapping difficulties. For example, any simple translation rule can also cause clashes: enum Farbe { rot, gr|n, blau }; enum ASCII_Farbe { blutrot, gruen, himmelblau }; For non-Germans, "gruen" is an alternative spelling for "gr|n" if umlauts are not available. The question is, do the two identifiers clash or not? They are in the same scope in IDL, but are spelled differently, so they should not clash. However, after translation, they both end up being the same: // C++, assuming the above mapping rule enum Farbe { rot, gruen, blau }; enum ASCII_Farbe { blutrot, gruen, himmelblau }; // Redefinition In my opinion, the easiest thing would be to restrict IDL identifiers to use *ASCII* letters, digits, and underscore. This neatly eliminates the problem. It discriminates against Europeans, I know. However, they've lived with this restriction in every programming language I know of, so adding one more shouldn't hurt too much ;-) Cheers, Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html Return-Path: Sender: jis@fpk.hp.com Date: Wed, 24 Jun 1998 15:19:36 -0400 From: Jishnu Mukerji Reply-To: jis@fpk.hp.com Organization: Hewlett-Packard New Jersey Labs To: orb_revision@omg.org Subject: Issue 724 proposal Please see attached. Jishnu. -- Jishnu Mukerji Email: jis@fpk.hp.com Hewlett-Packard New Jersey Labs, Tel: +1 973 443 7528 300 Campus Drive, 2E-62, Fax: +1 973 443 7422 Florham Park, NJ 07932, USA. Issue 724: Non ASCII alphabetics in IDL identifiers (orb_revision) Click here for this issue's archive. Source: DSTC (Mr. Michi Henning, michi@dstc.edu.au) Nature: Uncategorized Severity: Summary: IDL identifiers can contain non-ASCII alphabetic characters. None of the language maappings deals with this. To fix this restrict IDL identifiers to ASCII characters, digits and underscore Resolution: Since most of the implementation languages to which IDL is mapped do not accept non-ASCII characters in identifiers, and none of the language mappings for those languages deal with how non-ASCII characters get mapped to ASCII characters, it is time to recognize reality and restrict the alphabets in IDL identifiers to the ASCII subset 0f ISO 8859.1. The necessary changes in text are presented below Revised Text: All revisions refer to base document Rev 2.2+ i.e. Rev 2.2 as modified by ptc/98-03-03 Section 3.2 last paragraph: Replace the paragraph with the following: "Table 3-2 shows the ISO Latin-1 alphabetic characters; upper and lower case equivalences are paired. The first column of table 3-2 shows the ASCII subset of alphabetics, henceforth referred to as ASCII alphabetic." Section 3.2.3: First Paragraph: replace by: "An identifier is an arbitrarily long sequence of the ASCII alphabetic, digit, and underscore ("_") characters. The first character must be ASCII alphabetic. All characters are significant." Section 3.2.3 bullet list: Remove the second bullet item. Actions taken: September 17, 1997: received issue Return-Path: X-Authentication-Warning: tigger.dstc.edu.au: michi owned process doing -bs Date: Thu, 25 Jun 1998 08:18:04 +1000 (EST) From: Michi Henning Reply-To: Michi Henning To: Jishnu Mukerji cc: orb_revision@omg.org Subject: Re: Issue 724 proposal Content-ID: On Wed, 24 Jun 1998, Jishnu Mukerji wrote: > Please see attached. Jishnu, some comments. I agree with the changes you proposed, but I think we need to make a few more for consistency. Section 3.2 says: OMG IDL uses the ISO Latin-1 (8859.1) character set. This character set is divided... I think this paragraph must be deleted, because it is no longer correct as it stands. Instead, I would suggest OMG IDL uses the ASCII character set, except for string literals and character literals, which use the ISO Latin-1 (8859.1) character set. I would also suggest to delete tables 3-2, 3-3, 3-4, and 3-5. Now that identifiers are limited to ASCII letters, digits, and underscore, there is little point in retaining these tables. Section 3.2.5, "Character Literals", 2nd para: I would remove the references to the tables. Cheers, Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html Return-Path: Sender: jis@fpk.hp.com Date: Wed, 24 Jun 1998 18:32:54 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard New Jersey Labs To: Michi Henning Cc: orb_revision@omg.org Subject: Re: Issue 724 proposal References: Michi, thanks for those suggestions. I agree with them and will incorporate in the final version that is put up for vote. I will send out the revised issue for review in a few days. Thanks, Jishnu. -- Jishnu Mukerji Email: jis@fpk.hp.com Hewlett-Packard New Jersey Labs, Tel: +1 973 443 7528 300 Campus Drive, 2E-62, Fax: +1 973 443 7422 Florham Park, NJ 07932, USA. Return-Path: Date: Fri, 3 Jul 1998 04:05:07 -0400 From: ter@holta.ho.lucent.com (T E Rutt) To: orb_revision@omg.org Subject: Concern on issue for ASCII identifiers While I can accept the change for IDL identifiers to be restricted to ASCII alphabetics, I do not think it wise to eliminate the tables for LAtin 1. The ISO standard for latin 1 may not be easy for everyone to find. Latin 1 is still the default coding scheme for IDL strings and Char types. Thus it has an important role in IDL still. To get anything other than Latin 1 requires a negotiation with service context. Also, the trader and notificication constraint grammar depends on a restriction to Latin-1, not ASCII. Thus the proposed change is too harsh. The tables should remain describin Latin 1 ,and text regarding its use as default character set needs to be retained or beefed up. Tom Rutt (I m witholding my vote on vote2, until this issue on the issue is discussed further). Return-Path: Date: Fri, 03 Jul 1998 10:26:13 -0400 From: Jonathan Biggar Organization: Floorboard Software To: Jishnu Mukerji CC: Dan Frantz , Jeff Mischkinsky , Martin Chapman , Tom Rutt , Jonathan Biggar , Ken Fleming , Stephen McNamara , Ken Urquhart , Randy Fox , Bill Janssen , orb_revision@omg.org Subject: Re: Call for ORB Revision Vote 2 References: <359BE3FE.FC4D9F65@fpk.hp.com> I vote yes on all issues except for: #724: I agree with Tom. Keep the Latin-1 tables since they are still referenced as part of the spec for string & character literals. #910, #999: I don't think we have solved all of the issues yet. I'm withholding my vote pending further discussion. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Return-Path: X-Authentication-Warning: tigger.dstc.edu.au: michi owned process doing -bs Date: Sat, 4 Jul 1998 09:37:20 +1000 (EST) From: Michi Henning To: T E Rutt cc: orb_revision@omg.org Subject: Re: Concern on issue for ASCII identifiers On Fri, 3 Jul 1998, T E Rutt wrote: > While I can accept the change for IDL identifiers to > be restricted to ASCII alphabetics, I do not think it wise > to eliminate the tables for LAtin 1. The ISO standard for > latin 1 may not be easy for everyone to find. Well, I actually tried this and got quite a lot hits from altavista that got me to the relevant tables straight away. We are not publishing the ASCII table either, are we? ;-) > Latin 1 is still the default coding scheme for IDL strings > and Char types. Thus it has an important role in IDL still. > > [...] > > > Thus the proposed change is too harsh. > > The tables should remain describin Latin 1 ,and text regarding > its use as default character set needs to be retained or beefed up. I'm not hung up about it. The tables won't do any harm, it's just that my impression was they were mainly included to define the upper-case to lower-case mapping. Since that is no longer an issue, I thought the tables were redundant. However, if people feel strongly about it, I see no great harm in leaving the tables in place. Cheers, Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html vReturn-Path: Sender: jis@fpk.hp.com Date: Mon, 06 Jul 1998 11:14:39 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard New Jersey Laboratories To: Michi Henning Cc: T E Rutt , orb_revision@omg.org Subject: Re: Concern on issue for ASCII identifiers References: Michi Henning wrote: > On Fri, 3 Jul 1998, T E Rutt wrote: > > > While I can accept the change for IDL identifiers to > > be restricted to ASCII alphabetics, I do not think it wise > > to eliminate the tables for LAtin 1. The ISO standard for > > latin 1 may not be easy for everyone to find. > > Well, I actually tried this and got quite a lot hits from altavista > that got me to the relevant tables straight away. > > We are not publishing the ASCII table either, are we? ;-) > > > Latin 1 is still the default coding scheme for IDL strings > > and Char types. Thus it has an important role in IDL still. > > > > [...] > > > > > > Thus the proposed change is too harsh. > > > > The tables should remain describin Latin 1 ,and text regarding > > its use as default character set needs to be retained or beefed up. > > I'm not hung up about it. The tables won't do any harm, it's just that my > impression was they were mainly included to define the upper-case to > lower-case mapping. Since that is no longer an issue, I thought the tables > were redundant. > > However, if people feel strongly about it, I see no great harm in leaving > the tables in place. OK, In the interset of quick resolution of this disagreement, let us keep the tables in. I will modify the proposed text accordingly and put up theis issue for vote in its modified form in the next round. Thanks all, Jishnu. Return-Path: Date: Mon, 6 Jul 1998 05:11:03 -0400 From: ter@holta.ho.lucent.com (T E Rutt) To: orb_revision@omg.org, soley@omg.org, lowe@omg.org Subject: think twice before removing latin 1 identifiers I have been talking with Kerry Raymond about the elimination of Latin-1 characters from identifiers in IDL, and we came up with the following points which bear further discussion before we approve the revision request: 1) IDL is not an ISO standard, and any OMG change will have to be propagated through the ISO defect process. 2) Latin-1 has most of the characters needed for european languages to express their common words. 3) We must be very sure that no-one has written IDL using the Latin-t characters. Since vendor language bindings (except for java) probably would reject latin-1 extensions at IDL compile time it may be that no-one has actually done this. I suggest that we float this proposed change to a much wider international audience that the orb revision RTF. We do not want to have an "internationalization" problem of major consequences on our hands. Ugly escapes in language bindings might be a better approach, since if ascii is used they will not appear. IF we do this we better make sure our european users are really happy with it. Tom Rutt Return-Path: X-Authentication-Warning: tigger.dstc.edu.au: michi owned process doing -bs Date: Mon, 6 Jul 1998 21:31:37 +1000 (EST) From: Michi Henning To: T E Rutt cc: orb_revision@omg.org, soley@omg.org, lowe@omg.org Subject: Re: think twice before removing latin 1 identifiers On Mon, 6 Jul 1998, T E Rutt wrote: > 1) IDL is not an ISO standard, and any OMG change will > have to be propagated through the ISO defect process. I can't comment on the procedural side of things. However, given that most language mappings can't cope with non-ASCII characters, it seems a bit of a moot point. > 2) Latin-1 has most of the characters needed for european > languages to express their common words. > > 3) We must be very sure that no-one has written IDL using > the Latin-t characters. Since vendor language bindings > (except for java) probably would reject latin-1 extensions > at IDL compile time it may be that no-one has actually done this. To the best of my knowledge, not a single IDL compiler on the planet actually can handle non-ASCII characters in IDL identifiers. Given that, we can hardly argue that the functionality is essential, can we. More to the point, the requirement to support non-ASCII in IDL is actually in conflict with the requirement that IDL is preprocessed. This is because characters in the range 0x80-0xff (plus some ASCII characters) are actually illegal input to the preprocessor. > I suggest that we float this proposed change to a much wider > international audience that the orb revision RTF. We do not > want to have an "internationalization" problem of major consequences > on our hands. Well, we don't have this problem now, because no IDL compilers handle non-ASCII. Given that we don't have a problem now, we would hardly create one by admitting to the truth... Cheers, Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html Return-Path: Date: Mon, 6 Jul 1998 08:33:14 -0400 (EDT) From: Bill Beckwith X-Sender: beckwb@gamma To: T E Rutt cc: orb_revision@omg.org, soley@omg.org, lowe@omg.org Subject: Re: think twice before removing latin 1 identifiers On Mon, 6 Jul 1998, T E Rutt wrote: ... > Since vendor language bindings > (except for java) probably would reject latin-1 extensions > at IDL compile time it may be that no-one has actually done this. Note that the current IDL to Ada mapping handles Latin-1. -- Bill Return-Path: X-Sender: hlowe@emerald.omg.org Date: Mon, 06 Jul 1998 08:47:33 -0400 To: ter@holta.ho.lucent.com (T E Rutt), orb_revision@omg.org, soley@omg.org From: Henry Lowe Subject: Re: think twice before removing latin 1 identifiers At 05:11 AM 7/6/98 -0400, T E Rutt wrote: >I have been talking with Kerry Raymond about the >elimination of Latin-1 characters from identifiers >in IDL, and we came up with the following points which >bear further discussion before we approve the revision >request: > >1) IDL is not an ISO standard, and any OMG change will >have to be propagated through the ISO defect process. > >From the standards point of view, removing Latin-1 would have the potential for causing quite a stir in ISO which is promoting internationalization. Return-Path: Sender: jis@fpk.hp.com Date: Mon, 06 Jul 1998 11:14:39 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard New Jersey Laboratories To: Michi Henning Cc: T E Rutt , orb_revision@omg.org Subject: Re: Concern on issue for ASCII identifiers References: Michi Henning wrote: > On Fri, 3 Jul 1998, T E Rutt wrote: > > > While I can accept the change for IDL identifiers to > > be restricted to ASCII alphabetics, I do not think it wise > > to eliminate the tables for LAtin 1. The ISO standard for > > latin 1 may not be easy for everyone to find. > > Well, I actually tried this and got quite a lot hits from altavista > that got me to the relevant tables straight away. > > We are not publishing the ASCII table either, are we? ;-) > > > Latin 1 is still the default coding scheme for IDL strings > > and Char types. Thus it has an important role in IDL still. > > > > [...] > > > > > > Thus the proposed change is too harsh. > > > > The tables should remain describin Latin 1 ,and text regarding > > its use as default character set needs to be retained or beefed up. > > I'm not hung up about it. The tables won't do any harm, it's just that my > impression was they were mainly included to define the upper-case to > lower-case mapping. Since that is no longer an issue, I thought the tables > were redundant. > > However, if people feel strongly about it, I see no great harm in leaving > the tables in place. OK, In the interset of quick resolution of this disagreement, let us keep the tables in. I will modify the proposed text accordingly and put up theis issue for vote in its modified form in the next round. Thanks all, Jishnu. Return-Path: Sender: jis@fpk.hp.com Date: Mon, 06 Jul 1998 11:28:28 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard New Jersey Laboratories To: Henry Lowe Cc: T E Rutt , orb_revision@omg.org, soley@omg.org Subject: Re: think twice before removing latin 1 identifiers References: <3.0.1.32.19980706084733.00723124@emerald.omg.org> Henry Lowe wrote: > At 05:11 AM 7/6/98 -0400, T E Rutt wrote: > >I have been talking with Kerry Raymond about the > >elimination of Latin-1 characters from identifiers > >in IDL, and we came up with the following points which > >bear further discussion before we approve the revision > >request: > > > >1) IDL is not an ISO standard, and any OMG change will > >have to be propagated through the ISO defect process. > > > >From the standards point of view, removing Latin-1 would > have the potential for causing quite a stir in ISO which > is promoting internationalization. Then the only remaining way to deal with this is to raise an issue against evry language binding that does not handle full Latin-1 identifiers properly to provide mappings for the full latin-1 character set to the name space that is available for that implementation language. Additionally we have to modify the requirement that the IDL be pre-processed using a C++-like pre-processor, 'cause that won't let any latin-1 characters through anyway:-). Personally, I think hell has higher chances of freezing over before those issues will be addressed:-). So we can either (1) state the reality as is or (2) start walking down the arduous road to chagne reality to some ideal (i.e. full latin-1 acceptable for identifiers or (3) stick our heads in the proverbial sand and ignore it for now or forever and hope nobody will bug us about it.:-) Personally I prefer (1) and taking the lumps since I don't see anyone really investing to make (2) happen because the commercial returns from doing that are so miniscule. Of course (3) is always a chicken's way out of this mess;-) and we could certainly do that anytime:-). I do however, agree with Tom that it is prudent to float this issue with the proposed resolution to a broader audience with complete reasoning for why this makes sense, and defer it to a followup RTF, if people feel strongly about it. I do not want the rest of the revision to get hung up on this one issue. Cheers, Jishnu. Return-Path: Sender: jis@fpk.hp.com Date: Mon, 06 Jul 1998 12:17:38 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard New Jersey Laboratories To: Jishnu Mukerji Cc: Henry Lowe , T E Rutt , orb_revision@omg.org, soley@omg.org Subject: Re: think twice before removing latin 1 identifiers References: <3.0.1.32.19980706084733.00723124@emerald.omg.org> <35A0ED1B.F789D407@fpk.hp.com> Jishnu Mukerji wrote: > So we can either (1) state the reality as is or (2) start walking down > the arduous road to chagne reality to some ideal (i.e. full latin-1 > acceptable for identifiers or (3) stick our heads in the proverbial sand > and ignore it for now or forever and hope nobody will bug us about > it.:-) > > Personally I prefer (1) and taking the lumps since I don't see anyone > really investing to make (2) happen because the commercial returns from > doing that are so miniscule. Of course (3) is always a chicken's way out > of this mess;-) and we could certainly do that anytime:-). > > I do however, agree with Tom that it is prudent to float this issue with > the proposed resolution to a broader audience with complete reasoning > for why this makes sense, and defer it to a followup RTF, if people feel > strongly about it. In order to get a measure of how strongly people feel about this issue, I will include a vote on issue 724 with modified proposed resolution taking into account the comments regarding the tables. Depending on how that vote goes we will decide what to do with this issue. Sounds fair? > I do not want the rest of the revision to get hung up on this one issue. That bit still remains true.:-) Thanks, Jishnu. Return-Path: X-Sender: hlowe@emerald.omg.org Date: Mon, 06 Jul 1998 12:25:00 -0400 To: Jishnu Mukerji , Henry Lowe From: Henry Lowe Subject: Re: think twice before removing latin 1 identifiers Cc: T E Rutt , orb_revision@omg.org, soley@omg.org References: <3.0.1.32.19980706084733.00723124@emerald.omg.org> Jishnu, What I wrote shouldn't be read as my taking a position on this issue (staff aren't supposed to, though, privately I'd agree with you -- take the lumps now and move on). I was just warning that in ISO, if we're not very careful, some voters (NB's) who have a predilection for character sets might take removing Latin-1 as a move against I18N. My note was nothing more than a warning that some ISO folk could take this the wrong way and, if we're to adopt your choice (1) below, we want to get our ISO ducks lined up before hand. Regards, Henry ------------------------------------------ At 11:28 AM 7/6/98 -0400, Jishnu Mukerji wrote: >Henry Lowe wrote: > >> At 05:11 AM 7/6/98 -0400, T E Rutt wrote: >> >I have been talking with Kerry Raymond about the >> >elimination of Latin-1 characters from identifiers >> >in IDL, and we came up with the following points which >> >bear further discussion before we approve the revision >> >request: >> > >> >1) IDL is not an ISO standard, and any OMG change will >> >have to be propagated through the ISO defect process. >> > >> >From the standards point of view, removing Latin-1 would >> have the potential for causing quite a stir in ISO which >> is promoting internationalization. > > Then the only remaining way to deal with this is to raise an issue >against evry language binding that does not handle full Latin-1 >identifiers properly to provide mappings for the full latin-1 character >set to the name space that is available for that implementation >language. Additionally we have to modify the requirement that the IDL be >pre-processed using a C++-like pre-processor, 'cause that won't let any >latin-1 characters through anyway:-). Personally, I think hell has >higher chances of freezing over before those issues will be >addressed:-). > >So we can either (1) state the reality as is or (2) start walking down >the arduous road to chagne reality to some ideal (i.e. full latin-1 >acceptable for identifiers or (3) stick our heads in the proverbial sand >and ignore it for now or forever and hope nobody will bug us about >it.:-) > >Personally I prefer (1) and taking the lumps since I don't see anyone >really investing to make (2) happen because the commercial returns from >doing that are so miniscule. Of course (3) is always a chicken's way out >of this mess;-) and we could certainly do that anytime:-). > >I do however, agree with Tom that it is prudent to float this issue with >the proposed resolution to a broader audience with complete reasoning >for why this makes sense, and defer it to a followup RTF, if people feel >strongly about it. > >I do not want the rest of the revision to get hung up on this one issue. > >Cheers, > >Jishnu. > > > > Return-Path: Sender: jis@fpk.hp.com Date: Mon, 06 Jul 1998 12:39:18 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard New Jersey Laboratories To: Bill Beckwith Cc: T E Rutt , orb_revision@omg.org, soley@omg.org, lowe@omg.org Subject: Re: think twice before removing latin 1 identifiers References: Bill Beckwith wrote: > On Mon, 6 Jul 1998, T E Rutt wrote: > ... > > Since vendor language bindings > > (except for java) probably would reject latin-1 extensions > > at IDL compile time it may be that no-one has actually done this. > > Note that the current IDL to Ada mapping handles Latin-1. > > -- Bill Hmmm, so how do you do that in a real compiler that uses a "preprocessor" as described in "The Annotated C++ Reference Manual", one that recognizes tokens as made up of only ASCII characters, as one is required to do by the last paragraph of Section 3.3 of the CORBA Core? In particular, as the filenames that appear in the #include restricted to ASCII or should they also be full latin-1? How do you get an ASCII only preprocessor to meaningfully read pragmas that contain full latin-1 respository IDs? The point I am trying to make by asking this question is that this issue revolves around a rather large bit of internal inconsistency that exists in the CORBA Core specification that we need to fix, the sooner the better. Thanks, Jishnu. Return-Path: X-Authentication-Warning: tigger.dstc.edu.au: michi owned process doing -bs Date: Tue, 7 Jul 1998 06:51:48 +1000 (EST) From: Michi Henning To: Henry Lowe cc: Jishnu Mukerji , T E Rutt , orb_revision@omg.org, soley@omg.org Subject: Re: think twice before removing latin 1 identifiers On Mon, 6 Jul 1998, Henry Lowe wrote: > Jishnu, > > What I wrote shouldn't be read as my taking a position on > this issue (staff aren't supposed to, though, privately > I'd agree with you -- take the lumps now and move on). > I was just warning that in ISO, if we're not very careful, > some voters (NB's) who have a predilection for character sets > might take removing Latin-1 as a move against I18N. > > My note was nothing more than a warning that some ISO > folk could take this the wrong way and, if we're to adopt > your choice (1) below, we want to get our ISO ducks lined > up before hand. Hmmm... I think the ISO folks can take this any way they like. It doesn't matter how they take it, because right now, 1) the IDL spec has an internal contradiction (it requires use of the preprocess which is physically incapable of handling ISO Latin-1), 2) not a single IDL compiler implements processing of ISO Latin-1, 3) even if an IDL compiler were to process ISO Latin-1, it would not generate legal language bindings because for several language bindings, allowing these characters generates illegal source code, 4) for languages without native support for ISO Latin-1, mapping such identifiers would create bindings that are unusably ugly. So, people can scream all they like for internationalization -- the technical reality appears to be that supporting ISO Latin-1 IDL identifiers is technically infeasible (and by the way, that is not the same as saying that CORBA can't do internationalization). As Jishnu said, we can 1) stick our heads in the sand and pretend the problem does not exist, 2) or we can define a new preprocessor that handles ISO Latin-1 and then require the defective language bindings to add the missing support, 3) or we can admit that any solution to this would result in something too ugly to be used and therefore such a solution isn't worth creating. My vote is clearly for option three. Someone previously suggested to contact the Europeans to get an opinion. Keep in mind that to date, very few programming languages permit non-ASCII identifiers. Despite this, European programmers have managed to survive quite well for the last three decades or so... Cheers, Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html Return-Path: From: "Daniel R. Frantz" To: Subject: RE: think twice before removing latin 1 identifiers Date: Mon, 6 Jul 1998 17:33:32 -0400 X-MSMail-Priority: Normal Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V4.72.2106.4 In the time I served on ISO/IEC JTC1 SC22 (Languages and Environments), it was very clear that having non-ASCII characters in identifiers just wasn't going to happen. If I remember correctly, SC22 recognized that fact in an informal document and considered the issue closed. In order to do it correctly, you have to be able to do much more than Latin-1. You have to do all the other character sets in the ISO world, including the multiple byte sets such as 10646, which includes provisions for both fixed with and variable length character codes. In order to do that, there has to be some private vendor method for stating what character set is going to be used for the identifiers in this particular compilation unit. You need that so that the compiler can recognize, for example, that it is reading a source file with Unicode rather than ASCII for its identifiers. Otherwise, you can't tell what an identifier is since those two-byte characters could legally be interpreted as one-byte characters with punctuation and such. When you have that situation, the source code (the file) is not in the least bit portable because there's nothing to force a vendor to support a "optional" particular character set and certainly not all of them in the world. Recognizing this, SC22 stated that the primary goal is for languages to have some means of dealing with multiple character sets at run time. Not all languages can do even that yet. The next step, obviously more difficult, is to be able to express character constants in other than the ASCII character set. Most can't. At the time I stopped attending SC22, about 1994, that was the situation. It may have changed since then, but I seriously doubt it. We should certainly contact SC22 to find out, but the surprise will be to find that SC22 believes that languages should have anything else but ASCII identifiers as part of a standard. My point? ISO languages aren't expecting to deal with non-ASCII identifiers. Except for some research, non-production languages, it just doesn't happen. IDL shouldn't try to solve a problem that no other ISO language solves. This will not come as a shock to the ISO language community. Perhaps other parts of ISO will be shocked if they don't know the background, but we can just refer them to their own committee, SC22. Dan Frantz Return-Path: X-Sender: giddiv@gamma Date: Mon, 06 Jul 1998 20:52:42 -0400 To: Michi Henning , Henry Lowe From: Victor Giddings Subject: Re: think twice before removing latin 1 identifiers Cc: Jishnu Mukerji , T E Rutt , orb_revision@omg.org, soley@omg.org References: <3.0.1.32.19980706122500.0076bc54@emerald.omg.org> I hesitate to wade in, but.... At 06:51 AM 7/7/98 +1000, Michi Henning wrote: > >Hmmm... I think the ISO folks can take this any way they like. It doesn't >matter how they take it, because right now, > > 1) the IDL spec has an internal contradiction (it requires use > of the preprocess which is physically incapable of handling > ISO Latin-1), This refers back to the following from that Jishnu provided (my insertion): >At 12:39 PM 7/6/98 -0400, Jishnu Mukerji wrote: >Hmmm, so how do you do that in a real compiler that uses a >"preprocessor" as described in "The Annotated C++ Reference Manual", one >that recognizes tokens as made up of only ASCII characters, as one is >required to do by the last paragraph of Section 3.3 of the CORBA Core? >In particular, as the filenames that appear in the #include restricted >to ASCII or should they also be full latin-1? How do you get an ASCII >only preprocessor to meaningfully read pragmas that contain full latin-1 >respository IDs? > >The point I am trying to make by asking this question is that this issue >revolves around a rather large bit of internal inconsistency that exists >in the CORBA Core specification that we need to fix, the sooner the >better. (end inserted quote) This may well be a contradiction, but there are several other fixes: 1) Adopt the C++ Draft International Standard and deal with the "upper half" in the language mappings (of which 2 already have). 2) Spell out the intended syntax in the CORBA spec, e.g. adopt Latin-1 as the character set for the preprocessor tokens. I'd favor the second. > > 2) not a single IDL compiler implements processing of ISO Latin-1, There is at least one that implements this, the ORBexpress IDL compiler. We gave the explicit statement in the CORBA specification precedence over the referenced text book. > > 3) even if an IDL compiler were to process ISO Latin-1, it would > not generate legal language bindings because for several > language bindings, allowing these characters generates illegal > source code, Again, this could be handled by the language mappings. I note that the C and C++ mappings are the only ones that do not discuss the lexical mapping of names to the target language. > > 4) for languages without native support for ISO Latin-1, mapping > such identifiers would create bindings that are unusably ugly. No language mapping is without warts; several instances of a "ugly names" were brought up earlier. However, the "ugliness" can be addressed by the user, i.e. if the user doesn't like the result, the user can change the IDL. To keep the ugliness out of the CORBA specifications, I suggest that the IDL Style Guide address this issue. I suggest the mandatory guidelines for identifiers in adopted submissions include: 1) limitation to ASCII character set 2) limitation to 30 characters 3) no trailing or successive underscores (actually, I think this is already covered) 4) no identifiers that differ only by presence of an underscore (e.g., a_partridge_in_a_pear_tree, APartridgeInAPearTree) I'm sure there's more.. Further, the recommendation could be made that, for portability between languages, the above are recommended practices in all IDL. >So, people can scream all they like for internationalization -- the technical >reality appears to be that supporting ISO Latin-1 IDL identifiers is >technically infeasible (and by the way, that is not the same as saying >that CORBA can't do internationalization). > >As Jishnu said, we can > > 1) stick our heads in the sand and pretend the problem does not exist, > > 2) or we can define a new preprocessor that handles ISO Latin-1 > and then require the defective language bindings to add the > missing support, > > 3) or we can admit that any solution to this would result > in something too ugly to be used and therefore such a > solution isn't worth creating. I'm not sure what we are "creating". AFAIK, the requirement for Latin-1 identifiers goes back to the original specification. >My vote is clearly for option three. > >Someone previously suggested to contact the Europeans to get an opinion. >Keep in mind that to date, very few programming languages permit non-ASCII >identifiers. Despite this, European programmers have managed to survive >quite well for the last three decades or so... But were they happy about it? > > Cheers, > > Michi. >-- >Michi Henning +61 7 33654310 >DSTC Pty Ltd +61 7 33654311 (fax) >University of Qld 4072 michi@dstc.edu.au >AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html > My (non)vote is for 2. Return-Path: X-Authentication-Warning: tigger.dstc.edu.au: michi owned process doing -bs Date: Tue, 7 Jul 1998 12:12:03 +1000 (EST) From: Michi Henning To: Victor Giddings cc: Henry Lowe , Jishnu Mukerji , T E Rutt , orb_revision@omg.org, soley@omg.org Subject: Re: think twice before removing latin 1 identifiers On Mon, 6 Jul 1998, Victor Giddings wrote: > This may well be a contradiction, but there are several other fixes: > 1) Adopt the C++ Draft International Standard and deal with the > "upper > half" in the language mappings (of which 2 already have). > 2) Spell out the intended syntax in the CORBA spec, e.g. adopt > Latin-1 as > the character set for the preprocessor tokens. > > I'd favor the second. Right. And all the vendors can go and implement their own preprocessor that handles ISO Latin-1 instead of calling the preprocessor that comes with the C++ compiler? I don't think that would be very popular -- last time I looked, a C++ preprocessor took about 30000 lines of code. > > 2) not a single IDL compiler implements processing of ISO Latin-1, > > There is at least one that implements this, the ORBexpress IDL compiler. We > gave the explicit statement in the CORBA specification precedence over the > referenced text book. Which referenced textbook are you referring to? At any rate, the ORBexpress compiler *must* be in non-compliance with the spec, because to do this, its preprocessor must tokenize differently than what is required by the C++ spec. Also, how do you generate a C or C++ mapping for IDL identifiers that contain accented characters? And how does the C or C++ compiler then deal with these? > > 3) even if an IDL compiler were to process ISO Latin-1, it would > > not generate legal language bindings because for several > > language bindings, allowing these characters generates illegal > > source code, > > Again, this could be handled by the language mappings. I note that the C > and C++ mappings are the only ones that do not discuss the lexical mapping > of names to the target language. They *do* discuss the lexical mapping to the target language. It's just that the lexical mapping that is discussed results in illegal source code if the identifiers contain anything but ASCII letters, digits, and underscores. For the record, I do not see how the COBOL mapping could deal with this either. I believe Ada and Java can handle the identifiers. Not sure about Smalltalk. Once we get Eiffel, Oberon, Objective-C, Perl, Visual Basic, or other mappings, the set of languages that cannot handle ISO Latin-1 will increase, not decrease. > > 4) for languages without native support for ISO Latin-1, mapping > > such identifiers would create bindings that are unusably ugly. > > No language mapping is without warts; several instances of a "ugly names" > were brought up earlier. However, the "ugliness" can be addressed by the > user, i.e. if the user doesn't like the result, the user can change the > IDL. No. The user cannot change the IDL, because very frequently, the IDL is written by someone else. > To keep the ugliness out of the CORBA specifications, I suggest that > the IDL Style Guide address this issue. I suggest the mandatory > guidelines > for identifiers in adopted submissions include: > 1) limitation to ASCII character set > 2) limitation to 30 characters > 3) no trailing or successive underscores (actually, I think this is > already > covered) > 4) no identifiers that differ only by presence of an underscore > (e.g., > a_partridge_in_a_pear_tree, APartridgeInAPearTree) > I'm sure there's more.. Fine. So we have "mandatory guidelines"? That sounds oxymoronic to me. If a particular input to the IDL compiler is legal, then the spec must state how this input is transformed into output. Right now, the requirement for the preprocessor makes identifiers containing characters in the range 0x80 - 0xff illegal input. Even if we declare this input legal, the IDL compiler will still generate illegal output for some target languages. > Further, the recommendation could be made that, for portability between > languages, the above are recommended practices in all IDL. So we have "mandatory guidelines" that are "recommended practice"? This doesn't make sense to me. Either it is legal, in which case it has to work, or it is not legal. > >As Jishnu said, we can > > > > 1) stick our heads in the sand and pretend the problem does > not exist, > > > > 2) or we can define a new preprocessor that handles ISO > Latin-1 > > and then require the defective language bindings to add the > > missing support, > > > > 3) or we can admit that any solution to this would result > > in something too ugly to be used and therefore such a > > solution isn't worth creating. > > I'm not sure what we are "creating". AFAIK, the requirement for > Latin-1 > identifiers goes back to the original specification. The original specification was no better than the current one. It contained the same internal contradiction and conveniently ignored the issue of mapping ISO Latin-1 identifiers. The fact that this has been there from day one doesn't make it right. > But were they happy about it? One person (Martin von Loewis, Germany) responded that programmers are quite used to not having native characters for programming language identifiers. And, as Dan Frantz pointed out, ISO programming languages don't handle this either. I am German myself and have worked as a programmer in Germany. I have never perceived the lack of German characters in programming language identifiers as a problem. Neither has any of my German collegues ever expressed dissatisfaction because of the absence of native characters in identifiers. This entire thing is a non-issue -- Europeans have programmed without these characters for more than three decades. They still program without these characters today. Only very few programming languages offer support for native characters in identifiers. Despite that, we have not seen the demise of the non-English speaking software industry. Of course, if we really insist, we *can* redefine the preprocessor to handle ISO Latin-1, and we *can* change all the language bindings to create mangled names that will look incomprehensible and ugly at the source-code level, and we *can* put up with the fact that such a mapping would possibly become order-sensitive, and we *can* coincidentally declare the majority of commercial IDL compilers as non-compliant. We can do all of the above to find a solution for a problem that no-one has complained about. After all, CORBA doesn't handle ISO Latin-1 today, and it hasn't keeled over and died because of it. Surely, there are better things to expend our energy on? Cheers, Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html Return-Path: Date: Tue, 7 Jul 1998 03:47:13 -0400 From: ter@holta.ho.lucent.com (T E Rutt) To: orb_revision@omg.org Subject: I cave in on ASCII IDL Identifiers I say UNcle I can now vote yes for the ASCII restriction on IDL identifiers. I would like the final proposal to include some clarification (perhaps a note) stating that string literals used in IDL constants can be from the Latin-1 unrestricted character set. Tom Rutt Return-Path: Date: Tue, 7 Jul 1998 03:47:13 -0400 From: ter@holta.ho.lucent.com (T E Rutt) To: orb_revision@omg.org Subject: I cave in on ASCII IDL Identifiers I say UNcle I can now vote yes for the ASCII restriction on IDL identifiers. I would like the final proposal to include some clarification (perhaps a note) stating that string literals used in IDL constants can be from the Latin-1 unrestricted character set. Tom Rutt Return-Path: Date: Tue, 7 Jul 1998 03:47:46 -0700 From: Lewis Stiller To: victor.giddings@ois.com CC: michi@dstc.edu.au, hlowe@omg.org, jis@fpk.hp.com, ter@holta.ho.lucent.com, orb_revision@omg.org, soley@omg.org Subject: A modest proposal I see two similar issues: non-ASCII characters in IDL, and fixed point types In both we see a specification placed into the Core which is either vague or unimplementable; we see that the majority of ORBs do not implement this specification; and we see folks defending the integrity of the specification by arguing that it can be fixed. I think this is the wrong way to approach things. When a specification is in the Core it should be useful, implementable and unambiguous. If it is not, it should be removed, not tinkered with. Here then is my proposal: When a Core specification has been ratified for some period of time to be determined, say 18 months, and when there is doubt that it is implementable, there should be a "survey of actual practice". The survey of actual practice should poll a representative sample of commercial ORBs, let us say the top 20 commercial ORBs measured by licenses or something (or all ORBs, if we want). Some test should then be administered to determine whether these ORBs that claim to be CORBA compliant in fact implement at all the relevant section of the spec. If more than 80% of the ORBs do not implement that feature in some sense, then that spec should be removed from the Core by default. One can make an exception when the ORB vendors have concrete plans to implement it. That is, as a point of philosophy I do think CORBA Core functionality should be universally, or at least widely, available. I am speaking as an ORB implementor. The presence of so many portions of the specification which are sort of tacitly agreed to be ignored I believe increases the learning curve of users, increases the cost of ORB development, and impairs the credibility and acceptance of CORBA. I admit this proposal may be tongue-in-cheek in some ways, but there has to be some mechanism in place for generating specifications that are available and implementable rather than letting the Core accrete increasingly baroque and obscure features. > X-Sender: giddiv@gamma > From: Victor Giddings > > I hesitate to wade in, but.... > > At 06:51 AM 7/7/98 +1000, Michi Henning wrote: > > > >Hmmm... I think the ISO folks can take this any way they like. It > doesn't > >matter how they take it, because right now, > > > > 1) the IDL spec has an internal contradiction (it requires use > > of the preprocess which is physically incapable of handling > > ISO Latin-1), > > This refers back to the following from that Jishnu provided (my > insertion): > >At 12:39 PM 7/6/98 -0400, Jishnu Mukerji wrote: > >Hmmm, so how do you do that in a real compiler that uses a > >"preprocessor" as described in "The Annotated C++ Reference > Manual", one > >that recognizes tokens as made up of only ASCII characters, as one > is > >required to do by the last paragraph of Section 3.3 of the CORBA > Core? > >In particular, as the filenames that appear in the #include > restricted > >to ASCII or should they also be full latin-1? How do you get an > ASCII > >only preprocessor to meaningfully read pragmas that contain full > latin-1 > >respository IDs? > > > >The point I am trying to make by asking this question is that this > issue > >revolves around a rather large bit of internal inconsistency that > exists > >in the CORBA Core specification that we need to fix, the sooner > the > >better. > (end inserted quote) > > This may well be a contradiction, but there are several other > fixes: > 1) Adopt the C++ Draft International Standard and deal with the > "upper > half" in the language mappings (of which 2 already have). > 2) Spell out the intended syntax in the CORBA spec, e.g. adopt > Latin-1 as > the character set for the preprocessor tokens. > > I'd favor the second. > > > > 2) not a single IDL compiler implements processing of ISO > Latin-1, > > There is at least one that implements this, the ORBexpress IDL > compiler. We > gave the explicit statement in the CORBA specification precedence > over the > referenced text book. > > > > > 3) even if an IDL compiler were to process ISO Latin-1, it > would > > not generate legal language bindings because for several > > language bindings, allowing these characters generates > illegal > > source code, > > Again, this could be handled by the language mappings. I note that > the C > and C++ mappings are the only ones that do not discuss the lexical > mapping > of names to the target language. > > > > > 4) for languages without native support for ISO Latin-1, > mapping > > such identifiers would create bindings that are unusably > ugly. > > No language mapping is without warts; several instances of a "ugly > names" > were brought up earlier. However, the "ugliness" can be addressed > by the > user, i.e. if the user doesn't like the result, the user can change > the > IDL. To keep the ugliness out of the CORBA specifications, I > suggest that > the IDL Style Guide address this issue. I suggest the mandatory > guidelines > for identifiers in adopted submissions include: > 1) limitation to ASCII character set > 2) limitation to 30 characters > 3) no trailing or successive underscores (actually, I think this is > already > covered) > 4) no identifiers that differ only by presence of an underscore > (e.g., > a_partridge_in_a_pear_tree, APartridgeInAPearTree) > I'm sure there's more.. > > Further, the recommendation could be made that, for portability > between > languages, the above are recommended practices in all IDL. > > >So, people can scream all they like for internationalization -- > the technical > >reality appears to be that supporting ISO Latin-1 IDL identifiers > is > >technically infeasible (and by the way, that is not the same as > saying > >that CORBA can't do internationalization). > > > >As Jishnu said, we can > > > > 1) stick our heads in the sand and pretend the problem does > not exist, > > > > 2) or we can define a new preprocessor that handles ISO > Latin-1 > > and then require the defective language bindings to add the > > missing support, > > > > 3) or we can admit that any solution to this would result > > in something too ugly to be used and therefore such a > > solution isn't worth creating. > > I'm not sure what we are "creating". AFAIK, the requirement for > Latin-1 > identifiers goes back to the original specification. > > >My vote is clearly for option three. > > > >Someone previously suggested to contact the Europeans to get an > opinion. > >Keep in mind that to date, very few programming languages permit > non-ASCII > >identifiers. Despite this, European programmers have managed to > survive > >quite well for the last three decades or so... > > But were they happy about it? > > > > Cheers, > > > > Michi. > >-- > >Michi Henning +61 7 33654310 > >DSTC Pty Ltd +61 7 33654311 (fax) > >University of Qld 4072 michi@dstc.edu.au > >AUSTRALIA > http://www.dstc.edu.au/BDU/staff/michi-henning.html > > > > My (non)vote is for 2. > > Return-Path: X-Authentication-Warning: tigger.dstc.edu.au: michi owned process doing -bs Date: Tue, 7 Jul 1998 21:41:31 +1000 (EST) From: Michi Henning To: Lewis Stiller cc: victor.giddings@ois.com, hlowe@omg.org, jis@fpk.hp.com, ter@holta.ho.lucent.com, orb_revision@omg.org, soley@omg.org Subject: Re: A modest proposal On Tue, 7 Jul 1998, Lewis Stiller wrote: > When a specification is in the Core it should be useful, > implementable and unambiguous. If it is not, it should be removed, > not > tinkered with. I couldn't agree more. In fact, I suggested to remove the fixed-point types from the spec until they hang together, but I didn't find many takers ;-) Looks like its OK to put garbage in, but it's not OK to take garbage out :-( > Here then is my proposal: > > When a Core specification has been ratified for some period of > time to be determined, say 18 months, and when there is doubt that > it > is implementable, there should be a "survey of actual practice". The > survey of actual practice should poll a representative sample of > commercial ORBs, let us say the top 20 commercial ORBs measured by > licenses or something (or all ORBs, if we want). > > Some test should then be administered to determine whether > these ORBs that claim to be CORBA compliant in fact implement at all > the relevant section of the spec. If more than 80% of the ORBs do > not > implement that feature in some sense, then that spec should be > removed from the Core by default. One can make an exception when the > ORB vendors > have concrete plans to implement it. I think the proposal floated by Andrew Watson for a preliminary status on the spec is similar to this. I agree though -- if a feature hasn't made it into commercial implementations after some time, it should be removed. (How many ORBs correctly implement context clauses? I honestly don't know, but given the apparently universal non-use of the feature, I'd be inclined to make it the first candidate to be burned at the stake ;-) > I admit this proposal may be tongue-in-cheek in some ways, but there > has to be some mechanism in place for generating specifications that > are available and implementable rather than letting the Core accrete > increasingly baroque and obscure features. Yes, the old QA problem. I think part of it is that I can happily participate in what is effectively an international standards body without any qualifications whatsoever. If my company has enough clout, and if I shout long and hard enough, I'll have it my way, regardless of whether what I'm proposing makes sense or not (see the #pragma fiasco). After I've done the damage, I can turn around, walk away from it all, and wash my hands of the entire affair. Cleaning up the mess then becomes an SEP -- somebody else's problem :-( I'm not sure whether there really is a practical way out of this. The OMG is an organization that is to a large degree held together by volunteer labor, and it is a vendor organization. The practice of putting specs out by consensus and compromise means that they tend to be acceptable to a majority of vendors, and that the specs get out quickly. The alternative is to introduce all sorts of QA measures, implementation requirements, to provide a reference implementation, etc. If we do this, the quality of the specs will probably go up, but so will the time it takes to produce them. Right now, I think Andrew's proposal to keep specifications in limbo for a while seems to be the best bet to me. At least it will give us some chance to sanity-check the stuff before it becomes the law... Cheers, Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html Return-Path: From: Jeffrey Mischkinsky Subject: Re: I cave in on ASCII IDL Identifiers To: ter@holta.ho.lucent.com (T E Rutt) Date: Tue, 7 Jul 1998 17:57:39 -0700 (PDT) Cc: orb_revision@omg.org 'T E Rutt' writes: > > I say UNcle > > I can now vote yes for the ASCII restriction on IDL identifiers. now i feel comfortable stating our position :-) Seriously, I think the pragmatics here far outways any of the other arguments. So I'm happy to vote for the restriction. > > > I would like the final proposal to include some clarification > (perhaps > a note) stating that string literals used in IDL constants can be > from the Latin-1 unrestricted character set. I agree that would be a helpful addition. jeff > > Tom Rutt > > > > -- Jeff Mischkinsky jmischki@dcn.davis.ca.us +1 530-758-9850 jeffm@inprise.com +1 650-312-5158 jeffm@visigenic.com +1 650-312-5158 Return-Path: Sender: jon@floorboard.com Date: Tue, 07 Jul 1998 21:03:42 -0700 From: Jonathan Biggar To: Lewis Stiller CC: victor.giddings@ois.com, michi@dstc.edu.au, hlowe@omg.org, jis@fpk.hp.com, ter@holta.ho.lucent.com, orb_revision@omg.org, soley@omg.org Subject: Re: A modest proposal References: <199807071047.DAA10147@corba.franz.com> Lewis Stiller wrote: > When a Core specification has been ratified for some period > of > time to be determined, say 18 months, and when there is doubt that > it > is implementable, there should be a "survey of actual practice". The > survey of actual practice should poll a representative sample of > commercial ORBs, let us say the top 20 commercial ORBs measured by > licenses or something (or all ORBs, if we want). > > Some test should then be administered to determine whether > these ORBs that claim to be CORBA compliant in fact implement at all > the relevant section of the spec. If more than 80% of the ORBs do > not > implement that feature in some sense, then that spec should be > removed from the Core by default. One can make an exception when the > ORB vendors > have concrete plans to implement it. > > That is, as a point of philosophy I do think CORBA Core > functionality should be universally, or at least widely, available. > > I am speaking as an ORB implementor. The presence of so many > portions > of the specification which are sort of tacitly agreed to be ignored > I > believe increases the learning curve of users, increases the cost of > ORB development, and impairs the credibility and acceptance of > CORBA. > > I admit this proposal may be tongue-in-cheek in some ways, but there > has to be some mechanism in place for generating specifications that > are available and implementable rather than letting the Core accrete > increasingly baroque and obscure features. I agree we have a problem with broken stuff getting in the spec, but I don't think that this fixes the problem. Rather, this would give ORB vendors more power to ignore the latest version of the CORE and simply refuse to implement new features. It would also punish an ORB vendor that was trying to keep up to date, because new features added to its ORB could be arbitrarily taken away. Rather we need to build up processes that avoid having patently broken stuff added to the ORB. -- Jon Biggar Floorboard Software jon@floorboard.com jon@biggar.org Return-Path: Sender: jis@fpk.hp.com Date: Wed, 08 Jul 1998 11:13:22 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard New Jersey Laboratories To: T E Rutt Cc: orb_revision@omg.org Subject: Re: vote yes on IDL indentifier restrictions References: <199807081425.KAA02204@holta.ho.lucent.com> T E Rutt wrote: > I vote yes on the restriction of IDL identifiers to ASCII characters > > However, we need to clarify the state of IDL string literals in > constants. Most say latin 1 is ok for constants, however we need to > come to agreement. > > With regards to the ISO ODP IDL, a defect report can be raised in the > ISO process, using the preprocessor inconsistency as the justification, > after the OMG revision is voted in and accepted. > > If the ballot on the ISO defect report has any objections, that would > mean that the opponents would have to propose an alternative solution. > OMG RTF members could get involved in the resolution of the ISO ballot, > considering any new solutions proposed. > > Chances are great that european IDL users are already cautious and would > accept the pragmatic change. Tom, A modified resolution for issue 724 will be included in the next round of voting taking into account the comments that have come back from voters in this round. The two main comments are: (i) Retain the character tables. (ii) Clarify that Latin-1 characters can be used in string literals. Of these (ii) is already sort of there. The resolutiont ext says: "OMG IDL uses the ASCII character set, except for string literals and character literals, which use the ISO Latin-1 (8859.1) character set." However, if someone comes up with a better way of clarifying this we can use that text instead. Thanks, Jishnu. Return-Path: X-Sender: beckwb@gamma Date: Fri, 10 Jul 1998 15:59:05 -0400 To: Jishnu Mukerji From: Bill Beckwith Subject: Re: think twice before removing latin 1 identifiers Cc: T E Rutt , orb_revision@omg.org, soley@omg.org, lowe@omg.org, jeff@ois.com, vic@ois.com References: At 12:39 PM 7/6/98 , Jishnu Mukerji wrote: >Bill Beckwith wrote: > >> On Mon, 6 Jul 1998, T E Rutt wrote: >> ... >> > Since vendor language bindings >> > (except for java) probably would reject latin-1 extensions >> > at IDL compile time it may be that no-one has actually done this. >> >> Note that the current IDL to Ada mapping handles Latin-1. >> >> -- Bill > >Hmmm, so how do you do that in a real compiler that uses a >"preprocessor" as described in "The Annotated C++ Reference Manual", one >that recognizes tokens as made up of only ASCII characters, as one is >required to do by the last paragraph of Section 3.3 of the CORBA Core? >In particular, as the filenames that appear in the #include restricted >to ASCII or should they also be full latin-1? How do you get an ASCII >only preprocessor to meaningfully read pragmas that contain full latin-1 >respository IDs? Section C.3 of "The C++ Programming Language - 3rd Edition" does not specify the ASCII character set. In addition, the C++ DIS does not limit the characters of a preprocessor to ASCII. As a practical matter even if you don't have a Latin-1 capabable preprocessor you still could use Latin-1 characters in your IDL. The IDL compiler is then free to map those extra Latin-1 characters to languages that can support Latin-1. >The point I am trying to make by asking this question is that this issue >revolves around a rather large bit of internal inconsistency that exists >in the CORBA Core specification that we need to fix, the sooner the >better. I'm not so sure I view the original Core specification of Latin-1 as an inconsistency. The CORBA spec was specific to Latin-1. The CORBA spec depended on "The Annotated C++ Reference Manual" which was not specific to character set. The ANSI C spec and the C++ DIS avoid the specification of either ASCII or Latin-1. The problem is that a specific specification (CORBA) depended on an ambiguous specification (C++). One view is that the CORBA spec added definition to the C++ specification. -- Bill Return-Path: X-Authentication-Warning: tigger.dstc.edu.au: michi owned process doing -bs Date: Sat, 11 Jul 1998 10:46:44 +1000 (EST) From: Michi Henning To: Bill Beckwith cc: Jishnu Mukerji , T E Rutt , orb_revision@omg.org, soley@omg.org, lowe@omg.org, jeff@ois.com, vic@ois.com Subject: Re: think twice before removing latin 1 identifiers On Fri, 10 Jul 1998, Bill Beckwith wrote: > Section C.3 of "The C++ Programming Language - 3rd Edition" does not > specify the ASCII character set. In addition, the C++ DIS does not > limit the characters of a preprocessor to ASCII. > > As a practical matter even if you don't have a Latin-1 capabable > preprocessor > you still could use Latin-1 characters in your IDL. The IDL > compiler > is then free to map those extra Latin-1 characters to languages that > can support Latin-1. Bill, I'm sorry, but this is completely wrong. >From the C++ DIS: Section 2.1, Phases of translation: Physical source file characters are mapped, in an implementation- defined manner, to the source character set. [...] Any source file character not in the basic source character set is replaced by the universal-character name that designates that character. [...] The source file is decomposed into preprocessing tokens [...]. [...] Preprocessing directives are excecuted and macro invocations are expanded. If a character sequence that matches the syntax of universal-character-name is produced by token concatenation, the behavior is undefined. FYI, the C++ source character set is ASCII letters, digits, and the punctuation characters that are the ones normally used, such as _, {, }, etc. So, after the source has been mapped to the C++ source character set, it will only contain the above characters. Things like characters in the range 0x80 - 0xff and ASCII characters not in the basic source character set appear as the universal chracter name, such as \u0080. Next, we do tokenizing in preparation for macro substitution (section 2.4): Each preprocssing token that is converted to a token shall have the lexical form of a keyword, an identifier, a literal, an operator, or a punctuator. [...] So, what's an identifier? It's in section 2.10: identifier: nondigit identifier nondigit identifier digit nondigit: one of [ ASCII letter or underscore ] digit: [ 0 - 9 ] So, now we can analyze what happens when we present the preprocessor with an identifier such as grUEn (I've used UE here to denote the single German umlaut that looks like a u with two dots above it). The first pass maps the identifier to the source character set. This results in our identifier to appear as: gr\u00FCn The tokenizing pass then breaks that into the *three* tokens with value IDENTIFIER "gr" ??? \u00FC IDENTIFIER "n" Problem number one is that \u00FC does not match any of the legal token syntaxes, so we get a syntax error as soon as \u00FC is seen. The only way I can legally have \u00FC appear in a program is as part of a string or character literal, or in a comment. Problem number two is that macro substitution gets all screwed up: #define TIMES_TWO(arg) (2 * (arg)) If I call this with our grUEn identifier some_func(TIMES_TWO(grUEn)); I get errors again, because *three* tokens are passed to the macro, so the number of actual arguments does not match the formal number of arguments to the macro. Problem number three is that even if all of the above would work, it still results in illegal C++ under the current mapping rules // IDL enum Farbe { rot, grUEn, blau }; // C++ enum Farbe { rot , gr \u00FC n , blau } ; [ I've used blanks to show token boundaries here. ] Of course, this is syntactically malformed. > I'm not so sure I view the original Core specification of Latin-1 as an > inconsistency. The CORBA spec was specific to Latin-1. The CORBA spec > depended on "The Annotated C++ Reference Manual" which was not specific > to character set. This is incorrect. My copy of the ARM shows that the rules are very much the same, just not stated as precisely as in the DIS. In particular, the syntax for identifiers is the same. > The ANSI C spec and the C++ DIS avoid the specification of either ASCII or > Latin-1. Not correct, see above. > The problem is that a specific specification (CORBA) depended on > an ambiguous specification (C++). One view is that the CORBA spec > added > definition to the C++ specification. I hope the argument above shows that this simply is not correct. To argue that CORBA added to the definition of C++ is presumptious beyond belief. For one, the OMG is no position to do this. Second, I don't think the ANSI C++ committee requires any help. Third, if the CORBA specification were written with even a fraction of the rigor of the DIS, we wouldn't be in this mess in the first place. Any idea that it is possible to use the preprocessor with identifiers outside the C++ source character set is simply wishful thinking. Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html Return-Path: Date: Sat, 11 Jul 1998 23:42:15 -0400 (EDT) From: Bill Beckwith X-Sender: beckwb@gamma To: Michi Henning cc: orb_revision@omg.org, jeff@ois.com, vic@ois.com Subject: Re: think twice before removing latin 1 identifiers On Sat, 11 Jul 1998, Michi Henning wrote: > On Fri, 10 Jul 1998, Bill Beckwith wrote: ... > So, what's an identifier? > > It's in section 2.10: > > identifier: > nondigit > identifier nondigit > identifier digit > > nondigit: one of > [ ASCII letter or underscore ] Congratulations. I couldn't find any mention of ASCII in the text. I'll have to figure out why my greps didn't show up this reference. You're right this does cause the dependency on ASCII however hidden the reference is. Gee, why did they keep the trigraphs if the character set is locked to ASCII? ... > > The problem is that a specific specification (CORBA) depended on > > an ambiguous specification (C++). One view is that the CORBA spec > added > > definition to the C++ specification. > > I hope the argument above shows that this simply is not correct. Yes, thank you for the correction. > To argue > that CORBA added to the definition of C++ is presumptious beyond > belief. Hmmm. I think you need to re-read my sentence. "One view ..." Also, when specifications are not explicit then presumptions are the only recourse. > For one, the OMG is no position to do this. Second, I don't think the > ANSI C++ committee requires any help. CORBA certainly _could_ extend or modify any referenced standard when incorporating it. For example one could easily see referencing select portions of POSIX 1003.1 and altering some aspect as necessary for the CORBA standard. This doesn't modify the original standard, it just presents a modified version for use with CORBA. I'm not recommending this, just pointing out the possibility. > Third, if the CORBA specification > were written with even a fraction of the rigor of the DIS, we > wouldn't > be in this mess in the first place. I couldn't agree more. It seems that ambiguity is a goal in many of these specifications. > Any idea that it is possible to use the preprocessor with identifiers > outside the C++ source character set is simply wishful thinking. Or about four days development work with a good engineer and the GNU cpp source. As has been mentioned before our product does support Latin-1 characters in the IDL source. -- Bill v Return-Path: X-Authentication-Warning: tigger.dstc.edu.au: michi owned process doing -bs Date: Sun, 12 Jul 1998 14:05:17 +1000 (EST) From: Michi Henning To: Bill Beckwith cc: orb_revision@omg.org, jeff@ois.com, vic@ois.com Subject: Re: think twice before removing latin 1 identifiers On Sat, 11 Jul 1998, Bill Beckwith wrote: > > identifier: > > nondigit > > identifier nondigit > > identifier digit > > > > nondigit: one of > > [ ASCII letter or underscore ] > > Congratulations. I couldn't find any mention of ASCII in the text. > I'll have to figure out why my greps didn't show up this reference. > You're right this does cause the dependency on ASCII however hidden > the reference is. Actually, it is a bit more subtle than that. The source character set could be encoded in something other than ASCII, such as EBCDIC or even ISO Latin-1. However, you are not allowed to have any characters in the source file other than the ones in the source character set, however encoded. > Gee, why did they keep the trigraphs if the character set is locked > to ASCII? The trigraphs are for local keyboards that don't have some of the ASCII characters on them. For example, on a German keyboard, you won't normally find all of the weird ASCII characters, such as } or |. Instead, the corresponding keys are used for things like umlauts. I think it is worse in other languages with more non-ASCII characters, such as Swedish and Danish. The trigraphs allow me to still enter things like { without having to have an actual key for that character on the keyboard. Another example are British keyboards. A lot of them do not have a # key. Instead, the same key is used for the British Pound currency symbol. Depending on the terminal, the Pound symbol may actually occupy the same code position as and just be rendered as a Pound symbol on the screen. In that case, I can simply type a Pound sign in my code where I would normally use a #. However, the terminal may be using ISO Latin-1, in which case the Pound symbol maps the to ISO Latin-1 code for Pound, and the # symbol is simply inaccessible from the keyboard. > > To argue > > that CORBA added to the definition of C++ is presumptious beyond > belief. > > Hmmm. I think you need to re-read my sentence. "One view ..." > Also, when specifications are not explicit then presumptions are the > only recourse. > > > For one, the OMG is no position to do this. Second, I don't think > the > > ANSI C++ committee requires any help. > > CORBA certainly _could_ extend or modify any referenced standard > when incorporating it. For example one could easily see referencing > select portions of POSIX 1003.1 and altering some aspect as > necessary for the CORBA standard. This doesn't modify the original > standard, it just presents a modified version for use with CORBA. > I'm not recommending this, just pointing out the possibility. My apologies for being so sharp -- bad day yesterday (which isn't an excuse I know :-( I misread this as meaning that CORBA had modified the C++ definition in some normative sense, which I still maintain would be presumptious ;-) We could of course reference the DIS and then say that the behavior differs from the DIS in some way. I'm not sure I like this though -- we would end up with something that is almost, but not quite, like the original referenced thing. Or, to quote Douglas Adams, we could end up with something that is almost, but not quite, entirely unlike something completely different. Could be highly confusing... ;-) > > Any idea that it is possible to use the preprocessor with identifiers > > outside the C++ source character set is simply wishful thinking. > > Or about four days development work with a good engineer and the GNU > cpp source. As has been mentioned before our product does support > Latin-1 characters in the IDL source. Yes, you can do all these things, no doubt. But do all vendors want to use the GNU cpp source, given the copyleft restrictions? Some ORBs don't have their own preprocessor at all, and instead rely on the one that is provided by the target C++ compiler. Do we really want to force these ORBs to ship their own preprocessor instead? I think not. Besides, even if we were to change the preprocessor to deal with ISO Latin-1, we would still have the problem of how to map ISO Latin-1 to languages that can't handle these characters. None of the mappings that result are pretty, unfortunately. Cheers, Michi. -- Michi Henning +61 7 33654310 DSTC Pty Ltd +61 7 33654311 (fax) University of Qld 4072 michi@dstc.edu.au AUSTRALIA http://www.dstc.edu.au/BDU/staff/michi-henning.html Return-Path: Date: Sun, 12 Jul 1998 00:17:23 -0400 (EDT) From: Bill Beckwith X-Sender: beckwb@gamma To: Michi Henning cc: Bill Beckwith , orb_revision@omg.org, jeff@ois.com, vic@ois.com Subject: Re: think twice before removing latin 1 identifiers On Sun, 12 Jul 1998, Michi Henning wrote: > Besides, even if we were to change the preprocessor to deal with ISO Latin-1, > we would still have the problem of how to map ISO Latin-1 to languages > that can't handle these characters. None of the mappings that result are > pretty, unfortunately. There are lots of things about IDL that are hard to map to a particular language. None the less, this is a justification that I can live with. Hopefully you and others will be sympathetic on other issues that cause major headaches for other languages. :-) -- Bill