Issue 2847: ambiguity in wstrings having a Unicode escape sequence (orb_revision) Source: (, ) Nature: Uncategorized Issue Severity: Summary: Summary: According to the spec, a wide string may contain one or more Unicode escape sequences of the form \uxxxx, where the x"s are hex digits. It also says that such an escape sequence may not have the value 0, and that leading 0s are optional in the hex digits. Taking all this into account, how does a parser tell the difference between (imaginary parentheses inserted to show the writer"s intent) Resolution: Revised Text: Actions taken: August 13, 1999: received issue October 30, 2000: closed issue Discussion: End of Annotations:===== X-Sender: parsons@mail.cs.wustl.edu X-Mailer: QUALCOMM Windows Eudora Light Version 3.0.6 (32) Date: Fri, 13 Aug 1999 17:29:07 -0500 To: orb_revision@omg.org From: Jeff Parsons Subject: ambiguity in wstrings having a Unicode escape sequence Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-UIDL: 9dfb5d2729770bb4a5762ab3de130ce6 Hi, According to the spec, a wide string may contain one or more Unicode escape sequences of the form \uxxxx, where the x's are hex digits. It also says that such an escape sequence may not have the value 0, and that leading 0s are optional in the hex digits. Taking all this into account, how does a parser tell the difference between (imaginary parentheses inserted to show the writer's intent) L"hello(\u000a)bcd" which is legal, and L"hello(\u000)abcd" which is not? For that matter, how can one tell when an escape sequence ends? In L"hello\uabcdplayer" how are we to know how many of the characters following \u are included in the escape sequence? A logical guess might be that the parser will accept the longest Unicode escape sequence it can. But then a wstring such as L"hello(\ubcd)everybody" must be rewritten as L"hello(\u0bcd)everybody" and now the leading 0 is no longer optional. Have I missed something in the spec or is this really a hole? thanks, Jeff Parsons Sender: jis@fpk.hp.com Message-ID: <37BC2BCB.1A6E4FD3@fpk.hp.com> Date: Thu, 19 Aug 1999 12:07:39 -0400 From: Jishnu Mukerji Organization: Hewlett-Packard EIAL X-Mailer: Mozilla 4.08 [en] (X11; U; HP-UX B.11.00 9000/889) MIME-Version: 1.0 To: Juergen Boldt CC: issues@emerald.omg.org, orb_revision@emerald.omg.org, cxx_revision@emerald.omg.org Subject: Re: isue 2847 -- Core RTF Issue? References: <4.1.19990819113438.009dfb60@emerald.omg.org> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: 04a9e7715a716ca647f17fed0d6baf22 Juergen Boldt wrote: > > This is issue # 2847 > > ambiguity in wstrings having a Unicode escape sequence > > According to the spec, a wide string may contain one or > more Unicode escape sequences of the form \uxxxx, where > the x's are hex digits. It also says that such an > escape sequence may not have the value 0, and that > leading 0s are optional in the hex digits. > Taking all this into account, how does a parser > tell the difference between (imaginary parentheses > inserted to show the writer's intent) Looks like it has something to do with Core Chapter 3. So it is a core issue. Jishnu. -- Jishnu Mukerji Systems Architect Email: jis@fpk.hp.com Hewlett-Packard EIAL, Tel: +1 973 443 7528 300 Campus Drive, 2E-62, Fax: +1 973 443 7422 Florham Park, NJ 07932, USA. Sender: jis@fpk.hp.com Message-ID: <3828B7B3.CD8F19B3@fpk.hp.com> Date: Tue, 09 Nov 1999 19:09:23 -0500 From: Jishnu Mukerji Organization: Hewlett-Packard EIAL X-Mailer: Mozilla 4.08 [en] (X11; U; HP-UX B.11.00 9000/889) MIME-Version: 1.0 To: orb_revision@omg.org Subject: Issue 2847 proposed resolution Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: ~n3!!05'e9nc4!!Q--e9 Issue 2847: ambiguity in wstrings having a Unicode escape sequence (orb_revision) http://www.omg.org/issues/issue2847.txt Resolution: Once the parser starts scanning the hex number in the string, it keeps scanning until it hits the first invalid character, at which point it converts the string of valid hex digits scanned so far into an integer value, and proceeds to scan forward assuming that it is scanning a new character in the string. So for example: In order to deal with L"hello\uabcdplayer" one can write it as L"hello\ua" L"bcdplayer" or L"hello\uab" L"cdplayer" or L"hello\uabc" L"dplayer" or L"hello\uabcd" L"player" // same as L"hello\uabcdplayer" depending on what one wants. Actions taken: Close no change ___________________________________________________________ Comments? Jishnu. -- Jishnu Mukerji Systems Architect Email: jis@fpk.hp.com Hewlett-Packard EIAL, Tel: +1 973 443 7528 300 Campus Drive, 2E-62, Fax: +1 973 443 7422 Florham Park, NJ 07932, USA.