Issue 3019: using XML for traits and metadata (pids-rtf2) Source: Level Seven Visualizations (Mr. Jon Farmer, jon(at)level7vis.com) Nature: Uncategorized Issue Severity: Summary: Since this representation of trait values in XML is fully-compliant at the IDL level, it's not a functionality change. The expression of trait metadata (so as to document the STRUCTURE of it) is a functionality change but it is defensible as a fix, even to make to make it possible for PIDS clients to validate dates - not just textual format but the value domains at the basic and abstract datatypes. I think we can all agree at this point that we need an operation on IdentificationComponent called get_trait_metatdata and I feel strongly that it should use DTD notation. If we agree at this level (please everyone - do you agree so far?), then I think the next question is how to express basic and abstract types in the DTD. Resolution: Revised Text: Actions taken: October 21, 1999: received issue Discussion: End of Annotations:===== Reply-To: "Jon Farmer" From: "Jon Farmer" To: "PIDS RTF" Subject: using XML for traits and metadata Date: Thu, 21 Oct 1999 14:18:56 -0400 MIME-Version: 1.0 X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 4.72.2106.4 X-MimeOLE: Produced By Microsoft MimeOLE V4.72.2106.4 Content-Type: multipart/mixed; boundary="----=_NextPart_000_001E_01BF1BCF.3601A180" X-UIDL: Ena!!b+I!!E"=!!E5J!! Since this representation of trait values in XML is fully-compliant at the IDL level, it's not a functionality change. The expression of trait metadata (so as to document the STRUCTURE of it) is a functionality change but it is defensible as a fix, even to make to make it possible for PIDS clients to validate dates - not just textual format but the value domains at the basic and abstract datatypes. I think we can all agree at this point that we need an operation on IdentificationComponent called get_trait_metatdata and I feel strongly that it should use DTD notation. If we agree at this level (please everyone - do you agree so far?), then I think the next question is how to express basic and abstract types in the DTD. Davids DTD shows string vs. binary (if that's what the S and B stand for) but how do we denote a "date"? I recommend we do it precisely the way HL7 V3 does it - however that is. I am sure it is in the V3 MDF. Can somebody point us to exactly where (URL) to look for rules by which DTD constructs are derived from RIM abstract datatypes. Actually I saw that very topic is under sometimes0heated debate at HL7 control-query TC Atlanta, but I don't know if it has settled out. I need somebody to take this task and report back to us. -----Original Message----- From: David Forslund To: drbob@bigfoot.com ; Jon Farmer Cc: Eyestone, Scott M ; pids-rtf2 ; Ham, Gary A ; Janet.Martino@ha.osd.mil ; JMcCain@HQT.IHS.GOV Date: Tuesday, August 10, 1999 11:10 AM Subject: RE: PIDS RTF2 issues discussion conference call >To stimulate discussion, I've attached the DTD that we have made our PIDS >handle. Basically, it can handle any data structures of Traits that conform >to this DTD. In addition, I've attached the idl that allows for the type >information >to be specified and used. This is the ONLY extension of the PIDS spec we >have used and it isn't really necessary. Our server transparently supports >the existing PIDS Traits which are a subset of this DTD. > >Dave >At 11:32 PM 8/9/99 -0600, David Forslund wrote: > >>At 11:54 PM 8/9/99 -0500, Bobby R. Treat wrote: >>>I don't agree there are multiple traits being compared to each other, in the >>>sense you seem to mean. Strings are compared to Strings. Subclassed off >>>Strings might be SSANs which, again, are compared to each other -- not to >>>names or birthdays or anything else. Other String subclasses might include >>>FirstName, LastName, and MiddleInitial -- each of which would have VERY >>>different Information metrics. >> >>I agree with Bobby. Each Trait is compared to a Trait with the same name. >>If the name corresponds to a different type (e.g., HL7/PatientName vs >>vCard/PHOTO), >>the comparison algorithm could be different. Bobby is talking, however, about >>a slide extension to the current PIDS spec, not what the current spec >>directly supports. >>However, since a Trait is an Any, it isn't to big of a leap for it to >>support this. Tim's >>CinderellaTrait is an example of a complex, user-defined, whose comparison >>would >>have to be done between other CinderellaTraits (glass slippers). >> >> >>>Look at the Smalltalk Number subclasses, which support "<", ">", "<=", "=", >>>">=", "~=" messages polymorphically. Each subclass may support each >>>comparison differently. In the context of military patient records, there >>>are many kinds of comparison needed, and they can all be categorized in >>>terms of Trait classes. >>> >>>Other classes would support aggregation of lower-level Traits -- the >>>Composite trait class, for instance, could support an arbitrary group of >>>Traits, each with a name. The Information measure for the Composite Trait >>>would compare two groups by comparing sub-traits with the same name, adding >>>up the resulting Information measures derived from the subtrait classes >>>(assuming independent traits). Name could be a Composite subclass that >>>includes "LastName", "FirstName", and "MiddleInitial" as components. Maybe >>>it really should be NameSex, with Sex as a fourth component, to recognize >>>that a difference in LastName provides powerful negative Information for >>>males but not for females. >> >>Our PIDS server supports this now. Since an array of Traits is directly >>supported by the IDL, >>we have made the simple extension (I claim not really an extension), that >>allows for a >>user-defined, arbitrary tree. The leaves of the tree, can be strings, >>binary, etc. One can >>then compare similar depth trees with the same types at the leaves. This >>we have implemented >>and allows for fairly complex traits (including public-key certificates, >>physician patient relationships, >>multiple login/password combinations, etc.), all of which can be >>user-defined at by the client. >>We have a DTD which defines the legal relationship of the trait components. >> >> >>>If you have a partial demographic profile, and you have 5 million records >>>that may match it, it isn't the fault of any specific algorithm or interface >>>(or analyst) that you have possibly 5 million compares to do. It will take >>>creativity to minimize the work. Just as is done in all optimized >>>databases, indexes can be prebuilt and used to make searches efficient, and >>>the Trait classes can be responsible for knowing about and using such >>>indices. What you can't do is use SQL-style LIKE to solve all problems -- >>>it isn't flexible or powerful enough. >> >>Certainly the latter is true, and SQL-style LIKE statements are not used >>by commercial PIDS implementations, to my knowledge. Optimized fast and >>accurate searches is where the competition in PIDS will shine (as well as in >>accurate correlations, as Jon has mentioned). >> >> >>>Testing and tuning thresholds until they work with a real population means >>>your algorithm is tied to that population. There's no proof it will work >>>for another population, until you've tried it. Also, just because you can >>>input a threshold that gives you what you want, doesn't make that threshold >>>meaningful. Does it work in all situations, no matter which or how many >>>traits are compared? If it's meaningful, what does it mean? >>> >>>If you have a method that's 99.9% accurate, I have three reactions to that: >>> >>>1) I'm really more concerned about a concrete interface issue. I don't >>>claim nobody has done the neural net work, or whatever, and come up with >>>good matching algorithms. The plain fact is, the PIDS spec doesn't allow >>>users to get all the kinds of comparison they want. Use your implementation >>>if it works, but put a useful interface in front, and support all the >>>comparisons needed. In part, I'm claiming that didn't happen the first time >>>through, because people didn't see comparison as a Trait behavior, as they >>>should have done. >> >>I think I disagree. The PIDS spec allows for all kinds of comparisons, >>they just >>can't be specified by the client. This was deliberate in the first PIDS >>spec to allow >>for as large a flexibility as possible. If we tightened down the spec >>too early, we >>might not have the ability to match in as many different ways. Like >>Traits are >>to be compared and can be handled with comparison as a Trait behavior in >>an implementation. >> >> >>>2) If you take the sum-total of all CHCS sites and look for duplicates, with >>>an error rate of .1%, what does that mean? Let's say there are 20 million >>>CHCS records altogether. (I'm sure that's conservative.) n*(n-1)/2 >>>possible matches times .1% = 199,999,990,000 expected matching errors -- >>>more than the total number of records!! Good luck resolving them all. Even >>>in day-to-day operation, how many people walk into CHCS sites in a day, >>>worldwide? Multiply by .1% and make that many errors every day, and in a >>>year you'll have another fine mess indeed. So, you need a bigger number >>>than 99.9% to impress me. Add six more nines, and you're getting close for >>>pre-install. For operation, you need at least a couple more nines. Yes, >>>you can avoid the pre-install matching problem by finding duplicates only as >>>people come in, but it means your database will always be chock-FULL of what >>>your own matching algorithm would say are duplicates, if asked. It's a >>>tough problem, it really is. If you've solved it, point (1) still applies. >> >>I don't believe you have used Jon's statistics correctly, but I'm sure he >>will respond. >> >> >>>3) You're undoubtedly working with much cleaner records than CHCS has to >>>offer. >>> >>>"We've been toying with adding a function to dock confidence points for >>>misses, but not sure how important that is, given the accuracy we have now." >>> >>>Think about it some more. I'll gladly help as devil's advocate, in my spare >>>time. >> >>Dave >> >>>Bobby >>> >>>-----Original Message----- >>>From: Jon Farmer [mailto:jfarmer@ic.net] >>>Sent: Monday, August 09, 1999 8:24 PM >>>To: drbob@bigfoot.com; David W. Forslund >>>Cc: Eyestone, Scott M; pids-rtf2; Ham, Gary A; Janet.Martino@ha.osd.mil >>>Subject: Re: PIDS RTF2 issues discussion conference call >>> >>> >I think comparison is a Trait behavior. Treating it that way means that >>> >selecting the Trait class determines the matching algorithm. >>> > >>>Since there are multiple traits being compared to each other, I don;t see >>>comparison as "normalizing" to the trait class. >>> >>>Nor do I see it feasible, even with OBV, to incur 5 million method >>>invocations for a search. Now, if an IMPLEMENTATION wants to do it that >>>way, with some value add - more power to them (literally), but at the >>>coneptual responsibility level, the PIDS is responsible for matching >>>profiles of traits submitted as a unit - not individual profiles. It's nota >>>trait identification service, its a person identifiction service. >>> >>> >"Minus log probability of this degree of closeness by chance" has the >>>virtue >>> >of being additive across independent traits. Without additivity -- or some >>> >such combinatorial property -- it's damned difficult to make sense of a >>> >measure. Any metric, in any context in mathematics, requires some such >>> >property --- like the standard "triangle inequality" that says d(A,C) <= >>> >d(A,B) + d(B,C). (I'm pretty sure the Information measure satisfies that >>> >one too, by the way.) >>> >>>Most Definitely! There is no way to programmatically/procedurally cover all >>>the combinatorial cases of "what's missing" unless you got that additive >>>effect. >>> >>> > >>> >When searching for matches with partial demographic information, how do you >>> >make sense of it when on some records the middle initial or sex or rank or >>> >age is missing, but on others they're all there? The Information measure >>> >adds nothing for the missing traits, and remains mathematically accurate in >>> >doing so. It also provides an explicit test statistic, which has the >>> >advantage that lots of people [think they] understand that. >>> > >>> >>>Agree >>> >Yes, the Information measure is dependent on the reference population, but >>> >so is anything else that's useful. >>>Disagree. We have run for three years in a 5ook population with a set of >>>discrete matching trait groups (conclude match if any one of these trait >>>groups hits on each trait exactly) and statistical (weights anthresholds). >>>The customer claims well in advance of 99.9% correctness in terms of failed >>>consolidations or false consoilidations. >>> >>>Now - to get to that they had to experiment with the matching policy for a >>>few months and then QA samples of matching decisions. The weights and >>>thresholds were NOT computed with a prespecified reference population - >>>unless the fact that they tested it against a real population means it was. >>> >>>I think that qualifies as meaningful! >>> >>> > >>> >The concept of negative Information (information AGAINST a match) is also >>> >very important. A Trait like sex that's almost useless for matching can be >>> >very useful for preventing poor matches. Hence I like the idea of two >>> >polymorphic Trait behaviors for positive and negative information. >>> >>>We've been toying with adding a function to dock confidence points for >>>misses, but not sure how important that is, given the accuracy we have now. >>> > >>> >Bobby >>> > >>> >-----Original Message----- >>> >From: David Forslund [mailto:dwf@acl.lanl.gov] >>> >Sent: Monday, August 09, 1999 4:08 PM >>> >To: Jon Farmer >>> >Cc: drbob@bigfoot.com; Eyestone, Scott M; pids-rtf2; Ham, Gary A; >>> >Janet.Martino@ha.osd.mil >>> >Subject: Re: PIDS RTF2 issues discussion conference call >>> > >>> > >>> >The item that is in the RTF for discussion is a way to request or find >>> >out what matching "algorithm" is being used on the server. Currently, >>> >there is no way to find this out through the standard IDL. The issue >>> >of defining what we mean by a matching algorithm is still an issue. >>> > >>> >Dave >>> > >>> >Jon Farmer writes: >>> > > > >>> > > >>Most of all, I've been arguing for a matching measure that's >>> >meaningful, >>> > > >>like "probability of this degree of closeness by chance" (information >>>= >>> > > >>minus log of that). A meaningful metric is the first step in pulling >>> > > >>together a meaningful interface. Otherwise, you have no consistent >>>way >>> >of >>> > > >>asking the "give me close matches" question. >>> > > >>> > > >>> > > the current spec provides for a confidence scoring between 0 and 1, with >>> > > zero meaning no chance, 1 being certainty; therefore all real values >>> >should >>> > > be between them. the idea of defining it as a probability was brought >>>up >>> > > and discussed repeatedly and at length in the original spec effort. >>> > > Essentially, we cannot call it a probability since many underlying >>> >matching >>> > > algorithms are not calculating probabilities in a formal sense (actually >>> > > likelihoods), based on the distinctiveness of each incoming value among >>> >the >>> > > universe of alreadyknown values. Note that the selectivity of a given >>> >value >>> > > cannot even be approximated unless the whole local population is loaded! >>> > > >>> > > It was decided that in order to NOT disqualify all implmented algorithms >>> > > except probabilistic ones - we would define it as a confidence. Then, >>>it >>> > > becomes the resonsibility of the implementor to provide a meaningful >>> > > assignment of confidences to the candidates. In our product we apply >>> > > several algorithms, including a weights-and-thresholds algorithm that >>> >gives >>> > > us a confidence metric that is moreless linear with likelihood if you >>> >tune >>> > > it right. However, somebody else's regular-expression matcher - or a >>> > > primitive direct-total-exact-hit-or-nothing matcher have a harder >>> >challenge >>> > > to produce a 0-1 confidence scoring. But this is much better than >>>saying >>> > > "PIDs only works for probabilistic algorithms". >>> > > >>> > > Another issue arises, however, in the context of a delegated >>> >find_candidates >>> > > (f/c) operation. If my delegating PIDS gets a f/c request and delgates >>> >it >>> > > to three back-end PIDS servers, how valid is it to return the results >>> > > interleaved together, assuming that the client si going to trust the >>> > > rankings based on scores produced by different implementations? I was >>> >going >>> > > to post this as an issue, buthten I realized it could be resolved as >>> > > follows - the delegating PIDS - instead of passing the result sets back >>> >all >>> > > merged together - should od the back-end searches, LOG EACH ONE INTO ITS >>> >OWN >>> > > "REGISTER THESE IDS" and then hit its own f/c for the final resut set >>> >with >>> > > all confidence scores conputed by the same algorithm. >>> > > >>> > > > >>> > > >I think what you want is a way to specify the closeness requirement >>>from >>> > > >the client or at least >>> > > >know what the server is doing. That is one of the issues in the current >>> >PDS >>> > > >RTF. I don't know >>> > > >what the resolution will be. >>> > > >>> > > >>> > >>> > > [] pids.dtd [] TraitNode.idl X-Originating-IP: [202.28.159.201] From: "benjamas limprasert" To: dwf@lanl.gov Cc: jfarmer@ic.net, pids-rtf2@omg.org, drbob@bigfoot.com, EyestonS@BATTELLE.ORG, Ham@BATTELLE.ORG, Janet.Martino@ha.osd.mil, JMcCain@HQT.IHS.GOV, lovemum@hotmail.com Subject: Matching algorithm of PIDs Date: Tue, 13 Jun 2000 21:24:20 PDT Mime-Version: 1.0 Content-Type: text/plain; format=flowed X-UIDL: 4N7!!EPE!!d(j!!B9^!! Hello, I road text file on web site "www.omg.org/issues/issue3019.txt" I interesting about matching algorithm for matching patient from multiple domain (multiple PIDs). Because my thesis about "federated trader of PIDs on multiple domains." Example There are 2 domains are Domain A and Domain B. (Domain can be hospital or laboratory or other...) If i am a patient and last year i register in Domain A and now I'm sick and register at Domain B. If Domain B want to query infomation of me from Domain A by matching and serching traits.(ID in Domain A have 7 digits and ID in Domain B has 10 digits) What algorithm that you think i can use and if you have references about algorithm (file or website about algorithm) please tell or send to me? Thanks you very much. Best Regards, Benjamas Limprasert Master student (computer science) Mahidol University,Thailand ________________________________________________________________________ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com Reply-To: "Jon Farmer" From: "Jon Farmer" To: "benjamas limprasert" , Cc: , , , , , , References: <20000614042420.33827.qmail@hotmail.com> Subject: Re: Matching algorithm of PIDs Date: Wed, 14 Jun 2000 14:04:21 -0400 Organization: Care Data Systems MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.00.2314.1300 X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300 Content-Type: text/plain; charset="iso-8859-1" X-UIDL: 4R4!!YDd!!ToR!!7;kd9 Hi Benjamas, You probably know that matching cannot legitimately be done on identifiers, except WITHIN a domain. The matching that a PIDS does across domains is always on the traits, such as name components, bith date, address, etc. Therefore the number of digits of identifiers is immaterial except in that it needs to have an "address space size" big enough to accommodate the target population and maybe some more digits as "check digits". Some people stuff traits into identifiers, but that is a risky practice, because traits change and few organization have the resources to be updating both identifiers and the records they point to. Algorithms range from simplistic "direct hit" comparisons of trait values, to those that compute "closeness" (confidence) numbers (like weights of trait types and thresholds that constitute a hit) , to those that apply probability formulas to compute the appropriate weights of specific trait values. As a commercial vendor of a PIDS, it would be disloyal to my company to give you algorithms, but good luck! Jon ----- Original Message ----- From: benjamas limprasert To: Cc: ; ; ; ; ; ; ; Sent: Wednesday, June 14, 2000 12:24 AM Subject: Matching algorithm of PIDs > Hello, > > I road text file on web site "www.omg.org/issues/issue3019.txt" > I interesting about matching algorithm for matching patient from >multiple > domain (multiple PIDs). > Because my thesis about "federated trader of PIDs on multiple >domains." > Example There are 2 domains are Domain A and Domain B. (Domain can >be > hospital or laboratory or other...) > If i am a patient and last year i register in Domain A and now I'm >sick and > register at Domain B. > If Domain B want to query infomation of me from Domain A by matching >and > serching traits.(ID in Domain A have 7 digits and ID in Domain B has >10 > digits) > What algorithm that you think i can use and if you have references >about > algorithm (file or website about algorithm) please tell or send to >me? > Thanks you very much. > > Best Regards, > Benjamas Limprasert > Master student (computer science) > Mahidol University,Thailand > ________________________________________________________________________ > Get Your Private, Free E-mail from MSN Hotmail at >http://www.hotmail.com > > X-Sender: tculpepr@pop.mindspring.com X-Mailer: QUALCOMM Windows Eudora Pro Version 4.0.1 Date: Tue, 07 Nov 2000 14:23:51 -0600 To: coas-ftf@emerald.omg.org From: Tom Culpepper Subject: COAS RTF #1 In-Reply-To: <4.2.0.58.20001106154117.00ad1a00@emerald.omg.org> Mime-Version: 1.0 Content-Type: multipart/alternative; types="text/plain,text/html"; boundary="=====================_26253820==_.ALT" X-UIDL: mp/!!GX-e9-n[d9R)[!! All, I have attached the COAS RTF #1 Report that I am working on based on issues that have come in. In the report I cut and pasted the issues and am merely using the Rejected piece as a place holder. We need to look at each one and decide if it is an issue and then what do we need to do with it or if it is not an issue why? I have started putting some of my comments into some of the issues and would like feedback about how others feel, my comments are not intended to be the final answer just a starting point. Dave, issue 3136. I believe this is not an issue. Am I right? Chuck, issue 3019. Can we use the Naming Service to identify the "functional IDL" level of conformance. And use the Trader as a way to ask for more specific capabilities such as "HL7 Inside", "Single Struct COAS". Much like the White and Yellow pages. I can go to the white pages to find a plumber and go to the yellow pages to get more detail. We have until 30 November to complete this report. The comment deadline was 30 October. All other issues beside 3136 & 3019 came in after 30 October so technically we don't have to do anything with them but i would like to do as much as possible with your help. Thanks Tom _______________________________________________________ Thomas C. Culpepper 2AB, Inc. Senior Architect Integration Architects tculpepper@2ab.com www.2ab.com 205-621-7455 ext 107 OMG Domain Member Technical Co-Chair CORBAmed www.omg.org/homepages/corbamed _________________________________________________________ "Solutions for Distributed Business (SM)"