Issue 3019: using XML for traits and metadata (pids-rtf2)
Source: Level Seven Visualizations (Mr. Jon Farmer, jon(at)level7vis.com)
Nature: Uncategorized Issue
Severity: 
Summary: Since this representation of trait values in XML is fully-compliant at the
IDL level, it's not a functionality change.
The expression of trait metadata (so as to document the STRUCTURE of it) is
a functionality change but it is defensible as a fix, even to make to make
it possible for PIDS clients to validate dates - not just textual format but
the value domains at the basic and abstract datatypes.

I think we can all agree at this point that we need an operation on
IdentificationComponent called

        get_trait_metatdata

and I feel strongly that it should use DTD notation.  If we agree at this
level (please everyone - do you agree so far?), then I think the next
question is how to express basic and abstract types in the DTD.
Resolution: 
Revised Text: 
Actions taken:
October 21, 1999: received issue
Discussion: 

End of Annotations:=====
Reply-To: "Jon Farmer" <jfarmer@ic.net>
From: "Jon Farmer" <jfarmer@ic.net>
To: "PIDS RTF" <pids-rtf2@omg.org>
Subject: using XML for traits and metadata
Date: Thu, 21 Oct 1999 14:18:56 -0400
MIME-Version: 1.0
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 4.72.2106.4
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.2106.4
Content-Type: multipart/mixed;
	boundary="----=_NextPart_000_001E_01BF1BCF.3601A180"
X-UIDL: Ena!!b+I!!E"=!!E5J!!

Since this representation of trait values in XML is fully-compliant at
the
IDL level, it's not a functionality change.
The expression of trait metadata (so as to document the STRUCTURE of
it) is
a functionality change but it is defensible as a fix, even to make to
make
it possible for PIDS clients to validate dates - not just textual
format but
the value domains at the basic and abstract datatypes.

I think we can all agree at this point that we need an operation on
IdentificationComponent called

        get_trait_metatdata

and I feel strongly that it should use DTD notation.  If we agree at
this
level (please everyone - do you agree so far?), then I think the next
question is how to express basic and abstract types in the DTD.

Davids DTD shows string vs. binary (if that's what the S and B stand
for)
but how do we denote a "date"?  I recommend we do it precisely the way
HL7
V3 does it - however that is.  I am sure it is in the V3 MDF.  Can
somebody
point us to exactly where (URL) to look for rules by which DTD
constructs
are derived from RIM abstract datatypes.  Actually I saw that very
topic is
under sometimes0heated debate at HL7 control-query TC Atlanta, but I
don't
know if it has settled out.  I need somebody to take this task and
report
back to us.

-----Original Message-----
From: David Forslund <dwf@lanl.gov>
To: drbob@bigfoot.com <drbob@bigfoot.com>; Jon Farmer <jfarmer@ic.net>
Cc: Eyestone, Scott M <EyestonS@BATTELLE.ORG>; pids-rtf2
<pids-rtf2@emerald.omg.org>; Ham, Gary A <Ham@BATTELLE.ORG>;
Janet.Martino@ha.osd.mil <Janet.Martino@ha.osd.mil>;
JMcCain@HQT.IHS.GOV
<JMcCain@HQT.IHS.GOV>
Date: Tuesday, August 10, 1999 11:10 AM
Subject: RE: PIDS RTF2 issues discussion conference call


>To stimulate discussion, I've attached the DTD that we have made our
PIDS
>handle.  Basically, it can handle any data structures of Traits that
conform
>to this DTD.  In addition, I've attached the idl that allows for the
type
>information
>to be specified and used.  This is the ONLY extension of the PIDS
spec we
>have used and it isn't really necessary.  Our server transparently
supports
>the existing PIDS Traits which are a subset of this DTD.
>
>Dave
>At 11:32 PM 8/9/99 -0600, David Forslund wrote:
>
>>At 11:54 PM 8/9/99 -0500, Bobby R. Treat wrote:
>>>I don't agree there are multiple traits being compared to each
other, in
the
>>>sense you seem to mean.  Strings are compared to Strings.
Subclassed off
>>>Strings might be SSANs which, again, are compared to each other --
not to
>>>names or birthdays or anything else.  Other String subclasses might
include
>>>FirstName, LastName, and MiddleInitial -- each of which would have
VERY
>>>different Information metrics.
>>
>>I agree with Bobby.  Each Trait is compared to a Trait with the same
name.
>>If the name corresponds to a different type (e.g., HL7/PatientName
vs
>>vCard/PHOTO),
>>the comparison algorithm could be different.  Bobby is talking,
however,
about
>>a slide extension to the current PIDS spec, not what the current
spec
>>directly supports.
>>However, since a Trait is an Any, it isn't to big of a leap for it
to
>>support this.  Tim's
>>CinderellaTrait is an example of a complex, user-defined, whose
comparison
>>would
>>have to be done between other CinderellaTraits (glass slippers).
>>
>>
>>>Look at the Smalltalk Number subclasses, which support "<", ">",
"<=",
"=",
>>>">=", "~=" messages polymorphically.  Each subclass may support
each
>>>comparison differently.  In the context of military patient
records,
there
>>>are many kinds of comparison needed, and they can all be
categorized in
>>>terms of Trait classes.
>>>
>>>Other classes would support aggregation of lower-level Traits --
the
>>>Composite trait class, for instance, could support an arbitrary
group of
>>>Traits, each with a name.  The Information measure for the
Composite
Trait
>>>would compare two groups by comparing sub-traits with the same
name,
adding
>>>up the resulting Information measures derived from the subtrait
classes
>>>(assuming independent traits).  Name could be a Composite subclass
that
>>>includes "LastName", "FirstName", and "MiddleInitial" as
components.
Maybe
>>>it really should be NameSex, with Sex as a fourth component, to
recognize
>>>that a difference in LastName provides powerful negative
Information for
>>>males but not for females.
>>
>>Our PIDS server supports this now.  Since an array of Traits is
directly
>>supported by the IDL,
>>we have made the simple extension (I claim not really an extension),
that
>>allows for a
>>user-defined, arbitrary tree.  The leaves of the tree, can be
strings,
>>binary, etc.  One can
>>then compare similar depth trees with the same types at the leaves.
This
>>we have implemented
>>and allows for fairly complex traits (including public-key
certificates,
>>physician patient relationships,
>>multiple login/password combinations, etc.), all of which can be
>>user-defined at by the client.
>>We have a DTD which defines the legal relationship of the trait
components.
>>
>>
>>>If you have a partial demographic profile, and you have 5 million
records
>>>that may match it, it isn't the fault of any specific algorithm or
interface
>>>(or analyst) that you have possibly 5 million compares to do.  It
will
take
>>>creativity to minimize the work.  Just as is done in all optimized
>>>databases, indexes can be prebuilt and used to make searches
efficient,
and
>>>the Trait classes can be responsible for knowing about and using
such
>>>indices.  What you can't do is use SQL-style LIKE to solve all
problems --
>>>it isn't flexible or powerful enough.
>>
>>Certainly the latter is true, and SQL-style LIKE statements are not
used
>>by commercial PIDS implementations, to my knowledge.  Optimized fast
and
>>accurate searches is where the competition in PIDS will shine (as
well as
in
>>accurate correlations, as Jon has mentioned).
>>
>>
>>>Testing and tuning thresholds until they work with a real
population
means
>>>your algorithm is tied to that population.  There's no proof it
will work
>>>for another population, until you've tried it.  Also, just because
you
can
>>>input a threshold that gives you what you want, doesn't make that
threshold
>>>meaningful.  Does it work in all situations, no matter which or how
many
>>>traits are compared?  If it's meaningful, what does it mean?
>>>
>>>If you have a method that's 99.9% accurate, I have three reactions
to
that:
>>>
>>>1) I'm really more concerned about a concrete interface issue.  I
don't
>>>claim nobody has done the neural net work, or whatever, and come up
with
>>>good matching algorithms.  The plain fact is, the PIDS spec doesn't
allow
>>>users to get all the kinds of comparison they want.  Use your
implementation
>>>if it works, but put a useful interface in front, and support all
the
>>>comparisons needed.  In part, I'm claiming that didn't happen the
first
time
>>>through, because people didn't see comparison as a Trait behavior,
as
they
>>>should have done.
>>
>>I think I disagree.   The PIDS spec allows for all kinds of
comparisons,
>>they just
>>can't be specified by the client.  This was deliberate in the first
PIDS
>>spec to allow
>>for as large a flexibility as possible.  If we  tightened down the
spec
>>too early, we
>>might not have the ability to match in as many different ways.  Like
>>Traits are
>>to be compared and can be handled with comparison as a Trait
behavior in
>>an implementation.
>>
>>
>>>2) If you take the sum-total of all CHCS sites and look for
duplicates,
with
>>>an error rate of .1%, what does that mean?  Let's say there are 20
million
>>>CHCS records altogether.  (I'm sure that's conservative.)
n*(n-1)/2
>>>possible matches times .1% = 199,999,990,000 expected matching
errors --
>>>more than the total number of records!!  Good luck resolving them
all.
Even
>>>in day-to-day operation, how many people walk into CHCS sites in a
day,
>>>worldwide?  Multiply by .1% and make that many errors every day,
and in a
>>>year you'll have another fine mess indeed.  So, you need a bigger
number
>>>than 99.9% to impress me.  Add six more nines, and you're getting
close
for
>>>pre-install.  For operation, you need at least a couple more
nines.  Yes,
>>>you can avoid the pre-install matching problem by finding
duplicates only
as
>>>people come in, but it means your database will always be
chock-FULL of
what
>>>your own matching algorithm would say are duplicates, if asked.
It's a
>>>tough problem, it really is.  If you've solved it, point (1) still
applies.
>>
>>I don't believe you have used Jon's statistics correctly, but I'm
sure he
>>will respond.
>>
>>
>>>3) You're undoubtedly working with much cleaner records than CHCS
has to
>>>offer.
>>>
>>>"We've been toying with adding a function to dock confidence points
for
>>>misses, but not sure how important that is, given the accuracy we
have
now."
>>>
>>>Think about it some more.  I'll gladly help as devil's advocate, in
my
spare
>>>time.
>>
>>Dave
>>
>>>Bobby
>>>
>>>-----Original Message-----
>>>From: Jon Farmer [mailto:jfarmer@ic.net]
>>>Sent: Monday, August 09, 1999 8:24 PM
>>>To: drbob@bigfoot.com; David W. Forslund
>>>Cc: Eyestone, Scott M; pids-rtf2; Ham, Gary A;
Janet.Martino@ha.osd.mil
>>>Subject: Re: PIDS RTF2 issues discussion conference call
>>>
>>> >I think comparison is a Trait behavior.  Treating it that way
means
that
>>> >selecting the Trait class determines the matching algorithm.
>>> >
>>>Since there are multiple traits being compared to each other, I
don;t see
>>>comparison as "normalizing" to the trait class.
>>>
>>>Nor do I see it feasible, even with OBV, to incur 5 million method
>>>invocations for a search.  Now, if an IMPLEMENTATION wants to do it
that
>>>way, with some value add - more power to them (literally), but at
the
>>>coneptual responsibility level, the PIDS is responsible for
matching
>>>profiles of traits submitted as a unit - not individual profiles.
It's
nota
>>>trait identification service, its a person identifiction service.
>>>
>>> >"Minus log probability of this degree of closeness by chance" has
the
>>>virtue
>>> >of being additive across independent traits.  Without additivity
-- or
some
>>> >such combinatorial property -- it's damned difficult to make
sense of a
>>> >measure.  Any metric, in any context in mathematics, requires
some such
>>> >property --- like the standard "triangle inequality" that says
d(A,C)
<=
>>> >d(A,B) + d(B,C).  (I'm pretty sure the Information measure
satisfies
that
>>> >one too, by the way.)
>>>
>>>Most Definitely!  There is no way to programmatically/procedurally
cover
all
>>>the combinatorial cases of "what's missing" unless you got that
additive
>>>effect.
>>>
>>> >
>>> >When searching for matches with partial demographic information,
how do
you
>>> >make sense of it when on some records the middle initial or sex
or rank
or
>>> >age is missing, but on others they're all there?  The Information
measure
>>> >adds nothing for the missing traits, and remains mathematically
accurate in
>>> >doing so.  It also provides an explicit test statistic, which has
the
>>> >advantage that lots of people [think they] understand that.
>>> >
>>>
>>>Agree
>>> >Yes, the Information measure is dependent on the reference
population,
but
>>> >so is anything else that's useful.
>>>Disagree.  We have run for three years in a 5ook population with a
set of
>>>discrete matching trait groups (conclude match if any one of these
trait
>>>groups hits on each trait exactly) and statistical (weights
anthresholds).
>>>The customer claims well in advance of 99.9% correctness in terms
of
failed
>>>consolidations or false consoilidations.
>>>
>>>Now - to get to that they had to experiment with the matching
policy for
a
>>>few months and then QA samples of matching decisions.  The weights
and
>>>thresholds were NOT computed with a prespecified reference
population -
>>>unless the fact that they tested it against a real population means
it
was.
>>>
>>>I think that qualifies as meaningful!
>>>
>>> >
>>> >The concept of negative Information (information AGAINST a match)
is
also
>>> >very important.  A Trait like sex that's almost useless for
matching
can be
>>> >very useful for preventing poor matches.  Hence I like the idea
of two
>>> >polymorphic Trait behaviors for positive and negative
information.
>>>
>>>We've been toying with adding a function to dock confidence points
for
>>>misses, but not sure how important that is, given the accuracy we
have
now.
>>> >
>>> >Bobby
>>> >
>>> >-----Original Message-----
>>> >From: David Forslund [mailto:dwf@acl.lanl.gov]
>>> >Sent: Monday, August 09, 1999 4:08 PM
>>> >To: Jon Farmer
>>> >Cc: drbob@bigfoot.com; Eyestone, Scott M; pids-rtf2; Ham, Gary A;
>>> >Janet.Martino@ha.osd.mil
>>> >Subject: Re: PIDS RTF2 issues discussion conference call
>>> >
>>> >
>>> >The item that is in the RTF for discussion is a way to request or
find
>>> >out what matching "algorithm" is being used on the server.
Currently,
>>> >there is no way to find this out through the standard IDL.  The
issue
>>> >of defining what we mean by a matching algorithm is still an
issue.
>>> >
>>> >Dave
>>> >
>>> >Jon Farmer writes:
>>> > > >
>>> > > >>Most of all, I've been arguing for a matching measure that's
>>> >meaningful,
>>> > > >>like "probability of this degree of closeness by chance"
(information
>>>=
>>> > > >>minus log of that).  A meaningful metric is the first step
in
pulling
>>> > > >>together a meaningful interface.  Otherwise, you have no
consistent
>>>way
>>> >of
>>> > > >>asking the "give me close matches" question.
>>> > >
>>> > >
>>> > > the current spec provides for a confidence scoring between 0
and 1,
with
>>> > > zero meaning no chance, 1 being certainty; therefore all real
values
>>> >should
>>> > > be between them.  the idea of defining it as a probability was
brought
>>>up
>>> > > and discussed repeatedly and at length in the original spec
effort.
>>> > > Essentially, we cannot call it a probability since many
underlying
>>> >matching
>>> > > algorithms are not calculating probabilities in a formal sense
(actually
>>> > > likelihoods), based on the distinctiveness of each incoming
value
among
>>> >the
>>> > > universe of alreadyknown values.  Note that the selectivity of
a
given
>>> >value
>>> > > cannot even be approximated unless the whole local population
is
loaded!
>>> > >
>>> > > It was decided that in order to NOT disqualify all implmented
algorithms
>>> > > except probabilistic ones - we would define it as a
confidence.
Then,
>>>it
>>> > > becomes the resonsibility of the implementor to provide a
meaningful
>>> > > assignment of confidences to the candidates.  In our product
we
apply
>>> > > several algorithms, including a weights-and-thresholds
algorithm
that
>>> >gives
>>> > > us a confidence metric that is moreless linear with likelihood
if
you
>>> >tune
>>> > > it right.  However, somebody else's regular-expression matcher
- or
a
>>> > > primitive direct-total-exact-hit-or-nothing matcher have a
harder
>>> >challenge
>>> > > to produce a 0-1 confidence scoring.  But this is much better
than
>>>saying
>>> > > "PIDs only works for probabilistic algorithms".
>>> > >
>>> > > Another issue arises, however, in the context of a delegated
>>> >find_candidates
>>> > > (f/c) operation.  If my delegating PIDS gets a f/c request and
delgates
>>> >it
>>> > > to three back-end PIDS servers, how valid is it to return the
results
>>> > > interleaved together, assuming that the client si going to
trust the
>>> > > rankings based on scores produced by different
implementations?  I
was
>>> >going
>>> > > to post this as an issue, buthten I realized it could be
resolved as
>>> > > follows - the delegating PIDS - instead of passing the result
sets
back
>>> >all
>>> > > merged together - should od the back-end searches, LOG EACH
ONE INTO
ITS
>>> >OWN
>>> > > "REGISTER THESE IDS" and then hit its own f/c for the final
resut
set
>>> >with
>>> > > all confidence scores conputed by the same algorithm.
>>> > >
>>> > > >
>>> > > >I think what you want is a way to specify the closeness
requirement
>>>from
>>> > > >the client or at least
>>> > > >know what the server is doing. That is one of the issues in
the
current
>>> >PDS
>>> > > >RTF.  I don't know
>>> > > >what the resolution will be.
>>> > >
>>> > >
>>> >
>>> >
>

[] pids.dtd 

[] TraitNode.idl 

X-Originating-IP: [202.28.159.201]
From: "benjamas limprasert" <lovemum@hotmail.com>
To: dwf@lanl.gov
Cc: jfarmer@ic.net, pids-rtf2@omg.org, drbob@bigfoot.com,
        EyestonS@BATTELLE.ORG, Ham@BATTELLE.ORG,
	Janet.Martino@ha.osd.mil,
        JMcCain@HQT.IHS.GOV, lovemum@hotmail.com
Subject: Matching algorithm of PIDs
Date: Tue, 13 Jun 2000 21:24:20 PDT
Mime-Version: 1.0
Content-Type: text/plain; format=flowed
X-UIDL: 4N7!!EPE!!d(j!!B9^!!

Hello,

I road text file on web site "www.omg.org/issues/issue3019.txt"
I interesting about matching algorithm for matching patient from
multiple domain (multiple PIDs).
Because my thesis about "federated trader of PIDs on multiple
domains."
Example There are 2 domains are Domain A and Domain B. (Domain can be
hospital or laboratory or other...)
If i am a patient and last year i register in Domain A and now I'm
sick and register at Domain B.
If Domain B want to query infomation of me from Domain A by matching
and serching traits.(ID in Domain A have 7 digits and ID in Domain B
has 10 digits)
What algorithm that you think i can use and if you have references
about algorithm (file or website about algorithm) please tell or send
to me?
Thanks you very much.

Best Regards,
Benjamas Limprasert
Master student (computer science)
Mahidol University,Thailand
________________________________________________________________________
Get Your Private, Free E-mail from MSN Hotmail at
http://www.hotmail.com


Reply-To: "Jon Farmer" <jfarmer@ic.net>
From: "Jon Farmer" <jfarmer@ic.net>
To: "benjamas limprasert" <lovemum@hotmail.com>, <dwf@lanl.gov>
Cc: <pids-rtf2@omg.org>, <drbob@bigfoot.com>, <EyestonS@BATTELLE.ORG>,
        <Ham@BATTELLE.ORG>, <Janet.Martino@ha.osd.mil>,
	<JMcCain@HQT.IHS.GOV>,
        <lovemum@hotmail.com>
References: <20000614042420.33827.qmail@hotmail.com>
Subject: Re: Matching algorithm of PIDs
Date: Wed, 14 Jun 2000 14:04:21 -0400
Organization: Care Data Systems
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 5.00.2314.1300
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2314.1300
Content-Type: text/plain;
	      charset="iso-8859-1"
X-UIDL: 4R4!!YDd!!ToR!!7;kd9

Hi Benjamas,
You probably know that matching cannot legitimately be done on
identifiers,
except WITHIN a domain.  The matching that a PIDS does across domains
is
always on the traits, such as name components, bith date, address,
etc.
Therefore the number of digits of identifiers is immaterial except in
that
it needs to have an "address space size" big enough to accommodate the
target population and maybe some more digits as "check digits".  Some
people
stuff traits into identifiers, but that is a risky practice, because
traits
change and few organization have the resources to be updating both
identifiers and the records they point to.

Algorithms range from simplistic "direct hit" comparisons of trait
values,
to those that compute "closeness" (confidence) numbers (like weights
of
trait types and thresholds that constitute a hit) , to those that
apply
probability formulas to compute the appropriate weights of specific
trait
values.

As a commercial vendor of a PIDS, it would be disloyal to my company
to give
you algorithms, but good luck!
Jon
----- Original Message -----
From: benjamas limprasert <lovemum@hotmail.com>
To: <dwf@lanl.gov>
Cc: <jfarmer@ic.net>; <pids-rtf2@omg.org>; <drbob@bigfoot.com>;
<EyestonS@BATTELLE.ORG>; <Ham@BATTELLE.ORG>;
<Janet.Martino@ha.osd.mil>;
<JMcCain@HQT.IHS.GOV>; <lovemum@hotmail.com>
Sent: Wednesday, June 14, 2000 12:24 AM
Subject: Matching algorithm of PIDs


> Hello,
>
> I road text file on web site "www.omg.org/issues/issue3019.txt"
> I interesting about matching algorithm for matching patient from
>multiple
> domain (multiple PIDs).
> Because my thesis about "federated trader of PIDs on multiple
>domains."
> Example There are 2 domains are Domain A and Domain B. (Domain can
>be
> hospital or laboratory or other...)
> If i am a patient and last year i register in Domain A and now I'm
>sick
and
> register at Domain B.
> If Domain B want to query infomation of me from Domain A by matching
>and
> serching traits.(ID in Domain A have 7 digits and ID in Domain B has
>10
> digits)
> What algorithm that you think i can use and if you have references
>about
> algorithm (file or website about algorithm) please tell or send to
>me?
> Thanks you very much.
>
> Best Regards,
> Benjamas Limprasert
> Master student (computer science)
> Mahidol University,Thailand
> ________________________________________________________________________
> Get Your Private, Free E-mail from MSN Hotmail at
>http://www.hotmail.com
>
>

X-Sender: tculpepr@pop.mindspring.com
X-Mailer: QUALCOMM Windows Eudora Pro Version 4.0.1 
Date: Tue, 07 Nov 2000 14:23:51 -0600
To: coas-ftf@emerald.omg.org
From: Tom Culpepper <tculpepper@2ab.com>
Subject: COAS RTF #1
In-Reply-To: <4.2.0.58.20001106154117.00ad1a00@emerald.omg.org>
Mime-Version: 1.0
Content-Type: multipart/alternative;
	      types="text/plain,text/html";
	      boundary="=====================_26253820==_.ALT"
X-UIDL: mp/!!GX-e9-n[d9R)[!!

All,
I have attached the COAS RTF #1 Report that I am working on based on
issues that have come in. In the report I cut and pasted the issues
and am merely using the Rejected piece as a place holder. We need to
look at each one and decide if it is an issue and then what do we need
to do with it or if it is not an issue why?

I have started putting some of my comments into some of the issues and
would like feedback about how others feel, my comments are not
intended to be the final answer just a starting point.

Dave, issue 3136. I believe this is not an issue. Am I right?

Chuck, issue 3019. Can we use the Naming Service to identify the
"functional IDL" level of conformance. And use the Trader as a way to
ask for more specific capabilities such as "HL7 Inside", "Single
Struct COAS". Much like the White and Yellow pages. I can go to the
white pages to find a plumber and go to the yellow pages to get more
detail.

We have until 30 November to complete this report. The comment
deadline was 30 October. All other issues beside 3136 & 3019 came in
after 30 October so technically we don't have to do anything with them
but i would like to do as much as possible with your help.

Thanks

Tom
_______________________________________________________
Thomas C. Culpepper                    2AB, Inc.
Senior Architect                            Integration Architects
tculpepper@2ab.com                    www.2ab.com
205-621-7455 ext 107                   OMG Domain Member

Technical Co-Chair CORBAmed     www.omg.org/homepages/corbamed	
_________________________________________________________ 
"Solutions for Distributed Business (SM)"