Issue 3557: why does SimilaritySearchHit contain a list of Alignments (biomolecular-ftf) Source: (Mr. Philip Lijnzaad, p.lijnzaad@med.uu.nl) Nature: Uncategorized Issue Severity: Summary: during our implementation work we came across the following a point that is puzzling: why does SimilaritySearchHit contain a list of Alignments? A SearchHit is a match of one target sequence with one or more matched sequence. Practically by definition, each such a hit is _one_ alignment (even if the alignment is more than pairwise). So we should either change the spec to be valuetype SimilaritySearchHit : SearchHit { public Alignment the_alignment; }; or, leave the spec and document that the only legal length for alignment_list is 1. Thoughts? Resolution: Accepted. Chose to make only the documentation change. Revised Text: After the first para of 2.1.19 (p. 2-53), insert the following text. The member alignment_list is added to the SearchHit valuetype. This AlignmentList contains the details of a similarity search hit in the form of an Alignment. This list will frequently be of length one, but can be used to group all 'local' hits pertaining to a single found sequence into one SimilaritySearchHit. That is, one SimilaritySearchHit may contain one 'local match' or several 'local matches' on one sequence. What is most appropriate depends on the analysis that was run to obtain the hits, and/or on the objective of a service; an implementation must document these semantics. Also put this blurb (apart from first sentence) just before on 2.1.20, on page 2-55. Add the following to the Return value description of alignment_list on p. 2-54. If the alignment_list contains more than one Alignment, each Alignment shall involve the same two objects, i.e., only two Identifiers are used within one SimilaritySearchHit, one for the initial query_sequence (see SearchResult) and one for the current target sequence. Since each Alignment in a SimilaritySearchHit is a pairwise alignment, the result of the Alignment's num_rows() method shall be 2. On p. 2-60, end of 2.1.21 (just above section 2.1.22), insert the following text. The method get_hits() returns a list of SearchHits. If this concerns SimilaritySearchHits, then these elements will in turn contain a list of Alignments. The latter can be used to group all 'local' hits pertaining to a single found sequence into on SimilaritySearchHit. That is, one SimilaritySearchHit may contain one 'local match' or several 'local matches' on one sequence. What is most appropriate depends on the analysis that was run to obtain the hits, and/or on the objective of a service. An implementation must document these semantics. Actions taken: April 11, 2000: received issue May 24, 2001: closed issue Discussion: End of Annotations:===== aad To: biomolecular-ftf@omg.org cc: muilu@ebi.ac.uk Subject: more issues ... Reply-to: lijnzaad@ebi.ac.uk Content-Type: text X-UIDL: ^OPe9)_)e9Yopd9f33e9 Dear all, during our implementation work we came across the following a point that is puzzling: why does SimilaritySearchHit contain a list of Alignments? A SearchHit is a match of one target sequence with one or more matched sequence. Practically by definition, each such a hit is _one_ alignment (even if the alignment is more than pairwise). So we should either change the spec to be valuetype SimilaritySearchHit : SearchHit { public Alignment the_alignment; }; or, leave the spec and document that the only legal length for alignment_list is 1. Thoughts? Philip -- Not getting what you want is sometimes a wonderful stroke of luck. ----------------------------------------------------------------------------- Philip Lijnzaad, lijnzaad@ebi.ac.uk | European Bioinformatics Institute,rm A2-24 +44 (0)1223 49 4639 | Wellcome Trust Genome Campus, Hinxton +44 (0)1223 49 4468 (fax) | Cambridgeshire CB10 1SD, GREAT BRITAIN PGP fingerprint: E1 03 BF 80 94 61 B6 FC 50 3D 1F 64 40 75 FB 53 Date: Thu, 13 Apr 2000 16:47:07 -0700 From: Scott Markel Organization: NetGenics, Inc. X-Mailer: Mozilla 4.72 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 To: lijnzaad@ebi.ac.uk CC: biomolecular-ftf@omg.org, muilu@ebi.ac.uk Subject: Re: more issues ... References: <200004111348.OAA08609@o2-3.ebi.ac.uk> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: Pg3!!]Med9@:Z!!%S%!! Philip, Sorry for the delay in responding. Our resident BLAST expert was out of town for a few days and I just finished getting his opinion. Philip Lijnzaad wrote: > > Dear all, > > during our implementation work we came across the following a point > that > is puzzling: > > why does SimilaritySearchHit contain a list of Alignments? A > SearchHit is a > match of one target sequence with one or more matched > sequence. Practically > by definition, each such a hit is _one_ alignment (even if the > alignment is > more than pairwise). So we should either change the spec to be > > valuetype SimilaritySearchHit : SearchHit > { > public Alignment the_alignment; > }; > > or, leave the spec and document that the only legal length for > alignment_list is 1. Thoughts? I agree that there's a problem, but I'm not convinced that the solution is to change AlignmentList to a single Alignment. BLAST returns high scoring pairs (HSPs). An HSP consists of a query sequence, a matched sequence, some scores, and a single alignment. A single matched sequence can occur in multiple HSPs. If we equate an HSP to a SearchHit, then the only way to find out about multiple hits involving a single matched sequence involves looking at all SearchHits and checking each SearchHit's id. Everyone I know who has done it this way has regretted it. It should be easy to get all HSPs involving a single matched sequence. The BSA specification has done this, but what's murky is how to identify a single HSP (scores + alignment) from a SimilaritySearchHit. The array of Alignments makes the alignment part easy. The hit_info Properties is less clear. The specification doesn't indicate what's in the hit_info, e.g., for every alignment, there's a score or set of scores. This can be addressed in several ways. * Add text explaining how the hit_info Properties are organized. This leaves the IDL unchanged. * Change hit_info to some other kind of object, e.g., a 2D array of floats (the scores) ordered the same way the alignment array is ordered. This leaves out any non-float information for HSPs. * Add a new layer in the model called HighScoringPair and have IDL that looks something like the following. SearchResult and BioSequenceIdentifierResolver are unchanged. valuetype HighScoringPair { public CosPropertyService::Properties hsp_info; } valuetype AlignedHighScoringPair : HighScoringPair { public Alignment the_alignment; }; typedef sequence HighScoringPairList; typedef sequence AlignedHighScoringPairList; valuetype SearchHit { public Identifier id; public CosPropertyService::Properties hit_info; public HighScoringPairList high_scoring_pairs; }; It would be nice if we could get by with a simple text modification, but that may involve sweeping too much under the carpet. Comments? Scott -- Scott Markel, Ph.D. NetGenics, Inc. smarkel@netgenics.com 4350 Executive Drive Tel: 858 455 5223 Suite 260 FAX: 858 455 1388 San Diego, CA 92121 Date: Fri, 14 Apr 2000 09:25:49 +0100 (BST) Message-Id: <200004140825.JAA13093@o2-3.ebi.ac.uk> X-Authentication-Warning: o2-3.ebi.ac.uk: lijnzaad set sender to lijnzaad@ebi.ac.uk using -f From: Philip Lijnzaad To: smarkel@netgenics.com CC: biomolecular-ftf@omg.org, muilu@ebi.ac.uk In-reply-to: <38F65C7B.C651D605@netgenics.com> (message from Scott Markel on Thu, 13 Apr 2000 16:47:07 -0700) Subject: Re: more issues ... Reply-to: lijnzaad@ebi.ac.uk References: <200004111348.OAA08609@o2-3.ebi.ac.uk> <38F65C7B.C651D605@netgenics.com> Content-Type: text X-UIDL: 8j(!!PaLe9DN8!!Pj\!! On Thu, 13 Apr 2000 16:47:07 -0700, "Scott" == Scott Markel writes: >> why does SimilaritySearchHit contain a list of Alignments? A SearchHit is a >> match of one target sequence with one or more matched sequence. Practically >> by definition, each such a hit is _one_ alignment (even if the alignment is >> more than pairwise). Scott> I agree that there's a problem, but I'm not convinced that the Scott> solution is to change AlignmentList to a single Alignment. Scott> BLAST returns high scoring pairs (HSPs). An HSP consists of a query Scott> sequence, a matched sequence, some scores, and a single alignment. A Scott> single matched sequence can occur in multiple HSPs. For repetitive sequences, yes. Scott> If we equate an HSP to a SearchHit, yes, that's my reading Scott> then the only way to find out about multiple hits Scott> involving a single matched sequence involves looking at all SearchHits Scott> and checking each SearchHit's id. yes. Scott> Everyone I know who has done it this Scott> way has regretted it. It should be easy to get all HSPs involving a Scott> single matched sequence. Scott> The BSA specification has done this, but what's murky is how to identify Scott> a single HSP (scores + alignment) from a SimilaritySearchHit. The array Scott> of Alignments makes the alignment part easy. The hit_info Properties is Scott> less clear. The specification doesn't indicate what's in the hit_info, Scott> e.g., for every alignment, there's a score or set of scores. Scott> This can be addressed in several ways. Scott> * Add text explaining how the hit_info Properties are organized. This Scott> leaves the IDL unchanged. Scott> * Change hit_info to some other kind of object, e.g., a 2D array of Scott> floats (the scores) ordered the same way the alignment array is Scott> ordered. This leaves out any non-float information for HSPs. Scott> * Add a new layer in the model called HighScoringPair and have IDL Scott> that looks something like the following. SearchResult and Scott> BioSequenceIdentifierResolver are unchanged. Scott> valuetype HighScoringPair Scott> { Scott> public CosPropertyService::Properties hsp_info; Scott> } Scott> valuetype AlignedHighScoringPair : HighScoringPair Scott> { Scott> public Alignment the_alignment; Scott> }; Scott> typedef sequence HighScoringPairList; Scott> typedef sequence AlignedHighScoringPairList; Scott> valuetype SearchHit Scott> { Scott> public Identifier id; Scott> public CosPropertyService::Properties hit_info; Scott> public HighScoringPairList high_scoring_pairs; Scott> }; Scott> It would be nice if we could get by with a simple text modification, but Scott> that may involve sweeping too much under the carpet. Scott> Comments? Scott> Scott -- Not getting what you want is sometimes a wonderful stroke of luck. ----------------------------------------------------------------------------- Philip Lijnzaad, lijnzaad@ebi.ac.uk | European Bioinformatics Institute,rm A2-24 +44 (0)1223 49 4639 | Wellcome Trust Genome Campus, Hinxton +44 (0)1223 49 4468 (fax) | Cambridgeshire CB10 1SD, GREAT BRITAIN PGP fingerprint: E1 03 BF 80 94 61 B6 FC 50 3D 1F 64 40 75 FB 53 Date: Fri, 14 Apr 2000 11:58:27 +0100 (BST) Message-Id: <200004141058.LAA13118@o2-3.ebi.ac.uk> X-Authentication-Warning: o2-3.ebi.ac.uk: lijnzaad set sender to lijnzaad@ebi.ac.uk using -f From: Philip Lijnzaad To: smarkel@netgenics.com CC: biomolecular-ftf@omg.org, muilu@ebi.ac.uk In-reply-to: <38F65C7B.C651D605@netgenics.com> (message from Scott Markel on Thu, 13 Apr 2000 16:47:07 -0700) Subject: Re: more issues ... Reply-to: lijnzaad@ebi.ac.uk References: <200004111348.OAA08609@o2-3.ebi.ac.uk> <38F65C7B.C651D605@netgenics.com> Content-Type: text X-UIDL: ^'0!!Eb>e9n<5e9B8$!! On Thu, 13 Apr 2000 16:47:07 -0700, "Scott" == Scott Markel writes: >> why does SimilaritySearchHit contain a list of Alignments? A SearchHit is a >> match of one target sequence with one or more matched sequence. Practically >> by definition, each such a hit is _one_ alignment (even if the alignment is >> more than pairwise). Scott> I agree that there's a problem, but I'm not convinced that the Scott> solution is to change AlignmentList to a single Alignment. Scott> BLAST returns high scoring pairs (HSPs). An HSP consists of a query Scott> sequence, a matched sequence, some scores, and a single alignment. A Scott> single matched sequence can occur in multiple HSPs. For repetitive sequences and multi-domain proteins (in either target or matched sequence), yes Scott> If we equate an HSP to a SearchHit, yes, that's my reading. Scott> then the only way to find out about multiple hits involving a single Scott> matched sequence involves looking at all SearchHits and checking each Scott> SearchHit's id. yes. Scott> Everyone I know who has done it this way has regretted it. Mmm ... it also depends on the application, of course. If you're going after the repeats or the single domains, you don't want this thing collated into one bigger hit. Another examle is that users may expect that the Hits are ordered, but two region-hits on one sequence may not sort next to each other in the total output due to intervening region-hits with other sequences (this is actually quite common). BLAST used to give back single matching regions (which is obviously a pain) WUBLAST 2.? now takes the primary output, and has an algorithm that tries to patch things back into a single, bigger alignment using gaps. Which for many cases is clearly useful. Fasta just returns the single best alignment for any sequence, so there implicitly 'Hit = sequence'. A brief poll of a number of local experts was equivocal on the matter; a third was adamant that 'Hit' = 'local match', a third said 'It depends' and the rest was equally adamant that 'Hit = sequence'. The whole thing, of course, also crucially depends on what a sequence is. The 'hit=sequence' view only really makes sense if the majority of database sequences are defined as genes (i.e., not big clones or contigs, let alone chromosomes, because the thing would then become meaningless). Scott> It should be easy to get all HSPs involving a single matched sequence. Scott> The BSA specification has done this, But is not clear enough about the whole matter. 2.1.18 says 'SearchHit provides information about a particular sequence that was found', which doesn't determine whether this should be region or not. E.g., should SearchResult state that the Id's of it's SearchHits should be unique (or at least that that is the intended use)? I think this would be too limiting. But we should state something about it. BTW, can SearchHit (and/or SearchResult) be only about BioSequences? I thought that we tried to keep it more general. If that's the case, we maybe should alter the wording, because it's very tailored to BioSequences, and more importantly, it doesn't say whether it is legal to use Search{Hit,Result} for things other than BioSequences (e.g., text research, which I remember reading somewhere in the spec). Scott> but what's murky is how to identify a single HSP (scores + alignment) Scott> from a SimilaritySearchHit. The array of Alignments makes the Scott> alignment part easy. The hit_info Properties is less clear. The Scott> specification doesn't indicate what's in the hit_info, e.g., for every Scott> alignment, there's a score or set of scores. yes, agreed. Scott> This can be addressed in several ways. Scott> * Add text explaining how the hit_info Properties are organized. This Scott> leaves the IDL unchanged. and document the whole matter a bit more. It's not very clean, but it has the advantage that both views ('hit'='local match' and hit=sequence) can be represented, without favouring one over the other. Scott> * Change hit_info to some other kind of object, e.g., a 2D array of Scott> floats (the scores) ordered the same way the alignment array is Scott> ordered. This leaves out any non-float information for HSPs. yes, and it's a bit messy anyway. Scott> * Add a new layer in the model called HighScoringPair and have IDL Scott> that looks something like the following. SearchResult and Scott> BioSequenceIdentifierResolver are unchanged. Scott> valuetype HighScoringPair Scott> { Scott> public CosPropertyService::Properties hsp_info; Scott> } Scott> valuetype AlignedHighScoringPair : HighScoringPair Scott> { Scott> public Alignment the_alignment; Scott> }; Scott> typedef sequence HighScoringPairList; Scott> typedef sequence AlignedHighScoringPairList; Scott> valuetype SearchHit Scott> { Scott> public Identifier id; Scott> public CosPropertyService::Properties hit_info; Scott> public HighScoringPairList high_scoring_pairs; Scott> }; This is probably the cleanest way of doing it, but it is fairly big change/addition to the spec (and see note about sorting). Again, I think it all depends on what we think is the right (or rather, the most commonly used) semantics of the word 'Hit'. Without there being a clear consensus, maybe the spec should not choose one of the other. I think it is ultimately the Analysis program that says what are hits and what are not. Users probably will _know_ that a (old) BLAST hit is a region hit, and that a WUBLAST hit is sequence-hit. With users and analysis programs coming and going, the standard should probably be minimally constraining (but be explicit about this in the doc.) In summary, my (current ;-) preference is to keep the IDL the way it is, and fix the doc. Cheers, Philip -- Not getting what you want is sometimes a wonderful stroke of luck. ----------------------------------------------------------------------------- Philip Lijnzaad, lijnzaad@ebi.ac.uk | European Bioinformatics Institute,rm A2-24 +44 (0)1223 49 4639 | Wellcome Trust Genome Campus, Hinxton +44 (0)1223 49 4468 (fax) | Cambridgeshire CB10 1SD, GREAT BRITAIN PGP fingerprint: E1 03 BF 80 94 61 B6 FC 50 3D 1F 64 40 75 FB 53