Issue 3871: IntervalList get_gaps(in AlignmentElement element, in Interval the_interval (biomolecular-ftf) Source: (Mr. Philip Lijnzaad, p.lijnzaad@med.uu.nl) Nature: Uncategorized Issue Severity: Summary: add an operation > > IntervalList get_gaps(in AlignmentElement element, in Interval the_interval); > > to the interface Alignment. It's job is to simply return all the gaps of a > particular sequence in a particular alignment. For symmetry with > get_seq_region(), the_interval is also given, thus limiting the gaps to those > that you're interested in. Resolution: Accepted. Changed the operation signature slightly from the above suggestion. Revised Text: Add CompositeSeqRegion get_gaps(in string key, in Interval the_interval) raises(ElementNotInAlignment, IntervalOutOfBounds); to the Alignment interface (section 2.1.15, IDL change on p. 2-43 and method block on p. 2-46). Update the IDL in appendix C.1. The coordinates of a gap would be those of the original sequence; gaps of length 0 are not allowed. A start == 0 would be before the first nucleotide/aminoacid; a start = N is a gap between nucleotides/aminoacids N and N+1 (so start = sequence.length would be after the last). Actions taken: September 19, 2000: received issue May 24, 2001: closed issue Discussion: End of Annotations:===== Date: Tue, 19 Sep 2000 12:47:48 +0100 (BST) Message-Id: <200009191147.e8JBlmt33657@o2-3.ebi.ac.uk> X-Authentication-Warning: o2-3.ebi.ac.uk: lijnzaad set sender to lijnzaad@ebi.ac.uk using -f From: Philip Lijnzaad To: biomolecular-ftf@omg.org cc: senger@ebi.ac.uk, muilu@ebi.ac.uk Subject: Issue 3687b (or whatever number): adding Gap and get_gaps() Reply-to: lijnzaad@ebi.ac.uk Content-Type: text X-UIDL: HG'!!B>$!!-#Ne9Alkd9 Dear all, The EBI proposes as a resolution to 3687b (second sub-issue of 3687), to add typedef Interval Gap; typedef GapList IntervalList; GapList get_gaps(in AlignmentElement element, in Interval the_interval); to the Alignment interface. One gap would be represented as {start,length}, hence the use of Interval. The Gap[List] typedefs are to make the semantics clearer (elsewhere, an Interval is an existing segment; here it denotes a missing segment). The coordinates of a Gap would be those of the original sequence; gaps of length 0 are not allowed. A Gap.start == 0 would be before the first nucleotide/aminoacid; a gap.start = N is a gap between nucleotides/aminoacids N and N+1 (so gap.start = sequence.length would be after the last). (If the submitters are unhappy about Gap and GapList, Interval and IntervalList are fine with the EBI too). Philip -- When C++ is your hammer, everything looks like a thumb. (Steven Haflich) ----------------------------------------------------------------------------- Philip Lijnzaad, lijnzaad@ebi.ac.uk \ European Bioinformatics Institute,rm A2-24 +44 (0)1223 49 4639 / Wellcome Trust Genome Campus, Hinxton +44 (0)1223 49 4468 (fax) \ Cambridgeshire CB10 1SD, GREAT BRITAIN PGP fingerprint: E1 03 BF 80 94 61 B6 FC 50 3D 1F 64 40 75 FB 53 Date: Tue, 19 Sep 2000 12:51:27 +0100 (BST) Message-Id: <200009191151.e8JBpR437450@o2-3.ebi.ac.uk> X-Authentication-Warning: o2-3.ebi.ac.uk: lijnzaad set sender to lijnzaad@ebi.ac.uk using -f From: Philip Lijnzaad To: juergen@omg.org cc: biomolecular-ftf@omg.org, senger@ebi.ac.uk, muilu@ebi.ac.uk Subject: Re: split into new issues: Reply-to: lijnzaad@ebi.ac.uk Content-Type: text X-UIDL: j^Ce9ppTd9'9id9-B[d9 Dear Juergen, I made mistake: there's another sub-issue: > could you please, for the biomolecular-ftf, split issue 3687 into two new > issues: > > 3687a: > > typedef sequence IntervalList; is missing from the spec; can this > be added? > > ------------------------------------------------------------------------ > 3687b: > > add an operation > > IntervalList get_gaps(in AlignmentElement element, in Interval the_interval); > > to the interface Alignment. It's job is to simply return all the gaps of a > particular sequence in a particular alignment. For symmetry with > get_seq_region(), the_interval is also given, thus limiting the gaps to those > that you're interested in. > > ------------------------------------------------------------------------ I forgot a third sub-issue: 3687c: Separate the Alignment interface into more managable pieces as follows: interface SimpleAlignment : CosLifeCycle::LifeCycleObject { // ... // here, everything BUT get_seq_region() // ... } interface Alignment : SimpleAlignment { SeqRegion get_seq_region( in AlignmentElement element, in Interval the_interval) raises(ElementNotInAlignment, IntervalOutOfBounds); } ------------------------------------------------------------------------ with apologies, Philip -- When C++ is your hammer, everything looks like a thumb. (Steven Haflich) ----------------------------------------------------------------------------- Philip Lijnzaad, lijnzaad@ebi.ac.uk \ European Bioinformatics Institute,rm A2-24 +44 (0)1223 49 4639 / Wellcome Trust Genome Campus, Hinxton +44 (0)1223 49 4468 (fax) \ Cambridgeshire CB10 1SD, GREAT BRITAIN PGP fingerprint: E1 03 BF 80 94 61 B6 FC 50 3D 1F 64 40 75 FB 53 Date: Mon, 16 Oct 2000 22:36:51 -0700 From: Scott Markel Organization: NetGenics, Inc. X-Mailer: Mozilla 4.75 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 To: BSA FTF Subject: Re: issues 3870 - 3872 Biomolecular FTF issues References: <4.2.0.58.20000928141053.00c6f410@emerald.omg.org> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: Y~\d9Vb(!!,pMe9-"Ee9 Juergen Boldt wrote: > > This is issue # 3870 > > typedef sequence IntervalList; is missing from the spec This behavior is already contained in CompositeSeqRegion, which extends Interval. An Interval can already represent an array of Intervals. > -------------------------------- > > This is issue 3871 > > IntervalList get_gaps(in AlignmentElement element, in Interval the_interval > > add an operation > > > > IntervalList get_gaps(in AlignmentElement element, in Interval > the_interval); > > > > to the interface Alignment. It's job is to simply return all the gaps of a > > particular sequence in a particular alignment. For symmetry with > > get_seq_region(), the_interval is also given, thus limiting the gaps to > those > > that you're interested in. Since this issue was discussed many times during design and we know that the functionality is provided by repeated calls to get_seq_region(), I'd prefer not to make this change. Obviously nothing prevents a vendor from providing this additional method in an extension. > ------------------------------ > > This is issue # 3872 > > Separate the Alignment interface into more managable pieces > > Separate the Alignment interface into more managable pieces as > follows: > > interface SimpleAlignment : CosLifeCycle::LifeCycleObject { > // ... > // here, everything BUT get_seq_region() > // ... > } > > interface Alignment : SimpleAlignment { > SeqRegion get_seq_region( > in AlignmentElement element, > in Interval the_interval) > raises(ElementNotInAlignment, IntervalOutOfBounds); > } I'm against this. I don't see what's gained by splitting Alignment into two interfaces. In either case, one interface or two, an implementation of get_seq_region() is required. Scott -- Scott Markel, Ph.D. NetGenics, Inc. smarkel@netgenics.com 4350 Executive Drive Tel: 858 455 5223 Suite 260 FAX: 858 455 1388 San Diego, CA 92121 X-Authentication-Warning: o2-3.ebi.ac.uk: lijnzaad set sender to lijnzaad@ebi.ac.uk using -f From: Philip Lijnzaad MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Message-ID: <14828.35323.945505.510495@o2-3.ebi.ac.uk> Date: Tue, 17 Oct 2000 18:18:51 +0100 To: Scott Markel Cc: BSA FTF Subject: get_gaps() [ was: Re: issues 3870 - 3872 Biomolecular FTF issues ] In-Reply-To: "Re: issues 3870 - 3872 Biomolecular FTF issues" dated Oct 16 References: <4.2.0.58.20000928141053.00c6f410@emerald.omg.org> <39EBE573.D721F2BD@netgenics.com> X-Mailer: VM 6.76 under Emacs 20.5.1 Reply-To: lijnzaad@ebi.ac.uk Content-Type: text/plain; charset=us-ascii X-UIDL: Y!1!!WQf!!bVYd9kDad9 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <14828.19451.896807.499449@o2-3.ebi.ac.uk> Date: Tue, 17 Oct 2000 13:54:19 +0100 To: Martin Senger , alan Cc: Philip Lijnzaad , muilu@ebi.ac.uk Subject: Re: get_gaps In-Reply-To: "get_gaps" dated Oct 17 References: <14828.12191.786808.473644@o2-3.ebi.ac.uk> X-Mailer: VM 6.76 under Emacs 20.5.1 Reply-To: lijnzaad@ebi.ac.uk Bcc: lijnzaad@ebi.ac.uk --text follows this line-- Dear all, this is just to clarify our experience and position, as we do feel strongly about adding get_gaps() to the spec. >> This is issue 3871 >> >> IntervalList get_gaps(in AlignmentElement element, in Interval >> the_interval >> >> add an operation >> > >> > IntervalList get_gaps(in AlignmentElement element, in Interval >> the_interval); >> > >> > to the interface Alignment. It's job is to simply return all the >> gaps of a >> > particular sequence in a particular alignment. For symmetry with >> > get_seq_region(), the_interval is also given, thus limiting the >> gaps to >> those >> > that you're interested in. Scott> Since this issue was discussed many times during design and we know that Scott> the functionality is provided by repeated calls to get_seq_region(), I'd Scott> prefer not to make this change. Consider the following alignment: 1 2 3 4 5 6 7 8 12345678901234567890123456789012345678901234567890123456789012345678901234567890 ATTC-----------------------------------------------------------------GACGGCCCATG ATTC---------------------------------------------------------------------------G A--------------------------------------------------------------------GACGGCCCATG TTTCTTGTGTCTCAAGGACAGAAGAGACTTCAGGTTCCCCCAGGAGATGGTAAAAGGGAGCCAGTTGCAGAAGGCCCATG The underlying sequences are: ATTCGACGGCCCATG ATTCG AGACGGCCCATG TTTCTTGTGTCTCAAGGACAGAAGAGACTTCAGGTTCCCCCAGGAGATGGTAAAAGGGAGCCAGTTGCAGAAGGCCCATG which know nothing about gaps in them. In the current IDL, Alignment itself 'has' the gaps, but only in a very implicit way. In all likelyhood, an implementation will have a much more explicit internal representation of where the gaps are. Now, in the absence of a SingleCharacterAlignmentEncoder (which is an optional interface) or a get_gaps() method (which we think should be added), the only way for a client to find out where the gaps in a particular AlignmentElement are is something along the following lines: Alignment alt = ...; AligmentElement []elts = alt.get_alignment_elements(...); AligmentElement elt = elts[x]; int columns = alt.num_columns(); Interval segment = new Interval(0, 1); // start length | for(int pos=1; pos<=columns; pos++) { | segment.start(pos); // set start to pos | String found = elt.get_seq_region(elt, segment); | if ( found.equals("") ) | character = "-"; | else | character = found; | ... | } These are round trips to the server (and this is for just _one_ element). Maybe this can be optimized to some log(N) algorithm, but I would strongly object to force a 'stupid' client to implement functionality that could be madee available much more easily by the server (again, the server most likely has some fairly explicit internal representation of the gaps in its Alignments). Note that Alignments can be really big: thousands of columns is not exceptional in the genomic era. For alignments with M sequences, the problem is of course M times as big. The whole thing is compounded for Assemblies, where the number of alignments columns is large, but the sequences are relativelyi short. I.e., an Assembly consists _mostly_ of gaps, so you spend most time finding out exactly where those gaps are. In other words, 'finding out where the gaps are' is a central part of the functionality of an Alignment. Just for the sake of argument, assume an alignment of 10 sequences, 10,000 columns, one round trip takes 50 ms (in a local area network; Internet wide deployment is completely out of the question). Rendering this alignment in the simplistic way shown above would take: 10 * 10000 * 50 ms == 5000 seconds, or one hour and 20 minutes. (If the log(N) assumption is correct, it would take 5 seconds). I believe these number are realistic, and they demonstrate that the current IDL just doesn't scale. Now, if get_gaps() is available, life gets much easier (differences are only in the area marked '|'): | Interval [] gaps = alt.get_gaps(elt, segment); // one remore call | String seq = ...; // original sequence | StringBuffer withGaps = new StringBuffer(seq); | | for (int i=0; i--) { | Interval gapCoords = gaps[i]; | String gap = StringUtils.createRepeatedString(gapCoords, "-"); | withGaps.insert(gapCoords.start, gap); | } Alignment is one of the central data types in the BSA standard, so I think there really is no option but to get it right. Leaving something like get_gaps() to vendor-specific extensions leaves the BSA standard close to unusable as far as Alignments are concerned. It will come as no surprise that EBI will vote in favour of adding a get_gaps() method :-) With kind regards, Philip -- Time passed, which, basically, is its job. -- Terry Pratchett (in: Equal Rites) ----------------------------------------------------------------------------- Philip Lijnzaad, lijnzaad@ebi.ac.uk \ European Bioinformatics Institute,rm A2-24 +44 (0)1223 49 4639 / Wellcome Trust Genome Campus, Hinxton +44 (0)1223 49 4468 (fax) \ Cambridgeshire CB10 1SD, GREAT BRITAIN PGP fingerprint: E1 03 BF 80 94 61 B6 FC 50 3D 1F 64 40 75 FB 53 Date: Tue, 17 Oct 2000 16:16:08 -0700 From: Scott Markel Organization: NetGenics, Inc. X-Mailer: Mozilla 4.75 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 To: BSA FTF Subject: [OMG-BSA] proposed resolutions #3 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: ZGd!!Y64e9e!gd9$Z6!! Here's a third batch of proposed resolutions. Based on email traffic there seems to be a consensus on the proposals. In a few cases the consensus is 2 of 3. In most cases it's 2 of 2. I'm not reproducing the entire text of the individual issues. You can find that at http://cgi.omg.org/issues/biomolecular-ftf.html. This message is *not* a vote. I plan to send out a message kicking off a vote probably on Friday. The timing depends on the responses to this message. Scott ======================================================================== Issue 3870: typedef sequence IntervalList; is missing from the spec Proposed resolution: No change. CompositeSeqRegion already has this behavior. ------------------------------------------------------------------------ Issue 3871: IntervalList get_gaps(in AlignmentElement element, in Interval the_interval Proposed resolution: Add CompositeSeqRegion get_gaps(in string key, in Interval the_interval) raises(ElementNotInAlignment, IntervalOutOfBounds); to the Alignment interface (section 2.1.15, IDL change on p. 2-43 and method block on p. 2-46). Update the IDL in appendix C.1. The coordinates of a gap would be those of the original sequence; gaps of length 0 are not allowed. A start == 0 would be before the first nucleotide/aminoacid; a start = N is a gap between nucleotides/aminoacids N and N+1 (so start = sequence.length would be after the last). ------------------------------------------------------------------------ Issue 3872: Separate the Alignment interface into more managable pieces Proposed resolution: No change. ------------------------------------------------------------------------ Issue 3874: add an Identifier to SeqRegion Proposed resolution: Add "sequence Identifier" to the SeqRegion paragraph on p. 2-6 (section 2.1.5). Add "public Identifier id;" to the IDL on the bottom of the same page. Add a description block describing the ID as a sequence ID on p. 2-7. Update the IDL in appendix C.1. ------------------------------------------------------------------------ Issue 3875: inheritance in annotation iterators Proposed resolution: Add the following to the single sentence description in section 2.1.7 (p. 2-15) SeqAnnotationIterator is not used directly in this specification, but is provided as a convenience for vendor-specific IDL extensions and future OMG specifications where a collection of Annotations contains only SeqAnnotations. SeqAnnotationIterator is an optional interface. In addition, add SeqAnnotationIterator to the list of optional interfaces in section 1.1 (p. 1-1). ------------------------------------------------------------------------ Issue 3924: octet iterator Proposed resolution: Out of scope. ------------------------------------------------------------------------ Issue 3962: clarification of strand_type and CompositeSeqRegions Proposed resolution: Add the following to the first paragraph under CompositeSeqRegion (section 2.1.5, p. 2-7) and to region_operator's description (section 2.1.5, p. 2-8). All CompositeSeqRegions are expected to be translated in a depth-first traversal, along each node of the tree represented by the CompositeSeqRegions. This includes those nodes that have region_operator equal to JOIN or ORDER. Add the following to the descriptions of BioSequence's seq_interval() (section 2.1.9, p. 2-24) and NucleotideSequence's translate_seq_region() (section 2.1.10, p. 2-30). If the StrandType is minus, the string returned should be taken as reverse-complemented. Add the following to the description of NucleotideSequence's reverse_complement_interval() (section 2.1.10, p. 2-28). If the StrandType is minus, the string returned should be taken as reverse-complemented. This will result in a no-op, i.e., the strand_type leads to reverse-complementing, which is then reverse-complementing due to the semantics of the method, resulting in the same string that would be returned from seq_interval(). ------------------------------------------------------------------------ Issue 3963: clarification of SeqRegionOperator.ORDER Proposed resolution: Add the following to the description of the enum SeqRegionOperator's ORDER value (section 2.1.5, p. 2-8). Typically, it is used to represent a discontinuous region to which a descriptive annotation pertains. ------------------------------------------------------------------------ Issue 3964: OutOfBounds exceptions for circular sequences if start = 0 Proposed resolution: Add text similar to the following (BioSequence and Interval replaced by derived types, as appropriate) to the descriptions of * IntervalOutOfBounds (section 2.1.9, p. 2-21) * SeqRegionOutOfBounds (section 2.1.9, p. 2-21) * SeqAnnotationOutOfBounds (section 2.1.23, p. 2-63) and to the exceptions of * BioSequence's seq_interval() (section 2.1.9, p. 2-24) * NucleotideSequence's reverse_complement_interval() (section * 2.1.10, p. 2-28) * NucleotideSequence's translate_seq_region() (section 2.1.10, p. 2-30) Raises IntervalOutOfBounds if the Interval's start is less than 1 or if its start+length-1 is greater than the length of the BioSequence. If a BioSequence represents circular DNA, then this exception should be raised if the Interval's start is less than 1 or greater than the length of the BioSequence, or if its length is greater than the length of the BioSequence. ------------------------------------------------------------------------ Issue 3965: add BioSequence.get_annotations_by_name()? Proposed resolution: Out of scope. ------------------------------------------------------------------------ -- Scott Markel, Ph.D. NetGenics, Inc. smarkel@netgenics.com 4350 Executive Drive Tel: 858 455 5223 Suite 260 FAX: 858 455 1388 San Diego, CA 92121 Date: Tue, 17 Oct 2000 15:27:12 -0700 From: Scott Markel Organization: NetGenics, Inc. X-Mailer: Mozilla 4.75 [en] (WinNT; U) X-Accept-Language: en MIME-Version: 1.0 To: lijnzaad@ebi.ac.uk CC: BSA FTF Subject: Re: get_gaps() [ was: Re: issues 3870 - 3872 Biomolecular FTF issues ] References: <4.2.0.58.20000928141053.00c6f410@emerald.omg.org> <39EBE573.D721F2BD@netgenics.com> <14828.35323.945505.510495@o2-3.ebi.ac.uk> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=us-ascii X-UIDL: HGld90ZV!!Pl3e9'~9!! Philip, Philip Lijnzaad wrote: > > Dear all, > > this is just to clarify our experience and position, as we do feel strongly > about adding get_gaps() to the spec. I've noticed. :) > Consider the following alignment: > > 1 2 3 4 5 6 > 7 8 > 12345678901234567890123456789012345678901234567890123456789012345678901234567890 > ATTC-----------------------------------------------------------------GACGGCCCATG > ATTC---------------------------------------------------------------------------G > A--------------------------------------------------------------------GACGGCCCATG > TTTCTTGTGTCTCAAGGACAGAAGAGACTTCAGGTTCCCCCAGGAGATGGTAAAAGGGAGCCAGTTGCAGAAGGCCCATG Small aside: Please note that this particular example can be represented by 5 columns. > It will come as no surprise that EBI will vote in favour of adding a > get_gaps() method :-) I'll go along with adding get_gaps(), but not IntervalList or the proposed SimpleAlignment/Alignment split. Scott -- Scott Markel, Ph.D. NetGenics, Inc. smarkel@netgenics.com 4350 Executive Drive Tel: 858 455 5223 Suite 260 FAX: 858 455 1388 San Diego, CA 92121