Biomolecular Sequence Analysis Entities RFP Response Joint Initial Submission EMBL-EBI (European Bioinformatics Institute) NetGenics, Inc. OMG Document lifesci/2001-03-04 © Copyright 2001 by EMBL-EBI and NetGenics Inc. The organisations listed above hereby grant a royalty-free license to the Object Management Group, Inc. (OMG) for world-wide distribution of this document or any derivative works thereof, so long as the OMG reproduces the copyright notices and the below paragraphs on all distributed copies. The material in this document is submitted to the OMG for evaluation. Submission of this document does not represent a commitment to implement any portion of this specification in the products of the submitters. WHILE THE INFORMATION IN THIS PUBLICATION IS BELIEVED TO BE ACCURATE, THE COMPANIES LISTED ABOVE MAKE NO WARRANTY OF ANY KIND WITH REGARD TO THIS MATERIAL INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The companies listed above shall not be liable for errors contained herein or for incidental or consequential damages in connection with the furnishing, performance or use of this material. The information contained in this document is subject to change without notice. This document contains information, which is protected by copyright. All Rights Reserved. Except as otherwise provided herein, no part of this work may be reproduced or used in any form or by any means graphic, electronic, or mechanical, including photocopying, recording, taping, or information storage and retrieval systems without the permission of one of the copyright owners. All copies of this document must include the copyright and other information contained on this page. The copyright owners grant member companies of the OMG permission to make a limited number of copies of this document (up to fifty copies) for their internal use as part of the OMG evaluation process. RESTRICTED RIGHTS LEGEND. Use, duplication, or disclosure by government is subject to restrictions as set forth in subdivision (c) (1) (ii) of the Right in Technical Data and Computer Software Clause at DFARS 252.227.7013. CORBA, OMG and Object Request Broker are trademarks of the Object Management Group. Table of Contents 1 PREFACE 7 1.1 Submission Contact Points 7 1.2 Supporting Organisations 7 1.3 Acknowledgements 7 1.4 Proof of Concept 8 2 INTRODUCTION 9 2.1 General remarks 9 2.2 Overview 9 2.3 module DsLSRBioObjects 9 2.3.1 BioObject 9 2.3.1 Annotatable 9 2.3.2 AlignmentElement 9 2.3.3 Assembly 9 2.3.4 BioObjectIdentifierResolver 10 2.3.5 Gene 10 2.3.6 Alphabet 10 2.3.7 FiniteAlphabet 10 2.3.8 Symbol 10 2.3.9 BasisSymbol 10 2.3.10 AtomicSymbol 10 2.3.11 SymbolArray 10 2.3.12 SymbolIterator 11 2.3.13 SequenceCluster 11 2.3.14 AssemblyConstruction 11 2.3.15 BioCollection 11 2.3.16 SequenceCollection 11 2.3.17 TreeNode 11 2.3.18 Taxon 11 2.3.19 WeightMatrix 11 2.3.20 AbstractMatrix 11 2.3.21 DMatrix 12 3 RESPONSE TO RFP REQUIREMENTS 13 3.1 Mandatory RFP Requirements 13 3.2 Optional RFP Requirements 14 3.3 Relationship to existing or emerging OMG Specifications 14 3.4 Issues To Be Discussed 15 4 GENERAL DESCRIPTION 19 4.1 General Aims 19 4.2 Design Issues 19 4.2.1 Compatibility with the BioJava et al 19 4.2.2 OMG's New Model Driven Architecture 19 4.2.3 BSA Implementation Lessons 19 4.2.4 Identifiers and ControlledVocabularies 20 4.2.5 Implementation of LightweightSsequence Model 20 4.2.6 Query Mechanism for Annotations and Properties 20 5 MODULES AND INTERFACES 21 5.1 Modifications to the BSA specification 21 5.1.1 BioSequence 21 5.1.2 AlignmentElement 22 5.2 New Entities 23 5.2.1 Exceptions 23 5.2.2 Annotatable 23 5.2.3 BioObject 24 5.2.4 Assembly 24 5.2.5 BioObjectIdentifierResolver 26 Gene 26 5.2.7 Alphabet 28 5.2.8 FiniteAlphabet 28 5.2.9 Symbol 29 5.2.10 BasisSymbol 29 5.2.11 AtomicSymbol 29 5.2.12 SymbolArray 30 4.2.7 SymbolIterator 31 5.2.13 SymbolList 31 5.2.14 SequenceCluster 31 5.2.15 AssemblyConstruction 33 5.2.16 BioCollection 33 5.2.17 SequenceCollection 33 5.2.18 FuzzySeqRegion 34 LabeledSeqRegion 35 5.2.20 TreeNode 35 Taxon 36 5.2.22 TaxonList 37 WeightMatrix 37 5.2.24 AbstractMatrix 37 5.2.25 DMatrix 38 5.2.26 DoubleArray 38 5.2.27 DoubleList 38 6 GLOSSARY 39 7 REFERENCES 40 1 Preface This document describes a proposal for standards in response to RFP, Biomolecular Sequence Analysis Entities, Object Management Group (OMG) document number lifesci/2000-12-16. 1.1 Submission Contact Points Juha Muilu European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD United Kingdom (+44) 1223 494624 muilu@ebi.ac.uk Scott Markel NetGenics, Inc. 4350 Executive Drive Suite 260 San Diego, CA 92121 USA (+1) 858 455 5223 smarkel@netgenics.com 1.2 Supporting Organisations The following organisations have been involved in the process of developing, prototyping and/or reviewing this submission. The submitters of this response thank them for participating and giving their valuable input. Sanger Centre, United Kingdom 1.3 Acknowledgements The authors of this document wish to express their appreciation to those listed below (in alphabetic order) for their contributions of ideas and experience. Ultimately, the ideas expressed in this document are those of the authors and do not necessarily reflect the views or ideas of these individuals, nor does the inclusion of their names imply an endorsement of the final product. Ewan Birney birney@ebi.ac.uk Greg Cox gcox@netgenics.com Michael Dickson mdickson@netgenics.com Heikki Lehväslaiho heikki@ebi.ac.uk Philip Lijnzaad lijnzaad@ebi.ac.uk Matthew Pocock mrp@sanger.ac.uk Alan Robinson alan@ebi.ac.uk Martin Senger senger@ebi.ac.uk T. Alphonse Thanaraj thararaj@ebi.ac.uk 1.4 Proof of Concept This proposal is based on both the Biomolecular Sequence Analysis (dtc/2000-11-01) implementations by EBI and NetGenics and the BioJava (www.biojava.org), BioCorba (www.biocorba.org) and BioPerl (www.bioperl.org) initiatives. 2 Introduction 2.1 General remarks In this document we describe a standard for representing biological entities that extend the Biomolecular Sequence Analysis Specification (BSA). Also, some modifications are done to the BSA specification itself. The modifications are mainly commonalties between existing and new enities which are factored out. 2.2 Overview This section contains brief overview of the main (note lists are missing) new bio-objects proposed in this submission. 2.3 module DsLSRBioObjects The new interfaces and structs are additions to the DsLSRBioObjects module. 2.3.1 BioObject BioObject is used as a generalisation for biological entities, which have existence and life cycle of their own. Typically these entities have name, id and they inherit from the CosLifeCycleObject-interface 2.3.1 Annotatable The entity provides methods for handling annotations. The interface corresponds to the Annotatable interface of BioJava, except that it can also contain SeqAnnotations, i.e., sequence features. 2.3.2 AlignmentElement AlignmentElement, from the BSA specification, corresponds to one 'row' in a traditional alignment. AlignmentElement now wraps a BioObject, not an Object. 2.3.3 Assembly The interface extends the Alignment and the BioCollection interfaces. Assembly defines an entity used to represent a unique alignment of two or more sequences, which produce one or more contiguous consensus sequences (or consensus elements).. 2.3.4 BioObjectIdentifierResolver This general resolver for BioObjects is analogous to the BioSequenceIdentifierResolver in the BSA specification. 2.3.5 Gene Gene inherits from BioObject. The entity provides additional method for getting gene products as cDNA transcripts (reverse transcripts from mRNA). 2.3.6 Alphabet The concept of Alphabet is adapted from the BioJava project. Modifications have been done mainly to make it consistent with the BSA specification. 2.3.7 FiniteAlphabet FiniteAlphabet inherits from Alphabet. Represents a alphabet which based on finite set of Symbols 2.3.8 Symbol Symbol represents a single token in the sequence. Symbol can have multiple synonyms or matches within the same Alphabet, which makes it possible to represent ambiguity codes and gaps. 2.3.9 BasisSymbol BasisSymbol inherits from Symbol. Represents a Symbol that can be composed from other symbols. 2.3.10 AtomicSymbol AtomicSymbol inherits from BasisSymbol. Marker interface which represents an unambiguous symbol 2.3.11 SymbolArray SymbolArray represents a list of Symbols that belong to an Alphabet. Sequence string is stringified representation of the list. 2.3.12 SymbolIterator SymbolIterator is strongly type iterator for Symbols. 2.3.13 SequenceCluster The SequenceCluster, which inherits from SequenceCollection, represents a collection of sequences, which share some transitive or non-transitive sequence similarity. 2.3.14 AssemblyConstruction AssemblyConstruction inherits from ClusterConstruction. It represents a cluster construction, that is based on multiple/pairwise sequence alignments. 2.3.15 BioCollection BioCollection inherits from BioObject. The BioCollection is a base class for collections. 2.3.16 SequenceCollection SequenceCollection, a collection of BioSequences, can be used for example to describe data sets from sequence databases to clone libraries, which is important part of the SequenceCluster annotation. 2.3.17 TreeNode TreeNode represents a general tree node with a parent and children. 2.3.18 Taxon Taxon extends the TreeNode. It is used to store taxonomy data .. 2.3.19 WeightMatrix WeightMatrices are used to describe variations and intensities of signals like quality measures of sequences, etc. 2.3.20 AbstractMatrix Abstraction for two-dimentsional matrices 2.3.21 DMatrix DMatrix inherits from AbstractMatrix and contains a two dimensional array of double precision floating point numbers. 3 Response to RFP Requirements 3.1 Mandatory RFP Requirements The Sequence Analysis Entities Proposal shall: - Define the following biomolecular sequence analysis entities in OMG IDL: - Biomolecular sequence alphabet (including ambiguous and complementary bases) See Alphabet in section 5.2.7. - Fuzzy locations See FuzzySeqRegion in section 5.2.18. - Weight matrices See WeightMatrix in section 5.2.24.. - Patterns, including profiles and HMMs Patterns are not addressed in this initial submission, but will be covered in the revised submission. - Phylogenetic trees Phylogenetic trees are not addressed in this initial submission, but will be covered in the revised submission. - Assembly (including trace and quality data) See Assembly in section 5.2.4. - Composite (nested) annotations The submitters were able to design a solution for Gene without needing to introduce composite annotations. Composite annotations had been added to the RFP requirements specifically because it was thought that they would prove useful in the design of Genes. - Gene See Gene in section 5.2.6. - GeneticCode extensions, including initiators and terminators GeneticCode extensions are not addressed in this initial submission, but will be covered in the revised submission. - BioSequence extensions, including biomolecular sequence alphabet See BioSequence in section 5.1.1. - Taxonomy See Taxonomy in section 5.2.21. For use by the analysis machinery defined in the Biomolecular Sequence Analysis specification, sequences (lists) of these entities shall be defined. This has been done. - Clearly define the biological concepts expressed by these entities. The biological concepts expressed by the entities are clearly defined. - Include a glossary defining terms used in the proposal. A glossary is given in Section 6. 3.2 Optional RFP Requirements There are no optional requirements. 3.3 Relationship to existing or emerging OMG Specifications ? Property Service Some entities use a CosProperty::PropertySet attribute to specify additional characteristics. ? Collection Service We have considered using the Collection Service to provide the functionality desired in BioCollection and SequenceCollection. However, it was felt that requiring implementers to support all of the CosCollection::Collection functionality is not realistic. While the full generality of this service is useful, most of it is not needed and the burden of implementing this would be prohibitive. ? LSR Biomolecular Sequence Analysis (dtc/2000-11-01) The current RFP explicitly requests extensions to the BSA specification. The changes are being made directly to the BSA DsLSRBioObjects module. 3.4 Issues To Be Discussed Address compatibility with the existing Biomolecular Sequence Analysis specification, and recommend migration paths where necessary. The submitters are also implementers of the Biomolecular Sequence Analysis specification. After discussions with the BioJava and BioCorba communities it was decided that the best course of action for the life sciences community was to rewrite portions of the BSA specification to bring it in line with the BioJava and BioCorba initiatives. 4 General description This section describes the principles that are used by many of the components in this proposal, along with explanations of the design rationale. The more detailed descriptions provided in the next section, Modules and Interfaces, refer to those provided here. 4.1 General Aims The submitters have done two things in this submission: made changes to existing BSA entities and added new BSANE entities. Where possible the changes to existing BSA entities were kept to a minimum. 4.2 Design Issues 4.2.1 Compatibility with the BioJava et al At the same time the BSA submitters were implementing that specification several new Open Source efforts started. BioJava and BioCorba were patterned after the very successful BioPerl effort. The BSA submitters are working with the BioJava, BioCorba, and BioPerl communities to bring the OMG specifications in line with the Open Source efforts. Many of the new BSANE entities are took from the BioJava interfaces. 4.2.2 OMG's New Model Driven Architecture The submitters are leveraging the new MDA approach in the work to merge the BSA specification and the BSANE submissions with the Open Source efforts discussed above. It is expected that the revised submission will contain a platform independent model (PIM), along with a platform specific model (PSM) for CORBA. The revised submission may also have a PSM for XML. 4.2.3 BSA Implementation Lessons The implementers had some difficulties in implementing the BSA specification's DsLSRAnalysis module due to the fact that the XML metadata and the IDL-based bio- objects were not based on a common PIM. Specifically, it was hard to map analysis results to bio-objects. This is the reason why sequences of new bio-objects have been defined. The submitters have also rethought the use of a general name/value pair mechanism as a placeholder for additional data. The Annotatable interface provides more general access to this behavior. 4.2.4 Identifiers and ControlledVocabularies The reader is directed to the BSA and Genomic Maps specifications for details on the use of Identifiers and ControlledVocabularies. 4.2.5 Implementation of LightweightSsequence Model There is a concern that the current sequence model with annotations are too heavy weight for large-scale genomic applications. A solution is to refactor the BioSequence into three layers, as done in the BioCorba. The first layer contains just accessor method for getting the sequence string, the second layer adds identifiers and name. The third layer corresponds to the current implementation. 4.2.6 Query Mechanism for Annotations and Properties There may be a need to provide advance query mechanism for annotations. Currently the annotations can be queried only by name and/or sequence region. A solution is to use a separate utility interface or implement functionality directly into the Annotatable interface. 5 Modules and Interfaces This chapter describes the attributes and methods, as well as semantics of the standard in more detail. New and modified BSA entities are shown as a light colour in UML diagrams. 5.1 Modifications to the BSA specification 5.1.1 BioSequence BioSequence inherits from BioObject and SymbolArray interfaces. BioObject provides attributes for id, name, description, the_basis and accessor methods for annotations. The SymbolArray provides mechanism for handling sequence strings and alphabets. The BioSequence do not provide any attributes methods of its own. The methods and attributes inherited from the BioObject are same as those defined in the BSA specification for the BioSequence, except for one method that allows getting annotations by name. The interface components inherited from the SymbolArray provide the accessor methods for getting sequence string and sequence interval, as defined in the BSA specification. In addition the interface provides the sequence Alphabet, which forms the basis set of Symbols needed to construct the sequence sting. The interface also has a method for getting all Symbols over the sequence. 5.1.2 AlignmentElement readonly attribute string key The unique reference to each AlignmentElement. Note that there may be more than one copy of a particular BioObject in the Alignment. readonly attribute SeqRegion seq_region The seq_region represents the coordinates of a particular segment of the element that is aligned in the current Alignment, and that is considered one 'row' in the Alignment. The coordinates are those of the original BioObject, not those of the Alignment. readonly attribute BioObject element The actual BioObject which is aligned. Usually the element is of type BioSequence readonly attribute Annotatable the_annotation The attribute contains additional information about the AlignmentElement in the context of the Alignment. For example following information is often needed: ? The orientation of assembled sequences. The original orientation is part of the sequence annotation. If the original orientation is not available it is assumed that it is STRAND_PLUS. ? Masked sequence regions using CompositeSeqRegions. Masked regions are parts of original sequence, that do not contribute to consensus sequence (contig). ? In case of consensus elements (see Assembly) at least following information is needed: ? Method used to calculate the consensus ? Possible goodness or quality measure of the consensus 5.2 New Entities 5.2.1 Exceptions exception IllegalSymbolException { string reason ;}; The exception is raised by Alphabet::get_ambiquity(in SymbolList symbols) and Alphabet::get_symbols(in SymbolList symbols) if one of the symbols in a list is not part of the Alphabet 5.2.2 Annotatable The entity provides methods for handling annotations. The interface corresponds to the Annotatable interface of BioJava, except that it can also contain SeqAnnotations, i.e., sequence features. It is recognised that extended query facilities may be needed to access the Annotations. The query facilities can be implemented using a separate utility interface, which provides methods for extracting the annotations from the Annotatable instances. AnnotationList get_annotations_by_name (in string name) raises (IdentifierNotResolvable) Provides access to the annotations by name. Can raise IdentifierNotResolvable exception if the name is not part of controlled vocabulary in the context of entity. Annotation operations from BioSequence Other annotation operations like, get_annotations() and add_annotation() from the BSA BioSequence are moved under this interface. 5.2.3 BioObject BioObject is used as a generalisation for biological entities, which have existence and life cycle of their own. Typically these entities have name, id and they inherit from the CosLifeCycleObject-interface. readonly attribute Identifier id The id attribute represents an ID for the BioObject. Typically database name and key will be encoded in the Identifier. The id shall not be empty. readonly attribute String name A human readable common name for the BioObject (such as EST or gene name) readonly attribute String description The description attribute, a concise description of the BioObject, typically includes functional information, e.g. the contents of the description line from a FASTA file readonly attribute Basis the_basis The the_basis attribute specifies is whether the BioObject is based on experimental (BASIS_EXPERIMENTAL) or computational (BASIS_COMPUTATIONAL) evidence or both (BASIS_BOTH). The information can be also not know (BASIS_NOT_KNOWN) or not applicable (BASIS_NOT_APPLICABLE) if the underlying BioObject represents an entity which is neither computational nor experimental like, for example, the Alphabet. 5.2.4 Assembly The interface extends the Alignment and the BioCollection interfaces. Assembly defines an entity used to represent a unique alignment of two or more sequences, which produce one or more contiguous consensus sequences (or consensus elements). In the case of bio- sequences the identifier of consensus sequence should be assembly_id.N/consensus_id.M where consensus_id.M is id and version number of consensus and assembly_id.N is id and version number of the assembly. Assembly may have references to other assemblies. This is used to describe the situation where assemblies are logically but not physically connected. For example they can have sequences, which are derived from both ends of the same clone. The name of linking property, e.g., cloneid, should be part of annotation of the assembly readonly attribute AlignmentElementList consensus_elements List of consensus elements produced from the underlying fixed Assembly. This list can be empty if consensus elements are not available. AssemblyList get_linked() Returns list of Assemblies which are linked with the Assembly. For example, assemblies which have sequence derived from same clone, but do not have connecting sequence between them. 5.2.5 BioObjectIdentifierResolver This general resolver for BioObjects is analogous to the BioSequenceIdentifierResolver in the BSA specification. BioObject resolve(in Identifier id) raises (IdentifierNotFound, IdentifierNotResolvable, IdenifierNotUnique) The resolve() method provides the BioObject for the particular Identifier. Raises IdentifierNotFound if the database and the identifier within the database can be resolved but Identifier is not present IdentifierNotResolvable is raised if the database and the identifier within the database cannot be resolved such that the Identifier cannot even be searcher for IdentifierNotUnique is raised if the Identifier specification is ambiguous and returns more than one object 5.2.6 Gene Inherits from BioObject. The entity provides additional method for getting gene products as cDNA transcripts (reverse transcripts from mRNA). Complementary DNA is used to describe the products because those can be compared directly with the genomic DNA and cDNA databases (like ESTs). The cDNAs should contain proper annotation about poly- A tail, exon boundaries etc. All the SeqAnnotations must be in coordinates of the cDNA. Gene object has the gene structure annotations in coordinates of original database entry (or a view) defined by the id. Controlled vocabulary is used for the annotations readonly attribute NucleotideSequenceList cdna_transcripts The attribute cdna_transcripts describe products of the Gene as cDNAs. The exon annotation of the Gene must correspond to the cDNA NucleotideSequence get_genomic_sequence( in unsigned long n) raises (SeqRegionOutOfBounds) Returns underlying genomic sequence locus extended n-bases beyond the first and last element of the Gene. Resulting NucleotideSequence object has annotations about the gene structures etc. in a coordinates of the sequence fragment. The SeqRegionOutOfBounds exception is raised if the sequence fragment cannot be extended as much as asked. 5.2.7 Alphabet The concept of Alphabet is adapted from the BioJava project. Modifications have been done mainly to make it consistent with the BSA specification. Notably BioJava's SymbolList is renamed to SymbolArray and SymbolList is used to represent sequence of Symbols as done with the other IDL sequences. More information about the Alphabets and Symbols can be found from www.biojava.org Alphabet extends from BioObject. Alphabet contains set of symbols, which can be concatenated to form symbol lists. Sequence string, for example, is stringified representation of the symbol list (tokens of symbols). Alphabet can be composed of from other alphabets. readonly attribute AlphabetList alphabets Ordered list of Alphabets, which make up this compound alphabet. The order is important because it defines the order how symbols are assigned to alphabets in the get_symbol –method (see below) boolean contains( in Symbol s) Returns true if the Symbol belongs to the Alphabet Symbol get_ambiguity( in SymbolList symbols) raises( IllegalSymbolException) Returns a ambiguity symbol, which represent list of symbols. Order of symbols in a list is not significant. All symbols in a list must be members of this alphabet otherwise IllegalSymbolException is thrown. Symbol get_symbol(in SymbolList symbols) raises(IllegalSymbolException) Returns a Symbol, which represents ordered list of symbols given as a parameter. Each symbol in the list must be member of different sub-alphabet in the order defined by the alphabets attribute. For example, codons can be represented by a compound Alphabet of three DNA Alphabets, in which case the get_symbol( List[ a,g,t]) method of the Alphabet returns Symbol for the codon agt. IllegalSymbolException is raised if members of symbols are not Symbols over the alphabet defined by the_alphabets 5.2.8 FiniteAlphabet FiniteAlphabet inherits from Alphabet. The interface makes the distinction between infinite and finite alphabets. readonly attribute unsigned long size Size of the FiniteAlphabet readonly attribute SymbolArray symbols List of symbols, which make up this alphabet. SymbolArray contains one AtomicSymbol for each AtomicSymbol in this 5.2.9 Symbol Symbol represents a single token in the sequence. Symbol can have multiple synonyms or matches within the same Alphabet, which makes possible to represent ambiguity codes and gaps. Symbols can be also composed from ordered list other symbols. For example codons can be represented by single Symbol using a compound Alphabet made from three DNA Alphabets. readonly attribute Alphabet matches (Sub) Alphabet of symbols matched by this symbol including the symbol itself (i.e. if symbol is DNA ambiguity code W then the matches contains symbols for W and T) readonly attribute String name Descriptive name for the symbol readonly attribute string token Token for the symbol. 5.2.10 BasisSymbol Inherits from Symbol. Represents a Symbol that can be composed from other symbols readonly attribute SymbolList symbols List of Symbols that this symbol is composed from. 5.2.11 AtomicSymbol AtomicSymbol inherits from BasisSymbol and it represents a Symbol, which is not ambiguous, i.e., their matches attribute has an Alphabet, which contain just that one symbol. Single DNA nucleotide, codon, amino-acid are examples of AtomicSymbols 5.2.12 SymbolArray SymbolArray (SymbolList in BioJava) represents a sequence of symbols that belong to an Alphabet readonly attribute unsigned long length The number of symbols in this SymbolArray readonly attribute string seq Stringified representation of the symbol list readonly attribute Alphabet the_alphabet The alphabet that this SymbolArray is over SymbolList get_symbols( in unsigned long how_many, in Interval ival, out SymbolIterator the_rest) raises (IntervalOutOfBounds, SeqRegionInvalid) Uses list/iterator pattern to provide access to the Symbols. A list of no more than how_many elements are returned as the direct result. The remaining elements, if any, are available through the iterator returned in the out parameter. Only Symbols that overlap with ival are returned. If seq_region is null, all Symbols are returned. Raises IntervalOutOfBounds if the Interval's start is less than 1 or if its start+length- 1 isgreater than the length of the SymbolArray. If the SymbolArray represents circular DNA, then this exception should be raised if the Interval's start is less than 1 or greater that the length of the SymbolArray, or if its length is greater than that of the BioSequence. The exception is also raised if the Interval or its specialisation's are not applicable for underlying entity Raises SeqRegionInvalid if the_interval is an invalid SeqRegion. Examples include an incorrect StrandType, or an invalid CompositeSeqRegion (e.g. one that has a wrong SeqregionOperator or contains overlaps or circularities). string seq_interval(in Interval the_interval) raises(IntervalOutOfBounds, SeqRegionInvalid) Provides access to sub-sequence strings. The Interval indicates which sub-sequence should be returned. The entire sequence may also be obtained using the seq attribute. If the_interval is a SeqRegion and the StrandType is STRAND_MINUS, the string should be taken as reverse complement. Raises IntervalOutOfBounds if the Interval's start is less than 1 or if its start+length- 1 isgreater than the length of the SymbolArray. If the SymbolArray represents circular DNA, then this exception should be raised if the Interval's start is less than 1 or greater that the length of the SymbolArray, or if its length is greater than that of the BioSequence. Raises SeqRegionInvalid if the_interval is a invalid SeqRegion. Examples include an incorrect StrandType, or an invalid CompositeSeqRegion (e.g. one that has a wrong SeqregionOperator or contains overlaps or circularities) 4.2.7 SymbolIterator SymbolIterator provides strongly typed iterator for Symbols boolean next(out Symbol the_symbol) raises (IteratorInvalid) The next() operation gets the next Symbol in its out parameter the_symbol and returns a boolean value. If the iterator is at the end of the set, it returns FALSE and sets the output the_symbol to null. Otherwise the method returns TRUE. Raises IteratorInvalid if the iterator is no longer valid (e.g., the underlying collection has changed) boolean next_n(in unsigned long how_many, out SymbolList symbols) raises(IteratorInvalid) next_n() returns Symbols in the SymbolList out parameter symbols, containing at most the number specified in the first parameter (how_many) and returns boolean value directly. When it is at the end of the set it returns FALSE and the symbols parameter will have length zero. In all cases the length of symbols will be minimum how_many and the number of elements remaining. the method returns TRUE , if the iterator is not at the end of the set. Raises IteratorInvalid if the iterator is no longer valid (e.g., the underlying collection has changed) void reset() Sets the iterator to the start of the set void destroy() Frees the iterator object. 5.2.13 SymbolList typedef sequence SymbolList 5.2.14 SequenceCluster The SequenceCluster, which inherits from SequenceCollection, represents a collection of sequences, which share some transitive or non-transitive sequence similarity. The cluster (is a tree) can be hierarchical and non exclusive, i.e., same sequence can be in one or more sub-cluster clusters. For example, higher level super cluster can be based on pair wise alignments and lower level cluster on multiple alignments. Example of such a hierarchical cluster is the STACK database (www.sanbi.ac.za/Dbases.html) readonly attribute ClusterConstruction construction Capture information, which describes how the cluster is build SequenceClusterList get_sub_clusters( in unsigned long how_many, out SequenceClusterIterator the_rest) Uses the list/iterator pattern to provide access to the sub-clusters. A list of no more than how_many elements are returned as the direct result. The remaining elements, if any, are available through the iterator returned in the out parameter. SequenceCluster get_parent_cluster) Returns the parent cluster. Null is returned if the SequenceCluster is a root cluster. 5.2.15 AssemblyConstruction AssemblyConstruction inherits from ClusterConstruction. It represents a cluster construction, which based on multiple/pairwise sequence alignments. IdentifierList get_hits(in Identifier id) raises (IdentifierNotFound, IdentifierNotResolvable) Get identifiers of sequences which are "linked" with the given sequence (for example sequences which are considered to be similar in context of clustering). The method return only ids of "linked" sequences, which are used in the clustering process and not necessarily the matches based on all-against-all comparison. Concept adopted from the JESAM/EGI system (corba.ebi.ac.uk/EST) Raises IdentifierNotFound if the cluster and the identifier within the cluster can be resolved but the Identifier is not present IdentifierNotResolvable is raised if the cluster and the identifier within the cluster cannot be resolved such that the Identifier cannot even be searcher for AssemblyList get_assemblies(in Identifier id) raises (IdentifierNotFound,IdentifierNotResolvable) Returns all assemblies where the given sequence exists, using same semantics as the get_hit-method. There can be many assemblies from same set of sequences. In that case the score of assembly is stored into annotation of assembly using constrained vocabulary. Also, the ids of assemblies should be consistent using proper version numbering. Inspiration from the JESAM/EGI system. Raises IdentifierNotFound if the cluster and the identifier within the cluster can be resolved but the Identifier is not present. IdentifierNotResolvable is raised if the cluster and the identifier within the cluster cannot be resolved such that the Identifier cannot even be searched for. 5.2.16 BioCollection Inherits from BioObject. The BioCollection is a base class for collections. readonly attribute unsigned long size Number of elements in a collection 5.2.17 SequenceCollection Collection of BioSequences. It can be used for example to describe data sets from sequence databases to clone libraries, which is important part of the SequenceCluster annotation. BioSequenceList get_elements(in unsigned long how_many, out BioSequenceIterator rest_iter) Return BioSequences using the list/iterator pattern. A list of no more than how_many elements are returned as the direct result. The remaining elements if any are available through the iterator returned in the out parameter. 5.2.18 FuzzySeqRegion FuzzySeqRegion inherits from the SeqRegion. Represents fuzzy sequence regions, using minimum and maximum values for the boundaries. The fuzzy boundaries are converted to start and length information using predefined rules (to be specified). By default the start value is the minimum start position and length is calculated from that minimum to the maximum end position. The concept corresponds to the BioPerl co-ordinate model. The entity, which uses the fuzzy locations, must describe how the conversion is done using appropriate annotation (possibly by providing object reference to an instance that does the conversion). readonly attribute unsigned long start_min Minimum start position. readonly attribute unsigned long start_max Maximum start position. readonly attribute unsigned long end_min Minimum end position. readonly attribute unsigned long end_min Maximum end position. 5.2.19 LabeledSeqRegion LabeledSeqRegion inherits from FuzzySeqRegion. The LabeledSeqRegion represents approximate locations, where the location information is described using labels (like chromosomal locations). The labels are mapped to numerical values inherited from the FuzzySeqRegion using some consistent scheme. The source of inspiration is the BioPerl's Mutation/LifeSeq modules (www.biojava.org) The entity, which uses the LabeledSeqRegions, must describe how the conversion to the numerical values is done using appropriate annotation (possibly by providing an object reference to an instance, which does the conversion). readonly attribute string start_label Label which describes the start position readonly attribute string end_label Label which describes the end position 5.2.20 TreeNode TreeNode represents a general tree node with a parent and children. Tree get_parent()) Return parent node TreeList get_children() List of children nodes. List is empty for leaf nodes 5.2.21 Taxon Taxonomy extends TreeNode and BioObject. It is used to store taxonomy data. Based loosely on the EMBL taxonomy IDL (http://corba.ebi.ac.uk/EMBL_taxonomy.html). TaxonList get_lineage() Returns list of ancestor nodes. Ancestors are stored in order from the parent to the root. readonly String rank Position in a taxonomy tree (species,genus, kingdom…) readonly StringList synonyms List of scientific synonyms. Excluding the name readonly StringList common_names List of common names 5.2.22 TaxonList typedef sequence TaxonList 5.2.23 WeightMatrix WeightMatrices are used to describe variations and intensities of signals like quality measures of sequences, etc. WeightMatrix can be part of sequence annotation using defined name as a key. readonly attribute AbstractMatrix matrix Two-dimensional matrix. Number of columns is equal to length of signal and rows to size of alphabet. There is one to one mapping between the row and SymbolList element. Numbering starts from zero. readonly attribute Alphabet the_alphabet Alphabet of the matrix. SymbolList of the alphabet corresponds to rows on a matrix 5.2.24 AbstractMatrix Abstraction for two-dimensional matrices. readonly attribute unsigned long rows number of rows in a matrix readonly attribute unsigned long columns number of columns 5.2.25 DMatrix Inherits from AbstractMatrix. It contains a two dimensional array of double precision floating point numbers. readonly attribute DoubleArray matrix Two dimensional array of double precision floating point numbers. 5.2.26 DoubleArray typedef sequence DoubleArray 5.2.27 DoubleList typedef sequence DoubleList 6 Glossary This glossary defines a number of terms used in the document; some of them are particular to the current document, others are used throughout the domain of genome mapping. The definitions provided here can differ from the actual usage in sub-fields of molecular genetics and laboratory practice. Terms set in a bold typeface are entries in the glossary themselves. definedTerm Defined by anotherTerm. 7 References Object Management Group. 2000. Biomolecular Sequence Analysis Final Adopted Specification. OMG Document dtc/2001-11-01. Object Management Group. 2000. Biomolecular Sequence Analysis Entities RFP. OMG Document lifesci/2000-12-16. Object Management Group. 1998. CORBAservices: Common Object Services Specification. OMG Document formal/98-12-09. Object Management Group. 2001. CORBA v2.4.2. OMG Document formal/2001-02-33. Object Management Group. 1999. Genomic Maps Adopted Specification. OMG Document dtc/1999-12-01. BioCorba. http://www.biocorba.org/. BioJava. http://www.biojava.org/. BioPerl. http://www.bioperl.org/. STACK, www.sanbi.ac.za/Dbases.html JESAM/EGI, corba.ebi.ac.uk/EST 30 Genomic Maps joint second revised proposal lifesci/99-11-11 Biomolecular Sequence Analysis Entities RFP Initial Submission (lifesci/2001-03-04) 6 Biomolecular Sequence Analysis Entities RFP Initial Submission (lifesci/2001-03-04) 39