Chemical Structure Access & Representation Response to OMG LSR RFP 8 A Proposal Submitted by: 751 Malena Dr., Ann Arbor, MI 48103 Bioscience 9880 Campus Point Drive San Diego, CA 92121 October 27, 2003October 25, 2003 Table of Contents A Proposal 1 1 Administrative 8 1.1 Submission Contact Point 9 1.2 Overview to the Material in the Submission 9 1.3 Statement of Proof of Concept 10 2 Chemical Markup Language within the Context of CSAR 12 2.1 Introduction 12 2.2 CML the Chemistry XML 14 2.2.1 The Semantic Web of Chemistry 15 2.2.2 Creating, reading and editing XML/CML documents 15 2.3 Mapping of Models 18 2.3.1 Platform-Independent and Platform-Specific UML Models 18 2.3.2 A formal approach to describing objects 19 2.4 Class Diagram: cml / Overview 21 2.5 Resolution of RFP Mandatory and Optional Requirements 22 2.5.1 Mandatory Requirements 22 2.5.2 Optional Requirements 24 2.6 Relationship to existing or emerging OMG Specifications 24 2.7 [Chemical} Elements 25 2.7.1 Molecule 25 2.7.2 Properties 25 2.7.3 Associations 25 2.7.4 Operations 25 2.7.5 MoleculeFactory 26 2.7.6 Properties 26 2.7.7 Associations 26 2.7.8 Operations 26 2.8 MoleculeUtil 27 2.8.1 Properties 27 2.8.2 Associations 27 2.8.3 Operations 27 2.9 Atom 27 2.9.1 Properties 27 2.9.2 Associations 28 2.9.3 Operations 28 2.10 AtomFactory 28 2.10.1 Properties 28 2.10.2 Associations 28 2.10.3 Operations 28 2.11 AtomParity 28 2.11.1 Properties 29 2.11.2 Associations 29 2.11.3 Operations 29 2.12 Bond 29 2.12.1 Properties 29 2.12.2 Associations 29 2.12.3 Operations 29 2.13 Electron 30 2.13.1 Properties 30 2.13.2 Associations 30 2.13.3 Operations 30 2.14 Isotope 30 2.14.1 Properties 30 2.14.2 Associations 30 2.14.3 Operations 30 2.15 IsotopeFactory 30 2.15.1 Properties 30 2.15.2 Associations 30 2.15.3 Operations 30 2.16 AtomFactory 31 2.16.1 Properties 31 2.16.2 Associations 31 2.16.3 Operations 31 2.17 BondFactory 31 2.17.1 Properties 31 2.17.2 Associations 31 2.17.3 Operations 31 2.18 NumericAtomParity 31 2.18.1 Properties 31 2.18.2 Associations 32 2.18.3 Operations 32 2.19 StringAtomParity 32 2.19.1 Properties 32 2.19.2 Associations 32 2.19.3 Operations 32 2.20 ChemicalElement 33 2.20.1 Properties 33 2.20.2 Associations 33 2.20.3 Operations 33 2.21 BondStereo (new properties only) 33 2.21.1 Properties 33 2.22 Bond (new properties only…) 34 2.22.1 Properties 34 2.23 ChemicalElementFactory 34 2.23.1 Properties 34 2.23.2 Associations 34 2.23.3 Operations 34 2.24 IsotopeSet 34 2.24.1 Properties 34 2.24.2 Associations 35 2.24.3 Operations 35 2.25 IsotopeSetFactory 35 2.25.1 Properties 35 2.25.2 Associations 35 2.25.3 Operations 35 2.26 PeriodicTable 35 2.26.1 Properties 35 2.26.2 Associations 35 2.26.3 Operations 35 2.27 PeriodicTableFactory 35 2.27.1 Properties 36 2.27.2 Associations 36 2.27.3 Operations 36 2.28 BondStereo 36 2.28.1 Properties 36 2.28.2 Associations 36 2.28.3 Operations 36 2.29 Coordinate2 36 2.29.1 Properties 36 2.29.2 Associations 36 2.29.3 Operations 36 2.30 Coordinate3 36 2.30.1 Properties 37 2.30.2 Associations 37 2.30.3 Operations 37 2.31 AbstractAngle 37 2.31.1 Properties 37 2.31.2 Associations 37 2.31.3 Operations 37 2.32 Formula 37 2.32.1 Properties 37 2.32.2 Associations 37 2.32.3 Operations 37 2.33 FormulaElement 38 2.33.1 Properties 38 2.33.2 Associations 38 2.33.3 Operations 38 2.34 Crystal 38 2.34.1 Properties 38 2.34.2 Associations 39 2.34.3 Operations 39 2.35 Query and Collection 39 2.36 Class Diagrams 39 2.37 Recommendation 51 2.37.1 Details 53 3 Glossary 56 Appendix A: UML Use Cases 60 Actor Catalogue 60 Master Use Case 61 Detailed Use Case Scenarios 62 Appendix B: Java Code 67 Appendix C: Use Cases for Chemistry 69 A Proposal 1 1 Administrative 128 1.1 Submission Contact Point 139 1.2 Overview to the Material in the Submission 139 1.3 Statement of Proof of Concept 1410 2 Chemical Markup Language within the Context of CSAR 1612 2.1 Introduction 1612 2.2 CML the Chemistry XML 1814 2.2.1 The Semantic Web of Chemistry 1915 2.2.2 Creating, reading and editing XML/CML documents 1915 2.3 Mapping of Models 2218 2.3.1 Platform-Independent and Platform-Specific UML Models 2218 2.3.2 A formal approach to describing objects 2420 2.4 Class Diagram: cml / Overview 2622 2.5 Resolution of RFP Mandatory and Optional Requirements 2723 2.5.1 Mandatory Requirements 2723 2.5.2 Optional Requirements 2925 2.6 Relationship to existing or emerging OMG Specifications 2925 2.7 [Chemical} Elements 3026 2.7.1 Molecule 3026 2.7.2 Properties 3026 2.7.3 Associations 3026 2.7.4 Operations 3026 2.7.5 MoleculeFactory 3127 2.7.6 Properties 3127 2.7.7 Associations 3127 2.7.8 Operations 3127 2.8 MoleculeUtil 3228 2.8.1 Properties 3228 2.8.2 Associations 3228 2.8.3 Operations 3228 2.9 Atom 3228 2.9.1 Properties 3228 2.9.2 Associations 3329 2.9.3 Operations 3329 2.10 AtomFactory 3329 2.10.1 Properties 3329 2.10.2 Associations 3329 2.10.3 Operations 3329 2.11 AtomParity 3329 2.11.1 Properties 3430 2.11.2 Associations 3430 2.11.3 Operations 3430 2.12 Bond 3430 2.12.1 Properties 3430 2.12.2 Associations 3430 2.12.3 Operations 3430 2.13 Electron 3531 2.13.1 Properties 3531 2.13.2 Associations 3531 2.13.3 Operations 3531 2.14 Isotope 3531 2.14.1 Properties 3531 2.14.2 Associations 3531 2.14.3 Operations 3531 2.15 IsotopeFactory 3531 2.15.1 Properties 3531 2.15.2 Associations 3531 2.15.3 Operations 3531 2.16 AtomFactory 3632 2.16.1 Properties 3632 2.16.2 Associations 3632 2.16.3 Operations 3632 2.17 BondFactory 3632 2.17.1 Properties 3632 2.17.2 Associations 3632 2.17.3 Operations 3632 2.18 NumericAtomParity 3632 2.18.1 Properties 3632 2.18.2 Associations 3733 2.18.3 Operations 3733 2.19 StringAtomParity 3733 2.19.1 Properties 3733 2.19.2 Associations 3733 2.19.3 Operations 3733 2.20 ChemicalElement 3834 2.20.1 Properties 3834 2.20.2 Associations 3834 2.20.3 Operations 3834 2.21 BondStereo (new properties only) 3834 2.21.1 Properties 3834 2.22 Bond (new properties only…) 3935 2.22.1 Properties 3935 2.23 ChemicalElementFactory 3935 2.23.1 Properties 3935 2.23.2 Associations 3935 2.23.3 Operations 3935 2.24 IsotopeSet 3935 2.24.1 Properties 3935 2.24.2 Associations 4036 2.24.3 Operations 4036 2.25 IsotopeSetFactory 4036 2.25.1 Properties 4036 2.25.2 Associations 4036 2.25.3 Operations 4036 2.26 PeriodicTable 4036 2.26.1 Properties 4036 2.26.2 Associations 4036 2.26.3 Operations 4036 2.27 PeriodicTableFactory 4036 2.27.1 Properties 4137 2.27.2 Associations 4137 2.27.3 Operations 4137 2.28 BondStereo 4137 2.28.1 Properties 4137 2.28.2 Associations 4137 2.28.3 Operations 4137 2.29 Coordinate2 4137 2.29.1 Properties 4137 2.29.2 Associations 4137 2.29.3 Operations 4137 2.30 Coordinate3 4137 2.30.1 Properties 4238 2.30.2 Associations 4238 2.30.3 Operations 4238 2.31 AbstractAngle 4238 2.31.1 Properties 4238 2.31.2 Associations 4238 2.31.3 Operations 4238 2.32 Formula 4238 2.32.1 Properties 4238 2.32.2 Associations 4238 2.32.3 Operations 4238 2.33 FormulaElement 4339 2.33.1 Properties 4339 2.33.2 Associations 4339 2.33.3 Operations 4339 2.34 Crystal 4339 2.34.1 Properties 4339 2.34.2 Associations 4440 2.34.3 Operations 4440 2.35 Query and Collection 4440 2.36 Class Diagrams 4440 2.37 Recommendation 5952 2.37.1 Conformance Points 6154 2.37.2 Details 6154 3 Glossary 6557 Appendix A: UML Use Cases 6961 Actor Catalogue 6961 Master Use Case 7062 Detailed Use Case Scenarios 7163 Appendix B: Java Code 7668 Appendix C: Use Cases for Chemistry 7870 Table of Figures Figure 1 The representation problem. (CTL File Formats, MDL Information Systems) 13 Figure 2 Intended solution 16 Figure 3 Platform-Independent Model with Stereotypes 18 Figure 4 Summary 20 Figure 5 CSAR/CML Diagram 21 Figure 6 Search Module 39 Figure 7 Search Module (Properties Classes) 40 Figure 8 DB Module 41 Figure 9 Collection Module 42 Figure 10 Atom 43 Figure 11 Bond 44 Figure 12 Coordinates 45 Figure 13 Formula 46 Figure 14 Molecule 47 Figure 15 Other 48 Figure 16 MOLUTIL Module 49 Figure 17 Structure and Translation 50 Figure 18 Architecture 51 Figure 19 Processing diagram 52 Figure 20 Sea Search and DB Interface Modules interacting with each other to perform a chemical structure search against database. 54 Figure 21 General Use Case Scenario 61 Figure 1 The representation problem. (CTL File Formats, MDL Information Systems) 1713 Figure 2 Intended solution 2117 Figure 3 Platform-Independent Model with Stereotypes 2319 Figure 4 Mappings and transformations. 2319 Figure 5 Summary 2622 Figure 6 CSAR/CML Diagram 2723 Figure 7 Search Module 4841 Figure 8 Search Module (Properties Classes) 4842 Figure 9 DB Module 4943 Figure 10 Collection Module 5044 Figure 11 Atom 5245 Figure 12 Bond 5346 Figure 13 Coordinates 5447 Figure 14 Formula 5548 Figure 15 Molecule 5649 Figure 16 Other 5750 Figure 17 MOLUTIL Module 5851 Figure 18 Structure and Translation 5952 Figure 19 Architecture 6053 Figure 20 Processing diagram 6154 Figure 21 Search and DB Interface Modules interacting with each other to perform a chemical structure search against database. 6456 Figure 22 General Use Case Scenario 7062 Acknowledgements The authors of this document wish to express their appreciation to those listed below (in alphabetic order) for their contributions of ideas and experience. Ultimately, the ideas expressed in this document are those of the authors and do not necessarily reflect the views or ideas of these individuals, nor does the inclusion of their names imply an endorsement of the final product. Moreover, the inclusion of the CML materials does not necessarily reflect that the author of the CML endorses the content of this proposal. David Benton w.david.benton@gsk.com Robert DeWitte robert.dewitte@acdlabs.com James P. Fry james.fry@pfizer.com Saikrishna R. Gangeddula saikrishna.gangeddula@mail.com Shogo Ito shogo@gandolf.acad.emich.edu Peter Murray Rust peter.murray-rust@nottingham.ac.uk Wendy L. Sharp wsharp@gandolf.acad.emich.edu Valery Tkachenko valt@acdlabs.ru Scott Markel scott.markel@lionbioscience.com Tony Parsons tony_parsons@sandwich.pfizer.com Henry Rzepa h.rzepa@ic.ac.uk Richard Scott richard.scott@denovopharma.com 1 Administrative © Copyright 2002 and 2003 by Intelligent Solutions and LION Bioscience The organizations listed above hereby grant a royalty-free license to the Object Management Group, Inc. (OMG) for world-wide distribution of this document or any derivative works thereof, so long as the OMG reproduces the copyright notices and the below paragraphs on all distributed copies. The material in this document is submitted to the OMG for evaluation. Submission of this document does not represent a commitment to implement any portion of this specification in the products of the submitters. WHILE THE INFORMATION IN THIS PUBLICATION IS BELIEVED TO BE ACCURATE, THE COMPANIES LISTED ABOVE MAKE NO WARRANTY OF ANY KIND WITH REGARD TO THIS MATERIAL INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The companies listed above shall not be liable for errors contained herein or for incidental or consequential damages in connection with the furnishing, performance or use of this material. The information contained in this document is subject to change without notice. This document contains information, which is protected by copyright. All Rights Reserved. Except as otherwise provided herein, no part of this work may be reproduced or used in any form or by any means graphic, electronic, or mechanical, including photocopying, recording, taping, or information storage and retrieval systems without the permission of one of the copyright owners. All copies of this document must include the copyright and other information contained on this page. The copyright owners grant member companies of the OMG permission to make a limited number of copies of this document (up to fifty copies) for their internal use as part of the OMG evaluation process. RESTRICTED RIGHTS LEGEND. Use, duplication, or disclosure by government is subject to restrictions as set forth in subdivision (c) (1) (ii) of the Right in Technical Data and Computer Software Clause at DFARS 252.227.7013.CORBA, OMG and Object Request Broker are trademarks of Object Management Group. 1.1 Submission Contact Point Intelligent Solutions: Juan Esteva 751 Malena Dr. Ann Arbor, MI 48103 734.996-3628 juan.esteva@intellsolutions.com LION bioscience: Mitchell Miller 955 Ridge Hill Lane Suite 30 Midvale, UT 84047 801.569.1390 mitchell.miller@lionbioscience.com 1.2 Overview to the Material in the Submission A set of specifications describing the chemical structure access and retrieval processes is provided herein. The specifications are given using class hierarchies, use case scenarios, and sequence statements. Appendix A contains the XML representation for the core CML and Appendix B contains sample code. The XMI file for the entire project is provided upon request. 1.3 Statement of Proof of Concept Several code segments have been developed to test the use case scenarios presented below. Please also refer to Appendix A and Appendix B. Search Chemical Structure (against legacy database) Create CML molecule from a legacy format. Select the values for use as parameters in a query. Construct a query using interfaces in Search and DB modules. Transport resulting query to server. Vendor Implementation translates the query into the legacy format. Vendor Implementation performs the search from the database. Vendor Implementation translates the result into the CML object model. Transport the result back to client. Transform the result into a legacy format. Search Chemical Structure (within collection of CML molecules) Create CML molecule from a legacy format. Select the values for use as parameters in a query. Construct a query using interfaces in Search and Collection modules. Search -the collection using the query built above. Transform the result into a legacy format. Add Chemical Structure Create CML molecule. Transport the new molecule to server. Vendor Implementation translates the insert query into the legacy format. Vendor Implementation performs the insert into the database. Vendor Implementation translates the status of the insert into standard format. Transport the result back to client. Update Chemical Structure Search the database for the chemical structure to be updated. Replace the corresponding CML molecule with a newly constructed CML molecule. Transport the update query along with the new molecule to server. Vendor Implementation translates the query into the legacy format. Vendor Implementation performs the update to the database. Vendor Implementation translates the status of the update into standard format. Transport the result back to client. Delete Chemical Structure Search the database for the chemical structure to be deleted. Delete the corresponding CML molecule. [Actual deletion is performed by Vendor Implementation] Transport the delete query to server. Vendor Implementation translates the query into the legacy format. Vendor Implementation performs the delete to the database. Vendor Implementation translates the status of the delete into standard format. Transport the result back to client. Translate Chemical Structures between Representations Forward transformation will create a CML representation from a legacy representation. Reverse transformation will create a legacy representation from a CML representation. Manipulate Associated Intrinsic Properties Given a legacy representation of a molecule or result set, create CML molecule. Then gain access to its corresponding intrinsic properties – including atoms, bond types, connectivity, molecular weight, and molecular formula. 2 Chemical Markup Language within the Context of CSAR 2.1 Introduction Peter Murray-Rust and Henry Rzepa state that there are a number of problems when trying to capture the contextual meaning of chemistry that impede the use of this information. For instance, the following extract from a typical molecule science journal illustrates not only how precisely data and information must be represented, but also how much human perception is required to translate this information as presented in this (linear) form into e.g. a reproducible experiment or a mechanistic interpretation; "Thiamin phosphate synthase catalyzes the formation of thiamin phosphate from 4-amino-5- (hydroxymethyl)- 2-methylpyrimidine pyrophosphate and 5-(hydroxyethyl)-4-methylthiazole phosphate. The reaction involves... dissociative mechanism...carbenium ion intermediate...and pyrimidine iminemethide observed in the crystal..." Note the profusion of chemical structure information, concepts and terms, which only a trained human chemist could easily process. Quantitative concepts and units are also ubiquitous; "A 500 ul aliquot of 0.8 uM TP synthase in 50 mM Tris-HCl (pH 7.5) and 6 mM MgCl2 incubated at room temperature with 50uM CF3HMP-PP." An even greater degree of human perception is required when handling graphical chemical representations which may contain many, often fuzzy and dangerous, human-only semantics (e.g. 2D representations of 3D properties, relative stereochemistry, aromaticity, hydrogen and other "weak" bonding, use of generic and "R" groups, reaction arrows and mechanisms, etc). The challenge therefore is to develop an infrastructure which can be routinely used to capture, store and appropriately filter and display such information (see Figure 1). Consequently, the ontological chemically-specific information types to support the capturing and utilization of this information should include; ? Molecules and substances ? Reactions ? Analytical information, especially spectra ? Computation and simulation (QM, mechanics, dynamics, etc.) ? "Data-centric" concepts (numbers, units, arrays, matrices, etc.) ? Specialist software for display, editing, searching etc ? Support "adjoining" disciplines such as bio areas, materials science etc.) Figure 1 The representation problem. (CTL File Formats, MDL Information Systems) Figure 1 shows some of the common file formats that are used to represent chemical information. Therefore, the tasks that can be identified in implementing an XML solution include: ? The creation of documents from both legacy sources (see Figure 1) of data and de novo by humans ? The creation and capture of metadata (dictionaries of terms, tables of contents, codes, etc.) ? Specification of namespaces (a reserved addressing scheme for information) ? Human validation of the system (conformance to agreed specifications) ? Machine validation of documents (according to a specified and agreed schema) ? Document transformation (XSLT) ? Rendering and display (XSL-FO, Domain-specific such as molecular representations) The design of an XML-based markup language should provide for; ? a simple, extensible DTD or Schema (do not over-complicate and make it modular) ? Agreed semantics ? One (or more) agreed and published ontologies ? Agreed examples and conformance tests ? A community of critical mass Appropriate tools for accomplishing this should be identified. These might include; ? XML Writers ? XML Readers ? Legacy converters (difficult because of variation and ambiguity in the original data which may require some degree of perception for an accurate conversion) ? Validators ? Dictionaries ? Editors Custom written XSLT style sheets and generic editors will do some of these, but a DOM (Document object model, which represents a syntax-free abstraction of the data in memory) is probably essential for many subjects. 2.2 CML the Chemistry XML Markup Languages created using eXtensible Markup Language (XML) are becoming accepted as the medium for passing generic and domain-specific information over networks. In many cases this information represents persistent information objects which can be re- used in other applications and other contexts. Such messages and objects are likely to become the medium of chemical information flow, replacing the incompatible "legacy file formats" currently used in most applications. XML makes a considerable contribution to software design because it can describe and document data structures. Thus, an XML document represents a tree in which the component nodes and attributes are precisely defined and to which any tree-based algorithm (enumeration, sorting, walking, diffing, merging, pruning, etc.) can be defined. Because trees are common and useful, the W3C XML activity has created an Application Programming Interface (API) for XML trees called the Document Object Model (DOM). This encourages standardization of terminology, documentation and algorithms. We believe this can form the core for component-based computing and describe its extension to chemistry. 2.2.1 The Semantic Web of Chemistry Our dream is for a "Semantic Web" for chemistry. This concept epitomizes the ability for a machine to receive a document over a network without prior agreement on its contents and to carry out meaningful processing without human direction. The sender of the message has to represent: 1. What the components of the document are 2. Semantics (how the components are to be processed 3. The ontology used (what the components mean) XML solves the first; the encoding, markup, structure and namespace is precisely defined. It deliberately does not address the others. There is no universally agreed-upon mechanism or sets of mechanisms for encoding semantics; the most common is to associate executable code (e.g. plug-ins or Java applets), which behaves in a predetermined way when installed on the recipient's machines. Agent technology and CORBA are alternative approaches but are normally only deployed for projects where there is much prior agreement between senders and recipients. In this paper, we propose a Java interface as a first step. 2.2.2 Creating, reading and editing XML/CML documents Creating systems for processing structured documents is more difficult than often appreciated. It requires: ? A precise, agreed specification that has been tested through prototype implementations ? Tools to write compliant documents ? Tools to read compliant documents ? Tools to modify (edit) and validate compliant documents If the specification is flexible, a read/write/edit tool must cater to all possible variants of documents, often a major challenge. We argue here that it must be implemented with a DOM, and a CMLDOM is critical to chemical interoperability. CML deliberately does not cover all chemistry but concentrates on "molecules" (discrete entities representable by a formula and usually a connection table). It supports a hierarchy for compound molecules (clathrates, macromolecules, etc.). It also supports reactions and macromolecular structures/sequences (though it can interoperate with other macromolecular XML languages as they are developed). It has no specific support for physicochemical concepts but can support labeled numeric datatypes of several sorts which can cover a wide range of requirements. It allows quantities and properties to be specifically attached to molecules, atoms or bonds. In addition, CML does not provide a robust and all-encompassing solution. For instance: ? In CML, there are no classes to facilitate searches. ? In CML, there are no classes to facilitate searches for properties. ? In CML, there is no separate class for Cartesian coordinates but has Polar class for Polar coordinates. The goal of the proposed model (see § 2.37) is to make use of the representational and translational capabilities offered by CML and complement them with classes that facilitate searches and allow the creation of collections (see Figure 2). Figure 2 Intended solution 2.3 Mapping of Models 2.3.1 Platform-Independent and Platform-Specific UML Models There are multiple ways to transform a platform-independent model (PIM) expressed using UML into a corresponding platform-specific model (PSM) expressed using UML: 1. A human could study the PIM and manually construct a PSM, perhaps manually constructing the one-of refinement mapping between the two. 2. A human could study the PIM and utilize models of known refinement patterns to reduce the burden in constructing the PSM and the refinement relationship between the two. 3. An algorithm could be applied to the PIM to create a skeleton of the PSM to be manually enhanced by hand, perhaps using some of the same refinement patterns in item 2. 4. An algorithm could create a complete PSM from a complete PIM, explicitly or implicitly recording the refinement relationship for use by other automated tools. Note that the above list does not address the production of executable code from a platform-specific model. There also are variations in which the platform-specific model is not a semantically rich UML model but, rather, is expressed via a language such as IDL, Java interfaces, XML, etc. Fully-automated transformations are feasible in certain constrained environments. The degree to which transformations can be automated is considerably enhanced when the following conditions are obtained: ? There is no legacy to take into account. ? The model that serves as input to the transformation is semantically rich. ? The transformation algorithms are of high quality. It is much easier to generate executable code for structural features (attributes, certain associations and similar properties) of a model rather than behavioral features (operations) because the behavior of property getters and setters are quite simple. Automation of transformations is more tractable when the transformation is parameterized, (i.e. a human has a pre-defined set of options to select from) to determine how the transformation is performed. For example, a system that transforms an UML model to an XML DTD could allow some control over how a UML class's attributes are transformed, giving a human the choice to put them in an ATTLIST or to put each attribute in a separate ELEMENT [as in the UML model]. Some of these points are illustrated by a CORBA-specific model, corresponding to the fragment of a platform-independent model shown in Figure 3. Note that this model fragment has been annotated with a stereotype to denote that Account is an entity as opposed to a process. The model fragment has also been enhanced to indicate that the Account number constitutes a unique identifier for Account instances. Figure 3 Platform-Independent Model with Stereotypes The submitters made use of the capabilities offered by Rational Rose and the Unisys XML Metadata Interchange (XMI) to accomplish the transformation between the PIM to the PSM. Given two metamodels defined by the OMG, one describing UML and one describing Java, we can derive the APIs for these metamodels by applying mappings defined in the JMI specificati on. These APIs are the common metadata access APIs for UML tools and Java IDEs respectively. Since the respective metamodels describe the functions of the tools well, the JMI mapping results in a rich set of APIs for integration. The process is illustrated in Figure 4 below: Figure 4 Mappings and transformations. 2.3.2 A formal approach to describing objects The Unified Modeling Language (UML) [Grady Booch and Jacobson 1999] is rapidly becoming the de facto method for the representation of Object-Oriented (OO) concepts. Most implementations have graphical interfaces and provide tools for supporting the creation of software. For example, a framework described in UML can be converted to an IDL or XML representation or to Java class libraries. We have been able to transfer our CMLDOM Java bytecode to collaborators who have been easily able to use UML-based tools to represent and use the class functionality. Although XML is not directly an OO approach, it is easy and common to map it into OO frameworks. XML can support most OO concepts, the most common being: ? Abstraction. The formalization of the essential properties of a (real-world) entity. The abstraction is usually a subset of most human perceptions of the real world. For concrete subjects like chemistry, XML elements normally map well into these formalizations. Thus, elements in CML represent a core set of universal self-consistent properties of molecules, atoms, etc. from which we create corresponding OO abstractions ? Encapsulation. Each object is a "black box" and clients of it only have access through the interface. The implementation is not specified and details cannot normally be obtained from the interface, which exposes components through methods. The interface is often tied closely to the XML elements and their attributes, though some of the methods are related to XML prose descriptions. ? Aggregation. An object contains other objects (a "has A" relationship). This maps directly into the content model of XML DTDs, so that a DTD defines the object aggregation model. For CML, a element can (directly or indirectly) contain elements, elements, etc. and a Molecule object can contain Atom objects, Bond objects, etc. ? Polymorphism. Objects that share a common interface can be used in identical ways by other objects. Thus, all markup languages share common features (such as XMLAttributes) through XML itself. For example, MathMLDOM objects (the W3C's specification for mathematics) and CMLDOM objects both implement the W3CDOM interface, so either can be "plugged into" a generic W3CDOM-compliant tool. ? Inheritance. An object can inherit properties from another (its superclass), and can modify or add to them. Inheritance can be from a single superclass or many ("multiple inheritance"); Java has eschewed multiple inheritance and so do we, using interfaces where appropriate. XML has no direct concept of inheritance, though the recent XMLSchema specification provides some of this functionality with multiaxiality through facets. CMLDOM is designed for inheritance as outlined below. Other properties which XML does not normally support are: ? State and Sequence. These relate to dynamic processes in the UML sense. They are not currently used in CML. ? Behavior. This is normally provided through methods. XML is, at heart, a language for describing static documents, though many specific markup languages have found the need to specify behavior. This is true for chemistry but we defer the detailed behavior- related methods to the following ontological paper [A Universal approach to Web-based Chemistry using XML and CML, P. Murray-Rust, H. S. Rzepa, M. Wright and S. Zara, ChemComm, 2000, 1471-1472.]. Since CML V1.0 does not formalize the behavior of CML objects, the CMLDOM interface will be the major way of achieving this. 2.4 Class Diagram: cml / Overview Figure 554 Summary Figure 665 CSAR/CML Diagram 2.5 Resolution of RFP Mandatory and Optional Requirements The following is the list of requirements put forward in the Chemical Structure Access and Representation (OMG RFP LSR RFP 8). 2.5.1 Mandatory Requirements Define interfaces that provide uniform access to 2-D chemical structures, associated intrinsic properties, and collections thereof, independent of underlying implementation. These interfaces shall include the functionality listed below. In all of these, the interfaces must handle stereochemistry, the associations between core structures and their salts, and the association between tautomeric structures. Support the lifecycle of single as well as collections of 2-D chemical structures including associated intrinsic properties. Lifecycle support includes create, insert (into collections), update and delete. Intrinsic properties include, but are not limited to: unique identifier (if assigned), atoms, bond types, connectivity, 3-D spatial coordinates, molecular weight and molecular formula. Any calculations used must make available the method of calculation used, a description of the method, and literature references if available. The proposal addresses the manipulation of these properties--please refer to the CML module definitions, which include the CML, NONCML, and MOLUTIL interfaces. Search for and retrieve 2-D chemical structures from collections, independent of underlying storage implementation, using the following methods: ? Exact match – search for all 2-D structures in a collection that exactly match a query structure. ? Similarity – search for all 2-D structures in a collection that are loosely related to a query structure. ? Substructure – search for all 2-D structures in a collection that contain a user-defined partial chemical structure. ? Tautomer – search for all 2-D structures in a collection that are tautomers of a query structure. ? Generic (i.e., Markush, R-group, etc.) – search for all 2-D structures in a collection in which the identified structures are contained in a query represented by a generic structure. ? Search for intrinsic properties, as defined in 6.5.1.1 Searches consisting of any combination of the above methods using Boolean and comparison predicates (<, >, =, ~, AND, OR, etc.). For example: find all 2-D structures that match a query structure and whose molecular weight is equal to some number. Provide access to representations of 2-D chemical structures in each of the following formats regardless of underlying implementation: MOL, SD, and MOL2 files; SMILES strings and SLN. Provide the following operations on collections of 2-D chemical structures: union, intersection, subtraction, and disjunction. The proposal addresses the requirement posted above. A series of use case scenarios have been documented that take into consideration not only the search criteria listed above but also include a composite search using both structure(s), associated intrinsic properties and extrinsic properties.See Appendix B and C. Address the issues related to translating structures and structural queries between representations, for example the potential loss of information. Central to the manipulation of chemical structures in this proposal is the fact that we are considering using the Chemical Markup Language (CML) [http://www.xml-cml.org/]. CML acts primarily as an abstraction of current chemical models (usually connection-table based). It is intended to be able to represent any of them faithfully and to allow syntactic, semantic and ontological conversion. We have made some progress with the more common files types (so far, MOL and SMILE representations have been translated into CML) but will also be dependent on any standardized ontology to emerge from the IUPAC efforts. Provide extensibility with respect to the scope defined in section 6.2 and functionality defined in 6.5.1 without requiring changes to the interfaces. The proposal does this by providing parameterized interfaces. Clearly define all objects in terms of the chemical concepts expressed by them. Include a glossary defining terms and concepts used in the response. The submission does this. 2.5.2 Optional Requirements There are no optional requirements. 2.6 Relationship to existing or emerging OMG Specifications ? Model Driven Architecture. While the model provides only the data structures needed for the interchange and manipulation of Chemical Representation data and specifies no services, it does share the MDA approach of using the UML model as the basis of generating the DTD from the XMI. Given this model it is now possible to generate data structures that meet different language standards. ? XML Metadata Interchange. The model is submitted in XMI format, v1.1, generated from Rational Rose Enterprise Edition 2001 using the Unisys Rose XML Tools Version 1.3.2 add-in. ? Query Service. The Query service was not used,used; instead the submitters suggest that it is possible to implement a Query Service for Chemical Representation data using Xdb. Xdb is an XML document repository providing structured storage of XML data, at present using a Relational Database Management System (RDBMS) mapping over PostgreSQL. XDB provides a fast and scalable XML database framework with support for both XML Path Language (XPath) and XML Query Language (XQL), and the ability to store XML documents and SAX APIs. ? Bibliographic Query Service. Although the BQS specification is not directly incorporated, the attributes and annotation from associations of the Bibliographic Reference can be used in queries to data sources that support the interfaces in the BQS specification. ? Collection Service. The Collection Service is not used. Instead, the W3C DOM and XML-DEV SAX parsers provide similar capabilities for XML data as the Collection Service for IDL data. The DOM parser is ideal for smaller XML data and provides full navigation backwards and forwards. The SAX parser provides a forward only traversal of the data, which is ideal for parsing large XML documents. 2.7 [Chemical} Elements 2.7.1 Molecule At the heart of our model is the Molecule entity which represents a chemically substance. One Molecule contains zero or more sub-Molecules (no limit on the depth with which Molecules can be nested.), zero or more Atoms and zero or more Bonds. Molecules nested within a Molecule give our model the ability to accommodate sets of tautomers, conformers, residues, mixtures (not required for this submission, but definitely useful in chemistry) and other complex chemical entities. Molecules have the following properties: 2.7.2 Properties Fuzzy – boolean; when true, this molecule is only intended to query databases and probably does not reflect a real chemical substance that a researcher would make or isolate and then store in a database. childRole – string; when this Molecule is associated with another Molecule, childRole contains a description of the relationship. 2.7.3 Associations We use Associations to record the relationship between Molecule and other entities: ? ParentMolecule – identifies the next Molecule up in the hierarchy. ? subMolecules – the converse relationship, identifying the set of sub-Molecules under this Molecule. ? molecularAtoms – a pointer to the set of Atoms that belong to this Molecule. ? immediateParentMolecule – the converse relationship to molecularAtoms, identifying the Molecule immediately above the set of Atoms. ? molecularBonds – a pointer to the set of Bonds that belong to this Molecule. ? molecularBondAngles – a pointer to the set of BondAngles that belong to this Molecule. ? molecularTorsions – a pointer to the set of TorsionAngles that belong to this Molecule. ? molecularCrystal – a pointer to a Crystal unit, defining the crystal structure of this molecule. ? molecularElectrons -- a pointer to the set of Electrons that belong to this Molecule. 2.7.4 Operations The following operations define the behavior of Molecule. ? addAtom – appends an Atom to the set indicated by molecularAtoms. ? deleteAtom – removes an Atom from the set indicated by molecularAtoms. ? addBond – appends a Bond to the set indicated by molecularBonds. ? deleteBond – removes a Bond from the set indicated by molecularBonds. ? addBondAngle – appends a BondAngle to the set indicated by molecularBondAngles. ? deleteBondAngle – removes a BondAngle from the set indicated by molecularBondAngles. ? addTorsionAngle – appends a TorsionAngle to the set indicated by molecularTorsionAngles. ? deleteTorsionAngleAngle – removes a TorsionAngleAngle from the set indicated by molecularTorsionAngleAngles. ? getAtomByID – a convenience method for locating an Atom given an (alphanumeric) ID. ? addElectron – appends an Electron to the set indicated by molecularElectrons. ? deleteElectron – removes an Electron from the set indicated by molecularElectrons. ? addMolecule – appends a sub-Molecule to the set indicated by subMolecules. ? addMoleculeWithRole – appends a sub-Molecule to the set indicated by subMolecules and specifies its chileRole value. ? getAtom – given an integer, n, return the nth Atom in this Molecule's molecularAtoms. ? deleteMolecule – removes a Molecule from the set indicated by subMolecules. ? hasCoords – indicates where this Molecule has coordinates of a specified type (either 2D or 3D) ? getNonHydrogenAtoms – returns an array of Atoms associated with this Molecule (molecularAtoms), omitting hydrogens. ? getCalculatedMolecularMass – returns the molecular weight of this Molecule by summing the weights of constituent Atoms. 2.7.5 MoleculeFactory MoleculeFactory is an interface that defines the behavior of something that creates Molecules. 2.7.6 Properties none 2.7.7 Associations none 2.7.8 Operations ? createMolecule – initialize a Molecule with a set of Atoms and Bonds. ? (Note: at some point we may want to add an additional input parm of type Molecule[]) ? getAtomFactory – return something that can create Atoms. ? getBondFactory – return something that can create Bonds. 2.8 MoleculeUtil The interface MoleculeUtil defines the behavior of something that calculates properties for a given Molecule. 2.8.1 Properties none 2.8.2 Associations none 2.8.3 Operations ? getCalculatedFormula – automatically generate a molecular formula string. ? getMolecularWeight – return a double-precision real value for the mass of the Molecule. 2.9 Atom Atom represents a location within a molecule, generally a chemical atom (see glossary). A set of properties is defined. Properties listed as arrays, are generally single-valued for registerable structures (see glossary) but may have zero to many values for query structures. 2.9.1 Properties ? occupancy – whether the position occupied in coordinate space by this Atom actually has a chemical atom within it. ? formalCharges – a value or set of values for this Atom indicating whether it has gained (<0) or lost (>0) electrons relative to the uncombined form of the [chemical] element. ? hydrogenCounts – a value or set of values indicating the number of hydrogen atoms attached to this Atom. These hydrogen atoms may be used for substructure querying or display. ? nonHydrogenCounts – the number or allowed numbers of heavy Atoms attached to this Atom. ? atomID – string identifier attached to this Atom. ? substitutionCount – a query property of this Atom, used exclusively in database queries, indicating the number of heavy atoms attached. It can take non-negative integral values, plus '*' to indicate 'as drawn.' ? hydrogenCount – this is a string property, distinct from the array of integers hydrogenCounts that corresponds to the MDL molfile field indicating the number of hydrogens that must be present. It can take non-negative integral values, plus '*' to indicate 'as drawn.' ? ringMembership – a query property indicating the number of rings in which this Atom participates. It can take non-negative integral values, plus '*' to indicate 'as drawn.' ? aromaticity – a Boolean that indicates whether this Atom has 'aromaticity' as defined in the glossary. 2.9.2 Associations ? xy – relates the Atom to a set of 2D screen coordinates for display ? xyz – relates the Atom to a set of 3D coordinates specifying location in space. ? xyzFract -- relates the Atom to a set of fractional crystal coordinates. ? immediateParentMolecule—relates the Atom to the Molecule that encapsulates it. ? [chemical] elementTypes – defines that Atom type by relating it to a ChemicalElement in a periodic table. ? ligands – defines a set of Atoms that are bonded to this Atom. ? atomParities – defines the chirality (if any) of this Atom. 2.9.3 Operations ? addLigand – appends a new Atom to this Atom's list of attachments. ? getLigands – returns an array of bonded Atoms. ? getNonHydrogenLigands – convenience method to provide a list of bonded heavy Atoms. 2.10 AtomFactory The AtomFactory interface specifies the behavior of things that create Atoms. 2.10.1 Properties none 2.10.2 Associations none 2.10.3 Operations createAtom – instantiates a new Atom. 2.11 AtomParity The AtomParity interface defines a generalized pattern of behavior for definitions of atom- level chirality. Atom-level chirality means that the atom has some 'handedness' as a tetrahedral atom with 4 unlike groups around it. The setting for this chirality – returned by the getStereoCenter method – can be either numeric (as in MDL software) or a string (as in Daylight software). Atom parity is optional in some systems; molecular chirality may be fully specified using bond markings. The AtomParity interface is realized by two classes: NumericAtomParity and StringAtomParity, which store the parity as a double-precision real and a string, respectively. 2.11.1 Properties none 2.11.2 Associations none 2.11.3 Operations ? equals – compares two different AtomParities which may be in different formats. ? isChiral – returns true if the Atom has defined assymmetry, for example, as a tetrahedral atom with 4 different substituents . ? isSpecified – returns true when isChiral returns true AND a specific parity is set. ? getStereoCenter – returns the actual value for this stereocenter. 2.12 Bond Bond represents a chemical linkage between two Atoms. Properties listed as arrays are generally single-valued for registerable structures (see glossary) but may have zero to many values for query structures. 2.12.1 Properties ? bondOrders – an indication of the strength of the Bond. Common values include SINGLE, DOUBLE, TRIPLE, AROMATIC, RESONANT, ANY. ? bondLength – the distance between the atoms on either side of the Bond. This only has meaning for 3D structures. ? Topology – a query property that indicates whether the Bond is part of a ring, excluded from ring membership or unspecified. 2.12.2 Associations ? bondAtoms – relates the Bond to its constituent (pair of) Atoms. ? bondStereos – specifies the stereochemistry of the Bond. ? immediateParentMolecule – the container Molecule. 2.12.3 Operations ? getAtomByID – returns the constituent Atom of this Bond having the specified ID. ? getOtherAtom – given one Atom, return the second Atom constituent of this Bond. ? contains—returns true if the specified Atom is part of this Bond. 2.13 Electron Electron is reserved future use. 2.13.1 Properties ? none 2.13.2 Associations ? none 2.13.3 Operations ? none 2.14 Isotope Isotope represents one possible configuration of an atom of a given [chemical] element, defined by its mass. (Isotopes of a given [chemical] element differ from one another because of the number of neutrons.) 2.14.1 Properties ? mass - inertial property of the atom ? abundance - fraction of the [chemical] containing this 2.14.2 Associations ? none 2.14.3 Operations ? none 2.15 IsotopeFactory IsotopeFactory provides a uniform interface for things that create Istopes. 2.15.1 Properties ? none 2.15.2 Associations ? none 2.15.3 Operations ? createIsotope – instantiates an Isotope object. 2.16 AtomFactory AtomFactory provides a uniform interface for things that create Atoms. 2.16.1 Properties ? none 2.16.2 Associations ? none 2.16.3 Operations ? createAtom – instantiates an Atom object. 2.17 BondFactory BondFactory provides a uniform interface for things that create Bonds. 2.17.1 Properties ? none 2.17.2 Associations ? none 2.17.3 Operations ? CreateBond – instantiates an empty Bond object. ? createBondForAtoms – instantiates a Bond given a pair of Atoms. 2.18 NumericAtomParity NumericAtomParity is a realization of the AtomParity interface that uses numbers to hold the parity information. 2.18.1 Properties ? numericParity – the value of the current Atom's stereochemical parity. ? STEREO_PARITY_UNKNOWN – an indication that no information is available about the current Atom's stereochemical parity, probably because of a lack of experimental data. ? STEREO_PARITY_UNSPECIFIED – an indication that no information has been provided about the current Atom's stereochemical parity, probably because the property has not been given a value. ? STEREO_PARITY_EVEN – an indication that the mathematical function describing the current Atom's stereochemical parity gives a value divisible by 2. ? STEREO_PARITY_ODD – an indication that the mathematical function describing the current Atom's stereochemical parity gives a value not evenly divisible by 2. 2.18.2 Associations ? none 2.18.3 Operations ? none 2.19 StringAtomParity StringAtomParity is a realization of AtomParity interface that uses text to hold the parity information. 2.19.1 Properties ? SMILESString – the value of the current Atom's stereochemical parity. ? STEREO_PARITY_L – an indication that the current Atom's stereochemical configuration resembles a standard 'L' Atom. ? STEREO_PARITY_R – an indication that the current Atom's stereochemical configuration is classified as 'R' according to the Cahn-Ingold-Prelog rules. ? STEREO_PARITY_@ – an indication that current Atom has substituents arranged in an 'anticlockwise' fashion. ? STEREO_PARITY_@@ – an indication that current Atom has substituents arranged in a 'clockwise' fashion. ? STEREO_PARITY_S – an indication that the current Atom's stereochemical parity is classified as 'S' according to the Cahn-Ingold-Prelog rules. ? STEREO_PARITY_D – an indication that the current Atom's stereochemical parity resembles a standard 'D' Atom. ? STEREO_PARITY_UNKNOWN – an indication that no information is available about the current Atom's stereochemical parity, probably because of experimental limitations. ? STEREO_PARITY_UNSPECIFIED – an indication that no information has been provided about the current Atom's stereochemical parity, possibly because the property has not been given a value. 2.19.2 Associations ? none 2.19.3 Operations ? none 2.20 ChemicalElement ChemicalElement provides a way of describing the properties of a category of Atoms, related by having the same number of protons in the nucleus. 2.20.1 Properties ? atomicNumber – an integer that identifies this [chemical] element, equal to the number of protons in the nucleus. ? atomicWeight – the gravitational mass of the [chemical] element relative to carbon (a standard). This number is generally a weighted average of the Isotopes that make up the [chemical] element. ? covalentRadius – one half of the distance between two singly-bonded atoms of the [chemical] element. ? symbol – 1-3 letters that are used to represent the [chemical] element ? VDWRadius – the closest a non-bonded atom can approach without incurring very strong repulsive forces. ? valenceElectrons – number of electrons in the outermost shell of an atom of this [chemical] element. 2.20.2 Associations None 2.20.3 Operations None 2.21 BondStereo (new properties only) 2.21.1 Properties ? BONDSTEREO_CIS – describes a double bond whose principal substituents ('principal' is loosely defined) are on the same side. ? BONDSTEREO_TRANS – describes a double bond whose principal substituents ('principal' is loosely defined) are on the opposite sides. ? BONDSTEREO_Z – describes a double bond whose principal substituents (defined according to Cahn-Ingold-Prelog rules) are on the same side. ? BONDSTEREO_E – describes a double bond whose principal substituents (defined according to Cahn-Ingold-Prelog rules) are on the opposite sides. ? BONDSTEREO_WEDGE – describes a bond that appears to be 'coming out of the page.' ? BONDSTEREO_HATCH – describes a bond that appears to be 'going into the page.' ? BONDSTEREO_UNKNOWN – describes a bond whose stereochemistry is not available because of experimental limitation. 2.22 Bond (new properties only…) 2.22.1 Properties ? BOND_ORDER_SINGLE – a bond containing only a single shared pair of electrons. ? BOND_ORDER_DOUBLE – a bond containing two shared pairs of electrons. ? BOND_ORDER_TRIPLE – a bond containing three shared pairs of electrons. ? BOND_ORDER_ANY – a bond that may contain any number of shared electron pairs. ? BOND_ORDER_RESONANT – a bond whose electronic configuration cannot be described by an integral number of shared electron pairs. ? BOND_ORDER_AROMATIC – a bond whose properties are in between those of single and double bonds and also reflects a strong interaction with adjacent aromatic bonds. 2.23 ChemicalElementFactory ChemicalElementFactory provides a uniform interface for things that create ChemicalElements. 2.23.1 Properties None 2.23.2 Associations None 2.23.3 Operations ? createChemicalElement – instantiates a ChemicalElement having a specified atomic number. ? createIostopicallyLabeledChemicalElement – instantiates a ChemicalElement having a specified atomic number and a given set of Isotopes. 2.24 IsotopeSet An IsotopeSet is a grouping of Isotopes that define the composition of a sample of the [chemical] element. 2.24.1 Properties None 2.24.2 Associations None 2.24.3 Operations ? getMostCommonIsotope – returns the most prominent Isotope of the current [chemical] element. ? getIsotopeAbundance – returns the fraction of a given Isotope within the set. ? getAveragedWeight – returns the mean of the weights of the Isotopes making up the set. 2.25 IsotopeSetFactory IsotopeSetFactory provides a uniform interface for things that create IstopeSets. 2.25.1 Properties None 2.25.2 Associations None 2.25.3 Operations createIsotopeSet – instantiates an IsotopeSet. 2.26 PeriodicTable A grouping of ChemicalElements that provides a complete representation of all the atom types used in some chemical system. 2.26.1 Properties None 2.26.2 Associations None 2.26.3 Operations ? getElement – given an atomic symbol, return the corresponding ChemicalElement. ? getElementByAtomicNumber – given a number, return the ChemicalElement with that many protons. ? getSymbol -- given a number, return the atomic symbol for the ChemicalElement with that many protons. 2.27 PeriodicTableFactory PeriodicTableFactory provides a uniform interface for things that create PeriodicTables. 2.27.1 Properties None 2.27.2 Associations None 2.27.3 Operations createPeriodicTable – instantiates an empty PeriodicTable. 2.28 BondStereo Interface BondStereo defines the behavior of classes that define a Bond's stereochemistry. 2.28.1 Properties None 2.28.2 Associations None 2.28.3 Operations ? equals – provides a means of comparing the stereochemistry of two Bonds. ? getValue – returns the value of the current Bond's stereochemistry. 2.29 Coordinate2 Coordinate2 represents a set of x, y coordinates which specify the placement of an atom on a 2D display grid. 2.29.1 Properties ? x – the abscissa of this coordinate set. ? y – the ordinate of this coordinate set. 2.29.2 Associations None 2.29.3 Operations None 2.30 Coordinate3 Coordinate3 represents a set of x, y, z coordinates which specify the placement of an atom on a 2D display grid. 2.30.1 Properties ? x – the abscissa of this coordinate set. ? y – the ordinate of this coordinate set. ? z – the depth coordinate of this set. 2.30.2 Associations None 2.30.3 Operations none 2.31 AbstractAngle A generalization of the behavior of bond (or 3-center) angles and torsional (or 4-center) angles. Realizations: BondAngle and TorsionAngle. 2.31.1 Properties ? angle – the measure of this item 2.31.2 Associations ? units – specifies the measurement type of this angle (generally, radians or degrees) ? angleAtoms – a pointer to the (3 or 4) Atoms that constitute this angle. 2.31.3 Operations None 2.32 Formula Formula represents a listing of the atoms and quantities within a Molecule. It is built of formulaElements (q.v.) blocks. Since there are multiple ways of calculating the molecular formula for a given molecule, (depending on, for example, counting salt fragments that are not explicitly included in the structure), there may be more than one Formula for a given Molecule and therefore, Formulas may contain sub-Formulas. 2.32.1 Properties ? overallCount – total number of atoms in this formula. 2.32.2 Associations ? formulaFormulaElements – relates the Formula to the constituent formulaElements. 2.32.3 Operations ? createFromString – initializes a Formula from text input. ? addElement – append a [chemical] element (including a count) to the Formula. ? getElementTypes – returns an array representing the types of atoms present ? getElementCounts – returns an array of numbers representing the number of times each type of atom (from the array returned by getElementTypes) occurs in the Formula. ? addFormula – appends a Formula representation to this Formula. ? getFormulaAtoms – returns an array of Atoms represented herein. ? deleteAllFormulas – remove all sub-Formulas from this Formula. ? deleteFormula – remove one sub-Formula from this Formula. ? getFormalCharge – return the surfeit or deficit of electrons in this Formula. ? setFormalCharge – change the surfeit or deficit of electrons in this Formula. ? getCalculatedMolecularMass – returns the weight generated for this Formula. ? getFormattedString – generate a printable text representation of Formula, given a display mode. 2.33 FormulaElement It refers to a combination of atom type ([chemical] element) and count. It is used in defining molecular formulas. 2.33.1 Properties ? ElementType – pointer to an entry in a periodic table defining a kind of atom. ? Count – the number of times an ElementType occurs in a formula unit. 2.33.2 Associations None 2.33.3 Operations None 2.34 Crystal Bond represents a chemical linkage between two Atoms. 2.34.1 Properties ? ACELL – parameter of a crystal unit ? BCELL – parameter of a crystal unit ? CCELL– parameter of a crystal unit ? ALPHA– parameter of a crystal unit ? BETA– parameter of a crystal unit ? GAMMA– parameter of a crystal unit ? Z-FLOAT– parameter of a crystal unit ? Z-INT– parameter of a crystal unit ? SPACEGROUP– parameter of a crystal unit 2.34.2 Associations None 2.34.3 Operations None 2.35 Query and Collection ? ChemSearchEngineManager needs to be very generic to provide lookups for various legacy databases. ? To simplify the interface of ChemSearchEngin, the association between SearchCriteria and ChemSearchEngine was dropped. SearchCriteriaGroup will have one or more SearchCriteria. ? SearchableProperty will have an attribute called intrinsic_flag to indicate if a property is intrinsic or extrinsic. ? SearchCriteria interface will have an optional attribute called factor to be used for, i.e., to indicate a similarity factor for SIMILAR ACCURACY QUALIFIER. ? A row in ResultSet contains a column for Molecule (from CML) and as many columns as queried extrinsic properties. Intrinsic properties will be included as attributes of the Molecule. ? Split() in ChemSearchEngine splits the query into sub-queries: intrinsic and extrinsic (external properties) queries. The latter may be stored in a conventional rational database. ? ResultSet can be updateable for insert, delete, and update. ? The following diagram shows Search and Collection Modules interacting with each other to perform a chemical structure search within a collection. 2.36 Class Diagrams Search Module ? Holds interfaces that are used to search against legacy databases as well as collections. Figure 776 Search Module Figure 887 Search Module (Properties Classes) DB Module ?Holds interfaces to interact with legacy databases Figure 998 DB Module Collection Module?represents a collection of chemical elements. Figure 10109 Collection Module The collection module makes use of the Java API public class Collections which extends Object. This class consists exclusively of static methods that operate on or return collections. It contains polymorphic algorithms that operate on collections, "wrappers", which return a new collection backed by a specified collection, and a few other odds and ends. The elements of these collections are Molecules. Atom Figure 111110 Atom Figure 121211 Bond Figure 131312 Coordinates Figure 141413 Formula Figure 151514 Molecule Figure 161615 Other Figure 171716 MOLUTIL Module 2.37 Recommendation The following figures provide a pictorial description of the proposed recommendation. Figure 181817 Structure and Translation Figure 191918 Architecture The DB Interface module has been specified in a database independent manner. Consequently, any commercial database can be used to persist the objects mentioned throughout this standard. Figure 202019 Processing diagram 2.37.1 Conformance Points The UML model included here subsumes all the elements of the CSAR model including the CMLCore schema and the additional processing classes. Consequently, the UML model is normative. 2.37.12.37.2 Details The following paragraphs provide detailed descriptions of how the recommendation satisfies each of the requirements of the RFP. Define interfaces that provide uniform access to 2-D chemical structures associated intrinsic properties, and collections thereof, independent of underlying implementation. These interfaces shall include the functionality listed below. In all of these, the interfaces must handle stereochemistry, the associations between core structures and their salts, and the association between tautomeric structures. Support the lifecycle of single as well as collections of 2-D chemical structures including associated intrinsic properties. Lifecycle support includes create, insert (into collections), update and delete. Intrinsic properties include, but are not limited to: unique identifier (if assigned), atoms, bond types, connectivity, 3-D spatial coordinates, molecular weight and molecular formula. Any calculations used must make available the method of calculation used, a description of the method, and literature references if available. Not applicable this time. ? Search for and retrieve 2-D chemical structures from collections, independent of underlying storage implementation, using the following methods: ? Exact match – search for all 2-D structures in a collection that exactly match a query structure. ? Similarity – search for all 2-D structures in a collection that are loosely related to a query structure. ? Substructure – search for all 2-D structures in a collection that contain a user-defined partial chemical structure. ? Tautomer – search for all 2-D structures in a collection that are tautomers of a query structure. ? Generic (i.e., Markush, R-group, etc.) – search for all 2-D structures in a collection in which the identified structures are contained in a query represented by a generic structure. ? Search for intrinsic properties, as defined above ? Searches consisting of any combination of the above methods using Boolean and comparison predicates (<, >, =, ~, AND, OR, etc.). For example: find all 2-D structures that match a query structure and whose molecular weight is equal to some number. Processing Classes Involved Search Module?Holds interfaces that are used to search against legacy databases as well as collections. DB Interface Module?Holds interfaces to interact with legacy databases. Collection Module?Represents a collection of chemical elements. The following diagram shows Search and DB Interface Modules interacting with each other to perform a chemical structure search against database. Figure 212120 Sea Search and DB Interface Modules interacting with each other to perform a chemical structure search against database. 3 Glossary Aromaticity This is a quality possessed by many, many common compounds, from benzene to phenylalanine in which multiple double bonds, 'conjugate,' sharing electrons. This sharing produces a structure of lower energy than one in which the double bonds re isolated. (The above is very simple explanation of a common phenomenon. For more information, see any basic text on organic chemistry.) Different software systems define aromaticity differently. The biggest differences are whether aromaticity is specified in atom types or in bonds or perceived from the arrangement of bonds. Daylight's SMILES notation generally specifies aromatic atoms using lower-case letters. For example, benzene (an aromatic ring of 6 carbon atoms) is defined as 'c1ccccc1'. Because the atoms are denoted with lower-case letters, they are distinguished from cyclohexane (a non-aromatic or 'aliphatic' ring of carbon atoms (C1CCCCC1). MDL's molfile, on the other hand, defines aromatics from an arrangement of alternating single and double bonds within a ring of appropriate size. Still other systems explicitly define aromaticity using explicitly designated aromatic bonds. To make matters even more complicated, MDL supports an aromatic bond type that can be used for bonds within a molfile that can be used to query a database but not registered. Atom The smallest particle of an [chemical] element that can exist either alone or in combination, retaining any properties of the [chemical] element. We extend this definition to include points in space (that can be used to define the position of other points); 'superatoms' or atoms that represent a collection of other atoms. Bit string A contiguous set of characters consisting entirely of 1s and 0s. A bit string can be used to encode a good deal of information in a compact way. Bond A chemical link between two atoms. Bonds are classified as ionic (transfer of electrons from one atom to another); covalent (sharing of electrons, generally an equal number from each atom); dative (sharing of two electrons from a single atom); or hydrogen (attraction of electron-starved hydrogen atom to electron rich heavy atom.) Charge A deficiency or excess of electrons on a particular object, giving rise to a positive or negative charge, respectively. (www.allwords.com) Molecules can carry charges, which are often attributed to specific atoms. Chiral Adjective applied to a molecule that cannot be superimposed on its mirror image. ('Chirality' is the corresponding noun.) An example of a chiral structure for 2-chloro-2-iodo-butane is shown below: The compound has a mirror image that cannot be superimposed and is therefore termed its enantiomer. Connection table A means of representing the atoms contained within a molecule and the bonds that hold them together. Counterion A set of one or more bonded atoms, with opposite charge and generally smaller size, that accompanies another charged set of bonded atoms. Cyclic, acyclic bonds When chemical bonds occur within a ring, they are termed 'cyclic.' 'Acyclic bonds' by contrast, occur in open chain structures. electrons (p or pi) Molecules containing double or triple bonds typically have electrons that project outside the line between the atoms in the bond. When more than one double or triple bond are in close proximity, the electrons in the pi bonds interact and spread out over all the atoms involved. Fingerprints Fingerprints are bit strings that are based on features of a chemical structure. In this regard, they are similar to Structural keys (q.v.) Fingerprints are different from keys in that the bits they contain are typically 'folded over' or combined (using a logical OR) with one another to reduce the size of the string. Heteroatoms Atoms that are neither carbon nor hydrogen are considered 'heteroatoms' and are often handled differently by software systems. Markush structure It is common to represent chemical structures as a common core containing marked substitution sites, plus a set of possible structures for each substitution point. These Markush structures can be used in several ways: 1) to represent a set of compounds analyzed in order to determine the effect of varying substituents on compound activity (SAR, short of 'Structure Activity Relationships) 2) to represent a set of compound produced using combinatorial techniques (synthesized by serially attaching different chemical groups to a common core) 3) to produce a fine-tuned substructure query In this document, we use one Assembly to represent the core, (atoms of type 'R' designate the substitution points), plus one additional Assembly for each R-Group Molecule The smallest particle into which an [chemical] element or a compound can be divided without changing its chemical and physical properties; a group of like or different atoms held together by chemical forces. (www.allwords.com) Generally, composed of atoms held together by bonds. A molecule can represent: 1. an entire chemical entity 2. one portion of a complex chemical entity (such as a mixture, set of tautomers or conformers) 3. a collection of other molecules as defined in 2) above Orbital A subdivision of a nuclear shell containing zero, one, or two electrons (m-w.com) Query structure Most chemical software systems require structures to meet certain requirements in order to be entered into a database or used for calculation. A structure that meets these criteria is classified as 'registerable.' (Generally, atom types must correspond to entities in the periodic table and bond types must be well defined.) Additionally, the software will allow a chemist to draw structures that do not meet the criteria for registration but can be used to retrieve molecules out of a database using substructure searching (see below). An example of a structure that is valid for query but not registration is one with a just one bond designated as aromatic. Radical An atom or a group of atoms with at least one unpaired electron. (www.allwords.com) singlet – a radical with two unpaired electrons whose spins are opposite doublet – a radical with a single unpaired electron triplet – a radical with two unpaired electrons whose spins are aligned Registerable structure A chemical structure that meets the criteria for inclusion in a repository. (Compare with 'Query structure' above) Similarity Search A chemical data query in which a user seeks compounds that resemble a given structure without necessarily having a substructure (q.v.) match. The resemblance is often intuitive to a chemist but has a mathematical basis in terms of common features or properties. Similarity searches typically include a cutoff value X so the user sees only structures having X% or more similarity to the query structure. The mathematics of similarity searching require: 1. A means of evaluating individual chemical structures. Generally, this involves computing keys (q.v.) or fingerprints (q.v.). 2. A metric for comparing keys or fingerprints from two structures. The most common of these is the Tanimoto coefficient (q.v.) At search time, a user-supplied query structure is compared with every other structure in the database and those with a similarity metric greater than the cutoff are considered 'hits.' (Bit operations are typically fast enough to make large number of comparisons practicable.) Spin state A way of characterizing the angular momentum of electrons. Individual electrons may spin 'up' or 'down.' Two or more electrons may have their spins parallel or antiparallel. Stereochemistry Studying the effect of configuration of atoms around assymetric atoms and bonds. Structural keys When compounds are registered into most chemical search software systems, the structures are scanned for the presence of predefined features, such as heteroatoms (q.v.) or 6- membered rings. Each key sets a bit within a string that may be hundreds of characters long. These keys are used for substructure and similarity searching. Substructure One chemical structure is said to be a substructure of another if the first structure can be located within the second. (The second is said to be the superstructure of the first.) All structures are substructures of themselves. A substructure search scans a database for all substructural matches. Tanimoto coefficient Mathematical formula for evaluating the similarity of two structures SA,B = c/[a + b - c] Where SA,B = 'similarity of structures A and B' c = number of features in common between the given property in the two structures. (In the case of structure keys or fingerprints, this means the number of ON bits when the two bit strings are ANDed.) a = number of features ON in structure A b = number of features ON in structure B [From J. Chem. Inf. Comput. Sci. 1998, 38, 983-996] Tautomer 'One of two or more structural isomers that exist in equilibrium and are readily converted from one isomeric form to another.' From http://www.bartleby.com/65/ta/tautomer.html . An illustration of the tautomers of ethyl acetoacetate is shown: Valence The number of bonds an atom has to other atoms. Appendix A: UML Use Cases Actor Catalogue Actor Definition Client This is a logical client that represents any given system that will interface with the provided IDLAPI. Master Use Case Figure 222221 General Use Case Scenario Detailed Use Case Scenarios Add Chemical Structure Desired Outcome: A chemical structure is inserted into database. Entered When: Actor invokes the insert function. Finished When: A chemical structure is inserted into database and a confirmation is returned to the actor. Description: The purpose of this function is to insert a chemical structure into database. A chemical structure can be specified using CML. Data Elements: Name Description Chemical structure A chemical structure to be inserted into database. Add Chemical Structure – Add Unique Chemical Structure Desired Outcome: A chemical structure is inserted into database only if it will not be a duplicate. Entered When: Actor invokes the insert_if_not_duplicate function. Finished When: A chemical structure is inserted into database and a confirmation is returned to the actor. A collection of duplicated chemical structures is returned if any duplicate is found. Description: The purpose of this function is to insert an unique chemical structure into database. An exact match search will be performed to ensure a uniqueness of the input chemical structure. A chemical structure can be specified using CML. Data Elements: Name Description Chemical structure A chemical structure to be inserted into database. Update Chemical Structure Desired Outcome: An existing chemical structure data in database is updated. Entered When: Actor invokes the update function. Finished When: An existing chemical structure data in database is updated and a confirmation is returned to the actor. Description: The purpose of this function is to update an existing chemical structure data in database. A chemical structure can be specified using CML. Data Elements: Name Description Chemical structure A chemical structure to replace an existing data. Delete Chemical Structure Desired Outcome: An existing chemical structure in database is removed. Entered When: Actor invokes the delete function. Finished When: An existing chemical structure in database is removed and a confirmation is returned to the actor. Description: The purpose of this function is to delete an existing chemical structure in database. A chemical structure to be deleted can be specified using CML/ID. Data Elements: Name Description Chemical structure/ID Identifier of chemical structure to be deleted. Search Chemical Structure Desired Outcome: Collection of structures meeting the search criteria is returned. Entered When: Actor invokes the search function. Finished When: A collection of structures is returned. An empty collection is returned if no structure meets the specified criteria. Description: The purpose of this function is to identify a collection of structures meeting the specified criteria. Search criteria can be specified by a query string and/or structure (CML). A returned collection is in CML, which can be transformed into another format such as SMILES, MOL, SLN. Data Elements: Name Description Query string Specifies the search criteria. Comments: The following are sample query strings used in the MDL ISIS/Host: molstructure = [{mjk38smasd903kqlads90rmlw9masksoaskdoq}] molstructure tautomer [{mjk38smasd903kqlads90rmlw9masksoaskdoq}] pdnum = '0123456-0000' pdnum like '0123456-%' mol.weight < 500 The following are sample query strings used in the Daylight DayCard (Daylight Chemistry Cartridge for Oracle). http://www.daylight.com/meetings/mug2000/Delany/cartridge.html select * from medium where exact(smiles, 'O=c1ccocc1') = 1 select count(smiles) from large where contains(smiles, '>>O=c1c(C)cocc1') = 1; Exact Match molstructure = [{mjk38smasd903kqlads90rmlw9masksoaskdoq}] select * from medium where exact(smiles, 'O=c1ccocc1') = 1 Similarity molstructure = [{mjk38smasd903kqlads90rmlw9masksoaskdoq}] and factor = 0.7 Substructure molstructure sss [{mjk38smasd903kqlads90rmlw9masksoaskdoq}] select * from medium where contains(smiles, 'O=c1ccocc1') = 1 Tautomer molstructure tautomer [{mjk38smasd903kqlads90rmlw9masksoaskdoq}] select * from medium where tautomer(smiles, 'O=c1ccocc1') = 1 Comments: (from Tripos' RFI) There are potentially different methods of working with tautomers, either recognizing the potential for rearrangement at registration into the chemical database with flags that highlight the areas where tautomerism can take place. An alternative method would be to register the tautomers as different compounds and then flag their tautomeric relatives. The latter is what PD does. Generic Comments: i.e. Markush, R-group, etc. Intrinsic Properties mol.weight < 500 Combination molstructure sss [{mjk38smasd903kqlads90rmlw9masksoaskdoq}] and mol.weight < 500 Get Intrinsic Properties Desired Outcome: An intrinsic property of a chemical structure is returned. Entered When: Actor invokes the access (get) function. Finished When: An intrinsic property of a chemical structure is returned to the actor. Description: The purpose of this function is to query an intrinsic property of a chemical structure. Data Elements: Name Description Intrinsic property Name of an intrinsic property to be queried. Translate Chemical Structure from One to Another Desired Outcome: A type of chemical structure representation is transformed into another type. Entered When: Actor invokes the translate function. Finished When: A type of chemical structure representation is transformed into another type, which is then returned to the actor. Description: The purpose of this function is to transform the representation of one chemical structure into another representation. Data Elements: Name Description Structure representation type Source representation. Structure representation Chemical structure to be translated. Structure representation type Destination representation. Appendix B: Java Code This appendix contains a sampling of the Java code generated for the project. The complete Java API and XMI files can be found at: http://www.intellsolutions.com/serv01.htm package pim_chem.cml; public class BondStereo { public String BONDSTEREO_CIS = "CIS"; public String BONDSTEREO_TRANS = "TRANS"; public String BONDSTEREO_Z = "Z"; public String BONDSTEREO_E = "E"; public String BONDSTEREO_WEDGE = "WEDGE"; public String BONDSTEREO_HATCH = "HATCH"; public String BONDSTEREO_UNKNOWN = "UNKNOWN"; public static final String STRING = "STRING"; public static final String SYMBOL = "SYMBOL"; public static final String CIS = "C"; public static final String TRANS = "T"; public static final String WEDGE = "W"; public static final String HATCH = "H"; public static final String UNKNOWN = "U"; /** * * @param obj * @return boolean * @roseuid 3B48CBF4033F */ public boolean equals(Object obj) { return true; } /** * * @return java.lang.String * @roseuid 3B48CBF4034A */ public String getValue() { return null; } } //Source file: F:\\OMG\\pim_chem\\search\\SelectPropertyGroup.java package pim_chem.search; public interface SelectPropertyGroup { public SearchableProperty theSearchableProperty; /** * @param arg0 * @return boolean * @roseuid 3A120EE20091 */ public boolean addSelectProperty(SelectProperty arg0); /** * @param arg0 * @return boolean * @roseuid 3A120F2C0343 */ public boolean removeSelectProperty(SelectProperty arg0); /** * @return boolean * @roseuid 3A120F4A03C9 */ public boolean removeAll(); } Appendix C: Use Cases for Chemistry 1) Structure search A) using ISIS/Draw to sketch a molecule for a substructure search of a Daylight database I simple structure (e.g., biphenyl) II more complicated structure (disconnected fragments; atom lists; substitution counts; charges, etc.) a) two fragments (twofrag.mol) b) two fragments + atom list (tfraglist.mol) c) two fragments + atom list + substitution counts (tfsub.mol) III R Group Query IV S-Group data A) using the Ertl Java editor to generate a SMILES to search an MDL database B) browsing the hits I Browsing regular structures II Browsing structures with polymeric constructs III Browsing structures to which user does not have rights 2) Use ChemSymphony to search Web-based Available Chemical Directory (ACD) system for suppliers of a given set of structures A) superstructures of aromatic acid chlorides – very simple substructure to find a large class of compounds (aromacchlor.mol) B) p-nitrobenzoic acid – search for a specific compound (pnitrobenz.mol) 3) Looking for similar compounds A) ISIS/Draw front-end to Unity (or RS3) database back end B) ChemSymphony front-end to MDL database 4) Registration A) using ISIS/Draw to generate a molfile for registration into Daylight (convert to SMILES for direct chemical registration, as well as saving the molfile to an Oracle field) I Simple structure; everything in molfile translates to SMILES II Complex structure (charges, valence) but properties do translate III Parts of the structure (brackets, S-Group data) do not translate to SMILES B) using ChemSymphony to generate structures for registration to a Unity database 5) Registration correction A) database is MDL; need to locate molecule by ID, replace structure. Drawing tool is ChemSymphony I simple structure – no loss of information II more complicated structure – all information can be translated III very complicated structure – some information does not map D. H. Peapus, H. J. Chiu and N. Campobasso, Biochemistry, 2001, 40, 10103-10114. This section is taken verbatim from: Peter Murray-Rust and Henry S. Rzepa "Chemical Markup, XML and the World-Wide Web. Part II: Information Objects and the CMLDOM" 15