CORBA and XML; conflict or cooperation?
Every now and then the computer industry gets swept up in a wave of
enthusiasm for some new Silver Bullet that's apparently going to solve
everyone's problems overnight. Actually, these days the wild surges of
millennial euphoria seem to come at annual intervals. Usually the technology
in question actually is a step forward, able to solve real problems better
or faster than was possible before. However, as word spreads about the
power of the new technique some people will inevitably try to apply it
to the wrong problems. It's a bit like the enthusiasm for microwave ovens
when they first became cheap enough for anyone to buy; one could buy microwave
cookbooks explaining how to use them to cook everything from a complete
Christmas dinner to a soufflé. Fortunately, after a while sanity
returned, and people now use microwaves for what they're best at, and go
back to making toast in the toaster or roasting the turkey in the oven,
just as they always did, because they're the best tools for the job.
The same is true in the computer business, and as with cooking gadgets,
it's important to get the balance right. Pointing out that you shouldn't
try to make soup in your bread maker doesn't in anyway diminish the fact
that it's very, very good at making bread. In just the same way, this paper
aims to put the current enthusiasm for XML in perspective without in any
way detracting from or criticising XML, which is an excellent tool for
job for which it was designed. However, the question "Will XML replace
middleware?" is being asked so often at the moment that it seems appropriate
to pen a few words on what applications XML is (and is not) suited for,
and in particular why it isn't going to replace middleware solutions like
CORBA (or vice versa, for that matter). To do this properly, we have to
start with a little history. So, are you sitting comfortably? Then we'll
begin.
A Little History
XML stands for eXtensible Markup Language. It's a simplified subset
of a previous markup language standard called SGML (Standard Generalised
Markup Language), and was devised by a committee of the World Wide Web
consortium in response to the need for a generalisation of HTML, the Hypertext
Markup Language used for formatting Web pages.
SGML was conceived as a successor to document markup languages like
TeX, troff, and nroff. These languages add formatting directives to plain
text to tell typesetters, laser printer and other high quality output devices
how to format the text in various fonts of different sizes and styles.
When they first appeared in the 1960s, markup languages were designed to
be written by hand; one would use a text editor to create a plain text
document, adding in the occasional markup directive to indicate that some
piece of text should be printed in bold, or centred, or whatever. Of course,
it was important to make sure there was no confusion between the content
and the markup directives, so each family of markup languages had a set
of conventions for separating them. For instance, in nroff and troff the
directives are on lines beginning with a full stop (or period), while TeX
begins directives with a "\" character.
As use of markup languages became more widespread, macros were added
as a convenience feature. If headings in your document are to be displayed
in centred bold 14 point Helvetica, it would soon get tedious to write
four directives to change font, size, weight and justification for each
heading. With a macro facility one can define a single command to do all
this. Better yet, if you later decide your headings should be in Zapf Chancery
instead, changing the definition of the "heading" macro automatically does
the job everywhere you've used the macro.
Structure vs. presentation
Pretty soon authors creating complex documents found themselves maintaining
large libraries of macro definitions, and never using raw formatting directives
in the documents at all. Unix man pages are a good example - they're defined
using the "man" macros for the nroff text formatter, making it easy to
create manual pages with a consistent appearance.
During the 70s and 80s it became clear that the best way to use markup
was by formalising this approach; create a set of directives for describing
the structure of the document as sections, subsections, bulleted items
and so on, then separately define how to format those structural elements
on paper. By keeping these two kinds of definitions (of structure and presentation)
separate, altering the formatting of the documents or even re-using the
content in new documents could be a completely mechanical process. Furthermore,
automatic tools can process the documents to do jobs like building a contents
page by listing all the headings. If your job is maintaining the many tons
of paper documentation for (say) a commercial airliner, representing the
logical structure of the document in this way is no small advantage, since
it allows the same source documents to be used to deliver information in
a number of different formats. Again, Unix man pages are a good example;
when the manuals are printed on a high resolution printer, using the same
source text with a different library of (troff) macro definitions automatically
creates book-quality manual pages rather than the screen-formatted pages
generated from the same sources by nroff.
SGML, DSSSL and HTML
SGML was designed by ISO (the International Standards Organisation)
as a new standardised markup language that enshrined this separation of
structure and presentation. In order to apply SGML one creates a Document
Type Definition (DTD) that defines the set of valid tags for the documents
being created, and uses DSSSL (the ISO-standardised Document Style Semantics
and Specification Language that accompanies SGML) to define how to display
text labelled with those tags. Between them the DTD and DSSSL definitions
fill the same rôle as the macro library in older markup languages.
SGML has achieved limited success in large organisations that maintain
very large documentation sets, but the SGML standard alone is over 500
pages, and the accompanying DSSSL (rhymes with "whistle") standard is also
rather large and uses a syntax based on the Scheme programming language,
which some people find hard to learn. Many users lack the will or resources
to climb the SGML learning curve.
Meanwhile, at CERN in Geneva, Tim Berners-Lee was creating a simple
SGML DTD to define a few document structure tags like "heading" and "numbered
list" for defining the structure of documentation to be shared between
nuclear physicists over computer networks. This simple application of SGML,
called HTML (HyperText Markup Language) didn't have any accompanying way
of defining the appearance of documents - that was provided by settings
in the Web browser used to display the HTML document. The original HTML
specification was simply a conforming SGML DTD describing the syntax of
HTML documents, with the added wrinkle that one of the tags defined a way
to hyperlink to another HTML document.
HTML has, of course, been much more widely used than SGML, but as its
use spread, two problems became apparent. The first was that HTML defined
only the structure of Web page elements, with no associated way of specifying
their presentation, so the Web page designers had no way of controlling
exactly how their creations looked. As Web pages became more sophisticated,
with more graphic content, this became a serious problem, and ad-hoc extensions
were added to HTML to allow direct control of presentation by specifying
fonts, font sizes, text colours, and so on (which of course completely
violates the original SGML design principles). At the same time, because
the HTML had one fixed DTD, document designers had no way to create new
structure tags to represent document structure in particular HTML applications.
Without either an extension mechanism (like macros) or a way of defining
and controlling presentation, the original HTML neatly fell between two
stools, and short-term product development pressures have inevitably pushed
it towards being a presentation markup language which provides the Web
page designer with detailed control over how his/her document appears,
rather than representing its logical structure. While this deals effectively
with the primary purpose of Web pages, which is to be viewed by people
using Web browsers, the increasing size and ubiquity of the Web is creating
an increasing demand for Web pages that can be manipulated by Web-scanning
"robots" such as the search engines which "read" and catalogue millions
of Web pages daily. It became clear that the lack of structure encoding
threatened to slow down the development of the Web.
Enter XML
One solution to the problem of HTML's lack of structure would simply
have been to step up one level and use SGML and DSSSL directly on the Web.
However, the complexity of the ISO standards mitigated against this; something
simpler was needed. In mid-1996 Jon Bosak, an influential member of the
SGML community, persuaded W3C to set up an SGML Editorial Review Board
and Working Group, to define a simplified, extensible subset of SGML designed
for the Web. The final XML 1.0 specification was published by W3C in February
1998, and will be complemented by two further specifications currently
being prepared; XLL (the eXtensible Linking Language, for defining how
XML documents are linked together) and XSL (the eXtensible Style Language,
for defining how XML markup is formatted for display).
What should XML be used for?
XML is being enthusiastically embraced in many application domains because
there are a lot of applications that need to store data intended for human
use, but which it will be useful to manipulate by machine. One example
might be storing and displaying mailing list information. Defining and
using an XML DTD for storing address data makes it comparatively easy to
write applications to (say) generate address labels without inadvertently
printing the phone number in the postcode field. There are a large number
of initiatives to replace home-grown markup formats with applications of
XML - examples include Bioinformatic Sequence Markup Language (BSML), Weather
Observation Markup Format (OMF), the Extensible Log Format (XLF - a markup
format for logging information generated by Web servers), and others for
legal documents, real estate information and many more. In each case the
working group simply needs to define a DTD that defines the tags and how
they can be legally combined. These DTDs can then be used with XML parsers
and other XML tools to rapidly create applications to process and display
the stored information in whatever way is required. Of course, there are
still standardisation issues to be addressed, such as who controls the
libraries of tag definitions, how to manage version control in those libraries,
how to manage using multiple libraries simultaneously (especially when
tag names collide). But nevertheless using XML for these applications is
a lot simpler than creating a completely new markup language from scratch
every time, with a lot more scope for re-using the work of others.
One important point to note is that nowhere in the XML DTDs is there
a way of specifying what an XML tag "means", just where it can be positioned
in relationship to other tags, and (using XSL) how to format it on a display.
Creators of XML DTDs naturally usually choose short, descriptive names
for their tags, just as PC users choose short descriptive names for their
files, so it's very appealing to think that XML files are "self-describing",
because to an English speaker it's intuitive that an <address> tag labels
an address or a <date-of-birth> tag labels a person's birthday. However,
this is just the intuitive "meaning" we assign to the terms by assuming
that the creator of the DTD used these words in the way we would expect;
if the creator of the DTD had instead specified his tags in a foreign language
or using some private code we'd be none the wiser. XML files are in fact
just as "self-describing" as a C program or a database schema.
What shouldn't XML be used for?
The common thread in XML applications is that the document content is
intended to be read by people. Because XML is intended for marking up human-readable,
textual data, it is by the same token a rather inefficient way of storing
information that only ever needs to be machine-readable. The embedded XML
tags provide a way to extract or format particular parts of the content,
but the content itself will not usually be interpreted by the computers,
only by the ultimate human user - which is why it makes sense to store
it in human-readable form. Of course, it's perfectly possible to write
parsers to read in (say) formatted floating point numbers from an XML file
so that they can be processed, but it's relatively time-consuming, and
the XML file would be relatively much larger than one written in native
floating-point format.
When the requirement is to exchange data between cooperating computer
applications, there are other, more efficient ways of defining and storing
the data. Traditionally these definitions of data formats for machine communication
are called Interface Definition Languages (IDLs) because they're used for
defining the interfaces between cooperating computer applications. In contrast
to markup, which is used for the long-term storage of human-readable data,
IDLs define the smaller packets of transient, machine readable data that
are exchanged between the components of a distributed application when
some particular event occurs.
Interface Definition Languages are the most visible components of a
class of software known as "middleware", that class of software which is
neither part of an operating system nor an application, but is used to
link together the various parts of a distributed application spread across
geographically separated computers. By their very nature, successful middleware
solutions blend into the background, making few impositions on the users,
designers and programmers of a distributed system. Today's most widely-used
middleware packages all implement the CORBA (Common Object Request Broker
Architecture) specification, published by the OMG (Object Management Group).
Although IDL is the most visible aspect of middleware, there's much
more to it than that; middleware solutions like CORBA also provide security
to authenticate users and control access to resources, error handling to
gracefully handle the failures that are inevitable in a distributed computing
system, and a host of other support functions to keep computer networks
running smoothly. In these sorts of distributed computing applications
the data are transient, transferred between computers, often not permanently
stored anywhere and probably never seen by human eyes. To use XML as the
data encoding in such applications is less efficient that the compact,
native machine representations used to marshal data in (for instance) the
IIOP wire format used by CORBA implementations. Of course, if the requirement
is to store data for the long term and extract human-readable summaries
and reports then XML would be the more appropriate medium - but for the
data exchanges that tie together the components of a distributed system
, using XML would be expensive and pointless.
Summary
XML and middleware are complimentary technologies. XML is intended for
the storage and manipulation of text making up humane-readable documents
like Web pages, while middleware solutions like CORBA tie together cooperating
computer applications exchanging transient data that will probably never
be directly read by anyone. Neither of these technologies will replace
the other, but instead they will increasingly be used together - not least
in the specifications published by OMG, the body responsible for the CORBA
specification.
[an error occurred while processing this directive]
Last updated on
08/24/2012 |