Non-XML Documents in XPL
1. Introduction
In PresentationServer XPL and pipelines only
deal with XML documents. This means that between processor outputs and processor
inputs in a pipeline, only pure XML infosets circulate. There is however often a
need to handle non-XML data in pipelines, in particular:
-
Binary document: any document that can be represented as a stream of
bytes. In general this is the case of any document, but some document formats
are almost always represented this way: images, sounds, PDF documents, etc.
-
Text documents: any document that can be represented as a stream of
characters. Some documents are better looked at this way, like plain txt files,
HTML files, and even the textual representation of XML.
PresentationServers addresses this question by defining two standard XML document
formats to embed binary and text documents within an XML infoset. This solution has
the benefit of keeping XPL simple by limiting it to pure XML infosets, while
allowing XPL to conveniently manipulate any binary and text document.
2. Binary Documents
A binary document consist of a document root node containing
character data encoded with Base64. An xsi:type attribute is also
present, as well as an optional content-type attribute, for
example:
<document xsi:type="xs:base64Binary" content-type="image/jpeg">/9j/4AAQSkZJRgABAQEBygHKAAD/2wBDAAQDAwQDAwQEBAQFBQQFBwsHBwYGBw4KCggLEA4R ... KKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooA//2Q==
</document>
Note
For the curious, the Base64 encoding is documented in
RFC 2045. This encoding represents
binary data by mapping it to a set of 64 ASCII characters.
Such documents are not meant to be read by users, in the same way that regular
binary files are not meant to be examined by users. Binary documents are generated
by PresentationServer processors, like the URL generator and converters. They are consumed by processors like
the HTTP serializer, the Email processor, and converters.
3. Text Documents
A text document consists of a document root element containing the
text. An xsi:type attribute is also present, as well as an optional
content-type attribute:
<document xsi:type="xs:string" content-type="text/plain">This is line one of the input document! This is line two of the input document! This is line three of the input document!
</document>
The content-type attribute may have a charset parameter
providing a hint for the character encoding, for example:
<document xsi:type="xs:string" content-type="text/plain; charset=iso-8859-1">This is line one of the input document! This is line two of the input document! This is line three of the input document!
</document>
Because XML character data itself is represented in Unicode (in other words it is
designed to allow representing in a same document all the characters specified by
the Unicode specification), there is no requirement for specifying character
encoding in XML pipelines. However, when an XML infoset is read or written as an
textual XML document, specifying a character encoding may may be a useful hint. For
example a URL generator can, with this mechanism, communicate to an HTTP serializer
the preferred character encoding obtained when the document was read. The serializer
may then use that hint, but it is by no means authoritative.
In general, XML documents can be read and written using the utf-8
character encoding, which allows representing all the Unicode characters. However,
when dealing with other types of text documents, tools such as text editors may not
be able to deal correctly with utf-8. In such cases, it can be useful
to use even more widespread character encodings such as iso-8859-1 or
us-ascii. The drawback is that such encodings allow representing a much
smaller set of characters than utf-8.
Unlike binary documents, text documents can easily be examined by users. They can
also be easily manipulated by languages such as XSLT.
Like binary documents, they are generated by PresentationServer processors, like
the URL generator and converters. They are consumed by processors like
the HTTP serializer, the Email processor, and converters.
4. Streaming
Processors can stream binary and text documents by issuing a number of short
character SAX events. It is therefore possible to generate "infinitely" long binary
and text documents with a constant amount of memory, assuming both the sender and
the receiver of the document are able to perform streaming. This is the case for
example of the URL generator and
the HTTP serializer.