URL Generator
1. Introduction
Generators are a special category of processors that have no XML data inputs, only
outputs. They are generally used at the top of an XML pipeline to generate XML data
from a Java object or other non-XML source.
The URL generator fetches a document from a URL and produces an XML output document.
Common protocols such as http:, ftp:, and
file: are supported as well as the PresentationServer resource
protocol (oxf:). See Resource
Managers for more information about the oxf: protocol.
2. Content Type
The URL generator operates in several modes depending on the content type of
the source document. The content type is determined according to the following
priorities:
-
Use the content type in the content-type element of the
configuration if force-content-type is set to
true.
-
Use the content type set by the connection (for example, the content type
sent with the document by an HTTP server), if any. Note that when using the
oxf: or file: protocol, the connection content
type is never available. When using the http: protocol, the
connection content type may or may not be available depending on the
configuration of the HTTP server.
-
Use the content type in the content-type element of the
configuration, if specified.
-
Use application/xml.
3. XML Mode
The XML mode is selected when the content type is text/xml,
application/xml, or ends with +xml according to the
selection algorithm above. The generator fetches the specified URL and parses
the XML document. If the validating option is set to
true, a validating parser is used, otherwise a non-validating
parser is used. Using a validating parser allows to validate a document with a
DTD.
Example:
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>oxf:/urlgen/note.xml</url>
<content-type>application/xml</content-type>
<validating>true</validating>
</config> </p:input> <p:output name="data" id="xml"/>
</p:processor>
Note
The URL must point to a well-formed XML document. If it doesn't, an exception
will be raised.
4. HTML Mode
The HTML mode is selected when the content type is text/html
according to the selection algorithm above. In this mode, the URL generator
uses HTML Tidy to transform
HTML into XML. This feature is useful to later extract information from HTML
using XPath.
Examples:
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>http://www.cnn.com</url>
<content-type>text/html</content-type>
</config> </p:input> <p:output name="data" id="html"/>
</p:processor>
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>oxf:/html/example.html</url>
<content-type>text/html</content-type>
<force-content-type>true</force-content-type>
</config> </p:input> <p:output name="data" id="html"/>
</p:processor>
Note
HTML Tidy has some tolerance for malformed HTML, but it is encouraged to access
well-formed HTML whenever possible.
5. Text Mode
The text mode is selected when the content type according to the selection
algorithm above starts with text/ and is different from
text/html or text/xml, for example
text/plain. In this mode, the URL generator reads the input as a
text file and produces an XML document containing the text read.
Example:
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>oxf:/list.txt</url>
<content-type>text/plain</content-type>
</config> </p:input> <p:output name="data" id="text"/>
</p:processor>
Assume the input document contains the following text:
This is line one of the input document!
This is line two of the input document!
This is line three of the input document!
The resulting document consists of a document root element
containing the text according to the text document format. An
xsi:type attribute is also present, as well as a
content-type attribute:
<document xsi:type="xs:string" content-type="text/plain">This is line one of the input document! This is line two of the input document! This is line three of the input document!
</document>
Note
The URL generator performs streaming. It generates a stream of short character
SAX events. It is therefore possible to generate an "infinitely" long document
with a constant amount of memory, assuming the generator is connected to other
processors that do not require storing the entire stream of data in memory, for
example the
SQL processor (with an
appropriate configuration to stream BLOBs), or the
HTTP serializer.
6. Binary Mode
The binary mode is selected when the content type is neither one of the XML
content types nor one of the text/* content types. In this mode,
the URL generator uses a Base64 encoding to transform binary content into XML
according to the binary document
format. For example:
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>oxf:/my-image.jpg</url>
<content-type>image/jpeg</content-type>
</config> </p:input> <p:output name="data" id="image-data"/>
</p:processor>
The resulting document consists of a document root node containing
character data encoded with Base64. An xsi:type attribute is also
present, as well as a content-type attribute, if found:
<document xsi:type="xs:base64Binary" content-type="image/jpeg">/9j/4AAQSkZJRgABAQEBygHKAAD/2wBDAAQDAwQDAwQEBAQFBQQFBwsHBwYGBw4KCggLEA4R ... KKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooAKKKKACiiigAooooA//2Q==
</document>
Note
The URL generator performs streaming. It generates a stream of short character
SAX events. It is therefore possible to generate an "infinitely" long document
with a constant amount of memory, assuming the generator is connected to other
processors that do not require storing the entire stream of data in memory, for
example the
SQL processor (with an
appropriate configuration to stream BLOBs), or the
HTTP serializer.
7. Character Encoding
For text and XML, the character encoding is determined as follows:
-
Use the encoding in the encoding element of the configuration
if force-encoding is set to true.
-
Use the encoding set by the connection (for example, the encoding sent with
the document by an HTTP server), if any, unless
ignore-connection-encoding is set to true (for XML
documents, precedence is given to the connection encoding as per RFC
3023). Note that when using the oxf: or file:
protocol, the connection encoding is never available. When using the
http: protocol, the connection encoding may or may not be
available depending on the configuration of the HTTP server. The encoding is
specified along with the content type in the content-type
header, for example: content-type: text/html;
charset=iso-8859-1.
-
Use the encoding in the encoding element of the configuration,
if specified.
-
For XML, the character encoding is determined automatically by the XML
parser.
- For text, including HTML: use the default of iso-8859
When reading XML documents, the preferred method of determining the character
encoding is to let either the connection or the XML parser auto detect the
encoding. In some instances, it may be necessary to override the encoding. For
this purpose, the force-encoding and encoding elements
can be used to override this default behavior, for example:
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>oxf:/urlgen/note.xml</url>
<content-type>application/xml</content-type>
<encoding>iso-8859-1</encoding>
<force-encoding>true</force-encoding>
</config> </p:input> <p:output name="data" id="xml"/>
</p:processor>
This use should be reserved for cases where it is known that a document
specifies an incorrect encoding and it is not possible to modify the document.
HTML example:
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>http://www.cnn.com</url>
<content-type>text/html</content-type>
<encoding>iso-8859-1</encoding>
</config> </p:input> <p:output name="data" id="html"/>
</p:processor>
Note that only the following encodings are supported for HTML documents:
Also note that use of the HTML <meta> tag to specify the
encoding from within an HTML document is not supported.
8. HTTP Headers
When retrieving a document from an HTTP server, you can optionally specify the
headers sent to the server by adding one or more header elements,
as illustrated in the example below:
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>http://www.cnn.com</url>
<content-type>text/html</content-type>
<header> <name>User-Agent</name>
<value>Mozilla/5.0</value>
</header> <header> <name>Accept-Language</name>
<value>en-us,fr-fr</value>
</header> </config> </p:input> <p:output name="data" id="html"/>
</p:processor>
9. Cache Control
It is possible to configure whether the URL generator caches documents locally
in the PresentationServer cache. By default, it does. To disable caching, use
the cache-control/use-local-cache element, for example:
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>http://www.cnn.com</url>
<content-type>text/html</content-type>
<cache-control> <use-local-cache>false</use-local-cache>
</cache-control> </config> </p:input> <p:output name="data" id="html"/>
</p:processor>
Using the local cache causes the URL generator to check if the document is in
the PresentationServer cache first. If it is, its validity is checked with the
protocol handler (looking at the last modified date for files, the
last-modified header for http, etc.). If the cached document is
valid, it is used. Otherwise, it is fetched and put in the cache.
When the local cache is disabled, the document is never revalidated and always
fetched.
10. Relative URLs
URLs passed to the URL generator can be relative. For example, consider the
following pipeline fragment declared in a file called
oxf:/my-pipelines/backend/import.xpl:
<p:processor name="oxf:url-generator" xmlns:p="
http://www.orbeon.com/oxf/pipeline"
> <p:input name="config"> <config> <url>../../documents/claim.xml</url>
</config> </p:input> <p:output name="data" id="file"/>
</p:processor>
In this case, the URL resolves to:
oxf:/documents/claim.xml.