Also see the XMLC 2.2 Release Note, XMLC 2.2.1 Release Note, XMLC 2.2.2 Release Note, XMLC 2.2.3 Release Note, XMLC 2.2.4 Release Note, XMLC 2.2.5 Release Note, XMLC 2.2.6 Release Note, XMLC 2.2.7.1 Release Note, XMLC 2.2.8.1 Release Note, XMLC 2.2.9 Release Note, XMLC 2.2.10 Release Note, XMLC 2.2.11 Release Note, XMLC 2.2.12 Release Note, XMLC 2.2.13 Release Note, XMLC 2.2.14 Release Note, XMLC 2.2.15 Release Note, and XMLC 2.2.16 Release Note.
It took a *lot* of work (and even some patches, which have been applied to the official Xerces2 source repository at http://xerces.apache.org) to make it happen but XMLC is now compatible with Xerces2! It was as of XMLC 2.2.6 that XMLC was enhanced to successfully run in an environment containing the DOM3 APIs, such as under JDK1.5. However, DOM3 support in XMLC 2.2.6+ is only API-deep and not actually implemented. With Xerces2 as the new DOM implementation base for XMLC 2.3, DOM3 is finally fully supported.
Note that versions of Xerces prior to 2.8.0 are totally incompatible with XMLC 2.3. This is due to the required fix for issue XERCESJ-1133. Even then, fixes available in Xerces 2.8.1, such as XERCESJ-1181 and XERCESJ-1187, probably make 2.8.1 the minimum recommended version. Now that 2.9.0 has been released with a major performance enhancement (XERCESJ-1200), it is the recommended version.
A new dependency upon Apache XML Commons Resolver is now required. It is important to note that until the 1.2 release of Resolver, it had broken support for XCatalog (the original Xerces1 XCatalog implementation that XMLC uses; not the XMLCatalog OASIS specification). I supplied a fix (XMLCOMMONS-38562), which has been applied to the source repository, and released as part of version 1.2 and is packaged with Xerces-2.9.0. Versions of Resolver previous to 1.2 are totally incompatible with XMLC 2.3.
XMLC 2.3 is, pretty much, a drop-in replacement for XMLC 2.2.xx, assuming your classes are already compiled with -for-deferred-parsing (or you use dynamic loading with no build-time compile step). Please report observations to the contrary. Here's the list of jar dependencies...
Of the above, 2 - 6 may be shared by multiple applications (xercesImpl.jar and xml-apis.jar should, if possible, be placed in an endorsed directory when using an application server). XMLC, itself, is not yet ready to do so but may in the future. Make sure to deploy any xmlc-*.jar files as part of each individual application. Each of these jars are shipped in the XMLC binary zip package.
One note on build tools. Since Ant and, probably, Maven ship with Xerces and put it on the system classpath, make sure the latest version is in the respective build tool's lib directory as this will override any version referenced inside the build file.
All existing parsers have been removed in favor of a new XML parser (XercesDOMParser
) extending the Xerces2 DOMParser and a new HTML parser (XercesHTMLDOMParser
) extending the XML parser, using the NekoHTML HTMLConfiguration
. Parsing is now performed 100% inside Xerces2/NekoHTML with XMLC's minimal parser extensions merely gathering XMLCDocument information and performing entity resolution. DOM correctness is now entirely enforced by the underlying parsers, respectively.
Note that in cases where "tidy
" or "swing
" parser types have been specified in XMLC configuration, NekoHTML will be used instead. No need to recompile with a new configuration. To explicitly specify NekoHTML as the parser, use the new parser type "nekohtml
".
Xerces2 is much more strict about DOM creation. Static Loading builds the DOM from scratch. When it comes to appending DocumentType
, Entity
, and EntityReference
nodes, Xerces2 throws exceptions stating that operations on these objects are read only. Using Deferred Parsing and Dynamic Loading solves this issue, since the DOM is always built by the parser. The DOM is parsed once (and reparsed if a change to the source markup is detected) and cached. DOM instances are clones of the master cached DOM, with the cloning performed internally by Xerces, allowing for write operations on normally read-only nodes. An added benefit was the opportunity to lighten up XMLC by removing a good deal of code that is now unnecessary.
The -for-deferred-parsing flag is still recognized, but ignored since Deferred Parsing is now used by default. As such, there is no longer a requirement to specify -for-deferred-parsing
.
DOMFactory
interface - custom DOM Factories will need to be updated
getDocumentClassName()
method - meant to return the fully qualified class name of the Document
implementation being represented by the factory. This is used to feed the Xerces2 XMLParserConfiguration
"http://apache.org/xml/properties/dom/document-class-name"
property, informing Xerces of the custom DOM with which to bind.
createAccessorGenerator()
and createDocBuilderGenerator()
methods - the only AccessorGenerator
and DocBuilderGenerator
implementations now in use are the Deferred Parsing implementations, so there is no longer a need to specify these.
OutputOptions.FORMAT_XHTML
Updated DOMFormatter
, XMLFormatter
and HTMLFormatter
, and OutputOptions
. XMLFormatter
now supports outputting the HTML DOM as XHTML and HTMLFormatter
supports otuputing the XHTML DOM as HTML. For the former, one must provide oo.setFormat(OutputOptions.FORMAT_XHTML)
, and the latter oo.setFormat(OutputOptions.FORMAT_HTML)
(avoid either if using Lazydom and preformatted text). When left unspecified, the appropriate default format will be applied. HTML documents being output as XHTML will have the XHTML 1.0 transitional doctype, replacing the original document's doctype (unless it is a frameset doctype, in which case the default is set to the XHTML 1.0 frameset doctype), or have the XHTML transitional doctype inserted if no doctype existed in the HTML document. XHTML documents being output as HTML will have the equivalent HTML 4.01 doctype applied as well as have XML/XHTML specific attributes stripped, such as "xmlns"
, "xml:lang"
(the value of which is transfered to the "lang"
attribute), and "xml:space"
. Don't forget that doctypes can always be overridden via OutputOptions
.
Added new OutputOptions.set/getNewlineCharSequence()
. This allows for overriding the default '\n'
newline character in order to set platform-specific newlines when pretty printing. For instance, if writing to console or file on Windows, one might use oo.setNewlineCharSequence(new char[]{'\r','\n'})
. Overriding is unnecessary when outputting to a web browser. Note that where the Xerces/Nekohtml parser preserves white-space, the newline character will be '\n'
. This option only applies to newlines added by pretty printing.
Added new OutputOptions.set/getForceHTMLLowerCase()
. Setting this to true
results in the HTML DOM being serialized in lower-case. This option applies only to the HTML DOM when using the HTMLFormatter
. When outputting XHTML as HTML, DOM elements are already in lower-case so this conversion overhead is skipped. However, it still needs to be set to true
or elements will actually be forced upper-case to be consistent with how the HTML DOM is normally serialized.
Note that when formatting HTML as XHTML or XHTML as HTML, the XMLC CDATA hack continues to be supported, though discouraged because it is non-standard and may produce output that is not well formed (which is why HTML, whether output as HTML or XHTML, should be served as "text/html"
instead of "application/xhtml+xml"
). The XMLC CDATA hack is *not* supported when serializing as XML/XHTML from markup validated at parse time.
HTMLFormater
is now capable out outputting the doctype stored in the document. Because the new NekoHTML-based parser stores the doctype in the document, HTMLFormatter
is now able to output that doctype. It is no longer necessary to provide doctype overrides via oo.setPublicId(prefPublicId)
and oo.setSystemId(prefSystemId)
just to get a doctype into the output, though these overrides are still supported by HTMLFormatter
(and XMLFormatter
).
oo.setPreserveSpace(false)
is now recognized by both XMLFormatter
and HTMLFormatter
(Note: this is not applied within elements having xml:space="preserve"
nor <pre>, <script>, and <style>
elements in HTML). It hasn't been recognized until now, and the default has always been true
anyway. Set it to false
to remove extra whitespace that the XML parser preserves between elements (only applies when character content is *all* whitespace between elements). When not using pretty printing, a single space character is written to preserve single space separation, which is accomplished using newlines when pretty printing.
XMLFormatter
now recognizes xml:space="preserve"
(which is the default for the XHTML <pre>, <script>, and <style>
elements, even when left unspecified by the user) while HTMLFormatter
recognizes the <pre>, <script>, and <style>
elements, specifically; in both cases, these sections won't be formatted at all, even when using pretty printing. And for HTML/XHTML <script> and <style>
elements, no attempt is made to resolve special characters/entities; the text is output as is, as if inside a CDATASection
.
Pretty printing code has been reworked for both XMLFormatter
and HTMLFormatter
to provide nearly the same pretty printing capabilities that HTMLFormatter
has always had, but handles the fact that Xerces2 preserves whitespace (Xerces1 just got rid of it when parsing HTML). There are some caveats. Because the parser preserves space, truly pretty printing is only consistently possible when using both oo.setPrettyPrinting(true)
and oo.setPreserveSpace(false)
. Even then, there are still some quirks to be found, but it works pretty well. For pretty printing, the following OutputOptions
settings are recommended for XHTML and HTML, respectively...
//XHTML (default output format when using XMLC's XHTML DOM) oo.setEnableXHTMLCompatibility(true); //applies only when outputting HTML/XHTML as XHTML oo.setUseAposEntity(false); oo.setOmitXMLHeader(true); //always set to false if serving as content-type "text/html", inconsequential when formatting HTML as XHTML or XHTML as HTML oo.setPrettyPrinting(true); oo.setPreserveSpace(false); //provides for cleaner output even when pretty printing not enabled oo.setIndentSize(2); //optional, default is 4 //HTML (default output format when using the HTML DOM) oo.setUseAposEntity(false); oo.setPrettyPrinting(true); oo.setPreserveSpace(false); //provides for cleaner output even when pretty printing not enabled oo.setIndentSize(2); //optional, default is 4 //HTML as XHTML - use XHTML options above plus... oo.setFormat(org.enhydra.xml.io.OutputOptions.FORMAT_XHTML); //XHTML as HTML - use HTML options above plus... oo.setFormat(org.enhydra.xml.io.OutputOptions.FORMAT_HTML);
Don't forget that with Xerces2 as XMLC's new DOM backbone, it is now possible to utilize standard serialization facilities such as the DOM3 LSSerializer
, using domImplLS.createLSSerializer()
, or the JAXP Transformer API. However, your mileage may vary and these standard serializers certainly won't take into account LazyDOM specifics.