Scaling XML to High-Volume -- Dos and Don'ts

Developers looking to scale their pilot XML projects to embrace more volume or to link to more systems should carefully evaluate whether today's popular approaches will truly serve their needs. A recent whitepaper from Zapthink, web services consultancy in Waltham, Mass., found certain drawbacks with all of the top XML performance tuning options, including XSL, parsers and smaller element names. See how Zapthink says developers should cope.

Tags: XML, Developers, Schmelzer, XML Documents, Compression, Standard, XSL,

Developers looking to scale their pilot XML projects to embrace more volume or to link to more systems should carefully evaluate whether today's popular approaches will truly serve their needs.

A recent whitepaper from Zapthink, web services consultancy in Waltham, Mass., found certain drawbacks with all of the top XML performance tuning options, including
(a) XSL (Extensible Stylesheet Language),

(b) compression,

(c) using smaller "element names,"

(d) using non-standard XML parsers; and even

(e) rewriting XML rules and/or business logic

"Developers need to start their planning from the basic assumption that XML is inefficient. If developers don't spend time thinking clearly about what that means to their systems environment they could run into challenges," Zapthink analyst Ronald Schmelzer told Integration Developer News.

Overcoming Hazards of XML Shortcuts

The problem with scaling XML, Schmelzer said, is that it's difficult to determine just where the inefficiencies will crop up. Because many XML projects are low- and medium-volume projects designed to be small pilot tests, he said, "at first, often these XML inefficiencies don't really show up."

Schmelzer describes XML inherent challenges this way in a recent Zapthink column.

"XML is a text-based, human-readable, and metadata-encoded markup language that operates on the principle that the metadata that describes a message's meaning and context accompanies the content of the message. As a result, XML document sizes can easily be ten to twenty times larger than an equivalent binary representation of the same information.

Schmelzer says these "inefficiencies" can crop up and bite a developer in 3 main areas:

1. Bandwidth -- The simple transferring XML documents -- even without XML schema transformations -- can eat up a lot of bandwidth. A network might need up to 10-20 MB of bandwidth for high-volume transfers, Schmelzer said.

2. Processor overload -- While a growing number of developers are using XSL (Extensible Markup Language) to help with XML throughput, Schmelter says it's not a true answer. "XSL taxes a processor quite a lot, especially if you're doing 100 XSL transactions per second." And these XSL latencies can multiply, especially as developers begin to construct XML storage solutions.

3. Storage -- The more XML documents (or parts of documents) developers need to store, the more they may look to use XSL in many places. "Once you make a decision to use XSL, you may find that code will grow like wildfire," Schmelzer said.

Once the bottleneck (or potential bottleneck) is diagnosed, developers should apply care in applying a solution, Schmelzer said. In the push for better performance, developers are doing things to XML that aren't exactly standard," he said. Schmelzer admits there are "no established Best Practices" for speeding up XML and so "People are doing their own thing. Sometimes it works and sometimes it doesn't," he said.

Performance and Interoperability Trade-offs
Even though these techniques seem to solve today's (performance) problem, Schmelzer warned that the wide-spread use of non-standard solutions will probably constrain the interoperability of their XML systems -- both inside and outside the firewall.

"These solutions may speed up some performance to a point, but they don't work all the time for every bottleneck," Schmelzer said, "Developers need to realize that when doing XML data sharing or document transfers the other side of the communication has to understand these compressed formats or rewritten element names," he said. "Naturally, one consequence is that while your first project may work just fine, the more systems you add in, the more likely you'll lose interoperability."

  • Compressing or "Squeezing" XML -- While XML compression helps maximize bandwidth and limits storage needs, compression does little for CPU overload. In fact, Schmelzer said, unrestrained or inappropriate use of compression could actually worsen processor performance because now XML documents moving across a network will need to be decrypted, processed, parsed and re-encrypted. This raises the unhappy possibility that all the performance gained from XML compression could be lost to extra processing requirements. Schmelzer points out that developers have resorted to the tactic of referring to their XML elements as simply "" or "". While such short tags are definitely an improvement over tags like ", the resulting XML is for all practical purposes no longer human readable.]

  • Ignoring XML Validity -- Simply skipping the processing step of validating XML documents is another approach to improving XML performance. In fact, ZapThink's research has shown that few businesses use XML validation of any type (either DTD or W3C XML Schema) as part of runtime XML-based business processes. Instead, developers will check their XML for validity only during the test or design phases of an implementation, and then simply trust that the documents are remain valid thereafter.

  • Rewriting Parsers and XML -- Looking for more dramatic results, developers are increasing using proprietary parsers -- which are programs that let developers recompile XML so that it contains only a subset of all available XML functionality. Some more hands-on developers are even rewriting XML rules themselves -- eliminating the need for end tags or remove case-sensitivity within XML documents. In essence, these developers are creating new, proprietary markup languages of their own.

  • In Search of Solutions

    Schmelzer admits there aren't many standard answers to these problems -- at least not today. He vocally suggests that the W3C, OASIS or another web service standards body take up the issue of making high-volume XML more efficient. "As companies begin to rollout bigger and bigger XML projects, it will become much more evident that this problem merits standards attention," he told IDN.

    But in the meantime, he suggests the boost in performance should come from tuning the hardware or the application server -- not the XML itself. "To avoid problems down the line, the developer should implement XML as a standard as much as he can, and optimize his hardware, software application or network for performance," he said.

    Schmelzer posed the following question: "At what point does a compressed, stripped-down, non-validating "XML-like" format leave the standards behind and represent a proprietary data format?"

    To guard against tweaking XML out of standard compliance, Schmelzer also suggests that XML be converted before it's dropped onto the network wire, and in turn have that post-XML data format compressed. XSLT processors are one flavor of such an option, but even this technique needs to be evaluated for the type of XML traffic being pushed through the network because XSLT processors can slow application servers by giving them more pre- or post-processing, he added.

    "There should be somebody else worrying about how to squeeze more performance out of the XML traffic on the pipe so that developers don't need to write specialized parsers," he said. "That's the way EDI did it, with everyone writing their own parsers, and it definitely made things less interoperable."

    The full copy of Schmelzer's Zapthink column on "Breaking XML To Optimize Performance" is available.