Thursday, September 10, 2009

XSD Validation in Scala

One of the coolest features of Scala is that it treats XML as a first-class data type. It can be embedded directly into program code (much like E4X in Actionscript) and processed very succinctly with the language's pattern matching constructs.

But one key feature that the Scala libraries are lacking is support for validation of XML documents against an XSD schema. Fortunately, the JDK has a standard mechanism for this in the javax.xml.validation package. And with a little bit of work we can combine these capabilities to get the best of both platforms.

The scala.xml.XML object provides a set of simple functions for reading XML from various sources (files, streams, etc.). A quick inspection of the code in this class shows all the load* methods to be wrappers around the loadXML method of the scala.xml.parsing.NoBindingFactoryAdapter class.

NoBindingFactoryAdapter (via its parent class scala.xml.parsing.FactoryAdapter) implements the standard org.xml.sax.helpers.DefaultHandler interface and serves as a bridge between the Java and Scala XML worlds. The relevant loadXML method is defined on FactoryAdapter and looks like this:

def loadXML(source: InputSource): Node = {
// create parser
val parser: SAXParser = try {
val f = SAXParserFactory.newInstance()
f.setNamespaceAware(false)
f.newSAXParser()
} catch {
case e: Exception =>
Console.err.println("error: Unable to instantiate parser")
throw e
}

// parse file
scopeStack.push(TopScope)
parser.parse(source, this)
scopeStack.pop
return rootElem
}


Essentially, this method creates a standard javax.xml.parsers.SAXParser object and passes itself in as the ContentHandler. The callback methods in FactoryAdapter convert the standard SAX parsing events into instances of the core Scala XML types. This is our hook for introducing XSD validation.

The javax.xml.validation.ValidatorHandler is a class that acts as a filter of sorts (although not a true org.xml.sax.XMLFilter), intercepting callbacks sent from an SAX parser to a ContentHandler, and generating errors if the data doesn't conform to its schema definition. We will create a subclass of NoBindingFactoryAdapter that interposes a ValidatorHandler between the parser and functions defined in the FactoryAdapter to more-or-less transparently implement validation.

We will use the loadXML method of FactoryAdapter as a starting point for implementation:

import javax.xml.parsers.SAXParser
import javax.xml.parsers.SAXParserFactory
import javax.xml.validation.Schema
import javax.xml.validation.ValidatorHandler
import org.xml.sax.XMLReader

class SchemaAwareFactoryAdapter(schema:Schema) extends NoBindingFactoryAdapter {

override def loadXML(source: InputSource): Elem = {
// create parser
val parser: SAXParser = try {
val f = SAXParserFactory.newInstance()
f.setNamespaceAware(true)
f.setFeature("http://xml.org/sax/features/namespace-prefixes", true)
f.newSAXParser()
} catch {
case e: Exception =>
Console.err.println("error: Unable to instantiate parser")
throw e
}

val xr = parser.getXMLReader()
val vh = schema.newValidatorHandler()
vh.setContentHandler(this)
xr.setContentHandler(vh)

// parse file
scopeStack.push(TopScope)
xr.parse(source)
scopeStack.pop
return rootElem.asInstanceOf[Elem]
}
}


The key differences here are that we have enabled namespace awareness on the parser (which is a requirement for schema validation!) and stuck our ValidatorHandler in between the parser and the FactoryAdapter instance. Because we are actually overriding NoBindingFactoryAdapter#loadXML (which has a return type of Elem) we need the cast in the final line.

So now we can validate XML as follows:

// A schema can be loaded in like ...

val sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
val s = sf.newSchema(new StreamSource(new File("foo.xsd")))


// and whenever we would want to do something like:

val is = new InputSource(new File("foo.xml"))
val xml = XML.load(is)

// instead we'll use our class:

val is = new InputSource(new File("foo.xml"))
val xml = new SchemaAwareFactoryAdapter(s).loadXML(is)