Thursday, September 10, 2009

XSD Validation in Scala

One of the coolest features of Scala is that it treats XML as a first-class data type. It can be embedded directly into program code (much like E4X in Actionscript) and processed very succinctly with the language's pattern matching constructs.

But one key feature that the Scala libraries are lacking is support for validation of XML documents against an XSD schema. Fortunately, the JDK has a standard mechanism for this in the javax.xml.validation package. And with a little bit of work we can combine these capabilities to get the best of both platforms.

The scala.xml.XML object provides a set of simple functions for reading XML from various sources (files, streams, etc.). A quick inspection of the code in this class shows all the load* methods to be wrappers around the loadXML method of the scala.xml.parsing.NoBindingFactoryAdapter class.

NoBindingFactoryAdapter (via its parent class scala.xml.parsing.FactoryAdapter) implements the standard org.xml.sax.helpers.DefaultHandler interface and serves as a bridge between the Java and Scala XML worlds. The relevant loadXML method is defined on FactoryAdapter and looks like this:

def loadXML(source: InputSource): Node = {
// create parser
val parser: SAXParser = try {
val f = SAXParserFactory.newInstance()
f.setNamespaceAware(false)
f.newSAXParser()
} catch {
case e: Exception =>
Console.err.println("error: Unable to instantiate parser")
throw e
}

// parse file
scopeStack.push(TopScope)
parser.parse(source, this)
scopeStack.pop
return rootElem
}


Essentially, this method creates a standard javax.xml.parsers.SAXParser object and passes itself in as the ContentHandler. The callback methods in FactoryAdapter convert the standard SAX parsing events into instances of the core Scala XML types. This is our hook for introducing XSD validation.

The javax.xml.validation.ValidatorHandler is a class that acts as a filter of sorts (although not a true org.xml.sax.XMLFilter), intercepting callbacks sent from an SAX parser to a ContentHandler, and generating errors if the data doesn't conform to its schema definition. We will create a subclass of NoBindingFactoryAdapter that interposes a ValidatorHandler between the parser and functions defined in the FactoryAdapter to more-or-less transparently implement validation.

We will use the loadXML method of FactoryAdapter as a starting point for implementation:

import javax.xml.parsers.SAXParser
import javax.xml.parsers.SAXParserFactory
import javax.xml.validation.Schema
import javax.xml.validation.ValidatorHandler
import org.xml.sax.XMLReader

class SchemaAwareFactoryAdapter(schema:Schema) extends NoBindingFactoryAdapter {

override def loadXML(source: InputSource): Elem = {
// create parser
val parser: SAXParser = try {
val f = SAXParserFactory.newInstance()
f.setNamespaceAware(true)
f.setFeature("http://xml.org/sax/features/namespace-prefixes", true)
f.newSAXParser()
} catch {
case e: Exception =>
Console.err.println("error: Unable to instantiate parser")
throw e
}

val xr = parser.getXMLReader()
val vh = schema.newValidatorHandler()
vh.setContentHandler(this)
xr.setContentHandler(vh)

// parse file
scopeStack.push(TopScope)
xr.parse(source)
scopeStack.pop
return rootElem.asInstanceOf[Elem]
}
}


The key differences here are that we have enabled namespace awareness on the parser (which is a requirement for schema validation!) and stuck our ValidatorHandler in between the parser and the FactoryAdapter instance. Because we are actually overriding NoBindingFactoryAdapter#loadXML (which has a return type of Elem) we need the cast in the final line.

So now we can validate XML as follows:

// A schema can be loaded in like ...

val sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI)
val s = sf.newSchema(new StreamSource(new File("foo.xsd")))


// and whenever we would want to do something like:

val is = new InputSource(new File("foo.xml"))
val xml = XML.load(is)

// instead we'll use our class:

val is = new InputSource(new File("foo.xml"))
val xml = new SchemaAwareFactoryAdapter(s).loadXML(is)

Friday, June 05, 2009

Adding a Remotely Accessible REPL to a Clojure Application

One of the distinguishing features of Lisp systems, including Clojure, is that the components that parse and compile code are accessible to user programs, even at runtime (i.e. "the whole language always available"). This enables sophisticated language features like macros, and is very different than the traditional Java worldview of distinct compile- and run-times.

The most familiar application of this concept is the interactive REPL that comes with Clojure.
Because the REPL is written in Clojure itself, it is available to any user program as the function clojure.main/repl. We can use this in conjunction with the networking libraries from the JDK to add a REPL interface to our application that is accessible over a socket connection. This gives us the ability to interact with and even modify a running application (e.g. by adding new code or changing data values) simply by connecting to the REPL and evaluating Clojure expressions. Moreover, since none of these activities requires an application restart or a lengthy rebuild process, it can be a huge time saver during the development process. 

There are only a few modifications to our skeletal web app to make this happen. First, I'll make a slight modification to the ClojureContextListener so that it accepts a whitespace-separated list of files to evaluate on startup, rather than just a single file:
public void contextInitialized(ServletContextEvent sce) {
try {
ServletContext sc = sce.getServletContext();
String evalOnContextInitialized = sc.getInitParameter("evalOnContextInitialized");
String[] scripts = evalOnContextInitialized.split("\\s+");
for (String scripts : scripts) {
RT.loadResourceScript(script);
}
}
catch (Exception e) {
// log an error message here ...
}
}
Then we'll create the file myapp/repl.clj, in the WEB-INF/classes directory of our application that provides the functions we need to start and run the REPL. Our REPL will run on a separate thread and serve a single client at a time. Because the built-in repl function expects to send and receive data from *in*, *out* and *err*, we just need to rebind them to the input and output streams associated with our socket connection:
(ns myapp.repl
(:require clojure.main)
(:import (java.io InputStreamReader PrintWriter)
(java.net ServerSocket Socket)
(clojure.lang LineNumberingPushbackReader)))

(defn do-on-thread
"Create a new thread and run function f on it. Returns the thread object that
was created."
[f]
(let [thread (new Thread f)]
(.start thread)
thread))

(defn socket-repl
"Start a new REPL that is connected to the input/output streams of
socket."
[socket]
(let [socket-in (new LineNumberingPushbackReader
(new InputStreamReader
(.getInputStream socket)))
socket-out (new PrintWriter
(.getOutputStream socket) true)]
(binding [*in* socket-in
*out* socket-out
*err* socket-out]
(clojure.main/repl))))

(defn start-repl-server
"Creates a new thread and starts a REPL server on listening on port. Returns
the server socket that was just created."
[port]
(let [server-socket (new ServerSocket port 0)]
(do-on-thread #(while true (socket-repl (.accept server-socket))))
server-socket))
And we'll add at the end of the file the following snippet that starts the REPL server, if the system property replPort has been defined:
(let [repl-port (System/getProperty replPort)]
(if (not (null? repl-port))
(def *repl-server* (start-repl-server (Integer/parseInt repl-port)))))
This will allow us to disable the remote REPL (for example, in production) by simply omitting the replPort property. Lastly we'll add this to web.xml
   <context-param>
<param-name>evalOnContextInitialized</param-name>
<param-value>myapp/repl.clj myapp/web.clj</param-value>
</context-param>
Assuming that we have started our application with -DreplPort=12345 option, at this point we can any generic client (such as netcat) to connect to our application:
tinman:~ sean$ nc localhost 12345
clojure.core=>
 



UPDATE:It turns out that a remote REPL capability is available in the clojure.contrib.server-socket package. I found it accidentally when my Emacs tags file led me to its socket-repl function instead of my own. Interestingly, the two implementations are very similar.

Wednesday, June 03, 2009

Setting up a Clojure development environment is is fairly straightforward, although the tool support is fairly limited at present. The choices are:
  1. An Emacs clojure-mode, and that provides some support for the SLIME Lisp development environment;
  2. An Eclipse plugin, Clojure-dev;
  3. A NetBeans plugin, Enclojure.
All of these are fairly rudimentary. They each support syntax highlighting and formatting and some degree of REPL integration. The Eclipse and NetBeans plugins provide the project management capabilities of their environments, but the more advanced features (refactoring, automatic symbol cross-referencing, debugging) aren't available. Hopefully they will be added as the tools mature.

At this point, the most advanced version is Enclojure. Having been an Eclipse user for the past few years, learning a new tool has been somewhat annoying, but most of the features are present in NetBeans. The key advantage of Enclojure (over Clojure-dev) at this point is the ability to connect to a remote REPL, which is a very cool feature that I'll talk more about in a future post.

Monday, June 01, 2009

Because Clojure has such tight integration with the JVM, embedding it into a Java application is a snap. In particular, writing Clojure code that runs inside of a J2EE web container and handles web requests is trivial.

We just need to create two simple pieces of scaffolding code in Java to help the container find and dispatch web requests to our Clojure functions.

The first item is a ServletContextListener that initializes the Clojure runtime and evaluates our code when the web application is first loaded by the container:

package myapp.servlet;

import javax.servlet.ServletContextEvent;
import javax.servlet.ServletContextListener;

import clojure.lang.RT;

public class ClojureContextListener implements ServletContextListener {

public void contextInitialized(ServletContextEvent sce) {
try {
ServletContext sc = sce.getServletContext();
String script = sc.getInitParameter("evalOnContextInitialized");
RT.loadResourceScript(script);
}
catch (Exception e) {
// log an error message ...
}
}

public void contextDestroyed(ServletContextEvent sce) {
// nothing to do here ...
}

}

We'll expect a<context-param> "evalOnContextInitialized" from the application's deployment descriptor to contain the name of a file containing our functions.

The second bit of helper code is a simple HttpServlet that delegates calls to its doService method to a Clojure function of our choice:

package myapp.servlet;

import javax.servlet.ServletConfig;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

import clojure.lang.RT;
import clojure.lang.Var;

public class ClojureServlet extends HttpServlet {

private Var entryPoint;

@Override
public void init(ServletConfig sc) throws ServletException {
String entryPointString = sc.getInitParameter("entryPoint");
if (entryPointString != null) {
String[] nsAndName = entryPointString.split("/", 2);
entryPoint = RT.var(nsAndName[0], nsAndName[1]);
}
else {
throw new ServletException("Servlet " + sc.getServletName() +
" missing 'entryPoint' parameter!");
}
}

@Override
protected void service(HttpServletRequest req, HttpServletResponse resp)
throws ServletException {
try {
entryPoint.invoke(req, resp);
}
catch (Exception e) {
throw new ServletException(e);
}
}

}

The name of the Clojure function will be specified as an <init-param> in the servlet's configuration. We will follow the Clojure convention of namespace/name convention for identifying our function.

The final (and somewhat anti-climactic) piece of the puzzle is to create the Clojure code that actually handles the web request. This will be be a function of two arguments that accepts the javax.servlet.http.HttpServletRequest and javax.servlet.http.HttpServletResponse objects provided by the web container. We'll put in a simple "Hello world" handler to demonstrate.

(ns myapp.web)

(defn hello
"Hello world"
[req resp]
(.. resp (getWriter) (println) (str "<html><body>" "Hello world" "</html><body>"))

Since our hello function is in the the "myapp.web" namespace, it should be contained in a file named myapp/web.clj, that is on the application's CLASSPATH. In our case, this means in a directory beneath the WEB-INF/classes of the WAR file.

Accordingly, our web.xml file should look like:
<web-app version="2.4"
xmlns="http://java.sun.com/xml/ns/j2ee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/j2ee http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd">

<context-param>
<param-name>evalOnContextInitialized</param-name>
<param-value>myapp/web.clj</param-value>
</context-param>

<listener>
<listener-class>myapp.servlet.ClojureContextListener</listener-class>
</listener>

<servlet>
<servlet-name>ClojureServlet</servlet-name>
<servlet-class>myapp.servlet.ClojureServlet</servlet-class>
<init-param>
<param-name>entryPoint</param-name>
<param-value>myapp.web/hello</param-value>
</init-param>
</servlet>

<servlet-mapping>
<servlet-name>ClojureServlet</servlet-name>
<url-pattern>/hello</url-pattern>
</servlet-mapping>

</web-app>

And the resulting web application will be laid out as follows:

myapp.war/
WEB-INF/
web.xml
classes/
myapp/
web.clj
servlet/
ClojureServlet.class
ClojureContextListener.class
lib/
clojure-1.0.0.jar


Note that the single clojure-1.0.0.jar file is the only required dependency. That's all there is to it!

Friday, May 29, 2009

Clojure is a new dynamic programming language for the JVM. It is a dialect of Lisp that provides:

  1. Well defined concurrency semantics for all language features. Clojure requires a functional programming style, and as such variables are single assignment and most data is immutable. For mutable data, Clojure supplies a software transaction memory system.
  2. Total integration with the JVM. Clojure data elements are instances of java.lang.Number, java.lang.String, etc. There is no bridging of parallel type systems (for example, from a canonical C language implementation). Clojure code compiles directly to Java bytecode and can be profiled and debugged using standard tools. Access to Java code/libraries from Clojure is totally seamless.
  3. Collection types as first-class language features. There is syntax for dealing with sets, maps and vectors, and their implementations are have the same immutablility and concurrency semantics as the other data types.
  4. Macros. Coming from a Java background this is perhaps the coolest feature of Clojure.

There are several excellent introductions to the language. The "Clojure for Java Programmers" (first part, second part), assumes no familiarity with Lisp and is a decent introduction to the language. The version of the lecture for experienced Lisp programmers (first part, second part) goes into the language in much greater detail and is highly recommended. Once you have sat through the "non-Lisp" version you will have enough background to comprend the "Lisp" version.