Wednesday, November 08, 2006

ChitChat with the Groovy Compiler

Ever wished to be able to influence the compiling process in Java?

To do this you usually don't influence the compiler directly, you use AOP tools with strange syntax rules and extra files for configuration or you are working on the bytecode level and sue the instrumentation API. But these technics have their limits. To be able to instrument a class you need a ready compiled, the verifier passing class. What if you can't produce such a class?

But ok, let us go some steps back. Programming by convention is in all mouths, so let us look at adding a method depending on the class name. I guess people would use a inter-type declaration in AspectJ to add methods to a certain class. The problem here is, we don't know the class before. We react to the name of the class and only if the name fits, we add the method.

Another thing is adding imports. There is no point in adding imports in a class file, all class names in a class file are already resolved against the imports. Adding another import would be pointless. But if you want to let people write a small peace of code where they can forget about importing every class they use, then we are talking about adding something to what the compiler gets as input. The most easy solution would be to use some kind of macro system. Expand the macros and give the result in the compiler. But then, what is about the line numbering? Exceptions will show useless information about the position in the source file, compiler errors will point to invalid places. That's a bit bad.

Compiler are normally working on an AST, an abstract syntax tree. Such a tree represents the source as a tree, each node is a part logic created by the source. For example, the expression "foo.bar()" is in Groovy a method call expression with foo as the object the call is made on, bar the name of the method and an empty number of arguments. Such a method call expression is a node in the AST. I am afraid that explaining all the expression types and statements in Groovy would be too much for this small article. I suggest to look into the packages org.codehaus.groovy.ast.expr, org.codehaus.groovy.ast.stmt and org.codehaus.groovy.ast. Of course having around 35 expressions and 19 statements doesn't make this task very nice, but it is needed. Groovy uses the visitor pattern to visit the classes. That makes it a bit problematic for people how are not used to that. but it really is nice to have that.

Each phase of the compilation process after the AST is created consists of a visitor travaling along the AST replacing or modifying nodes. Btw, the Groovy AST is for example used in Eclipse for the outline.

Now imagine you could change the AST by yourself. Rename methods, add code, add methods... yes, right, we wanted to add a method if the class name fullfills a certain convention.

Now the Groovy compiler offers here some hooks you can use to change what the compiler will produce. What I will explain now counts only for the case that a file is compiled using the GroovyClassLoader, for example because a script is found in the classpath or because we use the GroovyShell to execute a script. The basic process to creat a compilation unit, that will hold the AST and other information, add a source to the compilation unit and then start the compilation. By subclassing GroovyClassLoader you can overwrite createCompilationUnit(CompilerConfiguration, CodeSource), the very first step. The CompilationUnit is really our compiler. It holds all the logic. So if you return a subclass of CompilationUnit here, you would return a custom compiler. But I don't want to go this far here.

I talked about hooks, but subclassing CompilationUnit would be a bit big possibly. I talked also about the phases a compiler goes through and that's where CompilationUnit offers hooks. The addPhaseOperation methods do allow you to give the compiler additional logic for a phase of your choice. In Groovy these phses are:

  • initialization: open files setting up the environment and such
  • parsing: use the grammar to produce tree of tokens repersenting the source
  • conversion: make a real AST out of these token trees
  • semantic analysis: check for all things the grammar can't check for, resolv classes and other things
  • canonicalization: complete the AST
  • instruction selection: choose isntruction set, for example java5 or pre java5
  • class generation: create the binaries in memory
  • output: write the binaries to the file system
  • finalization: cleanup
not all these phases are really filled, the canonicalization step is currently empty, but that is a implementation detail. If we want to operate on the the AST then we need one, thus the conversion phase is the most early phase to do that. You have to decide if your operation is about changing a certain source file or a certain class. Each file may contain multiple classes, so this is not equal. For example if we want to add an method, then it makes not much sense to work on the source (SourceUnit) but on the ClassNodes instead.

In our example we would call the addPhaseOperation(PrimaryClassNodeOperation, int) method. You may wonder about the name primary class node operation. A class node represents a class, but in the compilation process there are two kinds of classes: precompiled classes and classes we want to compile. These primary class node are classes we want to compile.

If you know the AST then the next steps are quite easy. For example:

class MyOperation extends PrimaryClassNodeOperation {
public void call(SourceUnit source, GeneratorContext context, ClassNode classNode) throws CompilationFailedException {
classNode.addMethod("foo", OpCodes.PUBLIC, null, null, null,code);
}
}

This would add a method named foo to every class we compile, foo takes no arguments, is public and has the return type Object. The contents of the method are stored in "code". I haven't specified that here to keep the article not totally complicated. Ok, again let us think back to the orignal task of adding a method if the class name fullfills a certain convention. For example if a class is XyzFoo, then add a method xyz, if the class is AbcFoo, then abc. Ok, that is a bit crazy, but well, just for the demonstration of the flexibility:

import org.codehaus.groovy.ast.ClassNode;
import org.codehaus.groovy.ast.stmt.BlockStatement;
import org.codehaus.groovy.classgen.GeneratorContext;
import org.codehaus.groovy.control.CompilationFailedException;
import org.codehaus.groovy.control.CompilationUnit;
import org.codehaus.groovy.control.CompilerConfiguration;
import org.codehaus.groovy.control.Phases;
import org.codehaus.groovy.control.SourceUnit;
import org.codehaus.groovy.control.CompilationUnit.PrimaryClassNodeOperation;
import org.objectweb.asm.Opcodes;

import groovy.lang.GroovyClassLoader;

class MyGroovy extends GroovyClassLoader {
protected CompilationUnit createCompilationUnit(CompilerConfiguration config, CodeSource source) {
CompilationUnit cu = super.createCompilationUnit(config, source);
cu.addPhaseOperation(new PrimaryClassNodeOperation() {
public void call(SourceUnit source, GeneratorContext context, ClassNode classNode) throws CompilationFailedException {
String name = classNode.getName();
if (name.endsWith("Foo") && name.length()>3) {
name = name.substring(0,name.length()-3);
BlockStatement code = new BlockStatement();
classNode.addMethod(name,Opcodes.ACC_PUBLIC,null,null,null,code);
}
}
},Phases.CONVERSION);
return cu;
}
}

Grails does use this mecahnism to add the toString and hashcode methods as well as the id field in domain classes.

Another easy example is adding a import. But since we don't add an import to a specific class, but to all classes in a file, we are operation on the SourceUnit (representation of the source file) so we use a SourceUnitOperation.

cu.addPhaseOperation(new SourceUnitOperation() {
public void call(SourceUnit source) throws CompilationFailedException {
source.getAST().addImport("User",ClassHelper.make("my.company.UserDAO"));
}
},Phases.CONVERSION);

here we add an import for UserDAO and make the class known as User inside the file we compile. That is comparable to "import my.company.UserDAO as User" in Groovy.

As you may see this is much more than aspect oriented programming could do. You can place arbitary logic inside the compiler and modify the ASTs. If you want to do more complex operations or more local ones, then you might have to really go into the AST. I suggest to use the LabelVerifier as entry point for this. A little debugging helps you to understand how the AST looks like and how the visting works. I plan to write a big documentation about these things, but this will be on the Groovy wiki then. The purpose of this article is just to give you a insight in the internals of the compilation and how it can be influenced from the outside. Without knowing the AST there is not really much you can do. It is not that the AST is very complicated or bad documented, but learning about over 60 classes isn't done in a short form. The biggest exmaple of AST traversion is AsmClassGenerator, the part in the compiler creating the binaries. But it is a big class, not too good to learn.

My plans for the future are to add an mechansim that allows to write some text that gets mixed into the classes the compiler produces. In the import example above you would maybe write something like
cu.mixinPerSource("import my.company.UserDAO as User")
. Or in the method example:
cu.mixinPerClass {"def ${it[0..-3]}(){}"}
. But it looks like this will ahve to wait for post 1.0

You could also use the AST to transform any DSL containing binary operations such as "&" and "<" into a builder or a method based version. You could transform it into a database request or whatever. There are many possible usages of this. And in fact something like that is already used in GroovySQL - but well hidden ;)

So might I ask again: Ever wished to be able to influence the compiling process in Java?