Javac Internals

In my research project I develop the Whiteoak compiler. Whiteoak is a language that extends Java with features like structural conformance and type classes.

Here is some information to help you better undersrand the design of the javac compiler (version 1.5). This should be helpful for anyone trying to hack javac to support additional functionality.

Parsing (package: com.sun.tools.javac.parser)
  • The lexical scanner is writen by hand. Javac does not use an autmatically generated scnner (for performance reasons). If you need to add a keyword/operator/etc. open the Scanner class.
  • For the same reasons, the Parse class is also a manually written recursive descent parser.
  • The superclass of all AST classes is com.sun.tools.javac.tree.Tree. Subclasses are defined as inner classes of Tree (you CAN define your own nodes outside of the Tree class).
  • AST nodes can be instantiated simply by calling the constructor. You don't have to use the factory class com.sun.tools.javac.tree.TreeMaker. Nonetheless, TreeMaker is convenient because it sets the pos field of the generated nodes, thereby making sure error messages are assoicated with the correct source line.
The basic architecture
  • The main class: com.sun.tools.javac.main.Main. The main method there is compile()
  • A single context object is created for each inovcation of Main.compile(). This context object represents the current compilation session.
  • An AST class provides just the basic information about a node. The compilation process is carried out through several "modules" that are implemented externanlly to the AST classes, E.g.: type checking, bytecode generation, inference of type paramters, ...
  • Some of the modules are implemented as visitors. Other modules are used by visitor modules but are not visitors themselves.
  • The modules are singletons with respect to a compilation session and are cached by the Context object. Thus, if you need to access (for example) the symtab module use Symtab.instance(context) to get this instance.
Major modules

Here are the modules that do most of the compilation work.
  • com.sun.tools.javac.comp.Lower - Rewrites AST nodes (e.g.: generates calls to Integer.valueOf()whenever a boxing of an int is in order);
  • com.sun.tools.javac.comp.Attr - Augments the AST with type/symbol infromation
  • com.sun.tools.javac.comp.Enter - Defines symbols for encountered definitions
  • com.sun.tools.javac.jvm.Gen - Generates bytecode
  • com.sun.tools.javac.code.Types - Type system (subclassing, unboxed type, ...)
  • com.sun.tools.javac.code.Symtab - Predefined symbols (primitive types, Object type, Class type)
  • com.sun.tools.javac.code.Check -Type checking helper class
  • com.sun.tools.javac.util.Name.Table - String table
Utility Classes (package: com.sun.tools.javac.util)
  • List - A generic immutable (functional) list. Note that many methods return a new list so you cannot ignore the return type. For example, prepending the integer 5 to the list ms: List ns = ms.prepend(5);
  • ListBuffer - A builder for a List
  • Name - Similar to Java's String, but uses a string table to ensure that two identical strings will be represented by the same Name object. Use one of the fromXXX() static methods to obtain an object. Use Name.Table.instance(Context) to obtain the Table module.
Symbols vs. Nodes

Some of the AST nodes have a symbol field (in particular: Tree.Ident as well as the various nodes that represent declarations). The Symbols hierarchy is somewhat similar to the AST hierarchy so it is important to understand the differences to avoid confusion.

The first step in the compilation process is lexical scanning+parsing which outputs a TREE structure (the AST). A tree has no cycles so if your program has three references to a variable named 'x' the resulting AST will have three identifier nodes holding the token 'x'.

In some point during compilation the compiler realizes that these references all refer to the same entity: the variable x. Obviously, it is much easier to work with a single node than three (for example, you don't want to record x's type in three different locations).
To this end, the compiler generates a GRAPH structure where each definition in the program is represented exactly once. In our simple example, the compiler creates a VarSymbol object to represent the x variable. Then, the symbol field in the three AST nodes (the three references to 'x') is set to point at this VarSymbol object.
Later on, when x's type is resolved, the type field in the VarSymbol vertex will be assigned with a ClassSymbol object, which in turn points to symbols representing its superclass, superinterfaces, fields, methods, etc.

To conclude, the symbol field in an AST node associates the TREE node with the corresponding vertex from the GRAPH of symbols.

Subclassing a module

Let's assume you want to subcalsss the Symtab modue.
  1. First, define a new subclass of the Symtab class: NewSymtab.
  2. Add a constructor that takes a Context object as a parameter.
  3. Add a preRegister() method to the NewSymtab class:
    public static void preRegister(final Context context)
    {
    context.put(symtabKey, new Context.Factory()
    {
    public Symtab make()
    {
    return new NewSymtab(context);
    }
    });
    }
  4. Add a call to NewSymtab.preRegister(context) immediately after the new Context object is created in Main (Subclassing Main to do that is a good idea).
  5. Note that NewSymtab will inherit a static instance(Context) method from its Symtab. The caching framework ensures that this method will return the correct object (i.e., a NewSymtab object) but the return type is the superclass type: Symtab. Therefore, you should redefine the static instance() method as follows:
    public static NewSymtab instance(Context context)
    {
    return (NewSymtab) Symtab.instance(context);
    }
  6. If your subclass needs to access other modules, add these modules as fields, and initialize them in the constructor by calling the corresponding instance(Context) method. In some cases you can inherit such fields from the super-class but usually the visibility level is private so the subclass cannot access these.

4 comments :: Javac Internals

  1. Hi,

    I'm trying to use the com.sun.tools.javac.main.JavaCompiler to compile the Java source in runtime. It's important to me to give the source as a string andnot as a file. Did you try it? Do you have an example?

  2. Of course you can tweak the compiler code to do that, but I think that there is much easier way: (a) create a temp. file; (b) dump your string to this file; (c) invoke the compiler's main class, passing the path to the temp. file as an argument.

  3. This is invaluable. Thanks a lot. I think this outline should make it into some sort of official wiki for the openjdk, the official documentation is either a bit lacking on high-level overview or I'm bad at finding these things.

    Thank you once again.

  4. This comment has been removed by the author.

Post a Comment