This file was automatically generated from http://svn.pugscode.org/pugs/docs/notes/precompilation_cache.pod on Mon Sep 10 11:50:25 2007 GMT, revision 17733.
Precompilation cache in Pugs
Rather than parse every .pl or .pm file pugs sees over and over again as they are needed, pugs stores the results of compilation in a cache. This gives the benefit of speed without the awkwardness of an opaque object file.
In the first version of the design, we aim for simplicity, reasonable forward compatibility, but not kitchen sink flexibility. For now, this is a Pugs mechanism. One cache should support several versions of Pugs, sharing objects where possible, and allowing easy maintenance. If it gets screwed up, the admin should be able to delete the cache directory and not suffer from anything but subsequent temporary cold cache slowdowns.
Objects stored in the cache are compilation units after parsing. Perl
6 has separate compilation, and presumably every compilation unit has
one canonical abstract representation per version of Pugs. Where this is
not the case, e.g. if BEGIN blocks intends to change the compilation
outcome of this unit, the author of the code should mark the unit as not
cachable. XXX: notation for this. XXX: Also need to add deps for used
modules that export macros. Especially macros in the prelude. Including
the dependencies hashes in the hash of the source code should suffice.
We now describe how cache objects are keyed in the cache, and the bytecode format of a cached object.
Example:
~joe/.pugscache/1/14/148071fa07847bc0de8df7d75cd03072f27239c2/9600-9658 # < $HOME .pugscache $H1 $H2 $SHA1 $ReleaseRev-$ParserRev >.catdir
The fast path for cache usage is successful lookup of a valid precompiled unit.
The cache is (in the first insance) a filesystem directory under the
user's $HOME. A global cache would have been nice for disk and CPU
performance on multiuser machines, but since original source code is
easily deducible from the compiled version, this is a security issue we
prefer to avoid. If we have pluggable backends we could potentially just
allow a memcached backend as well.
Every compilation unit entering the cache is stored in a directoy of its own. The entry location is a function of (a hash of) the source code. Inside that directory there is a file named with Pugs version, and Pugs parser version. This prevents us from loading a precompiled object for a unit that has changed, or for a parser that would have emitted a different structure than the one present in the cache. Using the hash as a directory name, without the parser version/etc lets us check in one stat if we have a cached entry or not, and then readdir for the compatible version info, instead of always having to readdir the entire cache level. (To keep file count in a given directory managable, the file may be hashed into a directory inside the cache. Optimal hashing depth may vary among systems and should be a user setting. On "modern" systems no extra dirs might actually be fastest.)
For now the current keying scheme implies that the source file needs
to be read from storage even if a precompiled version of it exists in
the cache, because we require its hash. There may be ways to optimize
that part away, but they can be added in the future. Not having the
compilation unit name as part of the hash key is nice, because it means
we don't have to worry about name canonicalization or source files with
similar names but different locations. It also means we can cache objects
that aren't files at all, such as eval $sting results (as long as
$string is the same across runs), or units received over the network
(as long as we can get their hashes before we attempt to compile them).
The cached object is a compressed YAML document containing a serialized Haskell Pugs structure.
We use gzip currently as the de-facto compression mechanism. This is the
easiest to deploy with Pugs: GHC bundles zlib, and Data.FastPackedString
which we use anyway for file IO contains enough bindings to read gzipped
files. Compression is desireable because serialized YAML for precompiled
units are large: for example, the 22 KB Prelude.pm takes on the order
of 1.2 MB to serialize, but 47 KB in gzipped form. There are better
compression algorithms out there; for example, BZip2 compresses the
same file to only 20 KB. If you wish to work on a patch for changing the
compression scheme, it should not lose portability or deployability to
the current setup (i.e., must bundle or implement whatever scheme you
desire), and provide a transparent readFile function that identifies
the actual object by magic number.
pugs -CParse-YAML File.pm or Pugs::Internals::emit_yaml output a
serialized form of the following structure. Pugs::Internals::eval_p6c
or the module loader load it:
data CompUnit = MkCompUnit
{ ver :: Int -- currently 1
, desc :: String -- e.g., the name of the contained module
, glob :: (TVar Pad) -- pad for unit Env
, ast :: Exp -- AST of unit
}
The ver field is currenly set to 1. It is always the first element in
the CompUnit structure, for forward compatibility. The desc field is
for diagnostics. glob is the global pad for this unit, and finally,
ast is the parsed tree.
try {
my $hash = $HASHFUNC($source);
my $dir = cachedir($hash);
die "no precompiled version found" unless $dir ~~ :d;
for $dir.readdir.sort:{numerically} -> $fn {
my ($pugsrev, $parserrev) = $fn ~~ /(\d+)-(\d+)/ orelse next;
next if $XXX_handwaving($pugsrev, $parserrev); # against %?CONFIG<pugsrev> etc.
load_precompiled($fn) orelse {
$fn.rm;
die "error loading cached version: $!";
};
return; # success
}
die "no precompiled version found";
CATCH {
my $compunit = $source.parse;
$compunit.load;
return if $compunit.nocache;
cache_cleanups;
write_cache($compunit, $dir);
}
}
There are a few cache maint tools that can be added bit by bit. They probably deserve a section of their own but for now: