Sunday, December 9, 2007

Memoization revisited

In last week's thrilling episode, the world watched in horror as I attempted memoization with Java Closures, then really pushed my luck with some dodgy curry.

This week, we're going to take the memoization plotline through a whole new twisty-turny story-arc, and see what happens with multi-key memoization.

Our story shall revolve around a class Font, which for some important but unspecified reason needs to ensure that only one instance will be created for any particular combination of its name and pointSize attributes:

public class Font {

private static Map<String, Map<Double, Font>> instanceMap =
new HashMap<String, Map<Double, Font>>();

public static synchronized Font getInstance(String name, double pointSize) {
Font font = null;
Map<Double, Font>> pointSizeMap = instanceMap.get(name);
if (pointSizeMap == null) {
pointSizeMap = new HashMap<Double, Font>();
instanceMap.put(name, pointSizeMap);
}
else {
font = pointSizeMap.get(pointSize);
}
if (font == null) {
font = new Font(name, pointSize);
pointSizeMap.put(pointSize, font);
}
return font;
}

private Font(String name, double pointSize) {
// construct the Font
}
}

There are a few problems with this. Although the code is fairly simple, it's still easy to make a mistake when implementing it. Worse, if we were writing a Font class for real we'd probably want to specify more than just the name and pointSize parameters - the example above doesn't even allow the user to specify any styles such as italic, or bold. The code just gets nastier when we do so:

public class Font {

private static Map<String, Map<Double, Map<EnumSet<FontStyle>, Font>>> instanceMap =
new HashMap<String, Map<Double, Map<EnumSet<FontStyle>, Font>>>();

public static synchronized Font getInstance(String name, double pointSize, EnumSet<FontStyle> styles) {
Font font = null;
Map<EnumSet<FontStyle>, Font>> stylesMap = null;
Map<Double, Map<EnumSet<FontStyle>, Font>>> pointSizeMap = instanceMap.get(name);
if (pointSizeMap == null) {
pointSizeMap = new HashMap<Double, Font>();
instanceMap.put(name, pointSizeMap);
stylesMap = new HashMap<EnumSet<FontStyle>, Font>>();
pointSizeMap.put(pointSize, stylesMap);
}
else {
stylesMap = pointSizeMap.get(pointSize);
if (stylesMap == null) {
stylesMap = new HashMap<EnumSet<FontStyle>, Font>>();
pointSizeMap.put(pointSize, stylesMap);
}
else {
font = stylesMap.get(styles);
}
}
if (font == null) {
font = new Font(name, pointSize);
stylesMap.put(pointSize, font);
}
return font;
}

private Font(String name, double pointSize) {
// construct the Font
}
}

We could abandon the nested Maps for a single Map with some kind of compound key - perhaps a String representation of the Font's attributes:

public class Font {

private static Map<String, Font> instanceMap = new HashMap<String, Font>();

public static synchronized Font getInstance(String name, double pointSize, EnumSet<FontStyle> styles) {
String key = name + " " + pointSize + " " + styles;
Font font = instanceMap.get(key);
// ...
}

// ...
}

Unfortunately this can often result in poorly distributed hash values - the hash codes for "Courier 10.0" and "Courier 10.5" are very close, compared to those for Double.valueOf(10.0) and Double.valueOf(10.5).
Worse, it's fragile. The Windows machine next to me has two distinct fonts "Rockwell" and "Rockwell Condensed", but what if "Condensed" happens to be the name of one of my FontStyles? Which font does a key of "Rockwell Condensed 10.0" represent - is it the Rockwell font with Condensed styling, or the Rockwell Condensed font with no styling?

Alternatively, we could opt to create a dedicated FontAttributes class to use as our compound key, but this just shuffles the complexity into this new class's hashCode() and equals() methods.

Finally, the simple synchronization technique employed exhibits poor concurrency, which is fixable, but at the cost of even more complexity.

Utilizing memoization, what we'd really like would be something like this:

public class Font {

private static {String, Double => Font} memoizer =
memoize({String name, Double pointSize => new Font(name, pointSize)});

public static Font getInstance(String name, double pointSize) {
return memoizer.invoke(name, pointSize);
}

private Font(String name, double pointSize) {
// construct the Font
}
}

Here, all the complexity involved in managing the creation of Font instances (including synchronization) is dealt with by the memoize() method, and it's practically effortless to add additional parameters.

So far though, we've only seen a version of memoize() which works with one key:

static <A, R> {A => R} memoize({A => R} fn) {
Map<A, R> cache = new HashMap<A, R>();
return {A a =>
R value = null;
synchronized(cache) {
value = cache.get(a);
if (value == null) {
value = fn.invoke(a);
cache.put(a, value);
}
}
value
};
}

How can we extend this to work with two, three, or even any number of keys? Obviously we can overload it, so let's see what a version which takes two keys might look like:

static <A1, A2, R> {A1, A2 => R} memoize({A1, A2 => R} fn) {
Map<A1, Map<A2, R>> cache1 = new HashMap<A1, Map<A2, R>>();
return {A1 a1, A2 a2 =>
R value = null;
synchronized(cache1) {
Map<A2, R> cache2 = cache1.get(a1);
if (cache2 == null) {
cache2 = new HashMap<A2, R>();
cache1.put(a1, cache2);
}
else {
value = cache2.get(a2);
}
if (value == null) {
value = fn.invoke(a1, a2);
cache2.put(a2, value);
}
}
value
};
}

This works, but it suffers from the same problems as our earlier implementation of getInstance() which used nested Maps - poor concurrency and increased complexity as the number of keys increases. It is better though, because now at least these problems don't need to be replicated throughout countless getInstance() methods. I'll deal with the concurrency issue later, but let's see if we can improve the complexity situation.

A useful observation we can make is that once we've retrieved the 'inner' Map (cache2), we're back to doing single-key memoization, which is a problem we already have a solution to. Can we define our two-key version of memoize() in terms of the single-key version?

In order to call the single-key memoize({A1 => R} fn), we somehow need to turn the fn we have, which is of type {A1, A2 => R}, into something of type {A1 => R}. This is exactly what partial application can do for us here, since we have the value of the first argument in a1:

static <A1, A2, R> {A1, A2 => R} memoize({A1, A2 => R} fn) {
Map<A1, {A2 => R}> cache = new HashMap<A1, {A2 => R}>();
return {A1 a1, A2 a2 =>
{A2 => R} valueMemoizer = null;
synchronized(cache) {
valueMemoizer = cache.get(a1);
if (valueMemoizer == null) {
valueMemoizer = memoize(partial(fn, a1));
cache.put(a1, valueMemoizer);
}
}
valueMemoizer.invoke(a2)
};
}

The cache is now a simple Map of keys of type A1 to values which are single-key memoizers. This is much better, since the code no longer increases in complexity as we increase the number of keys - we can see this clearly when we define our three-key version in terms of the two-key version:

static <A1, A2, A3, R> {A1, A2, A3 => R} memoize({A1, A2, A3 => R} fn) {
Map<A1, {A2, A3 => R}> cache = new HashMap<A1, {A2, A3 => R}>();
return {A1 a1, A2 a2, A3 a3 =>
{A2, A3 => R} valueMemoizer = null;
synchronized(cache) {
valueMemoizer = cache.get(a1);
if (valueMemoizer == null) {
valueMemoizer = memoize(partial(fn, a1));
cache.put(a1, valueMemoizer);
}
}
valueMemoizer.invoke(a2, a3)
};
}

In fact, the code is almost identical.

However, it's still a shame that we need to write lots of versions of memoize() to cater for different numbers of keys. Isn't there a way to do this with just one version?

Actually, there is :) If we look carefully at what's happening in the two- and three-key versions, they are essentially 'unwrapping' fn, 'injecting' memoization at each stage (ie. on each key). This sounds a lot like currying - turning a multi-argument function into a series of single-argument functions - and funnily enough, we can throw away our two- and three-key versions and rewrite the memoizer usage in curried form, just using the single-key version of memoize():

public class Font {

private static {String => {Double => Font}} memoizer =
memoize({String name => memoize({Double pointSize => new Font(name, pointSize)})});

public static Font getInstance(String name, double pointSize) {
return memoizer.invoke(name).invoke(pointSize);
}

private Font(String name, double pointSize) {
// construct the Font
}
}

So we only needed the single-key version of memoize() after all... great!

Except... we've now increased the complexity and understanding required for anyone using memoize() - not a great move in terms of API design. Perhaps there's a compromise though - we could have a method which did all 'injection' of memoization in curried form for us:

public class Font {

private static {String, Double => Font} memoizer =
injectMemoizer({String name, Double pointSize => new Font(name, pointSize)});

public static Font getInstance(String name, double pointSize) {
return memoizer.invoke(name, pointSize);
}

private Font(String name, double pointSize) {
// construct the Font
}
}

Here's a version of injectMemoizer() which copes with this:

static <A1, A2, R> {A1, A2 => R} injectMemoizer({A1, A2 => R} fn) {
{A1 => {A2 => R}} memoizer = memoize({A1 a1=> memoize({A2 a2 => fn.invoke(a1, a2)})});
return uncurry(memoizer);
}

OK, API usage is back to being relatively 'simple' again, but now we need a version of injectMemoizer() for 2 keys, 3 keys, etc etc... haven't we gone back a step? Well, yes and no. One more change and we can abstract out the call to memoize(), leaving us with a general purpose 'inject' method which we can use for other things. Perhaps we want to inject something that logs the values of the arguments, or substitutes their values for other values if some condition is met. We'll still need any number of versions of 'inject', but now it'll be in the same league as curry, uncurry and partial: one of a handful of building block methods with which this problem is more acceptable.

To do all this we need to find a way to tell inject() what it is that we want it to use (in our case, memoize()). So an additional parameter is required, but what is its type? Looking at the above example we can see that it is something which takes a function type with one argument and a return value, and returns an object of the same function type. These types aren't fixed though: in the first call to memoize() the function type is {A1 => {A2 => R}}, and in the second it is {A2 => R}.

Furthermore, we need to pass in an object, but currently we have a method. Putting all these requirements together suggests that we rewrite memoize() in terms of a type (so that we can have an object of that type), and since this is Java, that type should be an interface, which I'll call Bob because I have no idea what it should be called (same applies to inject for that matter), and the name Bob reminds me of an amusing episode of Blackadder.

public interface Bob {
<A, R> {A => R} invoke({A => R} fn);
}

public class Memoizer implements Bob {
public <A, R> {A => R} invoke({A => R} fn) {
Map<A, R> cache = new HashMap<A, R>();
return {A a =>
R value = null;
synchronized(cache) {
value = cache.get(a);
if (value == null) {
value = fn.invoke(a);
cache.put(a, value);
}
}
value
};
}
}

static <A1, A2, R> {A1, A2 => R} inject({A1, A2 => R} fn, Bob bob) {

{A1 => {A2 => R}} injected = bob.invoke({A1 a1 => bob.invoke({A2 a2 => fn.invoke(a1,a2)})});
return uncurry(injected);
}

Here's how we use it:

public class Font {

private static {String, Double => Font} memoizer =
inject({String name, Double pointSize => new Font(name, pointSize)}, new Memoizer());

public static Font getInstance(String name, double pointSize) {
return memoizer.invoke(name, pointSize);
}

private Font(String name, double pointSize) {
// construct the Font
}
}

And we're done... well, other than writing alternative versions of inject() for different versions of arguments, adding exception transparency, sorting out the poor concurrency in Memoizer.invoke(), and working out what Bob and inject() should really be called ;)

The key point is that we've now seen examples of currying, uncurrying and partial application being used as building blocks to support the creation and/or usage of a more obviously useful API.

3 comments:

Anonymous said...

Hi, your examples demonstrate a high expertise on your part, well above what is likely to achieve the "normal" java developer. I assume that in a regular project you would suggest building a single key element from its differents parts (like a Pair<U,V>, or a List<T>, or whatever object that correctly implements equals/hashcode). It would probably be just as efficient, and so far easier to read and maintain.

Mark Mahieu said...

vince,

Thanks for your comments.

I'd certainly rather see a Pair<U,V> than a Map of Maps, and all the associated logic... as long as this Pair class was written once, rather than reimplemented every time.

What's nice about using a memoizer of some form is that all the logic (including synchronization, if required) can be hidden away in one class or method - if bugs are found in that logic, or improvements made to the synchronization, it's just one place to make the changes.

I also think the 'single-key' form of memoizer is simple enough for "normal" Java developers to pick up and use easily, and could be passed a Pair<U,V> as you suggest in order to cope with 2-key scenarios.

What I was trying to do here was to extend that to work with any number of keys (removing the need for the Pair class and friends entirely), but I think the final version sacrifices some of the simplicity of the single-key memoize() method.

I wouldn't suggest that anyone write a memoizer every time they need this kind of functionality - it's only really viable when provided as part of a pre-written library.

Nicole said...

Well, I do not really believe this will have success.