Thursday, June 26, 2008

Schwartzian Transforms in Java

Libraries like Java's Collections API contain numerous algorithms tuned to give good performance in a theoretical 'typical case', but it can be very easy to forget that these algorithms aren't magical, and have certain requirements of their own if that goal is to be achieved.

On numerous occasions, I've written or encountered code that seemed to work perfectly well during development and testing, but performed very poorly when it had to deal with a larger or more complex data-set, something which often happens only after the code has been put into production use.

Here's a program which lists the contents of a specified directory ordered by size, with the size of a child directory being calculated from the total size of all of its contents:

import java.io.*;
import java.util.*;

public class FileSizes {

public static void main(String[] args) {

File directory = new File(args[0]);
List<File> files = new ArrayList<File>(Arrays.asList(directory.listFiles()));

Collections.sort(files, new Comparator<File>() {
public int compare(File file1, File file2) {
long file1Size = calculateSize(file1);
long file2Size = calculateSize(file2);
if (file1Size == file2Size)
return 0;
return file1Size < file2Size ? -1 : 1;
}
});

for (File file : files)
System.out.println(file);
}

static long calculateSize(File file) {
long size = 0;
if (file.isDirectory())
for (File child : file.listFiles())
size += calculateSize(child);
else
size = file.length();
return size;
}
}

Now, how long does this program take to execute?

Potentially a lot longer than it should, that's how long.

The problem here is that during the sort operation, each item being sorted may be compared against the other items multiple times - exactly how many depending on the sorting algorithm in use and the size and ordering of the initial data-set. If that comparison operation involves some expensive calculation, as it does in this rather blatant example, it's normally wasted effort to perform that calculation every time. In such cases, it's often preferable to cache the result of the calculation the first time it's performed, and reuse that cached version if it's needed again.

A common technique used by Perl programmers faced with this problem is to employ something known as the Schwartzian Transform, which looks something like this:

my @sorted_data =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
map { [$_, expensive_calculation($_)] }
@unsorted_data;

The idea here is that we start with the unsorted data (read the code from the bottom line up), and apply a mapping function to it which maps each item to a 2 element array containing the original item at index 0 and its associated calculated value at index 1. These are then sorted, using the element at index 1 for the comparisons, and the result is finally fed through another mapping function to extract just the original items, now in the desired order.

We can try something similar in Java. We'll need a mapping method, and let's make it general enough that we can give it any Iterable (such as a List) and a Mapper to apply to each element:

interface Mapper<T,U> {
U map(T item);
}

<T,U> Iterable<U> map(final Iterable<T> items, final Mapper<? super T, ? extends U> mapper) {
return new Iterable<U>() {
public Iterator<U> iterator() {
Iterator<T> iter = items.iterator();
return new Iterator<U>() {
public boolean hasNext() { return iter.hasNext(); }
public U next() { return mapper.map(iter.next()); }
public void remove() { iter.remove(); }
};
}
};
}

To match the Perl example, we'll also want a sort method which sorts and returns a new List rather than modifying the one passed to it:

<T> List<T> sort(Iterable<T> items, Comparator<? super T> comparator) {
List<T> list = new ArrayList<T>();
for (T t : items)
list.add(t);
Collections.sort(list, comparator);
return list;
}

Finally we'll need some kind of data structure to hold the item together with its calculated-value. A type-safe Pair class will do:

class Pair<A,B> {
public static <A,B> Pair<A,B> of(A fst, B snd) {
return new Pair<A,B>(fst, snd);
}
private A fst;
private B snd;
private Pair(A fst, B snd) {
this.fst = fst;
this.snd = snd;
}
public A fst() { return fst; }
public B snd() { return snd; }
}

Now we can replace the Collections.sort call in the original example with a Schwartzian Transform:

Iterable<File> sortedFiles =
map(sort(map(files,
new Mapper<File, Pair<File,Long>>() {
public Pair<File,Long> map(File f) {
return Pair.of(f, calculateSize(f));
}
}),
new Comparator<Pair<File,Long>>() {
public int compare(Pair<File,Long> p1, Pair<File,Long> p2) {
return p1.snd().compareTo(p2.snd());
}
}),
new Mapper<Pair<File,Long>,File>() {
public File map(Pair<File,Long> p) {
return p.fst();
}
});

Well, that's pretty horrible, whichever way you try to lay out the code. Still, there's one improvement we could make - since we're just writing a Comparator which extracts the Long values and compares them, we could overload our sort method to accept a Mapper instead of a Comparator:

<T, U extends Comparable<U>> List<T> sort(Iterable<T> items, final Mapper<? super T, U> mapper) {
List<T> list = new ArrayList<T>();
for (T t : items)
list.add(t);
Collections.sort(list, new Comparator<T>() {
public int compare(T t1, T t2) { return mapper.map(t1).compareTo(mapper.map(t2)); }
});
return list;
}

That allows us to shorten the code a little:

Iterable<File> sortedFiles =
map(sort(map(files,
new Mapper<File, Pair<File,Long>>() {
public Pair<File,Long> map(File f) {
return Pair.of(f, calculateSize(f));
}
}),
new Mapper<Pair<File,Long>,Long>() {
public Long map(Pair<File,Long> p) {
return p.snd();
}
}),
new Mapper<Pair<File,Long>,File>() {
public File map(Pair<File,Long> p) {
return p.fst();
}
});

It's still far too cumbersome though - if I were writing this 'for real' I'd probably have given up by now.

Let's try it using closures instead of anonymous classes:

Iterable<File> sortedFiles =
map(sort(map(files,
{File f => Pair.of(f, calculateSize(f))}),
{Pair<File,Long> p => p.snd()}),
{Pair<File,Long> p => p.fst()});

That's much better to my eyes - maybe not quite as succinct as the Perl version, but it's digestible.

The Mapper interface can be replaced by function types so let's lose that and apply a bit more closures magic to the map and sort methods:

<T,U> Iterable<U> map(Iterable<T> items, {T=>U} mapper) {
return {=>
Iterator<T> iter = items.iterator();
new Iterator<U>() {
public boolean hasNext() { return iter.hasNext(); }
public U next() { return mapper.invoke(iter.next()); }
public void remove() { iter.remove(); }
}
};
}

<T,U extends Comparable<U>> List<T> sort(Iterable<T> items, {T=>U} c) {
List<T> list = new ArrayList<T>();
for (T t : items)
list.add(t);
Collections.sort(list, {T t1, T t2 => c.invoke(t1).compareTo(c.invoke(t2))});
return list;
}

Of course we could also tuck all that map/sort/map stuff away in a handy reusable method:

<T,U extends Comparable<U>> Iterable<T> schwartzianSort(Iterable<T> items, {T=>U} mapper) {
return map(sort(map(items,
{T t => Pair.of(t, mapper.invoke(t))}),
{Pair<T,U> p => p.snd()}),
{Pair<T,U> p => p.fst()});
}

Leaving us with:

public static void main(String[] args) {

File directory = new File(args[0]);
List<File> files = new ArrayList<File>(Arrays.asList(directory.listFiles()));

for (File file : schwartzianSort(files, {File f => calculateSize(f)}))
System.out.println(file);
}

Much nicer!

Sometimes, it's not immediately obvious what a method like that should look like, or how it might be implemented. Breaking the algorithm down into a series of functions, as the Schwartzian Transform does, can be a useful way to approach the problem at hand, but only if the language you're using has sufficiently practical constructs.

updated 29/06/08 to fix a bug in map() pointed out by Konstantin Triger.
updated 03/07/08 to fix a bug in the Comparator pointed out by bjkail.

13 comments:

Anonymous said...

I recently had a similar experience with mapping objects from one domain model to another. The code first looped through a collection of objects and mapped them one by one to target objects in a new new collection using a generic mapper. Both the source and target collections were then looped over again in a subsequent block of code to perform some additional custom mapping across each pair.

Closures would make it very easy to perform the same task in one pass and in one iteration without any boilerplate.

hlovatt said...

Most people would find:

final File directory = new File( args[ 0 ] );
final SortedMap<Long, File> files = new TreeMap<Long, File>();
for ( final File file : directory.listFiles() ) { files.put( size( file ), file ); }
for ( final File file : files.values() ) { out.println( file ); }

easier to follow. Therefore this is not a great motivating example for closures.

hlovatt said...

Oops I meant to use multimap from the Google Collections library - sorry!

Mark Mahieu said...

Howard, the map->sort->map technique gets more interesting when you want to introduce concurrency, which is where I really want to go with this example - I'll try to get the next part up soon.

hlovatt said...

You could take a look at Doug Lee's ParallelArray class. This uses a parallel declarative style of programming. He gives examples of parallel sorting. ParallelArray is likely in Java 7.

Konstantin Triger said...

The Iterable returned by the following code:

<T,U> Iterable<U>
map(Iterable<T>
items, {T=>U} mapper) {
Iterator<T>
iter = items.iterator();
return {=> new Iterator<U>() {
public boolean hasNext() { return iter.hasNext(); }
public U next() { return mapper.invoke(iter.next()); }
public void remove() { iter.remove(); }
}};
}


Has a potential bug - it can be used only once since the iterator is initialized only once...

The "correct" version will be something like:

<T,U> Iterable<U>
map(Iterable<T>
items, {T=>U} mapper) {
return {=> new Iterator<U>
() {
private final Iterator<T> iter = items.iterator();

public boolean hasNext() { return iter.hasNext(); }
public U next() { return mapper.invoke(iter.next()); }
public void remove() { iter.remove(); }
}};
}

To avoid this kind of errors, I think, Java should consider adding kind of yield statement, so the code above will be like:

<T,U>
Iterable<U>
map(Iterable<T> items, {T=>U} mapper) {
for (T t : items)
yield return mapper.invoke(t);
}

And the compiler will handle the necessairy details.

Mark Mahieu said...

Konstantin,

Oops! Good spot, thanks for pointing it out! I'll fix that now...

bjkail said...

p1.snd() == p2.snd() is wrong since it uses identity comparison.

Anonymous said...

Nice post. And good explanation of the Schwartzian transformation. One thing is bugging me however. Why doesn't the comparator have a reference to hashmap mapping between File instances and it's size? When encountering a File instance not in the map, it'll compute its size and add it to the map. Effectively the same 'memorization' approach, without all those extra abstractions?

cheers,
kasper
(www.firstclassthoughts.co.uk)

Mark Mahieu said...

bjkail: thanks!

Mark Mahieu said...

Kasper,

That's a good question. There are two reasons.

Firstly, introducing a HashMap means that the performance of the sort becomes dependent upon how well the hash codes are distributed. Even with an excellent implementation of hashCode() (and equals() ), the 'wrong' data can cause the sort to take several times longer to complete. With a poor implementation of those two methods, I've seen it take hundreds of times longer to complete.

It's a subject worthy of another blog entry, although Joshua Bloch's Effective Java has a good overview.

Having said that, the technique you describe would be perfectly fine in many situations. I certainly wouldn't want to encourage anyone to reach for Schwartzian Transforms as a first measure every time they need to sort some data :)

The second reason is that I want to explore how we could make this code utilise concurrency, for which the map->sort->map approach is very useful. That will be the subject of a follow-up.

www.mueblesbaratos.nom.es said...

Pretty helpful material, thanks so much for the article.

三重古天樂 said...

酒店經紀 酒店小姐 酒店經紀公司
台北酒店工作 台北酒店經紀人 酒店兼差 酒店兼職 酒店公關 酒店上班 台北酒店上班 酒店打工 酒店工作內容 酒店pt 台北酒店 便服店 禮服店 酒店 酒店工作 酒店上班時間 酒店上班薪水 酒店上班內容 現領 當日領 酒店經紀 酒店打工 酒店工作 酒店上班 酒店兼差 酒店兼職 打工兼差 打工兼職 台北酒店 酒店應徵 禮服店 酒店 經紀 打工 兼差便服店 台北酒店經紀 酒店薪水 酒店工作內容 酒店時間 酒店經濟 美式餐廳 咖啡廳
打工兼職 打工兼差 酒店營業時間
酒店經濟 酒店面試
2015/08/26