Asymptotically Typeable

Home Blog RSS

Recursion Elimination — Or how to make pretty code ugly

Sun, 18 Aug 2019
  1. 1 Introduction
  2. 2 Recursive code that is easy to rewrite
  3. 3 Quest for a general algorithm
    1. 3.1 Hand wavy description
    2. 3.2 Are these steps tractable for more complicated code?
  4. 4 Interim — Case study: Lambda calculus evaluator
  5. 5 Continuation-passing style (CPS)
    1. 5.1 Motivation
    2. 5.2 High-level algorithm for CPS-ing a recursive function
    3. 5.3 CPS-ing dup, beta, and eval
  6. 6 Defunctionalization
    1. 6.1 Motivation
    2. 6.2 How is this relevant?
    3. 6.3 High-level algorithm for defunctionalizing a function
    4. 6.4 Defunctionalizing dup, beta, and eval
  7. 7 Inlining & Tail-call Optimization
  8. 8 Some Tangible Results

Introduction

Recursive code is the epitome of beautiful (read elegant, pretty, concise...) code, but is the bane of performance. A classic example is the Fibonacci function, here's the slow yet elegant recursive version:

int fibonacci(uint n) {
    return n < 2 ? 1 : fibonacci(n-1) + fibonacci(n-2);
}

And here is the fast yet unenlightening iterative version:

int fibonacci(int n) {
    int current = 1; int next = 1; int temp;
    while (n >= 2) {
        temp = current + next;
        current = next;
        next = temp;
        n--;
    }
    return next;
}

The other day I stumbled upon James Kopell's blog post (and video presentation) The best refactoring you've never heard of in which he presents a framework to transform recursive functions into iterative ones. What's the catch? Firstly, the transformation is hard to automate. Secondly, the resulting code's running time is identical to the recursive one's. So forget about transforming the recursive Θ(φn) Fibonacci into the iterative Θ(n) code by applying the transformation mindlessly. And finally, the transformed code is just really ugly... Regardless, I think this transformation could be helpful.

In this blog post I'll guide you through the steps to transform a few complicated recursive functions into iterative ones. In the process I hope to present the motivation behind this transformation, and to go deeper in the explanations of the components that make it up. To the curious, here's an example taken from Kopell's post that showcases the transformation:

// RECURSIVE VERSION
void printTree(Tree* tree) {
    if (tree != null) {
        printTree(tree.left);
        print(tree.node);
        printTree(tree.right);
    }
}

// ITERATIVE VERSION
class Cont {
    Tree tree;
    Cont next;
}
 
void printTree(Tree tree, Cont cont) {
    while (true) {
        if (tree != null) {
            cont = new Cont(tree, cont);
            tree = tree.left;
        } else {
            if (cont != null) {
                print(cont.tree.node);
                tree = cont.tree.right;
                cont = cont.next;
            } else return;
        }
    }
}

Recursive code that is easy to rewrite

Tail-recursive functions, i.e. functions whose last instruction is a recursive one, can be easily transformed into iterative code. Consider the following code that prints a linked list. We can transform it into iterative code by following these instructions:

  1. Wrapping the body of the function with a while (true) { ... },
  2. adding a break; (or a return) in the branches that contain no recursive calls,
  3. and replacing the recursive call with a re-assignment of the function's arguments (beware that the order of assignment may be important!).
// RECURSIVE
void printList(LinkedList list) {
    if (list == null) {
        print("()");
    } else {
        print(list.data + " -> ");
        printList(list.next);
    }
}

// ITERATIVE AFTER DOING TAIL CALL OPTIMIZATION
void printList(LinkedList list) {
    while (true) {
        if (list == null) {
            print("()");
            break;
        } else {
            print(list.data + " -> ");
            list = list.next;
        }
    }
}

For some functions that operate on the recursive call, e.g. factorial, we can use accumulators to transform the function into a tail-recursive one. Here's the factorial example:

// RECURSIVE
int factorial(int n, int acc = 1) {
    if (n < 2)
        return acc;
    return factorial(n-1, acc*n);
}

// ITERATIVE
int factorial(int n) {
    int acc = 1;
    while (true) {
        if (n < 2)
            return acc;
        acc = acc * n;   // the order of assignment is important here!
        n = n - 1;
    }
}

Quest for a general algorithm

Hand wavy description

The steps described here are the same ones presented in Kopell's presentation. To give value to my presentation and not regurgitate Kopell's I have tried to motivate the same steps from a different perspective.

How can we rewrite any recursive function iteratively? One approach is to see if this problem reduces to the simple case, i.e. we ask ourselves, can we rewrite any recursive function in a tail-recursive way? It's not clear how to accomplish that, let's see why. Consider the recursive printTree code, the code after printTree(tree.left) (lines 5-6) is code that needs to continue to evaluate. We can't just stop after the first recursive call, and the second recursive call prevents us from using the accumulator trick. The key insight is to recognize that lines 5 and 6 are the continuation of the first recursive call (i.e. what is left to be done). Rewriting the code in continuation-passing style (CPS) will abstract the remaining recursive calls into the continuation. At this stage we end up with a function with a single (albeit complicated) function call. This function call will be none other than the first recursive call. Now, our function is tail recursive that is passing around closures (the continuations). Alas these closures carry around an environment of their own, so we still can't do tail-call optimization. The second and last missing piece of the puzzle is Defunctionalization. This trick allows us to encode higher-order functions in terms of data. Putting everything together: encoding the continuations transforms the code into a recursive function (a couple actually) that we can apply TCO to.

The transformation, which I dub Recursion Elimination, consists of four stages.

  1. Rewriting in continuation-passing style (CPS)
  2. Defunctionalizing
  3. Inlining
  4. Doing tail-call optimization (TCO)

The last step, tail-call optimization, may need to be done twice: before and after inlining.

Are these steps tractable for more complicated code?

Sure printTree is a cute example, but I wonder if the transformation would be humanly tractable to apply to more complicated recursive code. So I set about rewriting The Most Beautiful Program Ever Written (spoiler: it's a lisp interpreter) in an iterative way. The resulting code, far from being called the ugliest program ever written, sure is cryptic and doubtlessly unmanageable. Transforming it was a challenge at first and so I've definitely learned from it. Furthermore the result was up to 24% faster than the recursive version, and it never segfaults (in practice)! I also noticed that with enough practice this sort of transformation could be mindlessly done quickly by hand.

Disclaimer: I didn't write a lisp interpreter but rather a lambda calculus evaluator. The language that we will implement will be very tiny. It consists of variables, abstractions (aka functions), and [function] applications. The only hard coded function that we will support is print that prints to stdout whatever it is given and returns it as its value. The interpreter is written in the D programming language, and all code samples from here on will be in that language. D is similar to C/C++, however here's a few things that you might've not known:

  1. Function arguments can have default values.
  2. Besides func(val) for calling functions, it's also possible to write val.func() or val.func. In general func(val, ...) can be rewritten as val.func(...). Empty parenthesis can be omitted.
  3. Closures, or functions with context, are called delegates. The type of a delegate is R delegate(T0, T1, ...) where R is the return type, and T0, T1, ... are the argument types.
  4. Structs can be stored in the heap by doing new StructName(...) which returns a pointer to this struct, or on the stack by doing StructName(...)
  5. D has a garbage collector.

Interim — Case study: Lambda calculus evaluator

The program that we will transform is the following:

enum TType { VAR, APP, ABS };

struct Term {
    TType type;
    string a;       // a is either the variable name, or the function's only parameter
    Term* t1;       // t1 is the body of the function, or the left term in an application
    Term* t2;       // t2 is the right term in an application
}

alias Env = Term*[string];  // a hashmap whose keys are strings and values are Term*

Term* dup(Term* t) {
    if (t == null) return null;
    return new Term(t.type, t.a, dup(t.t1), dup(t.t2));
}

Term* beta(Term* term, string var, Term* val) {
    if (term == null) return term;
    final switch (term.type) {
        case TType.VAR:
            return term.a == var ? val : term;
        case TType.ABS:
            if (term.a == var) return term;
            term.t1 = beta(term.t1, var, val);
            return term;
        case TType.APP:
            term.t1 = beta(term.t1, var, val);
            term.t2 = beta(term.t2, var, val);
            return term;
    }
}

Term* eval(Term* term, Env env, void delegate(Term*, int) interfunc, int depth = 0) {
    interfunc(term, depth);

    final switch (term.type) {
        case TType.VAR:
            if (term.a == "print") {
                return term;
            } else if ((term.a in env) !is null) {
                return eval(dup(env[term.a]), env, interfunc, depth);
            } else assert(false, "Unbound variable " ~ term.a);
        
        case TType.APP:
            if (term.t1.type == TType.ABS) {
                term.t1.t1 = beta(term.t1.t1, term.t1.a, dup(term.t2));
                return eval(term.t1.t1, env, interfunc, depth);
            } else {
                if (term.t1.type == TType.VAR && term.t1.a == "print") {
                    writefln("[print] %s", term.t2.toString);
                    return eval(term.t2, env, interfunc, depth-1);
                }
                term.t1 = eval(term.t1, env, interfunc, depth+1);
                term.t2 = eval(term.t2, env, interfunc, depth+1);
                return eval(term, env, interfunc, depth);
            }
        
        case TType.ABS:
            return term;
    }
}

Three functions are omitted since they are not relevant to this blog post. They are: the parsing function, the string toString(Term* t) function that displays a term, and the main function that runs the whole show. You can find these in typeless.git (there's a REPL to play around with!).

There are three recursive functions that we will rewrite iteratively:

  1. Term* dup(Term* t) creates a deep copy of a term.
  2. Term* beta(Term* t, string var, Term* val) replaces all the occurrences of the variable var by the value val. For example beta(parse("((lambda x x) x) x"), "x", parse("y")) produces ((lambda x x) y) y. Notice the outer xs were replaced and the one in the lambda is not (since it's shadowed by the lambda's argument).
  3. Term* eval(Term* term, Env env, ...) reduces the given term until it's a value (i.e. not possible to reduce it any further). A variable is reduced by looking up its value in the environment. If the variable is not defined (i.e. not present in the environment) then execution halts with an error message.
    The remaining two arguments, interfun and depth are for debugging purposes. Between every reduction a call to interfunc happens. The arguments to it are the current term and the depth of the reduction. I use this function to pretty-print the evaluation as it is happening, and to count the total number of reductions.
I figured these functions are complicated enough to test the applicability of Recursion Elimination.

Continuation-passing style (CPS)

Motivation

Very briefly Code written in CPS is code that explicitly defines what it does after every instruction.

Let's examine our opening statement again: "Recursive code is the epitome of beautiful code". If we consider "complete" code (self-descriptive) as criteria for beauty then our recursive Fibonacci example is surprisingly not beautiful. Consider this small variation

int fibonacci(int n) {
    writef("n=%d; ", n);
    return n < 2 ? 1 : fibonacci(n-1) + fibonacci(n-2);
}

What will we read on the console if we call fibonacci(3)? Without knowing much about D we can't know whether we'll see n=3; n=1; n=2;, or n=3; n=2; n=1;, or an infinite stream of numbers. The semantics of this function depend on the order of evaluation of fibonacci(n-1) + fibonacci(n-2); i.e. which side of + is evaluated first. And it also depends on whether the branches of the ternary operator x ? y : z are all evaluated first (hence the infinite stream of numbers) or not. Compiling this code with gcc will produce undefined behavior. On the other hand compiling this code with GNU ELisp where the order of evaluation is specified we know the output will be n=3; n=1; n=2;. Hence we cannot attest to the completeness of this recursive function as its semantics depend on the host language.

A trivial fix is to rewrite the code with assignments like so:

int fibonacci(int n) {
    writefln("computing fibonacci %d", n);
    if (n < 2) return 1;
    int left = fibonacci(n-1);
    int right = fibonacci(n-2);
    return left + right;
}

But who's to say the compiler (or a successor programmer) won't shuffle the order? As motivated in the outline above we really should transform everything into a single instruction since order of evaluation cannot be ambiguous when there's only one instruction to be evaluated. We know that after executing fibonacci(n-1) we have to call the function
(int left) { int right = fibonacci(n-2); return left + right; } with left being whatever the first computation produced. To achieve this we will rewrite fibonacci in continuation-passing style like so:

int fibonacci(int n, int delegate(int) continutaion) {
    writefln("computing fibonacci %d", n);
    if (n < 2) return continuation(1);
    return fibonacci(n-1, (int left) {
        return fibonacci(n-2, (int right) {
            return continuation(left + right);
        });
    });
}

Continuations are very useful, their usage spans many more applications. Continuations allow you to explicitly specify the control flow. A language equipped with continuations can use them to simulate try/catch, return, loops, and even "time-traveling". I recommend reading Continuations by example: Exceptions, time-traveling search, generators, threads, and coroutines by Matt Might. Or you can watch William Byrd's excellent video tutorial Intro to continuations, call/cc, and CPS.

High-level algorithm for CPS-ing a recursive function

  1. Add an extra argument cont to your function.
  2. In all branches where there are no recursive calls, replace return value; with return cont(value);.
  3. In all branches where there are recursive calls, add cont as a parameter to the last recursive call.
  4. Wrap the instructions that follow every recursive call in a function. This function will be the extra parameter to the recursive call.

CPS-ing dup, beta, and eval

There is not much to be said about this step of the transformation here, so here's the transformed code:

Term* dupCPS(Term* t, Term* delegate(Term*) cont) {
    if (t == null) return cont(null);
    return dupCPS(t.t1, (Term* ans1) {
        return dupCPS(t.t2, (Term* ans2) {
            return cont(new Term(t.type, t.a, ans1, ans2));
        });
    });
}

Term* betaCPS(Term* term, string var, Term* val,
              Term* delegate(Term*) cont = (Term* t) { return t; }) {
    if (term == null) return cont(null);
    if (term.type == TType.VAR) {
        return cont(term.a == var ? val : term);
    } else if (term.type == TType.ABS) {
        if (term.a == var) return cont(term);
        return betaCPS(term.t1, var, val, (Term* ans) {
            term.t1 = ans;
            return cont(term);
        });
    } else {
        return betaCPS(term.t1, var, val, (Term* ans) {
            term.t1 = ans;
            return betaCPS(term.t2, var, val, (Term* ans) {
                term.t2 = ans;
                return cont(term);
            });
        });
    }
}

Term* evalCPS(Term* term, Env env, void delegate(Term*, int) interfunc, int depth = 0,
              Term* delegate(Term*) cont = (Term* t) { return t; }) {
    interfunc(term, depth);
    
    if (term.type == TType.VAR) {
        if (term.a == "print") {
            return cont(term);
        } else if ((term.a in env) !is null) {
            return evalCPS(env[term.a].dup, env, interfunc, depth, cont);
        } else assert(false, "Unbounded variable " ~ term.a);
    } else if (term.type == TType.APP) {
        if (term.t1.type == TType.ABS) {
            term.t1.t1 = betaCPS(term.t1.t1, term.t1.a, term.t2.dup);
            return evalCPS(term.t1.t1, env, interfunc, depth, cont);
        } else {
            if (term.t1.type == TType.VAR && term.t1.a == "print") {
                writefln("[print] ", term.t2.toString);
                return evalCPS(term.t2, env, interfunc, depth-1, cont);
            }
            return evalCPS(term.t1, env, interfunc, depth+1, (Term* ans) {
                term.t1 = ans;
                return evalCPS(term.t2, env, interfunc, depth+1, (Term* ans) {
                    term.t2 = ans;
                    return evalCPS(term, env, interfunc, depth, cont);
                });
            });
        }
    } else {
        return cont(term);
    }
}

Defunctionalization

Motivation

Very briefly Defunctionalizing some code is roughly equivalent to writing a tiny interpreter for that code in the host language.

John C. Reynolds introduced defunctionalization in 1972 in his paper Definitional interpreters for higher-order programming languages [PDF] to address these issues:

  1. Eliminating higher-order functions from interpreters. Interpreters free of higher-order functions can be implemented using low-level languages (e.g. C) that do not support them. (side-note: C supports function pointers which allow you to pass and return functions, but it does not support closures.)
  2. Making the order of evaluation in the interpreter explicit. This issue was discussed in detail in the section about CPS. But I suppose you agree with me that making the semantics of a language dependent on the semantics of the interpreter's host language is a horrible idea.

CPS solves the second issue, and defunctionalization solves the first. Briefly, the idea is to pass descriptions of functions (which is data) rather than closures. If f is a function, then all applications f(x1, ..., xn) are replaced by apply(descriptionOfF, x1, ..., xn). Consider the following program which contains only one higher-order function, map:

int[] map(int[] list, int delegate(int) mapper) {
    if (list.length == 0) return [];
    else                  return [ mapper(list[0]) ] ~ map(list[1..$], mapper);
}
auto someNumbers(int maxN, int parg1 = 2, int parg2 = 3) {
    return range(1, maxN).map(x => pow(parg1, x))
                         .map(x => pow(parg2, x))
                         .map(x => x+1);
}

To begin with, we collect all the functions that are passed to map. In this case they are x => pow(parg1, x), x => pow(parg2, x), and x => x+1. Next, we identify all the free variables inside these closures, i.e. all the variables that assume their value from some parent scope. These are parg1 and parg2 in the above example. Finally, we create a data structure that holds all these free variables and a label to differentiate the closures. For the code above, we create an int label that will tell us whether we should exponentiate, or increment. Since parg1 and parg2 are never used in the same closure we can save some space and use a single field to encode them instead of two.

struct HOFunc { // Datastructure describing all lambdas passed to the higher-order function `map`
    int behavior;   // 0 means "do x+1", 1 means "do pow(powArg,x)"
    int powArg;     // the first argument to pow
}

int apply(int arg, HOFunc func) {
    if (func.behavior == 1) return pow(func.powArg, arg);
    if (func.behavior == 0) return arg + 1;
    assert(false, "Unknown function description ", func);
}

We must then rewrite map and someNumbers to use apply and HOFunc. All calls to the arg-functions will be replaced by calls to apply, and all introductions of arg-functions will be replaced by HOFunc data that describes said function. Here's how we would rewrite those two functions:

int[] map(int[] list, HOFunc mapper) {
    if (list.length == 0) return [];
    else                  return [ apply(list[0], mapper) ] ~ map(list[1..$], mapper);
}
auto someNumbers(int maxN, int parg1 = 2, int parg2 = 3) {
    return range(1, maxN).map(HOFunc(1, parg1))
                         .map(HOFunc(1, parg2))
                         .map(HOFunc(0, 0));    // 2nd arg is useless, can be anything
}

How is this relevant?

Look again at dupCPS, betaCPS, and evalCPS. You'll see that these are higher-order functions that take the continuation (a function) as their final argument. Once we defunctionalize all three functions we'll end up with code that doesn't use any higher-order functions, and all three functions (six in reality if we count the apply functions) will be tail-recursive.

High-level algorithm for defunctionalizing a function

  1. Locate all the higher-order functions in your code; i.e. functions that take functions as arguments, or return functions.
  2. For every higher order function:
    1. Locate all the functions that are passed to/returned from this function.
    2. Find all the functions' free variables.
    3. Create a struct that holds all these free variables. Add a label that acts as identification for every function.
    4. Replace all applications to these functions by calls to the special apply function.
    5. Replace all the function introductions with a value from the struct.

Defunctionalizing dup, beta, and eval

Instead of introducing one giant apply function and one giant HOFunc struct we will introduce one of each for each of the three higher-order functions.

Let's start with dupCPS. There are only two closures in the whole program that are fed to dupCPS and these are the ones in its body. The free variables of both these closures are Term* t, Term* ans1, and Term* delegate(Term*) cont. From here on we will rename all continuations in the set of free variables to next.

struct DupCont {
    bool inner;         // false is outer closure (line 3), true is inner closure (line 4)
    Term* term;         // the Term t
    Term* ans1;
    DupCont* next;      // the next continuation
}

Term* applyDup(Term* ans, DupCont* cont) {
    if (cont == null) return ans;
    if (cont.inner) {
        Term* t = new Term(cont.term.type, cont.term.a, cont.ans1, ans);
        return applyDup(t, cont.next);
    } else {
        return dupDefun(cont.term.t2, new DupCont(true, cont.term, ans, cont.next));
    }
}

Term* dupDefun(Term* t, DupCont* cont = null) {
    if (t == null) return applyDup(t, cont);
    return dupDefun(t.t1, new DupCont(false, t, null, cont));
}

The same procedure can be applied to betaCPS to get betaDefun. This time, we notice that the parameters var and val are constant and will be redundant in the struct. So instead we pass them to applyBeta. Moreover, there's no label to indicate which closure we're executing, we use the type to distinguish them since there's only one closure used for a given type.

struct BetaCont {
    TType type;         // technically redundant, we could use term.type instead
    int argNum;         // 0 means we're working with term.t1, 1 means term.t2
    Term* term;
    BetaCont* next;
}

Term* applyBeta(Term* ans, string var, Term* val, BetaCont* cont) { // var, val in here because they never change
    if (cont == null) return ans;
    switch (cont.type) {
        case TType.ABS:
            cont.term.t1 = ans;
            return applyBeta(cont.term, var, val, cont.next);
        case TType.APP:
            if (cont.argNum == 1) {
                cont.term.t1 = ans;
                return betaDefun(cont.term.t2, var, val, new BetaCont(cont.type, 2, cont.term, cont.next));
            } else {
                cont.term.t2 = ans;
                return applyBeta(cont.term, var, val, cont.next);
            }
        default: applyBeta(ans, var, val, cont.next);
    }
}

Term* betaDefun(Term* term, string var, Term* val, BetaCont* cont = null) {
    if (term == null) return applyBeta(term, var, val, cont);
    final switch (term.type) {
        case TType.VAR:
            return applyBeta(term.a == var ? val : term, var, val, cont);
        case TType.ABS:
            if (term.a == var) return applyBeta(term, var, val, cont);
            else return betaDefun(term.t1, var, val, new BetaCont(TType.ABS, 0, term, cont));
        case TType.APP:
            return betaDefun(term.t1, var, val, new BetaCont(TType.APP, 1, term, cont));
    }
}

evalCPS only has two closures passed to it, one at line 51, and the other at line 53. It should be noted that we are not interested in defunctionalizing the interfunc argument away. There's nothing special about this code that hasn't been said before, so here is the defunctionalization of evalCPS for the sake of completeness.

struct EvalCont {
    Term* term;
    bool inner;     // true is closure at line 53, and false is closure at line 51
    int depth;
    EvalCont* next;
}

Term* applyEval(Term* ans, Env env, void delegate(Term*, int) interfunc, EvalCont* cont) {
    if (cont == null) return ans;
    if (cont.inner) {
        cont.term.t2 = ans;
        return evalDefun(cont.term, env, interfunc, cont.depth+1, cont.next);
    } else {
        cont.term.t1 = ans;
        return evalDefun(cont.term.t2, env, interfunc, cont.depth,
                         new EvalCont(cont.term, true, cont.depth, cont.next));
    }
}

Term* evalDefun(Term* term, Env env, void delegate(Term*, int) interfunc,
                int depth = 0, EvalCont* cont = null) {
    interfunc(term, depth);
    if (term.type == TType.VAR) {
        if (term.a == "print")
            return applyEval(term, env, interfunc, cont);
        else if ((term.a in env) !is null)
            return evalDefun(dupDefun(env[term.a]), env, interfunc, depth, cont);
        else assert(false, "Unbounded variable " ~ term.a);
    } else if (term.type == TType.APP) {
        if (term.t1.type == TType.ABS) {
            term.t1.t1 = betaDefun(term.t1.t1, term.t1.a, dupDefun(term.t2));
            return evalDefun(term.t1.t1, env, interfunc, depth-1, cont);
        }
        if (term.t1.type == TType.VAR && term.t1.a == "print") {
            writefln("[print] %s", term.t2.toString);
            return evalDefun(term.t2, env, interfunc, depth-1, cont);
        }
        return evalDefun(term.t1, env, interfunc, depth+1,
                         new EvalCont(term, false, depth, cont));
    } else return applyEval(term, env, interfunc, cont);
}

There is something really interesting worth pointing out. Take a look at DupCont, BetaCont, and EvalCont. All three look like a linked list. In fact all three are a special linked list; a stack. We knew all along (but never mentioned it) that we ought to be manipulating a stack at some point since we are simulating recursive calls. And here is where the stack arises in our code.

Inlining & Tail-call Optimization

Now if you carefully look at the six functions: applyDup & dupDefun, applyBeta & betaDefun, and applyEval & evalDefun, you will notice that every couple of functions (as grouped previously) are mutually-recursive, and most importantly tail recursive! To complete our transformation we just need to inline the special apply functions and do TCO on every couple.

I grouped the inline and TCO steps together since in some cases, as is in applyBeta and applyDup, one must do TCO on the apply function first before inlining. So some functions will require three steps: do TCO on apply, inline apply with the function, and finally do TCO on the whole function.

I have already explained how to do TCO in section 2, so I won't repeat myself. Again, for the sake of completeness here are the three functions with inlining and TCO applied to them:

Term* dupOpt(Term* t, DupCont* cont = null) {
    while (true) {
        if (t == null) {
            Term* ans = t;
            DupCont* acont = cont;
            while (true) {
                if (acont == null) return ans;
                if (acont.inner) {
                    ans = new Term(acont.term.type, acont.term.a, acont.ans1, ans);
                    acont = acont.next;
                } else {
                    t = acont.term.t2;
                    cont = new DupCont(acont.term, true, ans, acont.next);
                    break;
                }
            }
        } else {
            cont = new DupCont(t, false, null, cont);
            t = t.t1;
        }
    }
}
Term* betaOpt(Term* term, string var, Term* val, BetaCont* cont = null) {
    Term* ans;
    BetaCont* acont;
    do {
        bool computeAns = false;
        if (term == null) {
            computeAns = true;
            ans = term; acont = cont;
        } else if (term.type == TType.VAR) {
            computeAns = true;
            ans = term.a == var ? val : term; acont = cont;
        } else if (term.type == TType.ABS) {
            if (term.a == var) {
                computeAns = true;
                ans = term; acont = cont;
            } else {
                cont = new BetaCont(TType.ABS, 0, term, cont);
                term = term.t1;
            }
        } else {
            cont = new BetaCont(TType.APP, 1, term, cont);
            term = term.t1;
        }
        if (computeAns) {
            while(true) {
                if (acont == null) return ans;
                if (acont.type == TType.ABS) {
                        acont.term.t1 = ans;
                        ans = acont.term;
                        acont = acont.next;
                } else if (acont.type == TType.APP) {
                    if (acont.argNum == 1) {
                        acont.term.t1 = ans;
                        term = acont.term.t2;
                        cont = new BetaCont(acont.type, 2, acont.term, acont.next);
                        break;
                    } else {
                        acont.term.t2 = ans;
                        ans = acont.term;
                        acont = acont.next;
                    }
                } else {
                    acont = acont.next;
                }
            }
        }
    } while (true);
}
Term* evalOpt(alias beta, alias dup)
             (Term* term, Env env, void delegate(Term*, int) interfunc, int depth = 0, EvalCont* cont = null) {
    Term* ans;
    EvalCont* acont;
    do {
        interfunc(term, depth);
        int computeAns = false;
        if (term.type == TType.VAR) {
            if (term.a == "print") {
                computeAns = true;
                ans = term; acont = cont;
            } else if ((term.a in env) !is null) {
                term = dup(env[term.a]);
            } else assert(false, "Unbounded variable " ~ term.a);
        } else if (term.type == TType.APP) {
            if (term.t1.type == TType.ABS) {
                term.t1.t1 = beta(term.t1.t1, term.t1.a, dup(term.t2));
                term = term.t1.t1;
                depth--;
            } else if (term.t1.type == TType.VAR && term.t1.a == "print") {
                writefln("[print] %s", term.t2.toString);
                term = term.t2;
                depth--;
            } else {
                cont = new EvalCont(term, false, depth, cont);
                term = term.t1;
                depth++;
            }
        } else {
            computeAns = true;
            ans = term; acont = cont;
        }
        if (computeAns) {
            if (acont == null) return ans;
            if (acont.inner) {
                acont.term.t2 = ans;
                term = acont.term;
                depth = acont.depth+1;
                cont = acont.next;
            } else {
                acont.term.t1 = ans;
                term = acont.term.t2;
                depth = acont.depth;
                cont = new EvalCont(acont.term, true, acont.depth, acont.next);
            }
        }
    } while (true);
}

Some Tangible Results

With the help of Church encoding to encode numbers and booleans I wrote a program that computes the fibonacci numbers (it's stdlib.lc in typeless.git). The table below displays 30 results, each for a different n (the input) and level of optimization. For a given run, if a function's cell contains the check mark ✓ then the transformed (i.e. iterative) version of that function was used. The results are grouped by n and the fastest result is in bold. I benchmarked the code on my slow notebook whose brain is an Intel Atom x5-Z8350 processor (1.9GHz / 2M cache) with 1.8 GB RAM.

Number of reductions eval beta dup Total (sec)
150,992
fibonacci(8)
0.358
0.415
0.373
0.768
0.413
0.849
0.791
0.847
4,710,572
fibonacci(15)
14.41
12.92 (+11.5%)
14.22
30.34
13.31
27.86
32.34
28.14
12,354,971
fibonacci(17)
45.33
41.61
39.9
85.33
37.32 (+21.5%)
78.73
20,000,475
fibonacci(18)
Segfault (@ 67.69)
Segfault (@ 67.01)
63.45
62.2
52,389,823
fibonacci(20)
162.34
174.28
221,990,915
fibonacci(23)
693.21
705.99

Before running these benchmarks I've profiled the interpreter. Briefly, the function that is called the most is dup, 4.3 times more than beta which is called 2.5 times more than eval.

As expected the code that uses all recursive versions eventually segfaults, which is basically an indication that we ran out stack space. What may be surprising is that using dupOpt instead of dup actually slows things down, by a lot! Keep in mind that accessing data on the heap is slower than on the stack, dup is already a memory intensive operation, dupOpt will be even more memory intensive, which I guess is the reason behind the slow-down.

However what I am incapable of explaining is why eventually, for really large n, using both evalOpt and betaOpt (version A) is slower than using eval and betaOpt (version B). My guess is that memory becomes too fragmented after too many pushes/pops on the EvalCont and BetaCont linked lists. I conjecture that eventually B will segfault for some large n. However, I tried running B on a macbook that is on average five times faster than my notebook with n=28 taking some 2.4 billion operations. B did not segfault and it ran significantly faster than A.

In conclusion, we should keep the recursive eval and dup and use the iterative betaOpt instead of beta. So beware of overtrusting the "iterative code is quicker than recursive code" idiom since that wasn't the case for 2/3 of our functions! The lesson drawn from this experiment was: As always, first benchmark and then decide.