Asymptotically Typeable
https://blog.grgz.me/
Thoughts and experiments of a programming-languages passionateenTue, 15 Mar 2022 19:25:36 -0000Context folks<p>
In Nautilus' <a href="https://nautil.us/deep-learning-is-hitting-a-wall-14467/" target="_blank">Deep Learning Is Hitting a Wall</a>, Gary Marcus elaborates on his bone of contention with deep learning.
Here I'll jot down my own with AI at large when applied to social situations.</p><p>
Let's talk about self-driving cars.<p>
I want to focus on two videos about autonomous cars.
<a href="https://www.youtube.com/watch?v=9wXRahO5-t8" target="_blank">The first</a> is in Switzerland.
It's an idyllic day.
Lightly windy.
The Tesla's passenger points his phone at the dashboard, recording an irregularity while crappy music is filling up the interior of the car and the entirety of the video.
The Tesla is parked in front of a grocery shop called Coop.
Its banner is white, vertical, and has the text in two colors: CO in red and OP in yellow.
The flag is flailing, and flickering between red and yellow on the Tesla's screen is a street light.
It thought the flag was a street light.
<a href="https://youtu.be/RVkLI9pPd24?t=166" target="_blank">The second video</a> is of a Tesla insisting on maneuvering through a worksite around a man holding a stop sign in the middle of the road. It did not take the man's stop sign seriously. It treated him like fallen debris.</p><p>
What is a street light? Or simpler still. What is a stop sign?</p><p>
Answer 1: it is a red hexagon with a white border and a white script screaming STOP.<br>
<em>Problem:</em> so a Tesla ought to halt if it saw a Stop sign sticker pasted on the bumper of some prankster.</p><p>
Answer 2: same as answer 1, but a stop sign is on a straight and tall-ish pole.<br>
<em>Problem:</em> Accidents happen, and a car could catapult itself into a stop sign and leave it bowing at an arbitrary angle. So, soon after such an accident, a Tesla ought to ignore the mutilated stop sign.</p><p>
Answer 3: same as answer 1, but it's standing on the pavement or way above the street.<br>
<em>Problem:</em> Axiom: Humans wear exotic t-shirts. Ergo, facing someone wearing a stop-sign t-shirt standing still on the pavement by a bus station, a Tesla ought to stop.</p><p>
Answer 4: same as answer 3, but a stop sign must not move.<br>
<em>Problem:</em> On a particularly windy day in a particularly unreputable neighborhood with unreliable infrastructure, a Tesla ought to keep going.<br>
<em>Problem:</em> Confronting a miserable man holding and rocking the stop sign, identically to the second video, the Tesla ought to swerve.</p><p>
We can go about this forever, conjuring hypothetical yet probable edge cases.
This is not mental masturbation.
False negatives, racing past a stop sign, and false positives, screeching to a halt, can cause accidents.
Someone's got to pay! Should the driver pay? But Tesla advertised an autonomous vehicle that nudges them to believe, although the contrary is in the fine print, that they are well-able of taking control.
Masquerading. Manipulating. Marketing.
So, should Tesla pay? Or should the guy with a Stop-sign t-shirt pay?
If such car accidents occur, when they remain unresolved, then they undermine the people's confidence in a fair justice system and safe roads.</p><p>
So why is it okay to apply machine learning to protein folding, board games, and data analysis, but not to autonomous cars, judicial rulings, and writing public policy?</p><p>
Would I be a cretin if I call out the obvious conclusion that context is key? Context folks.
There are cultural contexts that no computer can conjure in its measly mind.
At first, your eyes catch this stop-sign straddling sport in a bright yellow or orange jacket.
Then you think that they're standing aimlessly in the middle of the road next to a construction site.
Then you realize they're eyeing cars and their drivers, avoiding their gaze when they appear frustrated, and reciprocating a small smile with a similar one of his own.
Tiny cultural cues allow you to conclude that the creature is not crazy.
They're a human with a purpose.
They care about the drivers, and drivers should care about their stop signs.
Their Stop sign says stop.</p><p>
Context derives intention.
Intention informs judgment.
And judgment guides decisions.</p><p>
I hear the rebukes.
A large enough and deep enough neural network will extrapolate this cultural context.
Sure. Aside from <a href="https://en.wikipedia.org/wiki/Bonini's_paradox">Bonini's paradox</a>.
Nothing stops it as long as the training data only comes from the geographical and temporal spot where execution is expected.
Two regions; two networks.
Crossing from one to the other with the same car requires retraining.
Just as you must retake a driver's exam, the car must be reprogrammed.</p><p>
But these vehicles are on the road now.
The singular plan to make them safe is to push human drivers away from roads.
Sculpting a schism where some roads allow you behind the wheels and others behind the invisible driver.
Yesterday horse-drawn chariots were banned from highways, tomorrow the meaty man.</p><p>
But in the age of fast internet and fast food, we must squash these bugs now.
The rank and file, the rich, and the revenue depend on it.
So how will we do it? My prediction is in line with Clive Thompson's observation: <a href="https://onezero.medium.com/ai-wont-steal-your-job-but-it-ll-sure-make-it-suck-210973fc5e0a" target="_blank">"more jobs; worse jobs"</a>.
Shit jobs will be created to assist autonomous cars.</p><p>
Manufacturers will hire flesh-and-bone drivers to meander the streets, alleys, and boulevards of developed countries adorned with strict driving laws (forget the streets of Beirut).
You mustn't look any further than the internet to extrapolate my conclusion.
Every time a robot believes you to be of its kin you must scrutinize blurry and pixelated photos for stairs, buses, crosswalks, and street lights.
You are already assisting the machine by virtually meandering the streets in your armchair.
When the cars need live data, these carbon chauffeurs will gather it and stream it into the psyche of the silicon cruising the vicinities.
Street Popularurbanway is home to a crowd of foremen and workers birthing a building from these hours to these hours for the next five months.
Alley Famousdudestreet, where Famousdude School is, is busy tonight because of a prom.
The turn East on the intersection between Waytonether Street and Hellhighway is blocked by a boulder. Etc...
The misfortune of these drivers is that an AI specter will dictate to them the routes to drive.</p><p>
Robots at the ends of a production line and humans sandwiched in between.
Faceless humans toil for the appliance.</p></p>
https://blog.grgz.me/posts/context-folks.html
https://blog.grgz.me/posts/context-folks.htmlMon, 14 Mar 2022A harder-to-crack Wordle<p>
Wordle is a game that took the Tweetosphere by storm, seemingly transforming Twitter into the now-defunct emoji-only social network <a href="https://emoj.li" target="_blank">Emojli</a>.
Countless nerds and word junkies likely litter your Twitter timeline with a daily stream of 30 green, grey, and yellow emoji boxes.
So if you haven't joined the status-quo and engaged in this sweet new Internet addiction, here's the gist of the game.
It's really quite simple.
You have to guess the daily secret word with six tries.
After submitting a guess, the letters in the correct position will light in green, those that appear in the answer but are in the wrong place will light in yellow, and those not in the word dwell in a dull grey.
It's a very effective game to make you realize how few five-lettered words you remember or can spit out on demand.
</p>
<p>
Everyone and their mother have been presenting ideal strategies for playing Wordle.
I want to do something different.
I want to crack Wordle and exhibit fixes that make the game uncrackable.
I know, I know, I'm cheating.
But I like guarantees, and I like to play a game knowing it's guaranteed to be fair.
</p>
<p>
I'm not trying to condemn the authors of Wordle and its derivatives for not thinking about making the app uncrackable.
Quite the contrary, they made a great game that I enjoy playing (and breaking) every day.
Wordle was literally a labor of love.
Here's a quote from the <a href="https://www.nytimes.com/2022/01/03/technology/wordle-word-game-creator.html" target="_blank">New York Times</a>: "Josh Wardle, a software engineer in Brooklyn, knew his partner loved word games, so he created a guessing game for just the two of them".
Me? I'm just the party-pooper who's here to tell everyone how to break (and possibly fix) it.
</p>
<h2>Wordle's setup</h2>
<p>
The game was such a hit that it spun an alternative for <a href="https://wordle.danielfrg.com/" target="_blank">Spanish</a>, and a couple for Math like <a href="https://www.mathler.com/" target="_blank">Mathler</a> and <a href="https://nerdle.com/" target="_blank">Nerdle</a>.
I argue these spin-offs are possible for two reasons: first, the code is easy to fork, and second, the backend is cheap.
</p>
<p>
The backend is so cheap that it could practically be removed entirely.
The only interaction you'll have with the server is the initial page load.
That's it.
Zero contact is made to check if a word is in the dictionary.
Zero contact is made to list the correctly-placed letters.
Zero contact is made.
<em>Nil</em>.
</p>
<p>
All but Nerdle can be downloaded locally and played independently of any server.
Nerdle requires a server to download the daily answer.
From here on, the post will assume that it's perfectly acceptable to run tiny daily computations on the server like in Nerdle's case.
</p>
<h2>Attacks on Wordle and derivatives</h2>
<p>
The setup of Wordle, and derivatives, implies that the logic and the data it needs: the dictionary and the daily word, are already present in the code.
Both Wordle and Spanish Wordle provide the dictionary in clear text in the application's code (if you think about it, Mathler does not need a dictionary, the expression just needs to parse correctly).
And both Wordle and Mathler put the answer in clear text in your web browser's local storage.
</p>
<p>
Ergo, it's easy to crack the answer; just read it from the browser's localStorage. The cracking code fits in half-a-tweet:
</p>
<blockquote>Stuck on wordle and want to see the solution? Make a bookmark with this URL: <br><br>javascript:alert(JSON.parse(localStorage.gameState).solution)<br><br>and open it in wordle<br><br>— GEZ (@_typeable) <a href="https://twitter.com/_typeable/status/1485584616761307138">January 24, 2022</a></blockquote>
<br>
<p>
Nerdle does something strange.
It does store the solution in the localStorage, but only after you've interacted with the page beyond loading it.
So if you open the page and look in the localStorage, you won't find anything.
Press a key or a button on the screen, and the solution is written right in the localStorage making Wordle's crack achievable here.
Now I'm a lazy guy, and I want my cracks to work with the least amount of interaction with the page as possible.
</p>
<p>
However, if you look in the Networks tab, you'll find a request to a strange URL that returns an 8-character word.
Nerdle is the only one that requires an 8-character solution.
Could this be an encoding of the equation of the day? An invasive operation inserting a "debugger" statement with medical accuracy reveals that it's indeed the encoding.
This URL is also the MD5 hash of the number of days between Nerdle's launch and today.
</p>
<style>code pre { line-height: 1rem !important; }</style>
The crack, this time in Python 2, fits in four lines:
<code class="block">
<pre>import hashlib, urllib, time</pre>
<pre>offset=int((time.time()*1000-16426368e5)/864e5)</pre>
<pre>url="https://nerdle.com/words/"+hashlib.md5(str(offset)).hexdigest()</pre>
<pre>print "".join([chr((ord(c)+113)%126) for c in urllib.urlopen(url).read()])</pre>
</code>
<p>
On the other hand, Spanish Wordle is harder to crack at first glance.
The localStorage's "solution" entry is encrypted with what appears to be AES.
The encryption key is buried deep in the code's logic, but nothing a surgically placed "debugger" placement can't step you through.
Inspecting a curious line of code that contains "AES.decrypt(t,c)" that happened to be called with the encrypted solution reveals that the key and the salt (both pre-generated and not updated daily) are respectively "llanos" and "ibai."
Parenthesis: I googled llanos and ibai out of curiosity.
The results revealed what I assume to be Spanish Wordle's creator's alter-ego or favorite e-sport internet streamer: <a href="https://en.wikipedia.org/wiki/Ibai_Llanos" target="_blank">Ibai Llanos</a>.
How is he related to Spanish Wordle?
Your guess is as good as mine.
End Parenthesis.
</p>
<p>
So it's possible to crack, and the cracker still fits in a single tweet:
</p>
<blockquote>
Wordle(ES) crack bookmark:<br><br>
javascript:var d=document;var s=d.createElement("script");s.onload=()=>alert(CryptoJS.AES.decrypt(JSON.parse(localStorage.getItem("solution")),"llanos").toString(CryptoJS.enc.Utf8).slice("ibai".length));s.src="https://cdnjs.cloudflare.com/ajax/libs/crypto-js/3.1.2/rollups/aes.js";d.body.append(s) <br><br>
— GEZ (@_typeable) <a href="https://twitter.com/_typeable/status/1490810350836629512">February 7, 2022</a></blockquote>
<p>
The cracker is more involved, but here's the gist of it: it starts by loading CryptoJS from the CDNjs, since accessing the internal AES algorithm proved to be a nightmare.
Once loaded, the script uses the library to decrypt the "solution" entry in localStorage, removes the salted string, and reveals it in the most anachronistic pop-up ever.
</p>
<p>
<em>Post-script:</em> As I was getting ready to publish this post, I came across <a href="https://www.youtube.com/watch?v=v68zYyaEmEA" target="_blank">3Blue1Brow's video</a> about finding the best strategy to find an answer in Wordle.
Go watch it.
What's relevant in this video is that Grant quickly mentions a possible attack at 4:47.
Here's what he says: "The way that it's [the list of possible answers] visible in the source code is in the specific order in which answers come up from day-to-day. That you can always look up what tomorrow's answer will be."
</p>
<p>
This brings us to these fundamental questions: can we keep everything encrypted?
Can we compute the properties of a guess without needing to store clear text versions anywhere?
If there are encryption keys, can we avoid providing their generation algorithm?
Can we do that in the least amount of memory possible?
This sounds like a job for homomorphic encryption, but let's not get carried away and think of simpler alternatives.
</p>
<p>
To give you an idea of how much each application uses, Wordle's code is in a 173.34KB javascript file, Spanish Wordle's code is in a 639.04KB file, and Mathler's code is in a 198.71KB file.
</p>
<h2>Admissible attacks</h2>
<p>
If we can answer yes to all the questions above, we should be vulnerable to only two attacks.
</p>
<p>
The first is a naive daily dictionary attack, in which the attacker generates a rainbow table for each combination of letters and tries each individually.
I say each five-letter word combination because the dictionary will be "encrypted," so the attacker will not know which words we use and which we don't.
</p>
<p>
In the second attack, the attacker utilizes the properties of the guess to improve the searching strategy.
I argue that this is perfectly fine because the attacker is now playing the game, as intended!
</p>
<h2>The proposed solutions</h2>
<p>
<em>Disclaimer</em>: I am not an expert in cryptography or privacy-preserving techniques.
If you are an expert and think my analysis is faulty, please shoot a tweet at me.
</p>
<p>
In the following, I present three techniques that I experimented with.
All three proposals provide almost-complete secrecy about the guess and the dictionary and the zero-contact requirement of the setup.
There is only some daily code that the backend must execute.
If you want to read what I deem to be the best technique (shortest code, simplest conceptually, and most economical memory-wise), then head over to the last section.
The order in which I present the solutions follows my chronology of discovering them.
</p>
<p>
<em>Update (12/02/2022):</em>
I have implemented a proof-of-concept which is available on my <a href="https://github.com/geezee/poc-no-crack-wordle" target="_blank">Github</a>.
There is also a running instance that generates a new word every hour <a href="https://grgz.me/wordle" target="_blank">on my website</a>.
To encourage cracking this instance, no code was obfuscated or minimized, no library was used, and only the essential features were implemented.
</p>
<h3>Bloom Filters</h3>
<p>
If you are unaware of Bloom Filters, they are fundamentally a set with two operations: adding and checking membership.
A Bloom filter starts as a fixed-length array of 0s.
To add an item to the Bloom filter, you hash it k times, then put a 1 at the index equalling the hash.
To check the membership of an item, you hash it k times and make sure that all the entries in the array at those hashes are 1s.
</p>
<p>
A Bloom filter of all the words in the dictionary will not store these words, nor their hashes.
So they are perfect for hiding what the elements are.
</p>
<p>
Great, if the server encodes a Bloom filter in the code, then we can tell if a guess is a valid word or not.
Next question: how do we check if the second letter of the guess is in the correct place?
</p>
<p>
I propose that we construct a Bloom filter for every word index.
The words we will add to that filter are those valid words with the correct letter in that position.
For example, if today's mystery word is "ghost," we will construct five filters.
The first will contain words like "glass" and "gamer," the second words like "phase" and "rhyme," the third words like "khaki" and "twang," etc...
</p>
<p>
Notice that if a word is in all those five filters, then the guess is the word of the day.
So nothing special has to be done.
</p>
<p>
Okay, now how do we check if a letter which appears in the solution is wrongly placed?
</p>
<p>
Even more bloom filters! To be precise, five more bloom filters.
Each one will be associated with a word index.
So in the first Bloom filter, we will put all the words that start with all the letters which occur in the answer that are not the first letter.
More concretely, if today's word is again "ghost," the first Bloom filter will contain words like "honey," "octal," "state," and "taunt".
The same is done for the other four indices.
</p>
<p>
Done.
Problem solved.
Probably though, right?
Right.
A Bloom filter is a probabilistic data structure.
There will be some false positives.
In other words, we may say "correct, the second letter is indeed an E," yet the actual answer's second letter is an S.
Imagine the barrage of intimidating tweets you'll receive when this happens.
This actually happened to me while testing out this solution.
Not the stream of tweets, fortunately, but the false positives.
The trivial fix would be to iterate over all 26<sup>5</sup> possible words, check if the bloom filter has false positives, and regenerate it with new hashes if it's the case.
</p>
<p>
These bloom filters can be constructed in one pass over the dictionary.
So the runtime is linear in the size of the dictionary.
It should be fast to do by a server just before midnight every night.
</p>
<p>
On the other side, how much will our poor gamer have to download? Sadly no caching can be leveraged (as with all the solutions I present).
<a href="https://hur.st/bloomfilter" target="_blank">This is a fantastic calculator</a> to find the optimal parameters for a Bloom filter.
Our dictionary will include the 13K words of Wordle and have 1/26<sup>5</sup> false positive probability that boils down to 53.8KB.
The remaining 10 Bloom filters will contain 13K words in the worst case with a chance of 1/13K of false positives that end up being 31.29KB each.
Total: 366.7KB.
Twice Wordle and some change.
</p>
<h3>Finite-Field Polynomials</h3>
<p>
Can we do better memory-wise?
An optimal Bloom filter has 50% of its entries 0s.
Can we get rid of those somehow?
</p>
<p>
This proposal relies on observing that a polynomial of the shape (x-x<sub>1</sub>)...(x-x<sub>n</sub>) precisely has these n roots: x<sub>1</sub>, x<sub>2</sub>, until x<sub>n</sub>, and has n coefficients.
As it so happens, this is also the case in finite fields such as <strong>Z</strong><sub>p</sub> where p is a prime, i.e., all the numbers between 0 (inclusive) and p (exclusive).
</p>
<p>
It's rather critical to work in <strong>Z</strong><sub>p</sub> and not good-ol' <strong>R</strong>.
The reason has nothing to do with floating-point errors or compressing coefficients.
Working in <strong>R</strong> allows an attacker to use Newton's method to find all the roots of a given polynomial in just a handful of applications.
We want to hide as many properties about our polynomial as possible from the attacker.
Luckily, root finding in <strong>Z</strong><sub>p</sub> is hard.
The standard algorithm I'm aware of is <a href="https://en.wikipedia.org/wiki/Chien_search" target="_blank">Chien search</a> which is essentially just a fancy brute-force search.
</p>
<p>
To detect whether a guess is in the dictionary or not, we start by choosing a <em>perfect</em> hashing function for all the possible 26<sup>5</sup> words.
Next, we compute the coefficients of (x-h(w<sub>1</sub>))...(x-h(w<sub>n</sub>)) where w<sub>1</sub>...w<sub>n</sub> are the words in our dictionary.
When presented with a guess, we can evaluate the polynomial at the hash of the guess and check whether the answer is zero or not.
</p>
<p>
In my experience, this perfect hashing function can be found pretty quickly, often in less than 5 trials.
Moreover, computing the polynomial's coefficients by expanding its factors can be done in quadratic time in the size of the dictionary.
And the evaluation can be done in linear time in the size of the dictionary.
</p>
<p>
Since our domain must be <strong>Z</strong><sub>p</sub>, and we must support encoding 26<sup>5</sup> words, we'll choose the prime 11881379.
This is the first prime after 26<sup>5</sup>.
So all our numbers, hashes, and coefficients can be stored in log<sub>2</sub>(11881379) = 24 bits.
</p>
<p>
Next problem: how can we know whether the first letter is correct or exists in the answer?
I propose to encode all these properties in a single small number.
This is the scheme I adopted.
Let's work in base 3, where 0 means a letter is correctly placed, 1 means the letter exists in the answer, and 2 means that the letter does not exist.
For example, if today's word is once again "ghost," the guess "chogs" is encoded as 20011 in base 3 or 166 in base 10.
</p>
<p>
So we can construct another polynomial that interpolates the hash of every dictionary word and its encoding.
However, this time, we need a perfect hashing function not for 26<sup>5</sup> words but for 13K words.
Luckily 13001 is a nice close prime number.
This means all our numbers for this polynomial will fit in log<sub>2</sub>(13001) = 14 bits.
</p>
<p>
This interpolation can be trivially done using Lagrange interpolation in cubic time.
Which in practice will take a long time, hardly less than half an hour.
But it can be squeezed down to a quadratic time which is just fast enough.
</p>
<p>
The first polynomial has 13K coefficients taking each 24 bits.
The second one also has 13K coefficients, but each takes 14 bits.
Total: 61.75KB.
This is a reduction factor of 5 from the Bloom filter solution!
Putting us at half the size of Wordle.
</p>
<h3>Hashes to Hashes</h3>
<p>
The motivation is as follows: there is still some waste in the finite field approach; the result of computing our second polynomial is 14 bits long, yet we only need 8 bits to store the biggest encoding (22222 in ternary = 242 in decimal).
How to represent a function with a domain of 14 bits and a codomain of 8 bits?
Hash-tables, D'uh!
And do we really need that first polynomial now?
Can't we just map these 24 bits directly to these 8 bits?
</p>
<p>
Of course, we can!
But there's a catch, we should salt these 8 bits.
Otherwise, we'd be vulnerable to a new attack.
The number of words that hash into 21112 in ternary (203 in decimal) is a good enough hint to the answer.
For example, if the attacker sees 103 numbers that hash into 203, they'll deduce that the answer must be one of only 31 possible words (tiara, angry, evoke, amigo, latex, to list a few).
If they spot 27 203s, then the range of possible words is 174.
If that number turns out to be still too large for the attacker, they can look at how many hashes map into 21211, for example, and take the intersection of those two lists.
In this fashion, an attacker can avoid a full dictionary attack. Salting the property with the hash of the word (using a different hashing function than the key) eliminates this attack.
</p>
<style>code pre { line-height: 1rem !important; }</style>
The whole construction fits neatly in this seven-line Python script:
<code class="block">
<pre>import random</pre>
<pre>def n_bits(num, n): return num & ((1<<(n+1))-1)</pre>
<pre>words = list(filter(lambda x: len(x)==5, map(lambda x: x[:-1], open("dict.txt").readlines())))</pre>
<pre>k1 = random.randint(0, 1<<25)</pre>
<pre>k2 = random.randint(0, 1<<9)</pre>
<pre>answer = random.choice(words)</pre>
<pre>table = { n_bits(hash(w,k1), 24): compare(w,answer) + n_bits(hash(w,k2), 8) for w in words }</pre>
</code>
<p>
The code above clearly computes the hash table in one swoop over the dictionary.
So its runtime is linear.
</p>
<p>
What about memory consumption? The key of the hash table is 24 bits long, and the values are 8 bits long.
So for every entry, we only need 32 bits.
In other words, one integer!
How elegant!
In total, our hashtable will be 52KB big.
That's slightly less than the 58.3KB gzipped Wordle code.
</p>
<h2>Conclusion</h2>
<p>
It's possible to have versions of Wordle that do not expose either the dictionary or the word of the day to the player.
The downside of the Bloom filter approach, besides its relatively porky size, is its probabilistic nature.
So extra work has to be done to check that no player will complain.
</p>
<p>
Although the latter two methods are not probabilistic, they rely on perfect hashes.
This undoubtedly means extra work.
But where they shine is in their slim and graceful size.
</p>
<p>
Earlier I said these schemes provide almost-complete secrecy about the dictionary because there was one property that I couldn't hide: the size of the dictionary.
Here, the Bloom filter has the advantage.
Because it can only approximate the number of elements inside it.
This parameter could be a clue for the attacker to factor in his evil scheme.
But I believe that this can be fixed by randomly adding bogus elements that map to invalid properties to the data structures, making them appear larger than they actually are.
However, this entails more bytes to be downloaded.
</p>
https://blog.grgz.me/posts/crack-wordle.html
https://blog.grgz.me/posts/crack-wordle.htmlWed, 09 Feb 2022On Computing Derivatives<script type="text/x-mathjax-config">
MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}});
</script>
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>
<p>
In this blog post, I review techniques for computing $\frac{\partial}{\partial x_i} f$ for $f: \mathbb{R}^n \to \mathbb{R}^m$.
Broadly, there are three techniques: numerical, symbolic, and automatic differentiation.
Among them, the most remarkable and useful is the Automatic Differentiation (AD) technique in its reverse mode (or adjoint mode).
I will show you how, using first principles, you could've derived it yourself.
(Do not expect code in this blog post, but expect a lot of equations.)
</p>
<p>
In practice, every time you face an optimization problem, you will have to compute a derivative.
Gradient-descent techniques require the first-order derivative, and Newton's method requires a second-order one.
If you still need motivation to carry on, consider that training a neural network is an optimization problem.
The de-facto neural network training algorithm is Backpropagation.
It starts by computing its derivative, updating the weights in reverse, rinsing, and repeating...
And where does AD come into play?
Well, Backpropagation is a special case of AD.
It's also precisely this reverse-mode AD I mentioned earlier.
And what's more, Backpropagation is an optimized way of implementing reverse-mode AD.
So stick around if you want to know its origins and why you propagate backward.
</p>
Some of my sources and recommended readings are:
<ul>
<li><a href="https://www.jmlr.org/papers/v18/17-468.html" target="_blank">Automatic Differentiation in Machine Learning: a Survey</a>
by Baydin, Pearmutter, Radul, and Siskind, in the 2018 Journal of Machine Learning Research (JMLR'18)</li>
<li><a href="https://www.cs.princeton.edu/courses/archive/fall19/cos597C/files/wengert1964.pdf" target="_blank">A Simple Automatic Derivative Evaluation Program</a>
by R. E. Wengert in Communications of the ACM, Volume 7, Issue 8, Aug. 1964, doi: 10.1145/355586.364791</li>
<li><a href="https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation" target="_blank">Reverse-mode automatic differentiation: a tutorial</a> by Rufflewind</li>
<li>Some old notes for courses I've assisted in.</li>
</ul>
<br>
<h2>Numerical Differentiation</h2>
<p>
First, a foreward: In this section, I will only consider $f: \mathbb{R} \to \mathbb{R}$ that is once, twice, or four times differentiable.
But the techniques are easily generalizable to any $f: \mathbb{R}^n \to \mathbb{R}^m$.
</p>
<p>
To compute derivatives the first idea is to use the mathematical definition of the derivative
$$ \frac{d}{dx} f(x_0) = \lim_{h \to 0} \frac{f(x+h)-f(x)}{h} $$
We can choose a small $\varepsilon > 0$ and compute $\frac{f(x+\varepsilon)-f(x)}{\varepsilon}$.
</p>
<p>
This is an approximation and not an exact value, which begs the question of how good is this approximation?
Using Taylor's theorem, we can answer that question.
We know that
$$ f(x) = f(x_0) + f'(x_0) (x-x_0) + O\left((x-x_0)^2\right) $$
This $O(\cdot)$ is the big-O notation that we computer scientists know and love.
Setting $x \to x+\varepsilon$, and $x_0 \to x$, we get
$$ f(x+\varepsilon) = f(x) + f'(x)\varepsilon + O(\varepsilon^2) $$
and after rearranging:
$$ f'(x) = \frac{f(x+\varepsilon)-f(x)}{\varepsilon} + O(\varepsilon) $$
which means that our error is proportional to $\varepsilon$;
cutting $\varepsilon$ in half will cut the error in half too.
</p>
<p>
Can we approximate better?
Luckily, a fancy trick with a fancy name comes to the rescue: central differencing scheme.
Is tells us to use the following approximation:
$$ f'(x) \approx \frac{f(x+\varepsilon)-f(x-\varepsilon)}{2h} $$
To analyze its error, we use Taylor's theorem twice.
Once with $x \to x+\varepsilon$ and $x_0 \to x$, and once again with $x \to x-\varepsilon$ and $x_0 \to x$.
This leads to these two identities:
$$ f(x+\varepsilon) = f(x) + f'(x)\varepsilon + \frac{f''(x)}{2}\varepsilon^2 + O(\varepsilon^3) $$
$$ f(x-\varepsilon) = f(x) - f'(x)\varepsilon + \frac{f''(x)}{2}\varepsilon^2 + O(\varepsilon^3) $$
Subtracting the second identity from the first and rearranging we get:
$$ f'(x) = \frac{f(x+\varepsilon)-f(x-\varepsilon)}{2h} + O(\varepsilon^2)$$
which means that our error decreases quadratically with $\varepsilon$;
cutting $\varepsilon$ in half will cut the error by four!
</p>
<p>
Can we still do better?
I will guide you through a scheme that has an error of $O(\varepsilon^4)$ which I hope will show you how to generalize this technique without having me spell it out.
Somehow symmetry plays a role in improving the error, so inspired by the central differencing scheme, let's apply Taylor's theorem at these four points:
$x-2\varepsilon$, $x-\varepsilon$, $x+\varepsilon$, and $x + 2\varepsilon$.
We get:
$$ f(x+2\varepsilon) = f(x) + 2f'(x)\varepsilon + 2f''(x)\varepsilon^2 + \frac{4f'''(x)}{3}\varepsilon^3 + \frac{2f''''(x)}{3}\varepsilon^4 + O(\varepsilon^5) $$
$$ f(x+\varepsilon) = f(x) + f'(x)\varepsilon + \frac{f''(x)}{2}\varepsilon^2 + \frac{f'''(x)}{6}\varepsilon^3 + \frac{f''''(x)}{24}\varepsilon^4 + O(\varepsilon^5) $$
$$ f(x-\varepsilon) = f(x) - f'(x)\varepsilon + \frac{f''(x)}{2}\varepsilon^2 - \frac{f'''(x)}{6}\varepsilon^3 + \frac{f''''(x)}{24}\varepsilon^4 + O(\varepsilon^5) $$
$$ f(x-2\varepsilon) = f(x) - 2f'(x)\varepsilon + 2f''(x)\varepsilon^2 - \frac{4f'''(x)}{3}\varepsilon^3 + \frac{2f''''(x)}{3}\varepsilon^4 + O(\varepsilon^5) $$
Let's multiply the first by $-\frac{1}{21}$, the second by $\frac{8}{21}$, the third by $-\frac{8}{21}$, and the fourth by $\frac{1}{21}$.
Then if we add them all together and rearrange, we get:
$$ f'(x) = \frac{f(x-2\varepsilon)-8f(x-\varepsilon)+8f(x+\varepsilon)-f(x+2\varepsilon)}{21\varepsilon} + O(\varepsilon^4)$$
meaning that cutting $\varepsilon$ in half will cut the error by sixteen!
</p>
<p>
If you can figure out where these $\pm\frac{1}{21}$ and $\pm\frac{8}{21}$ magic numbers come from then you can generalize this technique to your heart's desire and reduce the error as much as you want; as low as $O(\varepsilon^{64})$ if you choose to.
The astute reader can generalize this technique even further to reach <em>Richardson extrapolation</em>.
</p>
<p>
The first criticism is that this scheme will always yield an approximation and never an exact value.
That criticism becomes critical when we consider actual implementations using floating-point numbers:
If we choose $\varepsilon$ to be small while we're interested in computing the derivative at a relatively large $x$ (orders of magnitude larger than $\varepsilon$) then we'll have to add a small number to a large number: the first capital sin.
Moreover, if our function is smooth enough, then $f(x-\varepsilon)$ and $f(x+\varepsilon)$ will be close to each other, and we end up subtracting numbers of a similar magnitude: the second capital sin.
Both are well-known problems that should be avoided when dealing with floating-point arithmetic.
These two problems and some more are highlighted in <a href="https://www.soa.org/news-and-publications/newsletters/compact/2014/may/com-2014-iss51/losing-my-precision-tips-for-handling-tricky-floating-point-arithmetic/" target="_blank">Losing My Precision</a>.
And, if that is not enough, choosing too-small an $\varepsilon$ leads us to commit the Deadly Sin #4: Ruination by Rounding (explained in McCartin's <em>Seven Deadly Sins of Numerical Computation</em> (1998) doi:10.1080/00029890.1998.12004989 which you should not use sci-hub to read a free version of this).
Shortly, a tiny $\varepsilon$ will reduce the truncation error — the approximation error — but will spike the round-off errors too — rounding error due to floating points!
And to make matters even worse, quoting from that same paper, "The perplexing facet of this phenomenon is that the precise determination of $h_{opt}$ [the optimal $\varepsilon$ that reaches an equilibrium between trunction and round-off errors] is impossible in practice."
</p>
<p>
The second criticism is that computing $f'(x)$ (for an error of $O(\varepsilon^4)$) calls the function $f$ four times.
If $f$ is expensive to call, computing the derivative becomes quadruply expensive.
What if $f$ was a deep neural network with millions of inputs?
In other words, $f: \mathbb{R}^{1,000,000} \to \mathbb{R}$, then to learn via gradient-descent requires computing $\frac{\partial}{\partial x_i}f$ for all million parameters resulting in 4,000,000 evaluations of the neural network for one optimization iteration.
</p>
<br>
<h2>Symbolic Differentiation</h2>
<p>
The previous paragraph introduced the two downsides of numerical methods.
They only approximate, and they perform multiple applications per input variable.
Symbolic Differentiation attempts to eliminate both concerns, with a slight caveat mentioned at the end.
Whereas numerical methods are done at runtime, symbolic differentiation is (generally) done at compile-time.
</p>
<p>
Numerical methods were inspired by the definition of the derivative.
Symbolic methods are inspired by the specific definition of some derivatives.
What do I mean?
We all learned in high school that $\left(x^n\right)' = nx^{n-1}$, and that $(uv)' = u'v + uv'$, etc...
Notice that these definitions do not use any $\varepsilon$ and are exact.
We use $=$ not $\approx$.
So let's make use of them, alongside the insight that a program is just a complicated composition of basic ideas like multiplication, addition, composition, etc... which we know their derivatives.
</p>
<p>
Provided a program encoding a mathematical expression a compiler manipulates its symbols (a lot like how a high-school student blindly applies the rules of differentiation) to generate another program that will compute the derivative in that exact way.
</p>
<style>code pre { line-height: 1rem !important; }</style>
In a basic programming language with a syntax defined like:
<code class="block">
<pre>t ::= c (c is a real number)</pre>
<pre> | x (variables)</pre>
<pre> | t1 + t2</pre>
<pre> | t1 - t2</pre>
<pre> | t1 * t2</pre>
<pre> | t ^ c (to a power of a constant)</pre>
<pre> | log t | exp t | sin t | cos t</pre>
</code>
We can define a recursive function <code>d</code> which given the variable <code>x</code> to differentiate with respect to, operates on a syntax tree to produce another syntax tree representing the derivative.
In an ML-like syntax we can define this function like so:
<code class="block">
<pre>d x c = 0</pre>
<pre>d x x = 1</pre>
<pre>d x y = 0</pre>
<pre>d x (t1 + t2) = (d x t1) + (d x t2)</pre>
<pre>d x (t1 - t2) = (d x t1) - (d x t2)</pre>
<pre>d x (t1 * t2) = t1 * (d x t2) + (d x t1) * t2</pre>
<pre>d x (t ^ c) = c * t ^ (c-1) * (d x t)</pre>
<pre>d x (log t) = (d x t) * t ^ -1</pre>
<pre>d x (exp t) = (d x t) * (exp t)</pre>
<pre>d x (sin t) = (d x t) * (cos t)</pre>
<pre>d x (cos t) = (d x t) * -1 * (sin t)</pre>
</code>
<p>
Some languages already have such a built-in construct.
In Maple for example you can do <code>diff(x*sin(cos(x)), x)</code> and get $sin(cos(x))−xsin(x)cos(cos(x))$ back.
You can also do higher-order derivative like <code>diff(sin(x), x$3)</code> which is the same as <code>diff(sin(x), x, x, x)</code>, meaning $\frac{\partial^3}{\partial^3x}sin(x)$ and get back $-cos(x)$.
The documentation of this <code>diff</code> function can be found <a href="https://www.maplesoft.com/support/help/maple/view.aspx?path=diff" target="_blank">here</a>.
But by far the most popular, and one that I feel sorry for you dear reader if you didn't use during your calculus course, is <a href="https://reference.wolfram.com/language/ref/Derivative.html" target="_blank">Mathematica's</a> that is exposed through a web interface on WolframAlpha.
As an example, if we run the query <a href="https://www.wolframalpha.com/input/?i=derivative+of+x*sin%28cos%28x%29%29" target="_blank">"derivative of x*sin(cos(x))"</a> on wolframalpha.com we get back the answer <code>-x cos(cos(x)) sin(x) + sin(cos(x))</code>.
Some other languages are expressive-enough to do symbolic differentiation through libraries.
As an example, you can consult Julia's <a href="https://github.com/JuliaSymbolics/Symbolics.jl" target="_blank">Symbolics.js</a>.
</p>
<p>
Programs are usually more complicated than simple mathematical expressions.
Heck, mathematical expressions can be more complicated than those expressible by our simple language.
Definitions can include recursive calls, branching, a change of variables, or aliasing.
Let's see how to add support for aliasing, i.e. variable definitions, i.e. let expressions of the form <code>let x = t1 in t2</code>.
Here's a simple example of a program that we'd like to take its derivative (with respect to x):
</p>
<code class="block" style="margin:-10px 0px">
<pre>let y = 2*x*x in</pre>
<pre>y + cos(y)</pre>
</code>
<p>
What could its derivative be?
Mathematically we have two definitions: $y(x) = 2x^2$, and $f(x) = y(x) + cos(y(x))$.
The derivative $\frac{df}{dx}$ is $y'(x) - y'(x) * sin(y(x))$ which depends on $y(x)$ and on a new function we call $y'(x)$.
And we know that $y'(x) = \frac{dy}{dx} = 4x$.
We could express that in code as follows:
</p>
<code class="block" style="margin:-10px 0px">
<pre>let y = 2*x*x in</pre>
<pre>let y' = 4*x in</pre>
<pre>y' + y' * cos(y)</pre>
</code>
<p>
We need to modify our <code>d x t</code> function to allow <code>let y = t1 in t2</code>.
It's important to remember that our expression may depend on y and its derivative.
So, deriving a let-expression must keep the variable binding as-is, introduce the variable's derivative, and derive the body.
The first obvious modification is:
</p>
<code class="block" style="margin:-10px 0px">
<pre>d x (let y = t1 in t2) =</pre>
<pre> let y = t1 in</pre>
<pre> let y' = (d x t1) in</pre>
<pre> (d x t2)</pre>
</code>
<p>
Of course, we assume that <code>y'</code> is a fresh variable that isn't introduced, nor mentioned, in the body of the let-expression.
But that is not all.
What will <code>d x (y + cos(y))</code> yield?
Our <code>d x y</code> rule says that the result should be 0, so the body will evaluate to <code>d x (y + cos(y)) = 0 - 0 * sin(x)</code>.
Not good.
Luckily the fix is simple.
Instead of returning 0, we should return the derivative of the variable (which must have been introduced in an earlier let-expression in a term only free in x).
So the second modification should be:
</p>
<code class="block" style="margin:-10px 0px">
<pre>d x y = y'</pre>
</code>
<p>
I tried to search for a language that allows you to take the derivative of arbitrary expressions, but I turned up empty handed.
By arbitrary expression I mean one where computation and control-flow are mixed in.
For example, I know of no library that can differentiate the following function:
</p>
<code class="block" style="margin:-10px 0px">
<pre>real foo(real x1, real x2) {</pre>
<pre> let f = lambda g: g(lambda y: y^2 + x2);</pre>
<pre> let z = x1*sin(x2)/log(x1^2);</pre>
<pre> if (x1 < 0) return foo(-x1, z);</pre>
<pre> else return z*f(lambda h: h(x1-cos(x2)));</pre>
<pre>}</pre>
</code>
<p>
Which is a complicated way of computing this function:
$$ f(x_1, x_2) = \begin{cases}
f\left(-x_1, \frac{x_1sin(x_2)}{log(x_1^2)}\right) &\mbox{if } x_1 < 0\\
\frac{x_1sin(x_2)}{log(x_1^2)}\left((x_1-cos\;x_2)^2+x_2\right) &\mbox{otherwise}
\end{cases} $$
</p>
<p>
There is a downside to using symbolic derivation: it goes by the name <em>expression swell</em>.
It's due to the large sub-expression that the derivative of a product produces.
As an example, deriving the product $x(2x+1)(x-1)^2$ produces the expression $(2x+1)(x-2)^2+2x(x-2)^2+2x(2x+1)(x-2)$.
Multiplying the original expression by the product $(3-x)$ produces the even larger expression
$$2(x+1)(x-2)^2(3-x)+2x(x-2)^2(3-x)+2x(2x+1)(x-2)-x(2x+1)(x-1)^2$$
Of course, such an expression can be reduced to $-10x^4+36x^3-27x^2-2x+3$, but that may not always be possible.
Consider the modified expression
$$cos(x)(2sin(x)+1)(log(x)-1)^2(3-e^x)$$
whose derivative cannot be reduced.
</p>
<p>
So even if symbolic differentiation produces an exact value and only requires the evaluation of one function per input to compute the derivative,
this derivative function may be more computationally expensive than the original function.
That is the caveat I mentioned at the start of this section.
</p>
<br>
<h2>Forward-Mode Automatic Differentiation</h2>
<p>
Here's the status till now: numerical differentiation is trivial to implement but is not exact, whereas symbolic differentiation is exact but harder to implement.
And both share the fact that computing a derivative is expensive.
Automatic differentiation tries to take the best of both worlds:
Like symbolic differentiation, AD spits an exact result, and like numeric differentiation, it spits a numerical value, all the while keeping the complexity of evaluating the derivative linear.
</p>
<p>
Let's reconsider that last expression, and test rigorously the claim that its symbolic derivative is much more complicated.
Naively the expression can be expressed as follows:
</p>
<code class="block" style="margin:-10px 0px">
<pre>cos(x) * (2 * sin(x) + 1) * (log(x) - 1)^2 * (3 - exp(x))</pre>
</code>
<p>
And its derivative as defined by the <code>d x</code> code transformation, which I've naively implemented in Haskell, is:
</p>
<code class="block" style="margin:-10px 0px">
<pre> cos(x) * (2 * sin(x) + 1) * ( (log(x) - 1)^2 * (0 - 1 * exp(x))</pre>
<pre> + 2 * (1 * x^-1 - 0) * (log(x) - 1)^1 * (3 - exp(x)))</pre>
<pre>+ (log(x) - 1)^2 * (3 - exp(x)) * (cos(x) * ((2 * 1 * cos(x) + 0 * sin(x)) + 0)</pre>
<pre> + 1 * 1 * sin(x) * (2 * sin(x) + 1))</pre>
</code>
<p>
But using our simple program language there are other programs that are equivalent to this expression.
For example consider <code>let p1 = cos(x) in p1 * (2*sin(x)+1) * (log(x)-1)^2 * (3-exp(x))</code>.
What is the derivative of a program that introduces as many non-trivial let-expressions as possbile?
For example what's the derivative of the following program?
</p>
<code class="block" style="margin:-10px 0">
<pre>let v1 = cos(x) in</pre>
<pre>let v2 = sin(x) in</pre>
<pre>let v3 = 2 * v2 in</pre>
<pre>let v4 = v3 + 1 in</pre>
<pre>let v5 = log(x) in</pre>
<pre>let v6 = v5 - 1 in</pre>
<pre>let v7 = v6^2 in</pre>
<pre>let v8 = exp(x) in</pre>
<pre>let v9 = 3 - v8 in</pre>
<pre>let v10 = v1 * v4 in</pre>
<pre>let v11 = v7 * v9 in</pre>
<pre>v10 * v11</pre>
</code>
<p>
I invite you to check that indeed this program (which can be seen as a compiled-version) is equivalent to the original expression.
I also invite you to take its derivative by hand.
In case you're lazy, here's the result using the same code transformation that I've naively implemented in Haskell:
</p>
<code class="block" style="margin:-10px 0">
<pre>let v1 = cos(x) in let v1' = -1 * 1 * sin(x) in</pre>
<pre>let v2 = sin(x) in let v2' = 1 * cos(x) in</pre>
<pre>let v3 = 2 * v2 in let v3' = 2 * v2' + 0 * v2 in</pre>
<pre>let v4 = v3 + 1 in let v4' = v3' + 0 in</pre>
<pre>let v5 = log(x) in let v5' = 1 * x^-1 in</pre>
<pre>let v6 = v5 - 1 in let v6' = v5' - 0 in</pre>
<pre>let v7 = v6^2 in let v7' = 2 * v6' * v6^1 in</pre>
<pre>let v8 = exp(x) in let v8' = 1 * exp(x) in</pre>
<pre>let v9 = 3 - v8 in let v9' = 0 - v8' in</pre>
<pre>let v10 = v1 * v4 in let v10' = v1 * v4' + v1' * v4 in</pre>
<pre>let v11 = v7 * v9 in let v11' = v7 * v9' + v7' * v9 in</pre>
<pre>v10 * v11' + v10' * v11</pre>
</code>
<p>
Look carefully at what this new code does: for every simple let-expression, we have another simple-ish let-expression that computes its derivative.
The body of the let-expressions is also simple.
The complexity of evaluating it is similar to 2 or 3 calls of the original program.
So the claim that symbolic derivation produces a derivative that is not linear in the program's size is bogus.
But that's if we transform our code into a representation with as many let-expressions as possible.
And that is precisely where automatic differentiation starts at!
</p>
<p>
There are two gateways to automatic differentiation.
The first is an interpretation of the exercise described above in the programming domain that is well expressed in the abstract of Wengert's 1964 paper introduced forward-mode AD:
"The technique permits the computation of numerical values of derivatives without developing analytical expressions for the derivatives.
The key to the method is the decomposition of the given function, by the introduction of intermediate variables, into a series of elementary functional steps."
Another way of expressing it can be found in Baydin, Pearlmutter, Radul, and Siskind's survey:
"When we are concerned with the accurate numerical evaluation of derivatives and not so much with their actual symbolic form, it is in principle possible to significantly simplify computations by storing only the values of intermediate sub-expressions in memory.
[...] This [...] idea forms the basis of AD and provides an account of its simplest form:
<em>apply symbolic differentiation at the elementary operation level and keep intermediate numerical results, in lockstep with the evaluation of the main function.</em>"
</p>
<p>
The second gateway is a mathematical re-interpretation of the two quotes above:
Fundamentally, every program is a composition of basic functions (like adding, multiplying, exponentiating, etc...) whose derivatives are known a-priori.
So a program boils down to a representation of the sort:
$$ P = f_1(f_2(f_3(\cdots f_r(x_1, \cdots, x_n)\cdots))) $$
whose derivative can be taken using the chain rule (in its simplest form):
$$ \frac{\text{d}}{\text{d}x} f(g(x)) = f'(g(x)) \cdot g'(x) $$
The "chain rule applied to elementary functions" is the other gateway to AD.
</p>
<p>
It's always a good idea to see many examples, so now and in the next section we'll see how to take the derivative of this new function:
$$ f(x_1, x_2, x_3) = \frac{log(x_1)}{x_3}\left(sin\left(\frac{log(x_1)}{x_3}\right)+e^{x_3}\cdot x_2 \cdot sin(x_1)\right) $$
One way this function can be "compiled" to a sequence of elementary function applications is like so:
$$\begin{align*}
v_1 &= log(x_1) \\
v_2 &= \frac{v_1}{x_3} \\
v_3 &= sin(v_2) \\
v_4 &= e^{x_3} \\
v_5 &= sin(x_1) \\
v_6 &= x_2 \cdot v_5 \\
v_7 &= v_4 \cdot v_6 \\
v_8 &= v_3 + v_7 \\
v_9 &= v_2 \cdot v_8 \\
f &= v_9
\end{align*}$$
</p>
<p>
If we are interested in computing $\frac{\partial f}{\partial x_1}$, we derive each identity line-by-line with respect to $x_1$, for example, $\dot{v_1} = \frac{1}{x_1}$, and $\dot{v_2} = \frac{1}{x_3}\cdot \dot{v_1}$ (where we make use of $\dot{v_1}$ which was just computed).
If we are interested in computing $\frac{\partial f}{\partial x_2}$ we will have to perform the same derivative again with respect to $x_2$.
Ultimately we are computing $n$ different programs for each input variable, so computing the derivative of $f: \mathbb{R}^{1,000,000} \to \mathbb{R}$ requires generating a million other programs.
To avoid doing that, we use a trick.
We differentiate with respect to some direction $s$ and treat our input variables as intermediary variables.
In other words, we assume we are given $\dot{x_i}=\frac{\partial x_i}{\partial s}$ as inputs.
If we want to compute the derivative with respect to $x_1$ then, it's enough to set $\dot{x_1} = 1$ and every other $\dot{x_i} = 0$.
</p>
<p>
Deriving one variable at a time, whence the "forward-mode" part of the name, produces the following sequence whose last variable will hold the value of the derivative in the direction $(\dot{x_1}, \dot{x_2}, \dot{x_3})$ at the point $(x_1, x_2, x_3)$:
$$\begin{align*}
\dot{v_1} &= \frac{\dot{x_1}}{x_1} \\
\dot{v_2} &= \frac{1}{x_3} \dot{v_1} - \frac{\dot{x_3}}{x_3^2} \\
\dot{v_3} &= \dot{x_2} cos(x_2) \\
\dot{v_4} &= \dot{x_3} e^{x_3} \\
\dot{v_5} &= \dot{x_1} cos(x_1) \\
\dot{v_6} &= x_2 \cdot \dot{v_5} + \dot{x_2} \cdot v_5 \\
\dot{v_7} &= v_4 \cdot \dot{v_6} + \dot{v_4} \cdot v_6 \\
\dot{v_8} &= \dot{v_3} + \dot{v_7} \\
\dot{v_9} &= v_2 \cdot \dot{v_8} + \dot{v_2} \cdot v_8 \\
\dot{f} &= \dot{v_9}
\end{align*} $$
</p>
<p>
Just as with all other methods, if we are interested in all derivatives of $f: \mathbb{R}^n \to \mathbb{R}^m$ then we must evaluate the derivative-program $n$ times; once for each input.
</p>
<br>
<h2>Reverse-mode Automatic Differentiation</h2>
<p>
The only sensible gateway to reverse-mode AD that I have found is the chain rule (which I glanced over in the previous section).
How did forward-mode use the chain rule?
Let's look at how a few $\dot{v_i}$s were computed: (keeping in mind that we are differentiating with respect to some direction $s$, so we only know $x_1,x_2,x_3,\dot{x_1},\dot{x_2},\dot{x_3}$, and all the intermediate $v_i$s:
$$ \begin{align*}
\dot{v_1} &\overset{def}= \frac{\partial v_1}{\partial s} \\
&= \frac{\partial v_1}{\partial x_1}\frac{\partial x_1}{\partial s} \\
&= \frac{1}{x_1} \dot{x_1} \\
\dot{v_2} &\overset{def}= \frac{\partial v_2}{\partial s} \\
&= \frac{\partial v_2}{\partial v_1}\frac{\partial v_1}{\partial s} + \frac{\partial v_2}{\partial x_3}\frac{\partial x_3}{\partial s} \\
&= \frac{1}{x_3} \dot{v_1} - \frac{v_1}{x_3^2}\dot{x_3} \\
\dot{v_8} &\overset{def}= \frac{\partial v_8}{\partial s} \\
&= \frac{\partial v_8}{\partial v_3}\frac{\partial v_3}{\partial s} + \frac{\partial v_8}{\partial v_7}{\partial v_7}{\partial s} \\
&= 1 \dot{v_3} + 1 \dot{v_7}
\end{align*} $$
The essential observation is that in every application of the chain rule, we have fixed the denominator $\partial s$ and expanded the numerator.
In other words, we define $\dot{y} \overset{def}= \frac{\partial y}{\partial s}$ where $s$ is fixed and attempt to express every $\dot{y}$ in terms of other $\dot{z}$s.
That is, to compute $\dot{v_i}$ we ask ourselves the question, which variables $v_j$ should we use to do $\frac{\partial v_i}{\partial v_j}\cdot\dot{v_j}$?
The answer is those variables that appear in the definition of $v_i$.
</p>
<p>
What happens if we fix the numerator instead?
In other words, to compute $\overline{y}\overset{def}=\frac{\partial f}{\partial y}$ where $f$ is the fixed output variable, we ask ourselves, what are the variables $v_j$s needed to use to do $\overline{v_j}\frac{\partial v_j}{\partial v_i}$?
Firstly notice that we removed the direction vector $s$.
After all, it was introduced as an optimization trick.
Moreover, what we are interested in is the derivative $\frac{\partial f}{\partial x_i}$.
The answer is those variables where $v_j$ that includes $v_i$ in their definition.
</p>
<p>
Now let's apply this new trick (fixing the numerator) to the intermediary variables $v_i$.
We will use the adjoint notation, which is the notation introduced in the earlier paragraph $\overline{v_i} \overset{def}= \frac{\partial f}{\partial v_i}$.
Using this notation, the answer we are after becomes $\overline{x_1}, \overline{x_2}, \overline{x_3}$.
The first thing to observe is that $\overline{f}=1$.
In the forward-mode, we knew $\dot{x_1},\dot{x_2},\dot{x_3}$, so we naturally had to go from top-down.
Here we only know $\overline{f}$, so we have no choice but to go bottom-up.
</p>
$$\begin{align*}
\overline{v_9} &= \frac{\partial f}{\partial v_9} = 1, \\
\overline{v_8} &= \frac{\partial f}{\partial v_9} = \frac{\partial f}{\partial v_9} \cdot \frac{\partial v_9}{\partial v_8} = \overline{v_9} \cdot v_2,\\
\overline{v_7} &= \frac{\partial f}{\partial v_7} = \frac{\partial f}{\partial v_8} \cdot \frac{\partial v_8}{\partial v_7} = \overline{v_8} \cdot 1,\\
\overline{v_6} &= \frac{\partial f}{\partial v_6} = \frac{\partial f}{\partial v_7} \cdot \frac{\partial v_7}{\partial v_6} = \overline{v_7} \cdot v_4,\\
\vdots\\
\overline{v_2} &= \frac{\partial f}{\partial v_2} \\
&= \frac{\partial f}{\partial v_9}\frac{\partial v_9}{\partial v_2} + \frac{\partial f}{\partial v_3}\frac{\partial v_3}{\partial v_2} \\
&= \overline{v_9}\cdot v_8 + \overline{v_3} \cdot cos(v_2)
\end{align*}$$
<p>
As mentioned earlier, the equation of $\overline{v_i}$ includes those variables that include in their definition $v_i$.
Hence why we use both $\overline{v_9}$ and $\overline{v_3}$ when computing the adjoint $v_2$ .
In practice, when implementing this algorithm, we build a dependency graph whose nodes are the intermediary values, the inputs, and the outputs.
In this graph, $v_i$ is connected to $v_j$ if $v_j$ includes $v_i$ in its definition.
Computing the derivative amounts to traversing this graph forwards to compute the values $v_i$, then traversing it backward to propagate the adjoints.
This is precisely the back-propagation algorithm used in training neural networks.
</p>
<p>For our running example, the graph looks like this:<br>
<img src="./depgraph.png" height="300px" style="display:block">
</p>
<p>
To formalize a little our graph-oriented algorithm from the previous paragraph, let's introduce some definitions.
Let $f_v: \mathbb{R}^r \to \mathbb{R}$ be the function that vertex $v$ computes, and let $\partial_1 f_v, \cdots, \partial_r f_v$ be its partial derivatives that are statically known.
Let $I_v$ be the vector of the vertices with an outgoing edge to $v$.
Let $O_v$ be the set of vertices that $v$ has an outgoing edge.
And finally let $i(v_1, v_2)$ be the index that $v_1$ is connected to $v_2$ on.
For example, $f_{v_2}(x_1, x_2) = \frac{x_1}{x_2}$, $\partial_1 f_{v_2} = \frac{1}{x_2}$, $\partial_2 f_{v_2} = -\frac{x_1}{x_2^2}$.
And $I_{v_2} = [v_1, v_3]$, here the order matters, since $f_{v_2}(I_{v_2})$ should be the value of $v_2$.
And $O_{v_2} = \{v_3, v_9\}$, and $i(v_1, v_2) = 1$, and $i(v_3, v_2) = 2$.
Then:
$$ \overline{x} = \sum_{v \in O_x} \overline{v} \cdot \partial_{i(x,v)}f_v\left(I_v\right) $$
</p>
<p>
And finally, for completeness, here are the adjoints for our running example:
$$ \begin{align*}
\overline{f} &= 1 \\
\overline{v_9} &= \overline{y} \\
\overline{v_8} &= \overline{v_9} v_2 \\
\overline{v_7} &= \overline{v_8} \\
\overline{v_6} &= \overline{v_7} v_4 \\
\overline{v_5} &= \overline{v_6} x_2 \\
\overline{v_4} &= \overline{v_7} v_4 \\
\overline{v_3} &= \overline{v_8} \\
\overline{v_2} &= \overline{v_3} cos(v_2) + \overline{v_9} v_8 \\
\overline{v_1} &= \overline{v_2} \frac{1}{x_3} \\
\overline{x_3} &= \overline{v_4} e^{x_3} - \overline{v_2} \frac{v_1}{x_3^2} \\
\overline{x_2} &= \overline{v_6} v_5 \\
\overline{x_1} &= \overline{v_1} \frac{1}{x_1} + \overline{v_5} cos(x_1) \\
\end{align*} $$
</p>
<p>
This is all!
One last observation is that in a single backward pass of $f$ we have managed to compute <u>all</u> its partial derivatives.
Which explains why in a neural network setting reverse-mode AD is exclusively used.
</p>
<p>
To deal with $f: \mathbb{R}^n \to \mathbb{R}^m$ (since in our example $m=1$), we will have to compute $m$ different programs for each output.
We can of course reuse the same trick of introducing a direction $s$ and setting $\overline{y_i} = \frac{\partial s}{\partial y_i}$.
To compute all $\frac{\partial y_1}{\partial x_i}$ we just set $\overline{y_1} = 1$ and every other $\overline{y_i} = 0$.
</p>
<p>
So, when should we use each method?
If we are differentiating $f: \mathbb{R}^n \to \mathbb{R}^m$, and $n \leq m$ then the numerical method is the simplest to implement, while the forward-mode AD is the fastest.
And if $n > m$ then reverse-mode AD is the fastest.
</p>
https://blog.grgz.me/posts/on-computing-derivatives.html
https://blog.grgz.me/posts/on-computing-derivatives.htmlWed, 18 Aug 2021A Case for For Loops<style>
blockquote { border-left: 5px solid rgba(0, 0, 0, 0.2); margin: 20px 0; padding-left: 5rem; }
code.in { padding: 0; background: transparent; border: 0; }</style><p>
I do not like for loops.
I think for loops should not be the first tool developers reach for when they wish to loop.
If you have the chance to use <code>map</code>, <code>filter</code>, or <code>reduce</code>,
and if your host language is capable of <a href="https://github.com/frenchy64/stream-fusion" target="_blank">fusing</a> them,
then it's always<sup>[source?]</sup> more readable to sequence them.</p><p>
But I think I found a case where for loops really are the most elegant solution.</p><p>
To illustrate this scenario imagine we were asked to find the prime numbers in a list of the first <em>n</em> fibonacci numbers.
Next, we were asked to find the square numbers in the same list of <em>n</em> fibonacci numbers.
In fact we were asked by some fibonacci-obsessed mathematician;
in other words we will be computing fibonacci numbers a lot.
So it's sensible to imagine some function <code>int[] firstNFibonacci(int n)</code> defined somewhere.</p><p>
However, this fibonacci-crazed odd-ball is of course interested in rare properties of fibonacci numbers;
which means that, of the <em>n</em> numbers computed, hardly any will remain.
So it's a waste of time (and mostly space) to construct a list of size <em>n</em> and to loop over it once to populate it with fibonacci numbers, then loop over it again to keep the interesting ones.</p><p>
It's obviously better if <code>firstNFibonacci</code> can yield the number once it was computed to the filtering logic before carrying on computing the second.
For the sake of generality I will assume your host language does not have a <code>yield</code> keyword (nor a synonym).
So we'll have to simulate yielding.
The way I choose to do it is to include the state of the computation as an argument (so the function can carry on where it left off),
and to include the next state in the output of the function (so the caller can chain these next states and make progress).</p>
Here is an example:<code class="block">
<pre>type state = (int, int, int);</pre>
<pre> </pre>
<pre>(int, state) firstNFibonacci(int n, state s) {</pre>
<pre> int found = s[0], a = s[1], b = s[2];</pre>
<pre> return found < n</pre>
<pre> ? (a+b, (found+1, b, a+b))</pre>
<pre> : (-1, s); // -1 indicates end of computation</pre>
<pre>}</pre></code><p>
The question now is, how do we use this iterator correctly?
Here are three ways of finding the prime fibonacci numbers.
I will let you judge which is <strong>completely</strong> correct.</p><code class="block">
<pre>state s = (0, 0, 1); // (none found, first fib., second fib.)</pre>
<pre>int n = 1000;</pre>
<pre>var result = firstNFibonacci(n, s);</pre>
<pre>while (result[0] > -1) {</pre>
<pre> if (is_prime(result[0])) print(result[0]);</pre>
<pre> result = firstNFibonacci(n, result[1]);</pre>
<pre>}</pre></code><code class="block">
<pre>state s = (0, 0, 1); // (none found, first fib., second fib.)</pre>
<pre>int n = 1000;</pre>
<pre>var result, value;</pre>
<pre>do {</pre>
<pre> result = firstNFibonacci(n, s);</pre>
<pre> value = result[0];</pre>
<pre> s = result[1];</pre>
<pre> if (is_prime(value)) print(value);</pre>
<pre>} while (value > -1);</pre></code><code class="block">
<pre>state s = (0, 0, 1); // (none found, first fib., second fib.)</pre>
<pre>int n = 1000;</pre>
<pre>var result, value;</pre>
<pre>while (true) {</pre>
<pre> result = firstNFibonacci(n, s);</pre>
<pre> value = result[0];</pre>
<pre> s = result[1];</pre>
<pre> if (is_prime(value)) print(value);</pre>
<pre> if (value > -1) break;</pre>
<pre>}</pre></code><p>
What happens when <code>n = 0</code>?
Will the first number always be considered?
Will the last number always be considered?
Will it halt?
The (do)while loops above invite the reader to ask these questions.
Questions whose answers are not obvious and therefore require a context switch in the head of the reader.
(In fact the last two versions are not completely correct)</p>
Now consider the for loop version<code class="block">
<pre>int n = 1000;</pre>
<pre> </pre>
<pre>for ( var result = firstNFibonacci(n, (0, 0, 1))</pre>
<pre> ; result[0] > -1</pre>
<pre> ; result = firstNFibonacci(n, result[1]))</pre>
<pre>{ </pre>
<pre> var value = result[0];</pre>
<pre> if (is_prime(value)) print(value);</pre>
<pre>}</pre></code><p>
I consider this version elegant because it fits with the semantics of for loops perfectly.
The first component is an initialization step,
the second is a termination condition,
and the third is the means of making progress.</p><p>
In conclusion, like gotos, there seems to be sensible use cases for for loops.
However I still believe that most of the time, for loops are not the tool for the job.
This investigation made me wonder if this is the reason why languages like Java adopted the same <code>for</code>
keyword for iterators instead of introducing a synonym.
But that is a question for the historians.</p>
https://blog.grgz.me/posts/a-case-for-for-loops.html
https://blog.grgz.me/posts/a-case-for-for-loops.htmlSun, 20 Jun 2021C: Fork as a knife<style>
blockquote { border-left: 5px solid rgba(0, 0, 0, 0.2); margin: 20px 0; padding-left: 5rem; }
code.in { padding: 0; background: transparent; border: 0; }</style><p>
Did you know that C's <code>int fork()</code> can fail?
I didn't until I read
<a href="https://rachelbythebay.com/w/2020/12/06/forked/" target="_blank">retvals, terrible teaching, and admitting we have a problem</a>.
First thing I did afterwards was type in my trusty terminal <code class="in">man fork</code>
and summon my machine's inner librarian.
Here's what this <a href="https://www.commandlinux.com/man-page/man3/fork.3am.html" target="_blank">man page</a>,
the one installed on my machine, had to say about <code class="in">fork()</code>:
<blockquote>
This function creates a new process. The return value is the zero in the
child and the process-id number of the child in the parent, or -1 upon
error. In the latter case, <code class="in">ERRNO</code> indicates the problem.
</blockquote></p>
Cool, cool... Scrolling down to some examples I find:<code class="block">
<pre>@load "fork"</pre>
<pre>...</pre>
<pre>if ((pid = fork()) == 0)</pre>
<pre> print "hello from the child"</pre>
<pre>else</pre>
<pre> print "hello from the parent"</pre></code>
Paraphrasing Rachelbythebay's blog post:
WTF?! Where's the <code>else if (pid < 0)</code>?
Why isn't anyone writing it?
The blog post concludes:<ol><li>
We're not including it in our examples, so those who are learning
from our examples won't know exist, and,</li><li>
the (arguably) worse reason of all: laziness combined with
a lack of creativity; the classical reason why a programmer won't do
something: "it makes my code messy".</li></ol><p>
Anyways, messy-code is not a valid excuse, it never was and it never will be.
Any code can always be "un-messified" with a little bit of creativity.
At worse it can be hid behind a function call;
<code>else handle_fork_fail()</code>
won't ruin the zen energy of any code if you ask me.
Why couldn't the man page example have been like this?</p><code class="block">
<pre>@load "fork"</pre>
<pre>...</pre>
<pre>if ((pid = fork()) == 0)</pre>
<pre> print "hello from the child"</pre>
<pre>else if (pid > 0)</pre>
<pre> print "hello from the parent"</pre>
<pre>else</pre>
<pre> ...</pre></code><p>
As example-writers we must always be aware that we will be imitated often,
and especially by those who know least.
And we must always be aware that our code will likely be copy-pasted pretty
much everywhere (example: your next car's firmware).</p><p>
Rachelbythebay's blog post points to, what I believe is, the fundamental issue.
It's not in the examples as much as it's in <code class="in">fork()</code> itself.
Luckily the blog puts it better than I could, here:
<a target="_blank" href="https://rachelbythebay.com/w/2011/07/15/ui/">if too many users are wrong, it's probably your fault</a>.</p><p>
Now here's an anecdote.
When I arrived in the french part of Switzerland I was already speaking the official French,
<em>le français Parisien</em> for the last fifteen years of my life.
First day there what do you do?
You go buy some food and the stuff the airport ruined.
Normal.
I get to the cashier and I heard a number I had never heard before.
You should know that in official French, numbers like 99 are literally pronounced "four twenties ten nine" (<em>quatre-vingt-dix-neuf</em> 4*20+10+9=99).
What I heard was a new word, something like "nine-ty nine" (<em>"nenate" neuf</em>), the Swiss got creative, and it blew my mind!
</p><p>
Coming back to <code class="in">fork()</code>,
we are complicating our lives when we use ad-hoc return values,
and explicitly encoding statuses as integers isn't helping,
at least hide them behind a <code class="in">typedef</code> or a <code class="in">#define</code>, have some shame.
Try it with me, say this out loud: "if the new process is zero", sounds funky, no?</p><p>
If you want to surrender to the status-quo and be pragmatic, here's a <code class="in">fork()</code> example for you.</p><code class="block">
<pre>#include <stdio.h></pre>
<pre>#include <unistd.h></pre>
<pre>#include <errno.h></pre>
<pre>#include <string.h></pre>
<pre> </pre>
<pre>int _g_last_pid_value;</pre>
<pre>int (*old_fork)() = fork;</pre>
<pre></pre>
<pre>#define last_fork_in_child if (_g_last_pid_value == 0)</pre>
<pre>#define last_fork_in_parent if (_g_last_pid_value > 0)</pre>
<pre>#define last_fork_error if (_g_last_pid_value < 0)</pre>
<pre>#define fork() (_g_last_pid_value = old_fork())</pre>
<pre></pre>
<pre>int main() {</pre>
<pre> int pid = fork();</pre>
<pre></pre>
<pre> last_fork_in_child printf("hello from the child\n");</pre>
<pre> last_fork_in_parent printf("hello from the parent (child=%d)\n", pid);</pre>
<pre> last_fork_error fprintf(stderr, "error %d\n%s\n", errno, strerror(errno));</pre>
<pre></pre>
<pre> return 0;</pre>
<pre>}</pre></code><p>
But, if you want to do better, ask yourself:
What are you telling your user?
Are you telling them "hey buddy, don't forget about the error case, no biggie if you forget, but please try to remember it",
or are you saying "listen pal, that's not gonna work if you won't tell me what to do if it fails"?</p><p>
If you want to absolutely force the user to provide an error handler then more sophisticated macros will be required.
My go-to strategy is a type-driven approach,
this way if an handler isn't provided the compiler will always let me know (I don't have to wait till the runtime).
I implemented a proof-of-concept in D (the full code is at the end),
here's an example illustrating my fork's interface:</p><code class="block">
<pre>MyFork.with_child_handler({</pre>
<pre> writeln("Hello from the child");</pre>
<pre> })</pre>
<pre> .with_parent_handler((int pid) {</pre>
<pre> writefln("Hello from the parent (child=%d)", pid);</pre>
<pre> })</pre>
<pre> .with_error_handler({</pre>
<pre> writeln("An error occurred");</pre>
<pre> })</pre>
<pre> .run();</pre></code><p>
The handlers can be provided in any order,
and exactly one handler must be provided,
providing two will produce a compile error.
The actual forking is done when <code class="in">run()</code> is executed.
Basically there are, eight phantom types (represented by mixed-in structs):</p><code class="block">
<pre>struct MyFork {</pre>
<pre> MyForkWithChild with_child_handler(ChildHandler handler);</pre>
<pre> MyForkWithParent with_parent_handler(ParentHandler handler);</pre>
<pre> MyForkWithError with_error_handler(ErrorHandler handler);</pre>
<pre>}</pre>
<pre>...</pre>
<pre>struct MyForkWithParent {</pre>
<pre> private ParentHandler parent_handler;</pre>
<pre> MyForkWithChildParent with_child_handler(ChildHandler handler);</pre>
<pre> MyForkWithParentError with_error_handler(ErrorHandler handler);</pre>
<pre>}</pre>
<pre>...</pre>
<pre>struct MyForkWithChildParentError {</pre>
<pre> private ChildHandler child_handler;</pre>
<pre> private ParentHandler parent_handler;</pre>
<pre> private ErrorHandler error_handler;</pre>
<pre> public void run() {</pre>
<pre> int pid = fork();</pre>
<pre> if (pid == 0) child_handler();</pre>
<pre> else if (pid > 0) parent_handler();</pre>
<pre> else error_handler();</pre>
<pre> }</pre>
<pre>}</pre></code><p>
To reiterate, my point is: messy code has never been a valid excuse, and never will be.
Besides sequencing <code>.with_..._handler</code>s we could use named arguments.
Or we could use fancier macros.
Or we could simulate the whole thing with function calls like the following example does.
There are a million ways to tidy up your code <strong>and</strong> do error handling.</p><code class="block">
<pre>int (*old_fork)() = fork;</pre>
<pre>void fork(void (*child)(), void (*parent)(int), void (*error)()) {</pre>
<pre> int pid = old_fork();</pre>
<pre> if (pid == 0) child();</pre>
<pre> else if (pid > 0) parent(pid);</pre>
<pre> else error();</pre>
<pre>}</pre></code><p>
For the interested here is the complete compilable/executable source code of the D code mentioned above.
The eight phantom types are not hand-written (you think I'm insane?)
they are generated at compile-time by D's mixins and compile-time-function-execution (C++ template magic allows you to do that too).
You could try this code by copy-pasting it in
<a href="https://run.dlang.io" targte="_blank">https://run.dlang.io</a>.
The implementation of the macros and templates and co. may be complicated,
but the interface ought to be as simple as possible and should encode our requirements correctly (preaching <em>The Right Thing</em>).
If we wanted to force the user to provide the handlers in a specific order then we would not need eight structs.
Similarly, if we wanted to allow the user to override the handler (by calling <code class="in">with_child_handler</code> many times for example) then we would not need eight structs.
Eight is the number of possible states we could be in (states representing which handlers were provided).
The struct that has all three handlers is the only struct with a <code class="in">run()</code> function.
If we wanted to make the parent handler optional then all structs with a child handler and an error handler (two of them) will have a copy of <code class="in">run()</code>.</p><code class="block">
<pre>import std.stdio;</pre>
<pre> </pre>
<pre>mixin template HasChildHandler() {</pre>
<pre> private void delegate() child_handler;</pre>
<pre>}</pre>
<pre>mixin template HasParentHandler() {</pre>
<pre> private void delegate(int) parent_handler;</pre>
<pre>}</pre>
<pre>mixin template HasErrorHandler() {</pre>
<pre> private void delegate() error_handler;</pre>
<pre>}</pre>
<pre> </pre>
<pre> </pre>
<pre>mixin template ProvideHandlerSetter(T, DG, string field) {</pre>
<pre> mixin(`T with_` ~ field ~ `_handler(DG dg) {</pre>
<pre> auto value = T();</pre>
<pre> value.` ~ field ~ `_handler = dg;</pre>
<pre> static if (__traits(compiles, this.child_handler)</pre>
<pre> && __traits(compiles, value.child_handler))</pre>
<pre> value.child_handler = this.child_handler;</pre>
<pre> static if (__traits(compiles, this.parent_handler)</pre>
<pre> && __traits(compiles, value.parent_handler))</pre>
<pre> value.parent_handler = this.parent_handler;</pre>
<pre> static if (__traits(compiles, this.error_handler)</pre>
<pre> && __traits(compiles, value.error_handler))</pre>
<pre> value.error_handler = this.error_handler;</pre>
<pre> return value;</pre>
<pre> }`);</pre>
<pre>}</pre>
<pre> </pre>
<pre> </pre>
<pre>mixin template ProvideChildHandlerSetter(T) {</pre>
<pre> mixin ProvideHandlerSetter!(T, void delegate(), "child");</pre>
<pre>}</pre>
<pre>mixin template ProvideParentHandlerSetter(T) {</pre>
<pre> mixin ProvideHandlerSetter!(T, void delegate(int), "parent");</pre>
<pre>}</pre>
<pre>mixin template ProvideErrorHandlerSetter(T) {</pre>
<pre> mixin ProvideHandlerSetter!(T, void delegate(), "error");</pre>
<pre>}</pre>
<pre> </pre>
<pre> </pre>
<pre>mixin({</pre>
<pre> string structName(bool child, bool parent, bool error) {</pre>
<pre> string name = "MyFork" ~ (child || parent || error ? "With" : "");</pre>
<pre> if (child) name ~= "Child";</pre>
<pre> if (parent) name ~= "Parent";</pre>
<pre> if (error) name ~= "Error";</pre>
<pre> return name;</pre>
<pre> }</pre>
<pre> string result = "";</pre>
<pre> for (int i=0; i<8; i++) {</pre>
<pre> bool child = (i & 0b001) == 1;</pre>
<pre> bool parent = ((i & 0b010) >> 1) == 1;</pre>
<pre> bool error = ((i & 0b100) >> 2) == 1;</pre>
<pre> result ~= "struct " ~ structName(child, parent, error) ~ " {";</pre>
<pre> result ~= (child ? "mixin HasChildHandler" : "mixin ProvideChildHandlerSetter!" ~ structName(true, parent, error)) ~ ";";</pre>
<pre> result ~= (parent ? "mixin HasParentHandler" : "mixin ProvideParentHandlerSetter!" ~ structName(child, true, error)) ~ ";";</pre>
<pre> result ~= (error ? "mixin HasErrorHandler" : "mixin ProvideErrorHandlerSetter!" ~ structName(child, parent, true)) ~ ";";</pre>
<pre> if (child && parent && error) {</pre>
<pre> result ~= q{void run() {</pre>
<pre> import core.sys.posix.unistd : fork, pid_t;</pre>
<pre> pid_t pid = fork();</pre>
<pre> if (pid == 0) child_handler();</pre>
<pre> else if (pid > 0) parent_handler(pid);</pre>
<pre> else error_handler();</pre>
<pre> }};</pre>
<pre> }</pre>
<pre> result ~= "}";</pre>
<pre> }</pre>
<pre> return result;</pre>
<pre>}());</pre>
<pre> </pre>
<pre> </pre>
<pre>void main() {</pre>
<pre> MyFork()</pre>
<pre> .with_error_handler({</pre>
<pre> "ERROR".writeln;</pre>
<pre> })</pre>
<pre> .with_child_handler({</pre>
<pre> "Hello from the child".writeln;</pre>
<pre> })</pre>
<pre> .with_parent_handler((child_pid) {</pre>
<pre> "Hello from the parent (child=%d)".writefln(child_pid);</pre>
<pre> })</pre>
<pre> .run();</pre>
<pre>}</pre></code class="block">
https://blog.grgz.me/posts/fork-as-a-knife.html
https://blog.grgz.me/posts/fork-as-a-knife.htmlThu, 10 Dec 2020Translation Of and Notes On Simone Weil's Letter to <em>Cahiers Du Sud</em>
on the Responsibilities of Literature.<style>
p { font-family: serif; }
em > em { font-style: normal; }
hr { width: 33%; border: 0; border-bottom: 1px solid #ccc; }
a.footnote { background: transparent; border: 0; }
hr.footnotes { margin-bottom: 50px; }
hr.footnotes + ol li { margin-bottom: 20px; }
sup { font-size: 55%; }</style><p><em>
Written in 1941 and published in 1951 in <em>Cahiers du Sud</em>, a literary
journal based in Marseilles, this letter is Simone Weil's response to the
journal's editor-in-chief Léon-Gabriel Gros who contributed two
chronicles, the first in October 1940 and the second in March 1941, that evoke
the responsibility of writers. Gros presents two theses in his two
contributions: The first thesis is the official thesis, that of the Vichy
government, that blames writers for the fall of France. The second is
that of the <em>Zone Libre</em> that wish writers would adopt a more moral
stance to help France.</em></p><p><em>
Throughout the translation I underlined some sentences. These sentences I believe
are either key points in Weil's argument, or are one of her assumptions, beliefs,
or observations.
Similarly the footnotes present are also entirely my own and represent my notes and
reflections about the text.
The original letter has neither footnotes nor underlines.</em></p><p><em>
About this translation: this translation is entirely my own. I have tried
to preserve the structure of the sentences and Weil's choice of words,
therefore some sentences may sound clunky. Throughout the letter Weil uses
the expression <em>le bon et le mal</em> which could either mean "the good
and the bad", or "the good and the evil". I have chosen to use the word
bad always instead of evil as to avoid any extremization. In one
instance where Weil uses <em>opposition du bien et du mal</em> I have chosen
to translate it as "good vs. evil" as opposed to the more literal "opposition
of the good and the bad".<br></em></p><p><em>
Very brief summary: Writers have responsibilities (paragraph 2). And literature's
been losing value, but has not lost its prestige which adds responsibility
to the writers (paragraph 3). But there's more (paragraph 4). Because
the writers have always had access and knowledge to values (paragraph 5).
But we have to act quickly before movements like Dadaism and Surrealism become
the norm (paragraph 6), just look at the words they, and those before them,
have used (paragraph 7). Now, the contemporaries show signs of lack of values
(paragraph 8). So let's go back to owning values (paragraph 9) by looking
inside (paragraph 10).</em></p>
<hr><p>
While reading Gros'<a href="#footnote-1-text" id="footnote-1-ref" class="footnote"><sup>1</sup></a> allusion to the controversy surrounding the
responsibility of writers, I was not able to resist going back to this
question and defend a point of view contrary to the journal's,
contrary to that of almost all who are sympathetic to me, and resembling
in form, by bad luck, that of people for whom I have no sympathy.</p><p>
<u>I believe in the responsibility of the writers of that era that just
passed</u> towards the misfortune of our time. By that I do not only mean the
defeat of France; the misfortune goes back a long way.
It goes around the whole world, that is in Europe, in America, and in
the other continents, for as long as Western influence has penetrated
there<a href="#footnote-2-text" id="footnote-2-ref" class="footnote"><sup>2</sup></a>.</p><p>
It is true, just as Mauriac<a href="#footnote-3-text" id="footnote-3-ref" class="footnote"><sup>3</sup></a> remarked, that the best contemporary
books are hardly ever read.
But the responsibility of the writers cannot be measured by the number
of books sold.
Because <u>the prestige of literature is immense</u>.
We can verify this if we consider the efforts done in the
past by some political groups to insure the names of popular writers for
demagogic goals.
For those who even the name of some popular writer is unknown to, do not
experience any less the prestige of that literature they are unaware of.
We have never read more than we are reading today.
<u>We do not read books, but we read mediocre and bad periodicals</u>; these
periodicals are everywhere, in the villages, in the suburbs, and now, due
to the effect of our time's literary customs, between the worse of these
periodicals and the best of our writers there are no ruptures of continuity.
This fact, which is known or rather confusedly felt by the public, adorns
in their eyes the most ignoble advertising firms with all the prestige
of high literature.
<u>There was, throughout the previous years, an incredible baseness</u>, such as
some sentimental consultations accorded by some known writers.
With no doubt all doesn't fall like so; a lot was needed.
But those who fell like so were not disavowed nor pushed-back by others;
they did not lose any consideration from their peers' milieu.
This ease of literary manners, <u>this tolerance of the baseness gives to
our most eminent writers, a responsibility in the demoralisation</u> of any
farm-girl who never left her village and who never heard their names<a href="#footnote-4-text" id="footnote-4-ref" class="footnote"><sup>4</sup></a>.</p><p>
But writers have a more direct responsibility.</p><p>
<u>The essential character of the first half of the 20<sup>th</sup> century
is the weakening and the near collapse of the notion of value</u>.
It is one of the rare phenomena that seem to be, for as long as we know,
really new in the history of humanity.
It could be, of course, the case that this happened before through periods
whose memory faded due to forgetfulness, as it could be later the case for
our time.
This phenomenon manifested itself in many domains foreign to literature,
even in everything.
The substitution for quantity of quality in the industrial production,
the discredit that fell on qualified work among the workers, the
substitution for the diploma of the culture as the goal of studies among
the studious youth<a href="#footnote-5-text" id="footnote-5-ref" class="footnote"><sup>5</sup></a> are some of these manifestations.
Even science does not hold the same criteria for value ever since the
abandonment of classical science<a href="#footnote-6-text" id="footnote-6-ref" class="footnote"><sup>6</sup></a>.
But writers were <em>par excellence</em>, the guardians of the lost
treasure<a href="#footnote-7-text" id="footnote-7-ref" class="footnote"><sup>7</sup></a>, and some drew vanity from that loss.</p><p>
Dadaism, and Surrealism are extreme cases.
They expressed the drunkenness of a total license, the drunkenness where
the soul dives when, rejecting any consideration for any value, it delivers
itself to the immediate.
<u>The good is the pole to which is necessary oriented the human spirit</u>,
not only in action, but in all efforts, including the effort of pure
intelligence.
The surrealists erected as a model disoriented thinking; they chose
as supreme value the complete absence of values<a href="#footnote-8-text" id="footnote-8-ref" class="footnote"><sup>8</sup></a>.
The license has always intoxicated men, and is the reason why, all
along history cities were sacked.
But the sacking of cities didn't always have a literary equivalent.
Surrealism, as such, is an equivalent.</p><p>
<u>The other writers of the same period and the previous one did not go as
far, but almost all - three or four excluded - are more or less affected
by the same lack, the lack of any sense of value</u>.
Some words such as spontaneity, sincerity, gratuity, richness,
enrichment, words that imply an indifference almost entire to the
oppositions of values, started appearing more frequently under their
pens than words that relate to the good and the bad.
In fact this last species of words has degraded, especially those that
are related to the good, as Valéry<a href="#footnote-9-text" id="footnote-9-ref" class="footnote"><sup>9</sup></a> remarked some years ago.
Words such as virtue, nobility, honour, honesty, and generosity either
became almost impossible to pronounce or held a bastardised meaning; <u>the
language doesn't provide anymore any resource to legitimately praise the
character of a human</u>.
It provides a little bit more, but hardly any to praise the spirit; even
the word spirit itself, the words of intelligence, intelligent and
others similar have also been degraded.
The destiny of words portrays sensibly the progressive collapse of the
sense of value, and though this destiny does not depend on the writers,
we cannot prevent making them particularly responsible, since words are
part of their business.</p><p>
We have lately praised a lot, and righteously, the book of Bergson<a href="#footnote-10-text" id="footnote-10-ref" class="footnote"><sup>10</sup></a>; we
have talked a lot about the influence this work had on the thoughts and
the literature of our time.
However, in the center of the philosophy that guides his first three books
lies a notion essentially foreign to any account of value, maybe even
foreign to any account of life.
Not surprisingly in vain some willed this philosophy to be foundations to
Catholicism, which in fact did not need any, since its own are much older.
The work of Proust<a href="#footnote-11-text" id="footnote-11-ref" class="footnote"><sup>11</sup></a> is full of analyses that try to describe the state of
disoriented souls; the good only appears in rare moments, or as a
consequence of either memory, or beauty, eternity lets itself be sensed
throughout Time.
We could make analogous remarks on many writers before and certainly
after 1914.
In a general manner the literature of the 20<sup>th</sup> century is
essentially psychological.
And psychology consists of describing the state of souls in laying them
all out on the same plane without discriminating for values, <u>as if the
good and the bad are exterior to them</u>, and <u>as if the effort towards the
good could be absent at any moment during the thought process of any human</u><a href="#footnote-12-text" id="footnote-12-ref" class="footnote"><sup>12</sup></a>.</p><p>
The writers are not to be professors of morals, but they are to express the
human condition. Since <u>nothing is as essential to human life, for every
human and at every instant, as the good and the bad</u>.
When literature becomes in part indifferent to the problem of good
<em>vs.</em> evil, it betrays its function and cannot begin to claim
excellence<a href="#footnote-13-text" id="footnote-13-ref" class="footnote"><sup>13</sup></a>.
Racine made fun of Jansenists in his youth, but he stopped making fun of
them while writing <em>Phèdre</em>, and <em>Phèdre</em> is
his masterpiece.
From that point of view it is not true that there is a continuity in
french literature.
It is not true that Rimbaud and his successors (putting aside some
passages from A Season In Hell) follow the footsteps of Villon<a href="#footnote-14-text" id="footnote-14-ref" class="footnote"><sup>14</sup></a>.
Why does it matter that Villon has stolen?
The act of stealing became, on its part, probably an effect of
necessity, or probably a sin, but it was not an adventure nor a free
act.
The feeling of the good and the bad impregnate all his verses, just as
he is impregnated in all works that are not foreign to human destiny<a href="#footnote-15-text" id="footnote-15-ref" class="footnote"><sup>15</sup></a>.</p><p>
Certainly, there are things more foreign still to the good and the bad
than amorality, and it's a certain morality.
Those who blame now the popular writers are worth infinitely less<a href="#footnote-16-text" id="footnote-16-ref" class="footnote"><sup>16</sup></a>,
and the "redress" that some would like to impose will be much
worse than the state of things we're pretending to remedy.
If the current sufferings never lead to a redress, it will certainly not
happen because of slogans, but through the silence and the solitude of
morality, through the penalties, the misery, the terrors, and in the
most intimate place in each spirit.</p>
<hr><p><em>
Note on the conclusion: Simone Weil is not being a professor of moral in
this letter, she is not teaching us or giving us a moral idea.
She is not trying to grade (by blaming) other writers, and seems to be
vehemently opposed to that idea. Instead what we see is a communication
of an observation by Simone Weil, she's pointing something out;
she's expressing a human condition.
Her observation is: if any redress or improvement needs to be done
then it will never come from another person, nor a group, nor their ideas,
nor their slogans. The only source of redress needs to come from within
every person, "in the silence and the moral solitude [...] and in the
most intimate place in each spirit". She's inviting us to dig deep within.</em></p>
<hr class="footnotes"><ol><li><span id="footnote-1-text" /></li></ol>
Léon-Gabriel Gros (1905-1985) was the editor-in-chief of
<em>Cahiers du Sud</em> who contributed two chronicles about the responsibility
of writers. Read also the introduction of this translation.
<a href="#footnote-1-ref">[back]</a><li><span id="footnote-2-text" /></li>
On the surface this may look like an Americo-Eurocentric position, but one must
keep in mind that Simone Weil read the Mahabharata and knew Sanskrit, so she is
very well aware of the rest of the world.<br>
In my opinion, in this sentence she only means that Western thinking is a
fundamental part of this misfortune, she doesn't seem to find that misfortune
in the East.
<a href="#footnote-2-ref">[back]</a><li><span id="footnote-3-text" /></li>
François Charles Mauriac (1885-1970) was a writer, journalist, member of the
<em>Académie Française</em>, and Nobel Literature prize laureate.
<a href="#footnote-3-ref">[back]</a><li><span id="footnote-4-text" /></li>
So we can conclude that it is the responsibility of <em>eminent</em> writers
to seek out the readers that do not know them. In other words, to be promoted
from a writer to an eminent writer it is one's job to advertise oneself and give
as broad an audience as possible to one's writings.
<a href="#footnote-4-ref">[back]</a><li><span id="footnote-5-text" /></li>
These are supposed to be arguments for the collapse of values. It is true that
these examples represent the collapse of the <em>old</em> values, but I
do not think it is fair to describe these as only a collapse. To me these
represent a shift, a change, a substitution as she says, of values. So people
are not losing their values, they are just changing them.
<a href="#footnote-5-ref">[back]</a><li><span id="footnote-6-text" /></li>
I do not know what Weil means by classical science. The best guess I can provide
is that she means natural philosophy by classical science, i.e. science as it
was up to the late 19<sup>th</sup> century before the rise of modern science.
<a href="#footnote-6-ref">[back]</a><li><span id="footnote-7-text" /></li>
Imagine you were locked up underground during a nuclear explosion. You go back
up to the world and realize the only survivors are a handful of children.
Will you not feel responsible towards their education? Will you not tell them
the stories of old, or teach them what it is to be human?<br>
Similarly, I imagine Weil would believe that it is the responsibility of religious
hermits and recluses to go back to society and share the revelations they
receive in the wild.
<a href="#footnote-7-ref">[back]</a><li><span id="footnote-8-text" /></li>
The Dadaists and the Surrealists have explicitly adopted the antithesis of
Simone Weil, so they are "the enemy", it is not surprising to see that paragraph
end the way it does.
<a href="#footnote-8-ref">[back]</a><li><span id="footnote-9-text" /></li>
Ambroise Paul Toussaint Jules Valéry (1871-1945) was a writer and philosopher.
He was nominated to the Nobel Literature prize 12 times.
<a href="#footnote-9-ref">[back]</a><li><span id="footnote-10-text" /></li>
Henri Bergson (1859-1941) was a philosopher. I am not knowledgeable enough to infer
the exact ideas that offend Weil.
<a href="#footnote-10-ref">[back]</a><li><span id="footnote-11-text" /></li>
The work Weil alludes to is Marcel Proust's (1871-1922) In Search of Lost Time
(1913-1927, earlier title is: Remembrance of Things Past).
<a href="#footnote-11-ref">[back]</a><li><span id="footnote-12-text" /></li>
I suppose we can conclude that for Simone Weil the good and the bad, ethics,
and morals are never entirely exterior to anyone.<br>
We could also infer that, for Weil, psychology (as she interpreted it, or as
the standard interpretation of the time was) is a futile project.
<a href="#footnote-12-ref">[back]</a><li><span id="footnote-13-text" /></li>
Therefore literature is <em>primarily</em> about the good <em>vs.</em> the evil.
Its function is to tell moral stories about human lives.
<a href="#footnote-13-ref">[back]</a><li><span id="footnote-14-text" /></li>
François Villon (1431-1463) is the best known French medieval poet. His
(short) life was shrouded in mystery and crime.<br>
In June 1455 he committed his first crime: killing Philippe Chermoye, a priest who
attacked him first (or so were the priest's last words) after a brawl started.
Since self-defense was not a legal excuse at the time Villon had to suffer
banishment from Paris, his place of birth.<br>
The following January Villon received two pardons from King Charles VII, one for
<em>François des Loges, aka Villon</em>, and the other for <em>François
de Montcorbier</em>, <em>de Loges</em> and <em>de Montcorbier</em> appear both
in his official documents. He used the last name <em>Villon</em> to refer to
himself in his writings. Villon is the last name of his foster father.<br>
The following December Villon participated in robbing the chapel of <em>Collège
de Navarre</em> which went unnoticed for some six-months. It is said that Villon
fled Paris shortly after committing the crime. Anyways, in November 1462 Villon was
arrested for an unrelated crime. During his arrest the <em>Collège de Navarre</em>
theft was revived and he was sentenced to death, which was later wavered to
banishment in January 1943.<br>
Nothing is known of his whereabouts after January 1943 and it is assumed he died
soon after being banished.<br>
The question that follows in the letter refers to the theft of the chapel.
<a href="#footnote-14-ref">[back]</a><li><span id="footnote-15-text" /></li>
Contrast this with the way Weil describes Proust's In Search of Lost Time.
<a href="#footnote-15-ref">[back]</a><li><span id="footnote-16-text" /></li>
I think this sentence is very nuanced. On the surface it may seem that Weil is,
herself, blaming writers such as the Dadaists and the Surrealists in paragraph 6.
So she must be making a distinction between popular writers (<em>écrivains
célèbre</em>) and eminent writers (<em>les écrivains les
plus éminants</em> in paragraph 3).<br>
The only (arguably) popular writers she criticises in the letter are Bergson, Proust,
and Rimbaud. She never blamed any of these five writers for
the loss of values (even if they personally, or their work, lacked values),
she only refers to them as exhibiting the symptoms of loss of values.
At worse she blames them for <em>unknowingly</em> spreading the loss of values.
<a href="#footnote-16-ref">[back]</a>
https://blog.grgz.me/posts/responsibilities-of-literature.html
https://blog.grgz.me/posts/responsibilities-of-literature.htmlSat, 05 Dec 2020Upon Reading a Friend's Message about the Awesomeness of the Kosaraju Algorithm<style>
.poetry { counter-reset: poetry; max-width: 500px; margin: auto; margin-left: 100px; }
.poetry ol { padding: 0; margin: 0; padding-left: 20pt; }
.poetry li { list-style: none; counter-increment: poetry; text-indent: -20pt; margin-right: 30px; }
.poetry .empty-line { margin-bottom: 30pt; }
.poetry li:nth-child(5n):after { content: counter(poetry); display: inline-block; float: right; margin-right: -30px; font-size: 10px; }</style>
<div class="poetry"><ol><li>Let us say you reader work as a match-maker</li><li>And a client complain'd about a deal-breaker.</li><li>"She dated a dear friend before" was what he said,</li><li>"I'd find it most abhor if we had shared a bed.</li><li>Not known to my social circle make sure she be,</li><li>Or by god, this bad-buck business I'll rate shabby."
<div class='empty-line'></div></li><li>A problem like this one dear ol' Kosaraju</li><li>Tried to solve. But was it more than what he could chew?</li><li>Fear not, for foresight sleekly set him out to find</li><li>An algorithm so neat; a one-of-a-kind.</li><li>One that runs linearly in time and in space,</li><li>Designed neatly to be speedy and full of grace.
<div class='empty-line'></div></li><li>You must interpret first people as a network</li><li>Where Bob connects to Alice if he was a jerk,</li><li>or a lover, or simply in acknowledgement.</li><li>Then comsume one strongly connected component</li><li>That clearly does not include your careful client</li><li>And pick potentials from that pool to be compliant.
<div class='empty-line'></div></li><li>But how do you go about finding a strongly</li><li>Connected component from a graph so quickly?</li><li>Precisely that process Kosaraju and kin</li><li>Published the year after nineteen seventy sev'n.
<div class='empty-line'></div></li><li>Let list L from holding elements be acquit,</li><li>First step is to set all vertices to-visit.</li><li>Then one by one through the vertices you should loop</li><li>And view each as visited, and its neighbors group</li><li>To recurse on them after in L they'd instill</li><li>If the vertex was marked to-visit, else nil.</li><li>When pushing the neighbors be certain to prepend</li><li>Cause the next step will crucially on it depend:</li><li>To go in twine in the L-defined-order 'long</li><li>The vertices and assign each itself belong.</li><li>Where assignment is defined being twixt two nodes</li><li>For former in the latter's assignment it abodes</li><li>And its neighbors' assignment's nest in the latter</li><li>If the former is assigned, else nil's the matter.
<div class='empty-line'></div></li><li>For your match-making mess mister reader rummage</li><li>Through those client-unassigned to avoid damage.</li></ol>
</div>
https://blog.grgz.me/posts/kosaraju_poem.html
https://blog.grgz.me/posts/kosaraju_poem.htmlSun, 08 Mar 2020Recursion Elimination — Or how to make pretty code ugly<h2 id="sec-1">Introduction</h2><p>
Recursive code is the epitome of beautiful (read elegant, pretty, concise...) code, but is the bane of performance.
A classic example is the Fibonacci function, here's the slow yet elegant recursive version:</p><code class="block">
<pre>int fibonacci(uint n) {</pre>
<pre> return n < 2 ? 1 : fibonacci(n-1) + fibonacci(n-2);</pre>
<pre>}</pre></code><p>
And here is the fast yet unenlightening iterative version:</p><code class="block">
<pre>int fibonacci(int n) {</pre>
<pre> int current = 1; int next = 1; int temp;</pre>
<pre> while (n >= 2) {</pre>
<pre> temp = current + next;</pre>
<pre> current = next;</pre>
<pre> next = temp;</pre>
<pre> n--;</pre>
<pre> }</pre>
<pre> return next;</pre>
<pre>}</pre></code><p>
The other day I stumbled upon James Kopell's blog post (and video presentation) <a href="http://www.pathsensitive.com/2019/07/the-best-refactoring-youve-never-heard.html" target="_blank">The best refactoring you've never heard of</a> in which he presents a framework to transform recursive functions into iterative ones.
What's the catch?
Firstly, the transformation is hard to automate.
Secondly, the resulting code's running time is identical to the recursive one's.
So forget about transforming the recursive <em>Θ(φ<sup>n</sup>)</em> Fibonacci into the iterative <em>Θ(n)</em> code by applying the transformation mindlessly.
And finally, the transformed code is just really ugly...
Regardless, I think this transformation could be helpful.</p><p>
In this blog post I'll guide you through the steps to transform a few complicated recursive functions into iterative ones.
In the process I hope to present the motivation behind this transformation, and to go deeper in the explanations of the components that make it up.
To the curious, here's an example taken from Kopell's post that showcases the transformation:</p><code class="block">
<pre>// RECURSIVE VERSION</pre>
<pre>void printTree(Tree* tree) {</pre>
<pre> if (tree != null) {</pre>
<pre> printTree(tree.left);</pre>
<pre> print(tree.node);</pre>
<pre> printTree(tree.right);</pre>
<pre> }</pre>
<pre>}</pre>
<pre></pre>
<pre>// ITERATIVE VERSION</pre>
<pre>class Cont {</pre>
<pre> Tree tree;</pre>
<pre> Cont next;</pre>
<pre>}</pre>
<pre> </pre>
<pre>void printTree(Tree tree, Cont cont) {</pre>
<pre> while (true) {</pre>
<pre> if (tree != null) {</pre>
<pre> cont = new Cont(tree, cont);</pre>
<pre> tree = tree.left;</pre>
<pre> } else {</pre>
<pre> if (cont != null) {</pre>
<pre> print(cont.tree.node);</pre>
<pre> tree = cont.tree.right;</pre>
<pre> cont = cont.next;</pre>
<pre> } else return;</pre>
<pre> }</pre>
<pre> }</pre>
<pre>}</pre></code><h2 id="sec-2">Recursive code that is easy to rewrite</h2><p>
Tail-recursive functions, i.e. functions whose last instruction is a recursive one, can be easily transformed into iterative code.
Consider the following code that prints a linked list.
We can transform it into iterative code by following these instructions:</p><ol><li>Wrapping the body of the function with a <code>while (true) { ... }</code>,</li><li>adding a <code>break;</code> (or a <code>return</code>) in the branches that contain no recursive calls,</li><li>and replacing the recursive call with a re-assignment of the function's arguments (beware that the order of assignment may be important!).</li></ol><code class="block">
<pre>// RECURSIVE</pre>
<pre>void printList(LinkedList list) {</pre>
<pre> if (list == null) {</pre>
<pre> print("()");</pre>
<pre> } else {</pre>
<pre> print(list.data + " -> ");</pre>
<pre> printList(list.next);</pre>
<pre> }</pre>
<pre>}</pre>
<pre></pre>
<pre>// ITERATIVE AFTER DOING TAIL CALL OPTIMIZATION</pre>
<pre>void printList(LinkedList list) {</pre>
<pre> while (true) {</pre>
<pre> if (list == null) {</pre>
<pre> print("()");</pre>
<pre> break;</pre>
<pre> } else {</pre>
<pre> print(list.data + " -> ");</pre>
<pre> list = list.next;</pre>
<pre> }</pre>
<pre> }</pre>
<pre>}</pre></code><p>
For some functions that operate on the recursive call, e.g. factorial, we can use accumulators to transform the function into a tail-recursive one.
Here's the factorial example:</p><code class="block">
<pre>// RECURSIVE</pre>
<pre>int factorial(int n, int acc = 1) {</pre>
<pre> if (n < 2)</pre>
<pre> return acc;</pre>
<pre> return factorial(n-1, acc*n);</pre>
<pre>}</pre>
<pre></pre>
<pre>// ITERATIVE</pre>
<pre>int factorial(int n) {</pre>
<pre> int acc = 1;</pre>
<pre> while (true) {</pre>
<pre> if (n < 2)</pre>
<pre> return acc;</pre>
<pre> acc = acc * n; // the order of assignment is important here!</pre>
<pre> n = n - 1;</pre>
<pre> }</pre>
<pre>}</pre></code><h2 id="sec-3">Quest for a general algorithm</h2><h3 id="sec-3.1">Hand wavy description</h3><p>
The steps described here are the same ones presented in Kopell's presentation.
To give value to my presentation and not regurgitate Kopell's I have tried to motivate the same steps from a different perspective.</p><p>
How can we rewrite any recursive function iteratively?
One approach is to see if this problem reduces to the simple case, i.e. we ask ourselves, can we rewrite any recursive function in a tail-recursive way?
It's not clear how to accomplish that, let's see why.
Consider the recursive <code>printTree</code> code, the code after <code>printTree(tree.left)</code> (lines 5-6) is code that needs to <em>continue</em> to evaluate.
We can't just stop after the first recursive call, and the second recursive call prevents us from using the accumulator trick.
The key insight is to recognize that lines 5 and 6 are the <em>continuation</em> of the first recursive call (i.e. what is left to be done).
Rewriting the code in <a href="https://en.wikipedia.org/wiki/Continuation-passing_style" target="_blank">continuation-passing style</a> (CPS) will abstract the remaining recursive calls into the continuation.
At this stage we end up with a function with a single (albeit complicated) function call.
This function call will be none other than the first recursive call.
Now, our function is tail recursive that is passing around closures (the continuations).
Alas these closures carry around an environment of their own, so we still can't do tail-call optimization.
The second and last missing piece of the puzzle is <a href="https://en.wikipedia.org/wiki/Defunctionalization" target="_blank">Defunctionalization</a>.
This trick allows us to encode higher-order functions in terms of data.
Putting everything together: encoding the continuations transforms the code into a recursive function (a couple actually) that we can apply TCO to.</p><p>The transformation, which I dub <em>Recursion Elimination</em>, consists of four stages.</p><ol><li>Rewriting in continuation-passing style (CPS)</li><li>Defunctionalizing</li><li>Inlining</li><li>Doing tail-call optimization (TCO)</li></ol><p>The last step, tail-call optimization, may need to be done twice: before and after inlining.</p><h3 id="sec-3.2">Are these steps tractable for more complicated code?</h3><p>
Sure <code>printTree</code> is a cute example, but I wonder if the transformation would be humanly tractable to apply to more complicated recursive code.
So I set about rewriting <a href="https://www.youtube.com/watch?v=OyfBQmvr2Hc" target="_blank">The Most Beautiful Program Ever Written</a> (<strong>spoiler:</strong> it's a lisp interpreter) in an iterative way.
The resulting code, far from being called the ugliest program ever written, sure is cryptic and doubtlessly unmanageable.
Transforming it was a challenge at first and so I've definitely learned from it.
Furthermore the result was up to <em>24% faster</em> than the recursive version, and it never segfaults (in practice)!
I also noticed that with enough practice this sort of transformation could be mindlessly done quickly by hand.</p><p>
<em>Disclaimer</em>: I didn't write a lisp interpreter but rather a lambda calculus evaluator.
The language that we will implement will be very tiny.
It consists of variables, abstractions (aka functions), and [function] applications.
The only hard coded function that we will support is <code>print</code> that prints to stdout whatever it is given and returns it as its value.
The interpreter is written in the <a href="https://dlang.org" target="_blank">D programming language</a>, and all code samples from here on will be in that language.
D is similar to C/C++, however here's a few things that you might've not known:<ol><li>Function arguments can have default values.</li><li>Besides <code>func(val)</code> for calling functions, it's also possible to write <code>val.func()</code> or <code>val.func</code>.
In general <code>func(val, ...)</code> can be rewritten as <code>val.func(...)</code>.
Empty parenthesis can be omitted.</li><li>Closures, or functions with context, are called delegates.
The type of a delegate is <code>R delegate(T0, T1, ...)</code> where <code>R</code> is the return type, and <code>T0, T1, ...</code> are the argument types.</li><li>Structs can be stored in the heap by doing <code>new StructName(...)</code> which returns a pointer to this struct, or on the stack by doing <code>StructName(...)</code></li><li>D has a garbage collector.</li></ol></p><h2 id="sec-4">Interim — Case study: Lambda calculus evaluator</h2><p>The program that we will transform is the following:</p><code class="block">
<pre>enum TType { VAR, APP, ABS };</pre>
<pre></pre>
<pre>struct Term {</pre>
<pre> TType type;</pre>
<pre> string a; // a is either the variable name, or the function's only parameter</pre>
<pre> Term* t1; // t1 is the body of the function, or the left term in an application</pre>
<pre> Term* t2; // t2 is the right term in an application</pre>
<pre>}</pre>
<pre></pre>
<pre>alias Env = Term*[string]; // a hashmap whose keys are strings and values are Term*</pre>
<pre></pre>
<pre>Term* dup(Term* t) {</pre>
<pre> if (t == null) return null;</pre>
<pre> return new Term(t.type, t.a, dup(t.t1), dup(t.t2));</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* beta(Term* term, string var, Term* val) {</pre>
<pre> if (term == null) return term;</pre>
<pre> final switch (term.type) {</pre>
<pre> case TType.VAR:</pre>
<pre> return term.a == var ? val : term;</pre>
<pre> case TType.ABS:</pre>
<pre> if (term.a == var) return term;</pre>
<pre> term.t1 = beta(term.t1, var, val);</pre>
<pre> return term;</pre>
<pre> case TType.APP:</pre>
<pre> term.t1 = beta(term.t1, var, val);</pre>
<pre> term.t2 = beta(term.t2, var, val);</pre>
<pre> return term;</pre>
<pre> }</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* eval(Term* term, Env env, void delegate(Term*, int) interfunc, int depth = 0) {</pre>
<pre> interfunc(term, depth);</pre>
<pre></pre>
<pre> final switch (term.type) {</pre>
<pre> case TType.VAR:</pre>
<pre> if (term.a == "print") {</pre>
<pre> return term;</pre>
<pre> } else if ((term.a in env) !is null) {</pre>
<pre> return eval(dup(env[term.a]), env, interfunc, depth);</pre>
<pre> } else assert(false, "Unbound variable " ~ term.a);</pre>
<pre> </pre>
<pre> case TType.APP:</pre>
<pre> if (term.t1.type == TType.ABS) {</pre>
<pre> term.t1.t1 = beta(term.t1.t1, term.t1.a, dup(term.t2));</pre>
<pre> return eval(term.t1.t1, env, interfunc, depth);</pre>
<pre> } else {</pre>
<pre> if (term.t1.type == TType.VAR && term.t1.a == "print") {</pre>
<pre> writefln("[print] %s", term.t2.toString);</pre>
<pre> return eval(term.t2, env, interfunc, depth-1);</pre>
<pre> }</pre>
<pre> term.t1 = eval(term.t1, env, interfunc, depth+1);</pre>
<pre> term.t2 = eval(term.t2, env, interfunc, depth+1);</pre>
<pre> return eval(term, env, interfunc, depth);</pre>
<pre> }</pre>
<pre> </pre>
<pre> case TType.ABS:</pre>
<pre> return term;</pre>
<pre> }</pre>
<pre>}</pre></code><p>
Three functions are omitted since they are not relevant to this blog post.
They are: the parsing function, the <code>string toString(Term* t)</code> function that displays a term, and the main function that runs the whole show.
You can find these in <a href="https://github.com/geezee/typeless/tree/13f629e9039a3af0ef10673ad06bdf521fca1ef4" target="_blank">typeless.git</a> (there's a REPL to play around with!).</p><p>There are three recursive functions that we will rewrite iteratively:<ol><li><code>Term* dup(Term* t)</code> creates a deep copy of a term.</li><li><code>Term* beta(Term* t, string var, Term* val)</code> replaces all the occurrences of the variable <code>var</code> by the value <code>val</code>.
For example <code>beta(parse("((lambda x x) x) x"), "x", parse("y"))</code> produces <code>((lambda x x) y) y</code>.
Notice the outer <code>x</code>s were replaced and the one in the lambda is not (since it's shadowed by the lambda's argument).</li><li><code>Term* eval(Term* term, Env env, ...)</code> reduces the given term until it's a value (i.e. not possible to reduce it any further).
A variable is reduced by looking up its value in the environment.
If the variable is not defined (i.e. not present in the environment) then execution halts with an error message.<br>
The remaining two arguments, <code>interfun</code> and <code>depth</code> are for debugging purposes.
Between every reduction a call to <code>interfunc</code> happens.
The arguments to it are the current term and the depth of the reduction.
I use this function to pretty-print the evaluation as it is happening, and to count the total number of reductions.</li></ol>
I figured these functions are complicated enough to test the applicability of <em>Recursion Elimination</em>.</p><h2 id="sec-5">Continuation-passing style (CPS)</h2><h3 id="sec-5.1">Motivation</h3><p><strong>Very briefly</strong> Code written in CPS is code that explicitly defines what it does after every instruction.</p><p>
Let's examine our opening statement again: "Recursive code is the epitome of beautiful code".
If we consider "complete" code (self-descriptive) as criteria for beauty then our recursive Fibonacci example is surprisingly not beautiful.
Consider this small variation</p><code class="block">
<pre>int fibonacci(int n) {</pre>
<pre> writef("n=%d; ", n);</pre>
<pre> return n < 2 ? 1 : fibonacci(n-1) + fibonacci(n-2);</pre>
<pre>}</pre></code><p>
What will we read on the console if we call <code>fibonacci(3)</code>?
Without knowing much about D we can't know whether we'll see <code>n=3; n=1; n=2;</code>, or <code>n=3; n=2; n=1;</code>, or an infinite stream of numbers.
The semantics of this function depend on the order of evaluation of <code>fibonacci(n-1) + fibonacci(n-2)</code>; i.e. which side of <code>+</code> is evaluated first.
And it also depends on whether the branches of the ternary operator <code>x ? y : z</code> are all evaluated first (hence the infinite stream of numbers) or not.
Compiling this code with <code>gcc</code> will produce undefined behavior.
On the other hand compiling this code with GNU ELisp where the <a href="https://gnu.org/software/emacs/manual/html_node/elisp/Function-Forms.html" target="_blank">order of evaluation</a> is specified we know the output will be <code>n=3; n=1; n=2;</code>.
Hence we cannot attest to the completeness of this recursive function as its semantics depend on the host language.</p><p>A trivial fix is to rewrite the code with assignments like so:</p><code class="block">
<pre>int fibonacci(int n) {</pre>
<pre> writefln("computing fibonacci %d", n);</pre>
<pre> if (n < 2) return 1;</pre>
<pre> int left = fibonacci(n-1);</pre>
<pre> int right = fibonacci(n-2);</pre>
<pre> return left + right;</pre>
<pre>}</pre></code><p>
But who's to say the compiler (or a successor programmer) won't shuffle the order?
As motivated in the outline above we really should transform everything into a single instruction since order of evaluation cannot be ambiguous when there's only one instruction to be evaluated.
We know that after executing <code>fibonacci(n-1)</code> we have to call the function<br>
<code>(int left) { int right = fibonacci(n-2); return left + right; }</code> with <code>left</code> being whatever the first computation produced.
To achieve this we will rewrite <code>fibonacci</code> in <a href="https://en.wikipedia.org/wiki/Continuation-passing_style" target="_blank">continuation-passing style</a> like so:</p><code class="block">
<pre>int fibonacci(int n, int delegate(int) continutaion) {</pre>
<pre> writefln("computing fibonacci %d", n);</pre>
<pre> if (n < 2) return continuation(1);</pre>
<pre> return fibonacci(n-1, (int left) {</pre>
<pre> return fibonacci(n-2, (int right) {</pre>
<pre> return continuation(left + right);</pre>
<pre> });</pre>
<pre> });</pre>
<pre>}</pre></code><p>
Continuations are very useful, their usage spans many more applications.
Continuations allow you to explicitly specify the control flow.
A language equipped with continuations can use them to simulate try/catch, return, loops, and even "time-traveling".
I recommend reading <a href="http://matt.might.net/articles/programming-with-continuations--exceptions-backtracking-search-threads-generators-coroutines/" target="_blank">Continuations by example: Exceptions, time-traveling search, generators, threads, and coroutines</a> by Matt Might.
Or you can watch William Byrd's excellent video tutorial <a href="https://www.youtube.com/watch?v=2GfFlfToBCo" target="_blank">Intro to continuations, call/cc, and CPS</a>.</p><h3 id="sec-5.2">High-level algorithm for CPS-ing a recursive function</h3><ol><li>Add an extra argument <code>cont</code> to your function.</li><li>In all branches where there are no recursive calls, replace <code>return value;</code> with <code>return cont(value);</code>.</li><li>In all branches where there are recursive calls, add <code>cont</code> as a parameter to the <em>last</em> recursive call.</li><li>Wrap the instructions that follow every recursive call in a function.
This function will be the extra parameter to the recursive call.</li></ol><h3 id="sec-5.3">CPS-ing dup, beta, and eval</h3><p>There is not much to be said about this step of the transformation here, so here's the transformed code:</p><code class="block">
<pre>Term* dupCPS(Term* t, Term* delegate(Term*) cont) {</pre>
<pre> if (t == null) return cont(null);</pre>
<pre> return dupCPS(t.t1, (Term* ans1) {</pre>
<pre> return dupCPS(t.t2, (Term* ans2) {</pre>
<pre> return cont(new Term(t.type, t.a, ans1, ans2));</pre>
<pre> });</pre>
<pre> });</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* betaCPS(Term* term, string var, Term* val,</pre>
<pre> Term* delegate(Term*) cont = (Term* t) { return t; }) {</pre>
<pre> if (term == null) return cont(null);</pre>
<pre> if (term.type == TType.VAR) {</pre>
<pre> return cont(term.a == var ? val : term);</pre>
<pre> } else if (term.type == TType.ABS) {</pre>
<pre> if (term.a == var) return cont(term);</pre>
<pre> return betaCPS(term.t1, var, val, (Term* ans) {</pre>
<pre> term.t1 = ans;</pre>
<pre> return cont(term);</pre>
<pre> });</pre>
<pre> } else {</pre>
<pre> return betaCPS(term.t1, var, val, (Term* ans) {</pre>
<pre> term.t1 = ans;</pre>
<pre> return betaCPS(term.t2, var, val, (Term* ans) {</pre>
<pre> term.t2 = ans;</pre>
<pre> return cont(term);</pre>
<pre> });</pre>
<pre> });</pre>
<pre> }</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* evalCPS(Term* term, Env env, void delegate(Term*, int) interfunc, int depth = 0,</pre>
<pre> Term* delegate(Term*) cont = (Term* t) { return t; }) {</pre>
<pre> interfunc(term, depth);</pre>
<pre> </pre>
<pre> if (term.type == TType.VAR) {</pre>
<pre> if (term.a == "print") {</pre>
<pre> return cont(term);</pre>
<pre> } else if ((term.a in env) !is null) {</pre>
<pre> return evalCPS(env[term.a].dup, env, interfunc, depth, cont);</pre>
<pre> } else assert(false, "Unbounded variable " ~ term.a);</pre>
<pre> } else if (term.type == TType.APP) {</pre>
<pre> if (term.t1.type == TType.ABS) {</pre>
<pre> term.t1.t1 = betaCPS(term.t1.t1, term.t1.a, term.t2.dup);</pre>
<pre> return evalCPS(term.t1.t1, env, interfunc, depth, cont);</pre>
<pre> } else {</pre>
<pre> if (term.t1.type == TType.VAR && term.t1.a == "print") {</pre>
<pre> writefln("[print] ", term.t2.toString);</pre>
<pre> return evalCPS(term.t2, env, interfunc, depth-1, cont);</pre>
<pre> }</pre>
<pre> return evalCPS(term.t1, env, interfunc, depth+1, (Term* ans) {</pre>
<pre> term.t1 = ans;</pre>
<pre> return evalCPS(term.t2, env, interfunc, depth+1, (Term* ans) {</pre>
<pre> term.t2 = ans;</pre>
<pre> return evalCPS(term, env, interfunc, depth, cont);</pre>
<pre> });</pre>
<pre> });</pre>
<pre> }</pre>
<pre> } else {</pre>
<pre> return cont(term);</pre>
<pre> }</pre>
<pre>}</pre></code><h2 id="sec-6">Defunctionalization</h2><h3 id="sec-6.1">Motivation</h3><p><strong>Very briefly</strong> Defunctionalizing some code is roughly equivalent to writing a tiny interpreter for that code in the host language.</p><p>
John C. Reynolds introduced defunctionalization in 1972 in his paper <a href="https://surface.syr.edu/cgi/viewcontent.cgi?article=1012&context=lcsmith_other" target="_blank">Definitional interpreters for higher-order programming languages <sup>[PDF]</sup></a> to address these issues:</p><ol><li>Eliminating higher-order functions from interpreters.
Interpreters free of higher-order functions can be implemented using low-level languages (e.g. C) that do not support them.
(side-note: C supports function pointers which allow you to pass and return functions, but it does not support closures.)</li><li>Making the order of evaluation in the interpreter explicit.
This issue was discussed in detail in the section about CPS.
But I suppose you agree with me that making the semantics of a language dependent on the semantics of the interpreter's host language is a horrible idea.</li></ol><p>CPS solves the second issue, and defunctionalization solves the first.
Briefly, the idea is to pass descriptions of functions (which is data) rather than closures.
If <em>f</em> is a function, then all applications <em>f(x<sub>1</sub>, ..., x<sub>n</sub>)</em> are replaced by <em>apply(descriptionOfF, x<sub>1</sub>, ..., x<sub>n</sub>)</em>.
Consider the following program which contains only one higher-order function, <code>map</code>:</p><code class="block">
<pre>int[] map(int[] list, int delegate(int) mapper) {</pre>
<pre> if (list.length == 0) return [];</pre>
<pre> else return [ mapper(list[0]) ] ~ map(list[1..$], mapper);</pre>
<pre>}</pre>
<pre>auto someNumbers(int maxN, int parg1 = 2, int parg2 = 3) {</pre>
<pre> return range(1, maxN).map(x => pow(parg1, x))</pre>
<pre> .map(x => pow(parg2, x))</pre>
<pre> .map(x => x+1);</pre>
<pre>}</pre></code><p>
To begin with, we collect all the functions that are passed to <code>map</code>.
In this case they are <code>x => pow(parg1, x)</code>, <code>x => pow(parg2, x)</code>, and <code>x => x+1</code>.
Next, we identify all the free variables inside these closures, i.e. all the variables that assume their value from some parent scope.
These are <code>parg1</code> and <code>parg2</code> in the above example.
Finally, we create a data structure that holds all these free variables and a label to differentiate the closures.
For the code above, we create an int label that will tell us whether we should exponentiate, or increment.
Since <code>parg1</code> and <code>parg2</code> are never used in the same closure we can save some space and use a single field to encode them instead of two.</p><code class="block">
<pre>struct HOFunc { // Datastructure describing all lambdas passed to the higher-order function `map`</pre>
<pre> int behavior; // 0 means "do x+1", 1 means "do pow(powArg,x)"</pre>
<pre> int powArg; // the first argument to pow</pre>
<pre>}</pre>
<pre></pre>
<pre>int apply(int arg, HOFunc func) {</pre>
<pre> if (func.behavior == 1) return pow(func.powArg, arg);</pre>
<pre> if (func.behavior == 0) return arg + 1;</pre>
<pre> assert(false, "Unknown function description ", func);</pre>
<pre>}</pre></code><p>We must then rewrite <code>map</code> and <code>someNumbers</code> to use <code>apply</code> and <code>HOFunc</code>.
All calls to the arg-functions will be replaced by calls to <code>apply</code>,
and all introductions of arg-functions will be replaced by <code>HOFunc</code> data that describes said function.
Here's how we would rewrite those two functions:</p><code class="block">
<pre>int[] map(int[] list, HOFunc mapper) {</pre>
<pre> if (list.length == 0) return [];</pre>
<pre> else return [ apply(list[0], mapper) ] ~ map(list[1..$], mapper);</pre>
<pre>}</pre>
<pre>auto someNumbers(int maxN, int parg1 = 2, int parg2 = 3) {</pre>
<pre> return range(1, maxN).map(HOFunc(1, parg1))</pre>
<pre> .map(HOFunc(1, parg2))</pre>
<pre> .map(HOFunc(0, 0)); // 2nd arg is useless, can be anything</pre>
<pre>}</pre></code><h3 id="sec-6.2">How is this relevant?</h3><p>
Look again at <code>dupCPS</code>, <code>betaCPS</code>, and <code>evalCPS</code>.
You'll see that these are higher-order functions that take the continuation (a function) as their final argument.
Once we defunctionalize all three functions we'll end up with code that doesn't use any higher-order functions,
and all three functions (six in reality if we count the apply functions) will be tail-recursive.</p><h3 id="sec-6.3">High-level algorithm for defunctionalizing a function</h3><ol><li>Locate all the higher-order functions in your code; i.e. functions that take functions as arguments, or return functions.</li><li>For every higher order function:<ol><li>Locate all the functions that are passed to/returned from this function.</li><li>Find all the functions' free variables.</li><li>Create a struct that holds all these free variables.
Add a label that acts as identification for every function.</li><li>Replace all applications to these functions by calls to the special apply function.</li><li>Replace all the function introductions with a value from the struct.</li></ol></li></ol><h3 id="sec-6.4">Defunctionalizing dup, beta, and eval</h3><p>
Instead of introducing one giant <code>apply</code> function and one giant <code>HOFunc</code> struct we will introduce one of each for each of the three higher-order functions.</p><p>Let's start with <code>dupCPS</code>.
There are only two closures in the whole program that are fed to <code>dupCPS</code> and these are the ones in its body.
The free variables of both these closures are <code>Term* t</code>, <code>Term* ans1</code>, and <code>Term* delegate(Term*) cont</code>.
From here on we will rename all continuations in the set of free variables to <code>next</code>.</p><code class="block">
<pre>struct DupCont {</pre>
<pre> bool inner; // false is outer closure (line 3), true is inner closure (line 4)</pre>
<pre> Term* term; // the Term t</pre>
<pre> Term* ans1;</pre>
<pre> DupCont* next; // the next continuation</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* applyDup(Term* ans, DupCont* cont) {</pre>
<pre> if (cont == null) return ans;</pre>
<pre> if (cont.inner) {</pre>
<pre> Term* t = new Term(cont.term.type, cont.term.a, cont.ans1, ans);</pre>
<pre> return applyDup(t, cont.next);</pre>
<pre> } else {</pre>
<pre> return dupDefun(cont.term.t2, new DupCont(true, cont.term, ans, cont.next));</pre>
<pre> }</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* dupDefun(Term* t, DupCont* cont = null) {</pre>
<pre> if (t == null) return applyDup(t, cont);</pre>
<pre> return dupDefun(t.t1, new DupCont(false, t, null, cont));</pre>
<pre>}</pre></code><p>
The same procedure can be applied to <code>betaCPS</code> to get <code>betaDefun</code>.
This time, we notice that the parameters <code>var</code> and <code>val</code> are constant and will be redundant in the struct.
So instead we pass them to <code>applyBeta</code>.
Moreover, there's no label to indicate which closure we're executing, we use the <code>type</code> to distinguish them since there's only one closure used for a given type.</p><code class="block">
<pre>struct BetaCont {</pre>
<pre> TType type; // technically redundant, we could use term.type instead</pre>
<pre> int argNum; // 0 means we're working with term.t1, 1 means term.t2</pre>
<pre> Term* term;</pre>
<pre> BetaCont* next;</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* applyBeta(Term* ans, string var, Term* val, BetaCont* cont) { // var, val in here because they never change</pre>
<pre> if (cont == null) return ans;</pre>
<pre> switch (cont.type) {</pre>
<pre> case TType.ABS:</pre>
<pre> cont.term.t1 = ans;</pre>
<pre> return applyBeta(cont.term, var, val, cont.next);</pre>
<pre> case TType.APP:</pre>
<pre> if (cont.argNum == 1) {</pre>
<pre> cont.term.t1 = ans;</pre>
<pre> return betaDefun(cont.term.t2, var, val, new BetaCont(cont.type, 2, cont.term, cont.next));</pre>
<pre> } else {</pre>
<pre> cont.term.t2 = ans;</pre>
<pre> return applyBeta(cont.term, var, val, cont.next);</pre>
<pre> }</pre>
<pre> default: applyBeta(ans, var, val, cont.next);</pre>
<pre> }</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* betaDefun(Term* term, string var, Term* val, BetaCont* cont = null) {</pre>
<pre> if (term == null) return applyBeta(term, var, val, cont);</pre>
<pre> final switch (term.type) {</pre>
<pre> case TType.VAR:</pre>
<pre> return applyBeta(term.a == var ? val : term, var, val, cont);</pre>
<pre> case TType.ABS:</pre>
<pre> if (term.a == var) return applyBeta(term, var, val, cont);</pre>
<pre> else return betaDefun(term.t1, var, val, new BetaCont(TType.ABS, 0, term, cont));</pre>
<pre> case TType.APP:</pre>
<pre> return betaDefun(term.t1, var, val, new BetaCont(TType.APP, 1, term, cont));</pre>
<pre> }</pre>
<pre>}</pre></code><p>
<code>evalCPS</code> only has two closures passed to it, one at line 51, and the other at line 53.
It should be noted that we are not interested in defunctionalizing the <code>interfunc</code> argument away.
There's nothing special about this code that hasn't been said before, so here is the defunctionalization of <code>evalCPS</code> for the sake of completeness.</p><code class="block">
<pre>struct EvalCont {</pre>
<pre> Term* term;</pre>
<pre> bool inner; // true is closure at line 53, and false is closure at line 51</pre>
<pre> int depth;</pre>
<pre> EvalCont* next;</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* applyEval(Term* ans, Env env, void delegate(Term*, int) interfunc, EvalCont* cont) {</pre>
<pre> if (cont == null) return ans;</pre>
<pre> if (cont.inner) {</pre>
<pre> cont.term.t2 = ans;</pre>
<pre> return evalDefun(cont.term, env, interfunc, cont.depth+1, cont.next);</pre>
<pre> } else {</pre>
<pre> cont.term.t1 = ans;</pre>
<pre> return evalDefun(cont.term.t2, env, interfunc, cont.depth,</pre>
<pre> new EvalCont(cont.term, true, cont.depth, cont.next));</pre>
<pre> }</pre>
<pre>}</pre>
<pre></pre>
<pre>Term* evalDefun(Term* term, Env env, void delegate(Term*, int) interfunc,</pre>
<pre> int depth = 0, EvalCont* cont = null) {</pre>
<pre> interfunc(term, depth);</pre>
<pre> if (term.type == TType.VAR) {</pre>
<pre> if (term.a == "print")</pre>
<pre> return applyEval(term, env, interfunc, cont);</pre>
<pre> else if ((term.a in env) !is null)</pre>
<pre> return evalDefun(dupDefun(env[term.a]), env, interfunc, depth, cont);</pre>
<pre> else assert(false, "Unbounded variable " ~ term.a);</pre>
<pre> } else if (term.type == TType.APP) {</pre>
<pre> if (term.t1.type == TType.ABS) {</pre>
<pre> term.t1.t1 = betaDefun(term.t1.t1, term.t1.a, dupDefun(term.t2));</pre>
<pre> return evalDefun(term.t1.t1, env, interfunc, depth-1, cont);</pre>
<pre> }</pre>
<pre> if (term.t1.type == TType.VAR && term.t1.a == "print") {</pre>
<pre> writefln("[print] %s", term.t2.toString);</pre>
<pre> return evalDefun(term.t2, env, interfunc, depth-1, cont);</pre>
<pre> }</pre>
<pre> return evalDefun(term.t1, env, interfunc, depth+1,</pre>
<pre> new EvalCont(term, false, depth, cont));</pre>
<pre> } else return applyEval(term, env, interfunc, cont);</pre>
<pre>}</pre></code><p>
There is something really interesting worth pointing out.
Take a look at <code>DupCont</code>, <code>BetaCont</code>, and <code>EvalCont</code>.
All three look like a linked list.
In fact all three are a special linked list; a stack.
We knew all along (but never mentioned it) that we ought to be manipulating a stack at some point since we are simulating recursive calls.
And here is where the stack arises in our code.</p><h2 id="sec-7">Inlining & Tail-call Optimization</h2><p>Now if you carefully look at the six functions:
<code>applyDup</code> & <code>dupDefun</code>,
<code>applyBeta</code> & <code>betaDefun</code>,
and <code>applyEval</code> & <code>evalDefun</code>,
you will notice that every couple of functions (as grouped previously) are mutually-recursive, and most importantly tail recursive!
To complete our transformation we just need to inline the special <code>apply</code> functions and do TCO on every couple.</p><p>
I grouped the inline and TCO steps together since in some cases, as is in <code>applyBeta</code> and <code>applyDup</code>, one must do TCO on the apply function first before inlining.
So some functions will require three steps: do TCO on apply, inline apply with the function, and finally do TCO on the whole function.</p><p>I have already explained how to do TCO in <a href="#sec-2">section 2</a>, so I won't repeat myself.
Again, for the sake of completeness here are the three functions with inlining and TCO applied to them:</p><code class="block">
<pre>Term* dupOpt(Term* t, DupCont* cont = null) {</pre>
<pre> while (true) {</pre>
<pre> if (t == null) {</pre>
<pre> Term* ans = t;</pre>
<pre> DupCont* acont = cont;</pre>
<pre> while (true) {</pre>
<pre> if (acont == null) return ans;</pre>
<pre> if (acont.inner) {</pre>
<pre> ans = new Term(acont.term.type, acont.term.a, acont.ans1, ans);</pre>
<pre> acont = acont.next;</pre>
<pre> } else {</pre>
<pre> t = acont.term.t2;</pre>
<pre> cont = new DupCont(acont.term, true, ans, acont.next);</pre>
<pre> break;</pre>
<pre> }</pre>
<pre> }</pre>
<pre> } else {</pre>
<pre> cont = new DupCont(t, false, null, cont);</pre>
<pre> t = t.t1;</pre>
<pre> }</pre>
<pre> }</pre>
<pre>}</pre></code><code class="block">
<pre>Term* betaOpt(Term* term, string var, Term* val, BetaCont* cont = null) {</pre>
<pre> Term* ans;</pre>
<pre> BetaCont* acont;</pre>
<pre> do {</pre>
<pre> bool computeAns = false;</pre>
<pre> if (term == null) {</pre>
<pre> computeAns = true;</pre>
<pre> ans = term; acont = cont;</pre>
<pre> } else if (term.type == TType.VAR) {</pre>
<pre> computeAns = true;</pre>
<pre> ans = term.a == var ? val : term; acont = cont;</pre>
<pre> } else if (term.type == TType.ABS) {</pre>
<pre> if (term.a == var) {</pre>
<pre> computeAns = true;</pre>
<pre> ans = term; acont = cont;</pre>
<pre> } else {</pre>
<pre> cont = new BetaCont(TType.ABS, 0, term, cont);</pre>
<pre> term = term.t1;</pre>
<pre> }</pre>
<pre> } else {</pre>
<pre> cont = new BetaCont(TType.APP, 1, term, cont);</pre>
<pre> term = term.t1;</pre>
<pre> }</pre>
<pre> if (computeAns) {</pre>
<pre> while(true) {</pre>
<pre> if (acont == null) return ans;</pre>
<pre> if (acont.type == TType.ABS) {</pre>
<pre> acont.term.t1 = ans;</pre>
<pre> ans = acont.term;</pre>
<pre> acont = acont.next;</pre>
<pre> } else if (acont.type == TType.APP) {</pre>
<pre> if (acont.argNum == 1) {</pre>
<pre> acont.term.t1 = ans;</pre>
<pre> term = acont.term.t2;</pre>
<pre> cont = new BetaCont(acont.type, 2, acont.term, acont.next);</pre>
<pre> break;</pre>
<pre> } else {</pre>
<pre> acont.term.t2 = ans;</pre>
<pre> ans = acont.term;</pre>
<pre> acont = acont.next;</pre>
<pre> }</pre>
<pre> } else {</pre>
<pre> acont = acont.next;</pre>
<pre> }</pre>
<pre> }</pre>
<pre> }</pre>
<pre> } while (true);</pre>
<pre>}</pre></code><code class="block">
<pre>Term* evalOpt(alias beta, alias dup)</pre>
<pre> (Term* term, Env env, void delegate(Term*, int) interfunc, int depth = 0, EvalCont* cont = null) {</pre>
<pre> Term* ans;</pre>
<pre> EvalCont* acont;</pre>
<pre> do {</pre>
<pre> interfunc(term, depth);</pre>
<pre> int computeAns = false;</pre>
<pre> if (term.type == TType.VAR) {</pre>
<pre> if (term.a == "print") {</pre>
<pre> computeAns = true;</pre>
<pre> ans = term; acont = cont;</pre>
<pre> } else if ((term.a in env) !is null) {</pre>
<pre> term = dup(env[term.a]);</pre>
<pre> } else assert(false, "Unbounded variable " ~ term.a);</pre>
<pre> } else if (term.type == TType.APP) {</pre>
<pre> if (term.t1.type == TType.ABS) {</pre>
<pre> term.t1.t1 = beta(term.t1.t1, term.t1.a, dup(term.t2));</pre>
<pre> term = term.t1.t1;</pre>
<pre> depth--;</pre>
<pre> } else if (term.t1.type == TType.VAR && term.t1.a == "print") {</pre>
<pre> writefln("[print] %s", term.t2.toString);</pre>
<pre> term = term.t2;</pre>
<pre> depth--;</pre>
<pre> } else {</pre>
<pre> cont = new EvalCont(term, false, depth, cont);</pre>
<pre> term = term.t1;</pre>
<pre> depth++;</pre>
<pre> }</pre>
<pre> } else {</pre>
<pre> computeAns = true;</pre>
<pre> ans = term; acont = cont;</pre>
<pre> }</pre>
<pre> if (computeAns) {</pre>
<pre> if (acont == null) return ans;</pre>
<pre> if (acont.inner) {</pre>
<pre> acont.term.t2 = ans;</pre>
<pre> term = acont.term;</pre>
<pre> depth = acont.depth+1;</pre>
<pre> cont = acont.next;</pre>
<pre> } else {</pre>
<pre> acont.term.t1 = ans;</pre>
<pre> term = acont.term.t2;</pre>
<pre> depth = acont.depth;</pre>
<pre> cont = new EvalCont(acont.term, true, acont.depth, acont.next);</pre>
<pre> }</pre>
<pre> }</pre>
<pre> } while (true);</pre>
<pre>}</pre></code><h2 id="sec-8">Some Tangible Results</h2><p>With the help of <a href="https://en.wikipedia.org/wiki/Church_encoding" target="_blank">Church encoding</a> to encode numbers and booleans I wrote a program that computes the fibonacci numbers (it's <code>stdlib.lc</code> in <a href="https://github.com/geezee/typeless/tree/13f629e9039a3af0ef10673ad06bdf521fca1ef4" target="_blank">typeless.git</a>).
The table below displays 30 results, each for a different <em>n</em> (the input) and level of optimization.
For a given run, if a function's cell contains the check mark ✓ then the transformed (i.e. iterative) version of that function was used.
The results are grouped by <em>n</em> and the fastest result is in bold.
I benchmarked the code on my slow notebook whose brain is an <a href="https://ark.intel.com/content/www/us/en/ark/products/93361/intel-atom-x5-x8350-processor-2m-cache-up-to-1-92-ghz.html" target="_blank"> Intel Atom x5-Z8350 processor (1.9GHz / 2M cache)</a> with 1.8 GB RAM.</p>
<table>
<thead>
<tr>
<th>Number of reductions</th>
<th><code>eval</code></th>
<th><code>beta</code></th>
<th><code>dup</code></th>
<th>Total (sec)</th>
</tr>
</thead>
<tbody>
<tr> <td rowspan="8">150,992<br>fibonacci(8)</td> <td></td> <td></td> <td></td> <td><strong>0.358</strong></td> </tr>
<tr> <td>✓</td> <td></td> <td></td> <td>0.415</td> </tr>
<tr> <td></td> <td>✓</td> <td></td> <td>0.373</td> </tr>
<tr> <td></td> <td></td> <td>✓</td> <td>0.768</td> </tr>
<tr> <td>✓</td> <td>✓</td> <td></td> <td>0.413</td> </tr>
<tr> <td>✓</td> <td></td> <td>✓</td> <td>0.849</td> </tr>
<tr> <td></td> <td>✓</td> <td>✓</td> <td>0.791</td> </tr>
<tr> <td>✓</td> <td>✓</td> <td>✓</td> <td>0.847</td> </tr>
<tr class="thick-top"> <td rowspan="8">4,710,572<br>fibonacci(15)</td> <td></td> <td></td> <td></td> <td>14.41</td> </tr>
<tr> <td>✓</td> <td></td> <td></td> <td><strong>12.92 (+11.5%)</strong></td> </tr>
<tr> <td></td> <td>✓</td> <td></td> <td>14.22</td> </tr>
<tr> <td></td> <td></td> <td>✓</td> <td>30.34</td> </tr>
<tr> <td>✓</td> <td>✓</td> <td></td> <td>13.31</td> </tr>
<tr> <td>✓</td> <td></td> <td>✓</td> <td>27.86</td> </tr>
<tr> <td></td> <td>✓</td> <td>✓</td> <td>32.34</td> </tr>
<tr> <td>✓</td> <td>✓</td> <td>✓</td> <td>28.14</td> </tr>
<tr class="thick-top"> <td rowspan="6">12,354,971<br>fibonacci(17)</td> <td></td> <td></td> <td></td> <td>45.33</td> </tr>
<tr> <td>✓</td> <td></td> <td></td> <td>41.61</td> </tr>
<tr> <td></td> <td>✓</td> <td></td> <td>39.9</td> </tr>
<tr> <td></td> <td></td> <td>✓</td> <td>85.33</td> </tr>
<tr> <td>✓</td> <td>✓</td> <td></td> <td><strong>37.32 (+21.5%)</strong></td> </tr>
<tr> <td>✓</td> <td>✓</td> <td>✓</td> <td>78.73</td> </tr>
<tr class="thick-top"> <td rowspan="4">20,000,475<br>fibonacci(18)</td> <td></td> <td></td> <td></td> <td>Segfault (@ 67.69)</td> </tr>
<tr> <td>✓</td> <td></td> <td></td> <td>Segfault (@ 67.01)</td> </tr>
<tr> <td></td> <td>✓</td> <td></td> <td>63.45</td> </tr>
<tr> <td>✓</td> <td>✓</td> <td></td> <td><strong>62.2</strong></td> </tr>
<tr class="thick-top"> <td rowspan="2">52,389,823<br>fibonacci(20)</td> <td></td> <td>✓</td> <td></td> <td><strong>162.34</strong></td> </tr>
<tr> <td>✓</td> <td>✓</td> <td></td> <td>174.28</td> </tr>
<tr class="thick-top"> <td rowspan="2">221,990,915<br>fibonacci(23)</td> <td></td> <td>✓</td> <td></td> <td><strong>693.21</strong></td> </tr>
<tr> <td>✓</td> <td>✓</td> <td></td> <td>705.99</td> </tr>
</tbody>
</table><p>
Before running these benchmarks I've profiled the interpreter.
Briefly, the function that is called the most is <code>dup</code>, 4.3 times more than <code>beta</code> which is called 2.5 times more than <code>eval</code>.</p><p>
As expected the code that uses all recursive versions eventually segfaults, which is basically an indication that we ran out stack space.
What may be surprising is that using <code>dupOpt</code> instead of <code>dup</code> actually slows things down, by a lot!
Keep in mind that accessing data on the heap is slower than on the stack, <code>dup</code> is already a memory intensive operation, <code>dupOpt</code> will be even more memory intensive, which I guess is the reason behind the slow-down.</p><p>
However what I am incapable of explaining is why eventually, for really large <em>n</em>, using both <code>evalOpt</code> and <code>betaOpt</code> (version A) is slower than using <code>eval</code> and <code>betaOpt</code> (version B).
My guess is that memory becomes too fragmented after too many pushes/pops on the <code>EvalCont</code> and <code>BetaCont</code> linked lists.
I conjecture that eventually B will segfault for some large <em>n</em>.
However, I tried running B on a macbook that is on average five times faster than my notebook with <em>n=28</em> taking some 2.4 billion operations.
B did not segfault and it ran significantly faster than A.</p><p>
In conclusion, we should keep the recursive <code>eval</code> and <code>dup</code> and use the iterative <code>betaOpt</code> instead of <code>beta</code>.
So beware of overtrusting the "iterative code is quicker than recursive code" idiom since that wasn't the case for 2/3 of our functions!
The lesson drawn from this experiment was: As always, first benchmark and then decide.</p>
https://blog.grgz.me/posts/recursion_elimination.html
https://blog.grgz.me/posts/recursion_elimination.htmlSun, 18 Aug 2019