My many misadventures with Literate Programming
Literate programming—or LP—is one of those things that represent some utopian way of Building Shit™, promising that you will be elevated to some higher level of being. Or something. Just like purely-functional-programming or the borrow checker. It fixes bugs! Makes you a better programmer! And yet, even without such qualifications, this concept has always fascinated me. Essentially, it's the formalization of every single "code tutorial". As I say, it's literally "code, but pretend you are writing a tutorial on how to do it". Hell, people point to Physically Based Rendering as an example of "LP done right".
Most modern day conceptions of LP involve some sort of notebook, like Jupyter Notebook. Don't get me wrong, those are great. The fact that you can see output of some piece of code from the get-go is very helpful for understanding in a way not so easily achieved by traditional LP. But "trad LP" is the slightly more interesting formulation of Donald Knuth (you might know him from TeX). The Knuthian definition of "literate programming" emphasizes code reordering, something that you couldn't get in notebooks. It is essentially "folding code" 40 years before it was cool. (Then again, he was always like 40 years ahead of his time—if you've heard of variable fonts, why not check out METAFONT?). It involves two separate, yet parallel, processes:
- Tangling: generating source code files to feed to a programming language compiler.
- Weaving: creating documents to print and read, e.g. HTML, PDF, PS, etc.
And they both are generated from the exact same source or set of sources. You may look around for examples, but they should involve pieces of code that are hidden underneath names, and then instantiated sort of like a "macro version" of a macro. And it really is, in a way, another view of programming: instead of writing-as-you-go or writing pseudo-code first, you literally just write pseudo-code and then translate it in the same document into the language of your choice.
But trying this out, of course, has been a bit tricky and definitely experimental. A lot of the documentation pertaining to LP as understood as such, dates back to the 1980s through the early 2000s. Which makes sense, as some of the "exciting" benefits of LP has since been evaporated by advances in tooling:
- We don't need pretty-printed source anymore; as syntax highlighting has become fast and practical for most programs out there. Besides, seeing
≠instead of!=seems rather disorienting (although some people don't think so—evidenced by the myriad of coding fonts out there that have that as an actual ligature!) - It isn't as necessary anymore to create indices on our own; since with IDEs (especially with the invention of the LSP) you can simply jump to where something is defined and have the symbol index right there on the sidebar.
- Chunks are an early form of "code folding". Well today, as mentioned, we have
#regionand actual code folding. Hell, there's some pushback against#region—some of which seems also applicable to LP as is applied.
Regardless, I think my entry point to all this seems to be a few articles—probably this one—that, despite the advances above, still believe in Knuth's vision. Maybe people really like explaining. Maybe capturing the context with the code, especially in that way, is still seen as a nice thing. Or that reordering code is still pretty cool. Even in the age where you can define functions anywhere and not have to define one after another, as in the days when LP was first invented. I'm a little bit of all of those, and so I've experimented with a few different tools for Knuthian LP.
literate & srcweave
These two are mentioned in the article linked to in the previous paragraph. srcweave in particular was written by the article's author. As I said before, LP is like a formalization of "code tutorials", well, why not write an entire tutorial using this very method? Initially in literate, I rewrote it in srcweave after a bit. I did try to make the most of it—putting in little graphics of my own that explain some hardware-related concept while having the actual code to operate it beneath. If I had something to say about these two:
srcweave: More versatile, harder to setup.srcweavedoes not assume text processing will be done using it and leaves things like markdown conversion to other tools. Moreover, it's using SBCL, which I don't find its package story to be all that smooth-sailing for some reason.literate: All-in-one, less versatile, easier to setup. It uses D's package system, which works well. Unlikesrcweavehowever it assumes a certain structure.
NailIt
Both literate and srcweave use Markdown. Which is nice, but they still have their own format which makes it not that presentable on GitHub. Or any other git forge with a Markdown viewer. So I thought, why not make a tool that… just used Markdown? It'd be great to have one as a README.md. That's why I wrote NailIt. I don't think I've ever used this in any serious sense other than its own program description, maybe that time will come yet. I did say, I want less "refer to section numbers a code block is in" and more "refer to code blocks directly" in my output—indeed literate and srcweave still refer to blocks/chunks by their section numbers. Which I think is probably a carry-over from Knuth's original tooling which is built for books (WEB, CWEB, and then noweb follow this model). But we live in an age of hyperlinks and so maaaaybe we ought to say the exact context where X is used.
Leo
This one is essentially the LP IDE. You create or load in a single project file that contains all your code, and you navigate through an outline of nodes (chunks), each of which can be edited interactively. Nodes have hierarchy—sub-nodes can either be the traditional <<program part>> or another documentation node. A single node can even have multiple references (called "clones")—for instance, you have a node containing a FooBar class. You can have a reference to the node underneath a node called How to do FooBar and a node called Class index at the same time—it's exactly the same node and a change to one will change the other—but it allows you to view the thing in multiple different contexts.
I thought this was the closest LP has been to being actually viable, especially for languages whose LSPs are iffy. It looked so clean, and so useful. Especially its killer feature: Leo can "untangle" changes made to tangled files back to the source! Especially @clean, which outputs… basically regular source code, so you get the best of both worlds! …In theory, of course. While it does a pretty good job, it's not perfect and it can trip up sometimes, so care must still be taken not to change too much in the tangled program, lest it untangles it all wrong. As always, too good to be true.
Since this is a tool with its own format, that means there's a certain level of "lock-in". This is mitigated somewhat due to Leo having multiple implementations: There's the main version that runs on PyQt (which in my experience isn't quite smooth-sailing on some setups I have), and then there's LeoJS (a VSCode extension). No matter which one I use, there's unfortunately some level of friction, probably on my end rather than theirs. The Qt version does run on Termux while I'm "on-the-go", so to speak, but opening the file browser is slow for some reason. Again, I'd chalk it up to Qt. Even without the technical problems, I'm still not sure if I want to go with this for my day to day work on account of "plain code" being easier to work with in general.
Inweb
This is the solution built by Graham Nelson, of which I am told is practically the king of interactive fiction—which in turn the only idea I have is... uh... going North, South, or Dennis. But it is also a fascinating tool in its own right, its output very much modern. It imposes a book-like structure right down to the file names: code and documentation live in sections which have Names like This.w (and yes, it wants spaces), and those sections live in folders like Chapter 1 and Chapter 13—their titles given in a file called Contents.w given at the root of the project. You can string multiple project together in "webs" although I haven't found a use for this feature yet. I think I might like the idea of forcing me to structure my program like an actual book, because of this I can seek out at a glance the context of where a piece of code might be—if I want to know where to Configure the Reflurberator I could go to the section called "Configure the Reflurberator" in the table of contents.
But while—like other LP tools—it claims support for "any language", there is a notable red-carpet treatment towards C-likes—but in practice, it's really just C. And its own little C dialect "InC" which among other things adds a sort of Namespace::capability.
Attempting to use it with C++ has made my C++ code end up just being C. As Inweb recognizes function definitions like void MyClass::something(, it helpfully tries to define it first as a C declaration. Which causes all sorts of compiler errors as it tries to reconcile the resulting declaration with the class that follows, and I need to use = (very early code)hacks to resolve it. I ended up refactoring my code to look like void MyClass__something(MyClass &self, ..). You can imagine from here how it'll work with templates and use all complicating matters. Yeah again I might as well just use C or InC. What about indented languages like Python or Nim? Suffice to say, it's not a good fit. Its output doesn't really care about spacing as much as it should for those kinds of languages, which is sad because I really like its output.
Unlike the tools mentioned so far, this tool does not allow multiple tangled files to be generated. I get why: like, you have a complete program here, why would you want to split that into files again? You already organize the program in a structure that fits it. But that's not the problem here—rather, the opportunity for incremental compilation has been utterly decimated. Tangling this thing always results in one big program, and that program has to be compiled anew every single time. You couldn't recompile, say, only dingo.c of myprogram.c, because it always creates myprogram.c as one big amalgamation (as SQLite would say). There is support however for creating files that just so happen to be attached to the program, maybe as in configuration or Makefile, and those are quite intentionally limited.
noweb
Initially I brushed this off after thinking it was seemingly tied-to-the-hip with TeX. Debian attaching TeXLive as this thing's dependency did not help at all and has, yes, in fact scared me off it. But after looking into this again, it's absolutely none of those things. It's even better.
- TeX isn't actually parsed or even actually required by the tool. It only parses:
<<code usages>>,<<code definitions>>=, and[[quoted code]]. That's literally it. The documentation parts aren't even touched, and will pass right through. - Building the thing is also equally simple—it only needs C and some form of AWK. There are mentions of a language called Icon (think of: alternative to Perl and Python) but it isn't needed either.
- Finally, it's actually a whole suite of tools.
noweb,notangle, andnoweaveare actually shell scripts that drive more granular tools:markup,noidx, andtohtml/totex.
It's Unix as all hell, and that is actually its killer feature. Instead of being an all-in-one parser and converter like the other tools so far, they are each their own little programs. Right down to the fact that stdin and stdout is literally all you need to operate these things. The parser, markup, actually converts the source into its own intermediate format. The idea here being that anyone can ingest this intermediate format and produce documents in their own way. Everything else, even code indexing (provided by noidx), is optional.
Anyone can slot in their own tooling—no wonder it has a 30-year (and counting) lifespan. So, made my own I did. It definitely isn't quite what I want just yet, but to me it's already preferable than what noweb's default HTML output looks like. Notably at present, references still refer to chunk numbers despite what I set forth in NailIt, but I'm sure even that can be worked around with. But the notable thing is that since I'm working with an intermediate format that is easily parsed, I can add restrictions of my own—in my tool, I forced file names to follow a number-chapter format, the idea being that a Makefile or a glob on a source directory should be all I need instead of a Contents.w. If I wanted to, I could also restrict adding to code chunks because that would hinder being able to analyze the code quickly. Or making the chunk name do special stuff like function names. Whatever I want, I can do it here.
Tips?
In trying to use LP as a practical tool for assisting coding in some instances, here's my attempt at formulating some light guides:
Don't hide your variables. Code chunks are, in a way, its own function. In a chunk, I want to, know what output is and does. So I name my chunks something like add something to "output".
Ensure that any variables mentioned in a code chunk is either:
- defined in whatever chunk is using it—so you can "follow the yellow brick road" of "Used in" links), or
- the variable is local to that chunk.
How Inweb handles this in C code is that every chunk is surrounded by { and }, which is meant to make if (something) @<do other thing@> possible, but the local-scoped thing is a good side effect.
Not every chunk needs a whole novel. "Old-school" LP says that the explanation is the enlightening part, but it worked in a time where it was still humanly possible to slow down and bask in every little detail. Now? Well, you just want things to work. Over-explanation was one of the things that led ifupdown's current maintainer to cut out noweb entirely. That leaves us with: when to split off chunks? The following circumstances should apply:
- The chunk is more than a page's worth, but you feel it shouldn't be its own function.
- It's not completely obvious what it does and it genuinely needs a paragraph instead of a one-line comment up top.
- The chunk is related to some other aspect of the program that would not make sense to be explained in the current section.
My trick is to make each function its own chunk. Inweb handles this automatically with C, because C function declarations follow a quite regular pattern. Not so with noweb, since it's a close to "universal" tool. My trick then is to just use <<function somethingHappens(shit)>>, the declaration either "above" the chunk (in the thing that uses it) or "under" the chunk (as part of the chunk)
Reduce the number of chunks overall. More chunks mean another possiblity to break the control flow as the user reads it. And this especially includes the kind of "appending to chunks" common to LP tools. While useful to "add functions" as they are defined, when overused it can hinder understanding a function quickly (remember, world moves fast these days…)