Modular Memory 2: Electric Boogaloo

A divide-and-conquer strategy with address space.

Sep 14, 2024

One of the big ideas I have been pressed into developing over the course of my architecting of Oración is a thing called modular memory. I wrote about this once before, but back then the idea was a lot fuzzier and conflated with nimbs (a general solution for arbitrary precision integers). After some writing on i486 machines about the matter, I came up with a much more coherent rendition of modular memory here. Funnily enough, it bears a lot of resemblance to the khipus of Incan accounting fame.

In my old terminology, I coined the words twig and trunk to help flesh out the idea. But upon further thought, I think the term knot is more fitting given the pattern of usage, which we’ll showcase below. For practicality’s sake, I’ve defined four sizes of knots: 8-knots, 12-knots, 16-knots, and 20-knots, which refer to 256-byte, 4096-byte, 64-kibibyte and 1-mebibyte contiguous blocks of memory, respectively.

The idea is, algorithms can be configured to work with such knots of memory, and using a little roll-over housekeeping scratch data in registers and additional logic, can continue their function again and again from one knot to the next in a well-formed succession. For cases where that may be too much of a performance hit, such a function can take a flag to skip its own housekeeping and assume the knots are all successive of one another anyway, giving near-full performance without giving up the flexibility of modular memory logic.

Believe it or not, the main attraction of modular memory is performance. This might seem counterintuitive to the casual systems engineer, who thinks, “doesn’t introducing more complexity, or necessitating any kind of stop and start, make things worse than not doing that?” And yes, on machines with large pointers and tons of abstraction in hardware and kernel space, that is true. But you’re not thinking big enough. On extremely high-performance hardware kernels, you would rather addresses only be as large as the program truly needs. In conventional computing, this is a hard wall that everyone copes with rather than tries to overcome. Even your newfangled Nvidia GPU is doing tons of rearranging-the-deck-chairs-on-the-Titanic behind the scenes, consuming gobs of precious energy. The only reason it can’t do better is because your code is too naïve to let it… until now, thanks to modular memory.

No one can deny the theoretical truth that using fewer address bits entirely is less wasteful than using more. Even if you try to boil them away with abstractions, the cost is simply being shifted to some other layer of the stack. Abstractions will never cheat laws of the universe, and you will always pay for it in electricity cost at some point. Unless, of course, you can get smart enough to actually, properly do without that which you do not need. This is hard, and programmers are so notoriously lazy they extol it as a virtue. I wish they would stop doing that.

I am proving the efficacy of modular memory by employing it to great effect on a class of computers everyone used to know and love but have largely been relegated to the dustbin of novelty and nostalgia: IBM-PC compatibles. Specifically, I am building a research lab filled with such computers, mostly sporting i286s and i486s with Cirrus graphics cards, and on them I am running a research operating system I call Sirius DOS, built on top of MS-DOS 6.22. Everything will stay in real mode, and I will achieve something no one else comprehensively did in real mode in the 1990s: leverage all of a system’s memory. Professional grade software like Microsoft Windows adopted protected mode as soon as it was practical, and all of your hackerware written by John Carmack et al were usually running in some variety of unreal mode – no matter how you got there, the consensus was universal that 16-bit pointers sucked and you need to get 32-bit ones by any means necessary.

Software modular memory is showing for the first time that you didn’t need 32-bit pointers to have anodyne access to larger-than-16-bit-addressable amounts of memory. Importantly, it is showing that you don’t need 32-bit pointers to get 32 bits of memory fed to algorithms in a generalised manner. Any algorithm you write can be adjusted to handle this without compromise and with minimal overhead.

Now think of the power of doing this on a modern chip with 16-bit pointers… or on thousands of them in parallel… are you getting it yet?

Implements for this will be coming with Hinterlib 2.0. Look forward to it.

Modular Memory 2: Electric Boogaloo

A divide-and-conquer strategy with address space.

Discussion about this post