Gary's Website

Identity: Keyoxide
Electronic Mail: gary@mooremoore.net
SourceHut: ~gary_moore
GitHub: @doggodoge
BlueSky: @mooremoore.net
 _______                    __        
|     __|.---.-.----.--.--.|  |.-----.
|    |  ||  _  |   _|  |  | |_||__ --|
|_______||___._|__| |___  |    |_____|
                    |_____|           
 __                __           __                       
|  |--.----.---.-.|__|.-----.--|  |.--.--.--------.-----.
|  _  |   _|  _  ||  ||     |  _  ||  |  |        |  _  |
|_____|__| |___._||__||__|__|_____||_____|__|__|__|   __|
                                                  |__|   
---

WASM SIMD in C Compiled With Zig
----------------------------------------

So I've been writing a bit of code for my JS string interner, and I've set a
hard rule for myself to only use JS. Really bumping into some sort of limits
there though.

I'm talking about prefix search. The naive way to find strings that match a
prefix is to do a byte by byte scan. A better way is to pack multiple bytes
into one number and use bitmasks to match. That's effectively what I went with,
this is called SIMD Within a Register (SWAR).

You kinda have your hands tied behind your back from the start doing this in JS
though, as JS engines only support floats at 64 bit width. You're stuck with 32
bit integers. We scan 4 characters in a single instruction, that still gets us
4x the performance over the simple byte scan in V8.

Still, I was curious what you can do these days with WASM, which now supports
128 bit wide SIMD. That maps pretty much perfectly to my Mac's NEON instruction
set for SIMD.

Initially I tried just writing the weird Sexpr .wat stuff directly. It's
relatively straight forward to do so. Still, wasn't the most pleasent thing in
the world to write.

The reason I avoided writing it in Zig is because it's not at all stable. I
didn't write it in C because I didn't want the emscripten headache. Then it
dawned on me, Zig is a really good C cross compiler. So I ended up just writing
the code in C with WASM SIMD intrinsics, and cross compiled with Zig. It worked
out beautifully! And Clang wrote better WASM than I could.

I'd definitely recommend trying this combo, WASM + C + Zig. Absolute power
trio.

Note: I'll get some code examples down here at some point.

---

Allocators in C
---------------

Just finished my arena allocator, it works by using mmap to overcommit 4GB of
virtual memory and when you reset the allocator, the pointer just gets set to
0 and we madvise MADV_DONTNEED. Seems to work well so far! :) It also can
return an Allocator struct which defines an interface, so we can have functions
just take that interface.

We also set the last page to PROT_NONE so it acts as a guard page. If anything
in the application tries to write into that page, the app just crashes. Gives
us some basic protection for free, the CPU mem unit handles it.

Here's how the memory is initialised, and the guard page set:

Arena *arena_init() {
    void *mem = mmap(NULL, FOUR_GB, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (mem == MAP_FAILED) {
        return NULL;
    }

    long page_size = sysconf(_SC_PAGESIZE);

    void *guard_page = (char *)mem + FOUR_GB - page_size;
    mprotect(guard_page, page_size, PROT_NONE);

    Arena *arena = malloc(sizeof(Arena));
    arena->mem = mem;
    arena->cap = FOUR_GB;
    arena->offset = 0;

    return arena;
}

Just point to the start of the last page, pass in the page size of your
platform, PROT_NONE, and it's off to the races!

Important to programmatically get the page size rather than hard code it as
some popular platforms these days (Apple Silicon) use a 16K page size rather
than 4K.

Also, here's how the memory is cleared, and it's probably the main thing that
makes memory arenas so powerful:

void arena_reset(Arena *arena) {
    if (arena == NULL) {
        return;
    }
    madvise(arena->mem, arena->offset, MADV_DONTNEED);
    arena->offset = 0;
}

The important thing here, we just reset the offset to 0, and we can reuse all
that mem. A sort of interesting thing here is the use of madvise to tell the
kernel that we don't need this memory anymore, so the kernel can go ahead and
reclaim it.

Kind of on the fence about the madvise in there. It sort of defeats the point
of the arena, adding a syscall in there is kinda heavy, and where arenas are
used, memory usage before a free tends to reach a similar high water mark. Hmm,
now I'm thinking about removing it.

---

Gabe Deck
---------

I've been playing games a good bit on the Gabe Deck[0], and it's got a really
unique control scheme. I've been having a mess with Raylib recently in order to
get a bit better at C programming[1], I might have a look at integating Steam
Input to see what cool stuff you can do with the track pads and gyro.

[0]: https://store.steampowered.com/steamdeck
[1]: https://git.sr.ht/~gary_moore/game-playground-raylib

---

Automatic .plan Upload
----------------------

This is my first proper automatic upload of my .plan file using my kqueue based
watcher[0]. Should use zero CPU while not in use, and I can write these plan files
by just editing the file in neovim or whatever editor, and on save it'll
automatically get uploaded to my website.

I'm not sure how long the upload takes, but it's surely < 500ms. By the time I
save and alt+tab refresh my browser the content is there already.

We're using bunny.net and completely moved away from cloudflare. This website
is just a template on the bunny.net equivalent of workers, and this plan just
gets uploaded to their object storage. On request, the template gets filled in
from object storage, so the way we update this plan is actually rather simple.

  1. Edit .plan
  2. Save
  3. Watcher picks up the change
  4. Watcher uploads to bunny.net object storage
  5. Future requests to the worker with my site automatically picks up
     my changes

Lovely! Personal micro-blogging doesn't need to be so complicated :)

Just had another thought there, this is a pretty neat thing for covering live
events. Might want to somehow send updates via server sent-events or something,
but on the editor side, seems great for that.

[0]: https://git.sr.ht/~gary_moore/toe

---

Blog Publishing Platform
------------------------

So I have an idea for this, something similar to what I've done already for
publishing to my gemlog on the gemini protocol. That is, why not just send the
blog post to myself as a signed email? Already got a decent bit of the code,
just need to flesh out HTML templating vs the much simpler gem format.

That's what I think I'll do eventually in order to have a platform for the
below article series. There's one important deviation compared to how I did it
before, though. Instead of the silly stuff with the git repo and SourceHut
build step as before, I'll just push to a bucket, and do something with a
bunny.net script.

---

"Low-Level" JS article series idea
----------------------------------

So this is something that's been floating about in my head for a while now.
There's some interesting performance topics that I haven't really seen covered
much with regards to JavaScript, and I suppose that's manual memory management
with typed arrays.

There's some specific probably niche topics here:

  1. Packing low cardinality information into bitfields
  2. Discussion on deopt pitfalls wrt polymorphic func parameters
  3. SIMD within a register, specifically in u32s
  4. Buffer backed string interning for cache locality
  5. Eliminating GC churn with fixed slots and freelists
  6. Improving instruction level parallelism with loop unrolling

And more topics as I think of them I suppose. Before doing a wee series on JS
perf, I suppose I should build out some way to publish blogs on my personal
site. I already publish to the gemini protocol because the LLM crawlers don't
go looking there, but I've got this .plan thing here now, and I think I can
borrow the kqueue stuff for a really low friction CMS-ish thing.