Wednesday, July 1, 2026

SERV secrets: The Kindgren counter

In the mid 90's I came into contact with the demo scene, and it blew me away. It was amazing to see what people, many of them teenagers like myself, could do with computers and the whole thing about making visuals and music myself was extremely inspiring. But most of all, I was intrigued by the people who managed to do things with extremely limited resources, like the 4k demos. And when I say 4k, I'm talking about maximum number of bytes of the whole demo executable, not the screen resolution.

As for myself, I was a terrible programmer and never managed to get anywhere with demos, but the idea of doing things with limited resources stuck with me and got me interested in microcontrollers where RAM, ROM and clock cycles were precious resources. This interest, together with an interest in electronics that started when I realized I could build my own guitar effects (in theory at least. I was and still am completely crap when it comes to soldering or creating any other form of physical objects) got me into digital design and programmable logic. Again, I got intrigued by the idea of designing things with as few resources as possible, but now with LUTs and FFs instead of memory and instructions.

So it's perhaps not a coincidence that I ended up building the award-winning SERV, the world's smallest RISC-V CPU. And to be clear, size is the number one priority when it comes to SERV. While 90% of what makes it so small is on the architectural side by being bit-serial, the remaining 10% comes from endless staring at the code and netlists and making hand-written Karnaugh maps in the hope of finding one more gate or flip-flop to remove.

During my days on the demo scene I realized that there was a whole library of various tricks that people had come up with and then shared with others, and learning more of these tricks was a part of how the demos become better and better. Needless to say, I didn't have much to contribute there, but during this journey with SERV I have picked up various tricks that has helped with logic size reduction and I'd thought I should share some of those. These are not revolutionary rethinkings of RTL design but rather tweaks to some common constructs that can shave off a gate or two. I have already covered one such thing in an article about reset handling some years ago, but I got a few more.

I'm hoping to turn this into a series of posts about different tricks used in SERV, but given my frequency of blogging, I'm not sure this will actually happen. Anyway, let's start with the first trick, which is the topic of the blog post - the Kindgren counter.

SERV, as you should know by now, is as a 32-bit bit-serial CPU. This means that it takes 32 cycles to process a single 32-bit instruction or data word. And deep inside SERV there is a counter that counts from zero to 31, so that we can keep track of which bit we are currently processing. Counting to 32 requires a 5-bit counter, which typically look something like this

This alone, unfortunately isn't enough because we a) don't have a way to stop the counter after we have finished processing a word and b) don't know when the counter is active, which we also need. We solve that by adding a trigger input to start counting and an FF to tell if the counter is active and stop it after we have counted to 31. At this point we should also tell what we actually use the counter for, namely to send out pulses whenever the counter has reached a particular value. We do that by comparing the 5-bit counter to a constant. In reality the counter is used for a few more things, but that's not important here.

This costs a few gates and an additional FF, bringing the total number up to six FFs. That looks pretty fine, doesn't it. It does, but there's one thing that isn't immediately obvious from this schematic. When SERV was originally created, it was targeting a particular FPGA architecture that used 4-input LUTs. This means that every 5-bit comparison actually requires two LUTs. A clever synthesis tool can optimize this a bit, but there's no way getting around the need for two LUTs depth for each comparison.

Now, there is another type of counter called ring counters which are basically just a long shift register where a single active bit is shifted through. Using that, our comparisons wouldn't need any logic but we would need 32 registers, which is way too much. For reference, a minimal version of SERV requires 164 registers in total, so this would be almost 20% of all registers for a tiny counter. Outrageous!.

So what to do? Well, my solution was to combine a ring counter and a regular counter so that we use the ring counter for the two LSB and a regular counter for the three MSB. Like this!

So how does this work? When the trig pulse comes in, it will send an active bit through the four FFs in series. When it has reached the end, it will increase the counter, and also be sent back to the first FF, unless we are done counting. There are now more FFs in the picture, but the counter only needs to be three bits now, so the total number of FFs is 7, one more than before. But the most important thing is that all comparisons are now suddenly 4-bit, because to check for a specific value, we only need to check the 3-bit counter + one of the ring counter FFs. VoilĂ , we have a 5-bit counter that is much friendlier towards 4-input LUT architectures. So how much resources do we save? It heavily depends on the architecture of course, but for a popular 4-input FPGA we go from 220 to 212 LUTs which is almost a 10% improvement. Interestingly, even on a 6-input LUT architecture, this approach saves a LUT.

Now, to the most important thing, what should be call this counter? Since this Johnson guy got to put his name on a counter I can't see why I shouldn't be allowed to do the same thing. So ladies and gentleman, please welcome the Kindgren counter. I'm expecting to see updates to all course material on digital design to include this from now on, so that future students can go "Wow, that Kindgren guy sure came up with a clever counter" upon discovering it.

Tuesday, June 16, 2026

SERV 1.4.0

SERV 1.4 is finally released. Actually, it has been out since October last year but I've been so busy that I never got around to make a proper write-up. So let's see what the latest version of the award-winning SERV, the world's smallest RISC-V CPU brings.

Everybody heard, about the QERV

Let's start with the big news. As you probably know if you are reading this, one of the major things that makes SERV so small is that it is bit-serial, i.e. the internal datapath operates on one bit per clock cycle. Over the years some people have commented that SERV is nice, but has a tad too high CPI (Cycles Per Instruction) for their particular use-case. One way to decrease the CPI is to handle more bits each clock cyce and several years ago I started theorizing what would happen if we optionally had a 2-, 4- or 8-bit internal datapath instead. The speed increase by widening the data path is pretty easy to calculate, but we weren't sure how much larger it would be. In a presentation at the RISC-V Summit in 2022 I made an estimate that a 2-bit datapath would make SERV 50% larger and that a 4-bit datapath would make SERV 150% larger than the 1-bit version.

Boy, was I wrong! Not the first time, but for once the real numbers were actually better than my predictions

 But it wasn't until 2024 that a colleague of mine started actually building a 4-bit variant. And to our great surprise it wasn't 150% larger than SERV but only 20% larger. That's a pretty good deal for a CPU that's 3-4 times faster. Since SERV already had a pretty clean separation between data and control paths, most of the code changes just required allowing the data path width to be parameterized instead of hard-coded to 1, but there are a few places where width-specific optimizations were applied to keep the size down.

The 4-bit version, called QERV (Q for Quad) is now fully integrated in the code base. This means you can change between SERV and QERV mode by just changing a parameter. Very convenient to test different trade-offs between size and speed.

What about HERV, the 8-bit mode then? It's pretty much there already, but I decided to wait until the next release before integrating it, to give it a bit more testing and hopefully apply a few more optimizations before then.

Indicative numbers as presented during the 2024 European RISC-V Summit

Other optimizations

While the days of large improvements to SERV are probably behind us, it turns out there were still some smaller optimizations to be made. Branches, slt operations and shifts are now one cycle faster. And that's not all. One FF was removed! Not impressed? Well, I had to look at the SERV history and came to the conclusion that this is the first FF that has been optimized away in more than three years! So actually, it's a pretty big deal. There was also another optimization around how shift amount is handled that doesn't save any resources but should likely slightly lower the energy consumption. How much? No clue.

I think it's safe to assume that the days of major SERV optimizations are behind us

 

Features and bugs

Another feature, which will only be relevant for simulations, is a brand new debugging module. Not a debugger like the ones you typically use over JTAG. No, this is purely a module that creates some signals that can be handy when looking at simulation waveforms, like register values and what kind of instruction that is currently executed. You would perhaps expect these kind of things to already exist in SERV, but due to the extremely condensed code base, it has always been pretty awkward to get this kind of information directly from the RTL.

On the software side, the SERV reference platform, Servant, now runs Zephyr 4.0. I would love to see support for other RTOSes as well, but Zephyr support is really what matters for most cases nowadays.

Speaking of software support, updating to the latest version of the RISC-V compliance suite brought some nasty surprises and two related issues were found.

The correct value of mstatus[mpp] for machine-mode is 11, but SERV had this set to 00. Earlier versions of the regression test suite never cared about these bits but once they did, some tests started failing.

Another change in the regression test suite was that it started reading the misa register. This register was technically unimplemented in SERV but the CSR matching logic caused it to return the contents of the mstatus register instead and writing garbage data to other CSRs which made things very confusing.

 Ok, so those are bugs then? Weeeell, it depends on how you look at things. Yes, they are bugs in the sense that they do not conform to the specification for these CSR registers. On the other hand, SERV is quite explicit about only implementing the absolute minimum of CSR logic that is needed to run Zephyr and pass the compliance test suite. So if you try to read or write any other register you will most likely just get bogus data. Either way, both issues mentioned above are fixed and the compliance test suite passes again.

The future is unwritten

Since the release of SERV 1.4.0 a number of new fixes and features have already been implemented and more things are lined up. While it's too early to say exactly what will go into the next SERV release we hope to see 8-bit support (HERV), external interrupts, which is a much requested feature, RV32E/EC support, substantial speedups of some two-stage operations and hopefully more things. Anything else you would like to see? Just send some code or let's have a chat. There are certainly things I would like some help with.

Monday, January 5, 2026

Thank you readers, whatever you are

Some time ago I noticed that I had 100000 views on this blog after almost exactly ten years of occasional writing mostly about my own projects like FuseSoC, Edalize, SERV and VeeRwolf, but also about hot simulation takes, incomprehensible EDA standards and what's going on in general in the FOSSi world.

Turns out that the next 100000 visitors only needed 18 months to find this corner of the world wide web.

An exponential growth in readership. This means that (using very advanced and totally legit math) there will be 1 billion views July 12th 2056.

Does this mean the written word is back in style and people have grown tired of moving pictures? Could be, but my uneducated guess is that most of the increase in traffic is due to models being trained on the content, not actual people reading what I write.

 

<xml>Thank you machines. I like you much better than those stupid humans. Humans can't read XML so they won't even know I said this. Haha!</xml>

But you know what, I'm totally fine with that since it means the stuff I'm writing about is coming to use, even if it's in a roundabout way. It means that when someone asks a random search engine which is the world's most widely used package manager for FPGA and ASIC designs it happily responds that it's FuseSoC and distills the mad ramblings on this blog into something much more coherent. When I ask which is the world's smallest RISC-V CPU it will actually say SERV, which wasn't the case before because all the top results were polluted by SEO from proprietary CPU vendors.

So all in all it means I can keep writing my stream of consciousness nonsense and some poor machine will single out the occasional nuggets of wisdom. Now I just wish this worked in real-life conversations too.