As for myself, I was a terrible programmer and never managed to get anywhere with demos, but the idea of doing things with limited resources stuck with me and got me interested in microcontrollers where RAM, ROM and clock cycles were precious resources. This interest, together with an interest in electronics that started when I realized I could build my own guitar effects (in theory at least. I was and still am completely crap when it comes to soldering or creating any other form of physical objects) got me into digital design and programmable logic. Again, I got intrigued by the idea of designing things with as few resources as possible, but now with LUTs and FFs instead of memory and instructions.
So it's perhaps not a coincidence that I ended up building the award-winning SERV, the world's smallest RISC-V CPU. And to be clear, size is the number one priority when it comes to SERV. While 90% of what makes it so small is on the architectural side by being bit-serial, the remaining 10% comes from endless staring at the code and netlists and making hand-written Karnaugh maps in the hope of finding one more gate or flip-flop to remove.
During my days on the demo scene I realized that there was a whole library of various tricks that people had come up with and then shared with others, and learning more of these tricks was a part of how the demos become better and better. Needless to say, I didn't have much to contribute there, but during this journey with SERV I have picked up various tricks that has helped with logic size reduction and I'd thought I should share some of those. These are not revolutionary rethinkings of RTL design but rather tweaks to some common constructs that can shave off a gate or two. I have already covered one such thing in an article about reset handling some years ago, but I got a few more.
I'm hoping to turn this into a series of posts about different tricks used in SERV, but given my frequency of blogging, I'm not sure this will actually happen. Anyway, let's start with the first trick, which is the topic of the blog post - the Kindgren counter.
SERV, as you should know by now, is as a 32-bit bit-serial CPU. This means that it takes 32 cycles to process a single 32-bit instruction or data word. And deep inside SERV there is a counter that counts from zero to 31, so that we can keep track of which bit we are currently processing. Counting to 32 requires a 5-bit counter, which typically look something like this
This alone, unfortunately isn't enough because we a) don't have a way to stop the counter after we have finished processing a word and b) don't know when the counter is active, which we also need. We solve that by adding a trigger input to start counting and an FF to tell if the counter is active and stop it after we have counted to 31. At this point we should also tell what we actually use the counter for, namely to send out pulses whenever the counter has reached a particular value. We do that by comparing the 5-bit counter to a constant. In reality the counter is used for a few more things, but that's not important here.
This costs a few gates and an additional FF, bringing the total number up to six FFs. That looks pretty fine, doesn't it. It does, but there's one thing that isn't immediately obvious from this schematic. When SERV was originally created, it was targeting a particular FPGA architecture that used 4-input LUTs. This means that every 5-bit comparison actually requires two LUTs. A clever synthesis tool can optimize this a bit, but there's no way getting around the need for two LUTs depth for each comparison.
Now, there is another type of counter called ring counters which are basically just a long shift register where a single active bit is shifted through. Using that, our comparisons wouldn't need any logic but we would need 32 registers, which is way too much. For reference, a minimal version of SERV requires 164 registers in total, so this would be almost 20% of all registers for a tiny counter. Outrageous!.
So what to do? Well, my solution was to combine a ring counter and a regular counter so that we use the ring counter for the two LSB and a regular counter for the three MSB. Like this!
So how does this work? When the trig pulse comes in, it will send an active bit through the four FFs in series. When it has reached the end, it will increase the counter, and also be sent back to the first FF, unless we are done counting. There are now more FFs in the picture, but the counter only needs to be three bits now, so the total number of FFs is 7, one more than before. But the most important thing is that all comparisons are now suddenly 4-bit, because to check for a specific value, we only need to check the 3-bit counter + one of the ring counter FFs. VoilĂ , we have a 5-bit counter that is much friendlier towards 4-input LUT architectures. So how much resources do we save? It heavily depends on the architecture of course, but for a popular 4-input FPGA we go from 220 to 212 LUTs which is almost a 10% improvement. Interestingly, even on a 6-input LUT architecture, this approach saves a LUT.
Now, to the most important thing, what should be call this counter? Since this Johnson guy got to put his name on a counter I can't see why I shouldn't be allowed to do the same thing. So ladies and gentleman, please welcome the Kindgren counter. I'm expecting to see updates to all course material on digital design to include this from now on, so that future students can go "Wow, that Kindgren guy sure came up with a clever counter" upon discovering it.



























