Every once in a while there is a question in the Arduino forum on how fast you can toggle an IO pin. The answer to this question depends on your line of thinking / what you will allow as an answer. So here is my take on performance optimization for its own sake.
First of all there are pins PB6 and PB7 which are connected to the crystal. Since these pins can be used as IO pins (but only if running without crystal) I would conclude the maximum speed is 16 MHz. However this feels like cheating as these pins just oscillate due to the crystal. Let us look at the next option for super fast output. According to the data sheet the fuses can be changed such that the system clock will be passed to PB0 which corresponds to digital 8. This will again give a 16 MHz output. But it is still somehow cheating because it requires to change the fuses which is probably out of reach for a lot of Arduino tinkerers. The next thing is to go for one of the hardware timers and toggle one of the output pins at each clock cycle. This would give an output of 8 MHz. I would not consider this cheating but it is somewhat boring because the hardware does all the work.
Pondering about this kind of arguments I started to wonder how fast can I toggle a pin in software? And as an afterthought how fast can I count a 20 bit counter (all the IO pins of my Arduino) in software? Of course this is just a somewhat theoretical exercise / performance optimization for its own sake. In a production environment one would feed a clock signal into counter chips. This would be cheaper and faster. However back to my issue: how fast can an Arduino count a 20 bit output without using a hardware counter?
First of all it is clear that instead of digitalWrite() I will use direct port manipulation. The naive solution would be to increment port D (digital 0-7), then carry on to port B (digital 8-13) and finally carry on to port C (digital 14-19).
This solution has two significant drawbacks though. The most significant drawback is that this approach does not work at all. The issue arises once this counter reaches 255. That is the point in time all bits of port D are set to 1 and all other ports are 0. The next step will set all bits of port D to 0 and bit 0 of port B to 1. Since this is performed in software the output will jump from 255 to 0 (after port D is incremented) and then to 256 (after the increment is carried to port B). Changing the update order does not make this any better. Then the output will jump from 255 to 0 and then 256. In both cases it is just impossible to count glitch free.
The other issue with the naive solution is that every time a carry operation is required two output ports must be toggled thus reducing the maximum reachable speed.
In order to resolve both issues I figured that it is better to not count in binary at all. Instead I will implement a Gray code counter. See the table below how Gray code looks compared to binary.
The point is that counting in Gray code will only toggle 1 bit per step. Thus it avoids both issues I already mentioned. As it will turn out it also allows to come up with a very fast implementation.
Besides this theoretical considerations there are practical considerations as well. How to implement this with as little computations as possible? And along the same train of thought: how to implement this with as little memory operations as possible? How to keep as much as possible in some CPU registers?
You may want to ponder this a little bit before you read on and have a look at my first solution.
Read More: Fast Counter