I wrote a partial parallel ROM emulator -like program, in a similar way to what is described in this thread: viewtopic.php?t=335460
The PIO program waits for a read signal to become active low and then runs an IN instruction to shift input address pins states to ISR (which in advance contains the higher address bits). This adds-up to the ISR auto-push level and so the address gets auto-pushed to RX_FIFO, where an enabled DMA1 channel waits for DREQ from it and then copies the address variable (32 bit) from PIO RX_FIFO to DMA2 channel READ_ADDR register, which is the trigger alias register. The DMA2 channel is configured to use PIO TX_FIFO DREQ (though it works the same with an always active DREQ too) and once triggered by DMA1 as described, it reads from the set READ_ADDR from RAM (from a data array) a 16 bit value and writes it to the PIO TX_FIFO. The address bus bit 0 is kept low to prevent DMA bus error on unaligned access. At that time, the PIO program has been waiting due to an auto-pull, triggered by an OUT pindirs instruction which has shift amount that exhausted the OSR. Once TX_FIFO has data, the subsequent OUT pins instruction manages to get that data, output it to the pins and complete. A DMA2 channel is configured to trigger DMA1 with CHAIN_TO, to continue the cycle.
The side-set signal is set to change in four places, so as to be possible to detect what the delays between those instructions are.In the first IN instruction, there is a side-set to control the level of an output pin. In the instruction after the OUT pins instruction, there is also such. The time duration between those two points corresponds to 14 cycles (at 125MHz on RP2040). This is with bypassed input synchronization, though that does not matter as I am measuring the time on an output pin, from the IN instr. to after the OUT instr. It is necessary to measure on the instruction after the OUT one, because the side-set of the OUT instruction actually gets ran much earlier - before TX_FIFO has received the DMA data, so that position cannot be used to measure it.
A guess about how the 14 cycles may be spread over the operations done:
- 1 cycles IN pins to ISR and then to RX_FIFO
- 6 cycles DMA1
- 6 cycles DMA2
- 1 cycles autopull from TX FIFO to OSR and output from OSR to pins
Furthermore I did a test in which only one DMA channel was triggered by PIO RX_FIFO DREQ and then that wrote to PIO TX_FIFO (same PIO code). That ran from the IN instr to after the OUT instr for 8 cycles. 14cy - 8cy = 6 cy added by adding one more channel to the 'chain'. Perhaps the above estimate is not correct, and instead, one more cycle is added due to the use of a trigger register alias. But I also tried by configuring DMA2 to be triggered by CHAIN_TO and that resulted in the same timing.
Perhaps a more accurate estimate is this, but still 5 cycles for DMA seem too many.
- 2 cycles IN pins to ISR and then to RX_FIFO
- 5 cycles DMA1
- 5 cycles DMA2
- 2 cycles autopull from TX FIFO to OSR and output from OSR to pins
The thread mentioned above specifies somewhat shorter DMA times. Is there anything that can be done to improve DMA performance or PIO FIFO delays?
Is there a difference (improvement) in the RP2350 PIO IN, OUT instructions <-> FIFO and DMA performance, from what it is on RP2040?
It would have been good if it wasn't necessary to run an MCU at more than 10 times a certain bus access times, in order to interface with it, but perhaps this is not what the PIO was meant for.
The PIO program waits for a read signal to become active low and then runs an IN instruction to shift input address pins states to ISR (which in advance contains the higher address bits). This adds-up to the ISR auto-push level and so the address gets auto-pushed to RX_FIFO, where an enabled DMA1 channel waits for DREQ from it and then copies the address variable (32 bit) from PIO RX_FIFO to DMA2 channel READ_ADDR register, which is the trigger alias register. The DMA2 channel is configured to use PIO TX_FIFO DREQ (though it works the same with an always active DREQ too) and once triggered by DMA1 as described, it reads from the set READ_ADDR from RAM (from a data array) a 16 bit value and writes it to the PIO TX_FIFO. The address bus bit 0 is kept low to prevent DMA bus error on unaligned access. At that time, the PIO program has been waiting due to an auto-pull, triggered by an OUT pindirs instruction which has shift amount that exhausted the OSR. Once TX_FIFO has data, the subsequent OUT pins instruction manages to get that data, output it to the pins and complete. A DMA2 channel is configured to trigger DMA1 with CHAIN_TO, to continue the cycle.
The side-set signal is set to change in four places, so as to be possible to detect what the delays between those instructions are.
Code:
.program pioRegRead.side_set 1 opt //Signal used to measure the times.loop:out pindirs 32 side 0//Output-enable of the data bus. Also triggers auto-pull (as it drains OSR to trig. level).out pins 16 side 1//Output the data that was fetched from DMA to TX_FIFO previously by the auto-pull.mov osr, null side 0//Load value for disabling pindirs.wait 1 gpio 1//Wait for the end of /RD.out pindirs 16//Disable data outputs. Don't cause auto-pull.public entryPoint:wait 0 gpio 1//Wait for /RD.in pins, 8 side 1//Load the lower 8 bits (the upper bits are loaded already). Also causes an auto-push.in y, 24//Load the upper 24 addr bits in ISR from Y (for the next read cycle).mov osr, ~ null//Preload OSR with a value to disable outputs. Also clears the old shift-count.jmp loop
A guess about how the 14 cycles may be spread over the operations done:
- 1 cycles IN pins to ISR and then to RX_FIFO
- 6 cycles DMA1
- 6 cycles DMA2
- 1 cycles autopull from TX FIFO to OSR and output from OSR to pins
Furthermore I did a test in which only one DMA channel was triggered by PIO RX_FIFO DREQ and then that wrote to PIO TX_FIFO (same PIO code). That ran from the IN instr to after the OUT instr for 8 cycles. 14cy - 8cy = 6 cy added by adding one more channel to the 'chain'. Perhaps the above estimate is not correct, and instead, one more cycle is added due to the use of a trigger register alias. But I also tried by configuring DMA2 to be triggered by CHAIN_TO and that resulted in the same timing.
Perhaps a more accurate estimate is this, but still 5 cycles for DMA seem too many.
- 2 cycles IN pins to ISR and then to RX_FIFO
- 5 cycles DMA1
- 5 cycles DMA2
- 2 cycles autopull from TX FIFO to OSR and output from OSR to pins
The thread mentioned above specifies somewhat shorter DMA times. Is there anything that can be done to improve DMA performance or PIO FIFO delays?
Is there a difference (improvement) in the RP2350 PIO IN, OUT instructions <-> FIFO and DMA performance, from what it is on RP2040?
It would have been good if it wasn't necessary to run an MCU at more than 10 times a certain bus access times, in order to interface with it, but perhaps this is not what the PIO was meant for.
Statistics: Posted by wisi — Tue Aug 13, 2024 10:31 pm