Accelerating a 2-D image convolution operation on FPGA Hardware
Part One - The Architecture Outline
Part Two - The Convolution Engine
Part Three - The Activation Function
Part Five - Adding fixed-point arithmetic to our design
Part Six - Putting it all together: The CNN Accelerator
Part Seven - System integration of the Convolution IP
Part Eight - Creating a multi-layer neural network in hardware.
In the run up to building a fully functional neural network accelerator, it is important that we take a shot at full system integration of the IP we are writing. Before we add the entire complexity of a neural network with multiple layers, we should write the logic around our IP that would allow it to function as a part of a larger system through which the said acceleration can actually be materialised.
For this little experiment, let’s use the convolver IP that can take an image of any size and perform 2-D convolution over it.
The convolver IP, as it is in the current state needs the following things to become operational as a part of a larger system where a processor controls the data flow to an from the IP.
- AXI Stream slave interface the processor can use to feed the inputs to the convolver.
- AXI Stream master interface the convolver can use to send output to the processor.
- A bunch of configuration registers that can store the weights required for a particular convolution operation. For this we will be using an AXI Lite Slave interface that the processor can write to.
- Zooming out, a DMA block is required to pull data from the memory and manage the input and output AXI Stream interfaces to this IP.
The following block diagram shows the overall architecture and dataflow we want to achieve.
We will by using a PYNQ-Z2 board which is arguably the best suited platform for designing ML Hardware. Once we have a working system in place for the convolver IP, adding more elements to our IP and other bells and whistles to this setup should only be an incremental effort. This setup will also greatly help us in verifying our design over actual hardware as we continue to add features and eventually deploy a full neural network model.
We begin by adding the AXI Lite slave interface to our wrapper. This will support a parameterized number of configuration register that the processor can access thorough its address space. I have stuck to the standard notation used by Xilinx for AXI interfaces in its IPs in order to make it easier for the IP packager later.
//Interface
//We will be using the default AXI data width of 32 in this experimental setup,
//when we work with actual neural network models, this can be optimized.
//As for Address width, we will have to configure k*k number of weights apart
//a few extra configuration registers here and there.
parameter C_S_AXI_DATA_WIDTH = 32,
parameter C_S_AXI_ADDR_WIDTH = 2 + $clog2(KERNEL_DIM*KERNEL_DIM) +1,
// Global Clock Signal
input wire S_AXI_ACLK,
// Global Reset Signal. This Signal is Active LOW
input wire S_AXI_ARESETN,
//Write Address Interface
input wire [C_S_AXI_ADDR_WIDTH-1 : 0] S_AXI_AWADDR,
input wire [2 : 0] S_AXI_AWPROT,
input wire S_AXI_AWVALID,
output wire S_AXI_AWREADY,
//Write Data Interface
input wire [C_S_AXI_DATA_WIDTH-1 : 0] S_AXI_WDATA,
input wire [(C_S_AXI_DATA_WIDTH/8)-1 : 0] S_AXI_WSTRB,
input wire S_AXI_WVALID,
output wire S_AXI_WREADY,
//Write Response Interface
output wire [1 : 0] S_AXI_BRESP,
output wire S_AXI_BVALID,
input wire S_AXI_BREADY,
//Read Address Interface
input wire [C_S_AXI_ADDR_WIDTH-1 : 0] S_AXI_ARADDR,
input wire [2 : 0] S_AXI_ARPROT,
input wire S_AXI_ARVALID,
output wire S_AXI_ARREADY,
//Read DAta Interface
output wire [C_S_AXI_DATA_WIDTH-1 : 0] S_AXI_RDATA,
output wire [1 : 0] S_AXI_RRESP,
output wire S_AXI_RVALID,
input wire S_AXI_RREADY
//Skipping the AXI - Lite State machine that handles the protocol and timing
//between interface signals.
//Write to configuration registers based on address
generate
genvar g;
for(g = 0; g<NUM_REGS; g = g+1) begin
always @( posedge S_AXI_ACLK )begin
if ( S_AXI_ARESETN == 1'b0 ) begin
slv_regs[g] <= 'd0;
end else if(slv_reg_wren) begin
if(axi_awaddr[ADDR_LSB+OPT_MEM_ADDR_BITS:ADDR_LSB] == g) begin
slv_regs[g] <= S_AXI_WDATA[C_S_AXI_DATA_WIDTH-1:0];
end
end else begin
slv_regs[g] <= slv_regs[g];
end
end
end
endgenerate
//Read out from configuration registers based on read address
always @(*) begin
reg_data_out <= slv_regs[axi_araddr[ADDR_LSB+OPT_MEM_ADDR_BITS:ADDR_LSB]];
end
// Output register or memory read data
always @( posedge S_AXI_ACLK )
begin
if ( S_AXI_ARESETN == 1'b0 )
begin
axi_rdata <= 0;
end
else
begin
// When there is a valid read address (S_AXI_ARVALID) with
// acceptance of read address by the slave (axi_arready),
// output the read dada
if (slv_reg_rden)
begin
axi_rdata <= reg_data_out; // register read data
end
end
end
That gives our IP a bunch of configurable address mapped registers.
Using these registers, let’s capture the weights in a format our IP requires.
//A soft reset for the convolution IP controllable by the processor
assign conv_rst = slv_regs[1][0];
//Concatenating the weights because our IP requires them in such a format
generate
genvar k;
for(k=8;k< 8 + KERNEL_DIM*KERNEL_DIM;k=k+1) begin //weights begin only at 9th register, first 8 are for debugging and other reserved configs
assign concat_cntrl_reg[ (k-8+1)*C_S_AXI_DATA_WIDTH -1 : (k-8)*C_S_AXI_DATA_WIDTH ] = slv_regs[k];
end
endgenerate
Now we need to take care of a specific nuance that comes with the AXI stream interface, or any streaming interface for that matter.
With that in place, let’s take care of the Streaming interfaces that will pump data in and out of the IP.
//AXI-Stream IF for datapath
//master to slave IF
input s_axis_valid,
input [C_S_AXI_DATA_WIDTH-1:0] s_axis_data,
output s_axis_ready,
//Slave to Master IF
output m_axis_valid,
output [C_S_AXI_DATA_WIDTH-1:0] m_axis_data,
output m_axis_last,
input m_axis_ready,
Now Let’s see how to convert this interface into something our IP can work with:
The instantiation template to our IP looks like this:
convolver #(
.N(C_S_AXI_DATA_WIDTH),
.n(IM_DIM),
.k(KERNEL_DIM),
.s(1)
) u_conv (
//Clocks and Resets
.clk(S_AXI_ACLK),
.global_rst(conv_rst), //processor controlled soft reset
//Configuration Interface
.weight1(concat_cntrl_reg), //the config registers we just built via AXI-Lite
//Input Interface -> Conects to AXI Slave
.activation(),
.ce(),
//Output Interface -> Connects to AXI Master
.conv_op(),
.end_conv(),
.valid_conv()
);
Some of these are straightforward, like the activation input can be directly connected to the incoming stream data signal saxisdata and the conv_op output can be connected to the stream master data maxisdata. The valid_conv signal also gets connected to the maxisvalid pin.
Coming to the more tricky ones
- ce (clock enable) - This is simply the gating signal that allows the convolution logic to take a step forward. When this is low, the operation is frozen. Initially, its obvious that this should be connected to a combination of signals that indicates a valid AXI data beat on the input Slave interface, meaning:
assign conv_ce = s_axis_valid && s_axis_ready;
However, there is one thing that this logic is still missing, in a streaming interface of any kind, the master will be able to transfer data to the slave only when the slave is ready.
Output of the convolver is an AXI Master interface that feeds the slave that is the DMA. This transaction cannot happen if the processor de-asserts its ready signal.
Since there is no storage of any kind inside our IP, this means that the entire convolution pipeline should be halted whenever the DMA slave becomes unavailable.
which makes our clock enable signal:
assign conv_ce = s_axis_valid && s_axis_ready && m_axis_ready;
Since there is no reason to stall the pipeline for any other purpose other than the DMA slave not being ready, the ready signal of our convolution IP, which is slave to the DMA at the input streaming interface also becomes:
assign s_axis_ready = m_axis_ready;
There is one more signal left, that a lot of people who are beginners to the AXI interface forget. But this is very important because the DMA slave needs to be informed when a transaction from it’s master is complete. Otherwise it will stay waiting for more data beats and leave the channel stuck forever.
This is the maxislast signal, which will get connected to the end_conv output of the convolver IP.
Full System implementation:
The complete system block diagram looks something like this when being built for the. ZYNQ-7020 SOC that is on our PYNQ-Z2 board.
Here I’m using the chiprentals facilities to access a PYNQ-Z2 board along with a host linux system with all tools pre-installed.
There is extra debug logic on this system which is required for us to be able to look at the waveforms of actual hardware signals via the ILA. The debug bridge IP allows an XVC server to access these ILAs over a TCP connection through which the Vivado Hardware manager can let us see the actual signals in hardware like we would in a real oscilloscope. This is one of the coolest features enabled by the xilinx ecosystem and very soon I’ll write an article on it’s internal details.
Otherwise, the connections between the DMA to convolver IP and also the Processing System to the DMA are the same as what we had in our architecture diagram, only difference being the several AXI Interconnect blocks inserted by vivado to manage the addressing to different blocks.
Implementation Challenges
There was just one detail that ended up breaking the timing of our design during implementation at 100Mhz clock frequency. Once I dug into the path, it was obvious. The MAC (Multiply and Accumulate) operation code in our RTL currently looks like this:
//file: mac_manual.v
always@(posedge clk)
begin
if(sclr)
begin
temp_q <= 0;
c_q <= 0;
end
else if(ce)
begin
temp_q <= (a*b+c);
end
end
assign p = temp_q;
For the keen eyed, it should be obvious why timing failed on this path. This is a 32-bit multiplier followed by a 32-bit adder all clubbed into a single combo cloud. There is no reason why is should be this way and we can indeed cut down the critical path by a simple register re-timing optimization:
always@(posedge clk)
begin
if(sclr)
begin
temp_q <= 0;
c_q <= 0;
end
else if(ce)
begin
c_q <= c;
temp_q <= (a*b);
end
end
assign p = temp_q + c_q;
This modification still preserves the single cycle count of this operation but significantly reduces the critical path by moving the adder logic after the flop-flop stage temp_q
Packaging the IP
Another super convenient feature of the xilinx ecosystem is that it let’s you package the IP in a reusable format and even does the versioning etc for it.
I will skip the details of this step but you should be able to find it pretty much anywhere else. It is a simple 1:1 mapping between the AXI protocol signals that vivado understands with the actual wrapper signals on your design.
Software platform
What we need to test this IP in a real world system is a processor that is closely coupled with FPGA fabric. In a typical FPGA only development board, it would be meaningless to compare the hardware performance with that of a CPU which is in no way connected to the FPGA.
The ZYNQ SOC on the other hand provides the perfect environment where two ARM cores are tightly coupled to FPGA fabric via a variety of AXI interfaces.
The PYNQ system let’s you use ipython notebooks through which you can effortlessly load the FPGA bitstream, instruct the DMA to fetch data from the DDR memory and pump it through the convolver IP and at the same time use pure software python functions to perform the same operation on the ARM processor.
Another luxury we have is that we can save ourselves from getting involved with extremely low level details of each IP and it’s corresponding driver.
Now, if you wish to fully understand an embedded linux system from top down then maybe getting into the drivers is a good idea. But if you are trying out ideas at the architectural level, you want a system that abstracts away these complexities for you so that you can focus on the bigger ideas.
Let’s start writing software for our newly integrated IP block:
#Load the required packages
from pynq import Overlay
from scipy import signal
from pynq.lib import DebugBridge
from pynqutils.runtime.repr_dict import ReprDict
from pynq import allocate
import numpy as np
import time
#Load Overlay
ol = Overlay("/home/xilinx/pynq/overlays/conv_fpga/conv_acc.bit")
#Load the IPs inside this overlay
dma = ol.axi_dma_0
conv_ip = ol.conv_axi_wrap_0
#Load the dma channels
dma_send = dma.sendchannel
dma_recv = dma.recvchannel
#Define the SW function for the convolution oepration
def strideConv(arr, arr2, s): #the function that performs the 2D convolution
return signal.convolve2d(arr, arr2[::-1, ::-1], mode='valid')[::s, ::s]
#soft reset convolution IP - remember this reset from the RTL above?
conv_ip.write(0x4,0x01)
conv_ip.read(0x4)
conv_ip.write(0x4,0x00)
ksize = 3 #kernel dimension
im_size = 4 #image dimension
#SW execution
kernel = np.arange(0,ksize*ksize,1).reshape((ksize,ksize)) #the kernel is a matrix of increasing numbers
act_map = np.arange(0,im_size*im_size,1).reshape((im_size,im_size)) #the activation map is a matrix of increasing numbers
#Configure the weights for convolution kernel via the AXI-Lite interface
weight_ram_addr = 32
for x in kernel_flat:
conv_ip.write(weight_ram_addr,int(x))
weight_ram_addr += 4
#Perform a software execution of the operation and measure how long it takes
st = time.time()
conv = strideConv(act_map,kernel,1)
et = time.time()
sw_time = et - st
print("sw time = ", sw_time)
#Allocate buffers in the RAM to hold the activation matrix (input to convolver)
#and hold the output activation matrix recieved after the operation
ip_buffer_sz = im_size*im_size
op_buffer_sz = (im_size - ksize + 1)*(im_size - ksize + 1)
input_buffer = allocate(shape=(ip_buffer_sz,), dtype=np.uint32)
output_buffer = allocate(shape=(op_buffer_sz,), dtype=np.uint32)
for i in range(ip_buffer_sz):
input_buffer[i] = i
#Send the activation matrix to the IP via a DMA send transaction
#Capture the output on the DMA slave interface
#Measure how long this entire operation takes
st = time.time()
dma_send.transfer(input_buffer)
dma_recv.transfer(output_buffer)
et = time.time()
hw_time = et - st
#Acceleration achieved would be equal to:
acc_factor = sw_time/hw_tim
First let’s look at the output for a small matrix convolution just to see our system in action:
Software Execution:
Hardware Execution:
Not so surprisingly, the acceleration factor for this operation was 0.42 , meaning the hardware was slower than the software, which is obvious given the small size of the activation map. More time was spent in moving the data around than actual computation.
The speedup comes when we use larger matrices, things the size of an actual image.
I have summarized the experimental results in the following table:
Image Dimentsion | Kernel Dimension | Speedup Factor |
---|---|---|
4 | 3 | 0.42 |
600 | 3 | 134.15 |
1200 | 3 | 466.52 |
Note that the sw_time variable varies a lot from time to time even for the same activation matrix, this is because SW always has a non-deterministic latency, another reason to use FPGAs.
You can find all the code in this article at the github repository.