The convolution Engine

convolver

In this article, we design a fully parameterized 2D convolution engine in Verilog HDL and verify its functionality with a golden model written in python.

Now you can access an FPGA LAB environment on demand with chiprentals. This service is currently in the Beta stage and is being provided for free. Go perform your projects in a real world environment. Book your slot here. The PYNQ-Z2 board being offered currently is the best choice for trying out ML and AI acceleration projects like this one.

Part Zero - The Introduction

Part One - The Architecture Outline

Part Two - The Convolution Engine

Part Three - The Activation Function

Part Four - The Pooling Unit

Part Five - Adding fixed-point arithmetic to our design

Part Six - Putting it all together: The CNN Accelerator

Part Seven - System integration of the Convolution IP

Part Eight - Creating a multi-layer neural network in hardware.

From here on we shall refer to a convolution layer as a conv layer.
Our design implements a fully parameterized convolution operation in Verilog. By 'fully parameterized', I mean to say that the same code can be used to synthesize a conv layer of any size the architecture demands(conv3x3, conv7x7,conv11x11). You just have to vary the parameters passed to this code before the synthesis.
The primary idea behind the approach to this design is to build a highly pipelined streaming architecture wherein the processing module does not have to stop at any point in time. i.e any part of the convolver does not wait for any other part to finish its work and supply it with the result. Each stage is performing a small portion of the entire work continuously on a different part of the input every clock cycle. This is not just a feature of this design, it is a general principle called 'pipelining' that is widely used to break down large computational processes into smaller steps and also increase the highest frequency that the entire circuit can operate at.
Our design uses the MAC (Multiply and Accumulate) units with the aim of mapping these operations to the DSP blocks of the FPGA. Achieving that would make the multiplication and addition operations much faster and consume less power since the DSP blocks are implemented in hard macros. which is to say, the DSP blocks have already been synthesized, placed, and routed into silicon on the FPGA device in the most efficient way possible, unlike general IP blocks wherein only the tried and tested RTL code is supplied to you and can be synthesized at your will.

NOTE: This design unit was implemented on an FPGA as a part of a complete Processor + FPGA system to measure the actual acceleration it provides. You can read about it in Part Seven - System integration of the Convolution IP.

The convolver is nothing but a set of MAC units and a few shift registers, which when supplied with the right inputs, output the result of the convolution after a fixed number of clock cycles. Let's Dig into it!
Let us assume the size of the input Feature map (image) to be of a dimension NxN and our kernel (filter/window) to be of the dimensions KxK. It is understood obviously, that K < N.
s represents the value of the stride with which the window moves over the feature map.

Now, if you know how the 2-D convolution works, the size of the output feature map will be smaller than that of the input feature map. Precisely, if we do not have some form of zero paddings on the input feature map, the output will be of the size (N-K+1)/s x (N-K+1)/s

As you can see in the image, our convolver has MAC units laid out in the size of the convolution kernel and a couple of shift registers connected at various levels in the pipeline. The thing to note about the inputs to these various MAC units is that.

The kernel weights are placed such that each MAC unit has a fixed value of the weight throughout the process of the convolution, but each mac unit has a different value placed on it according to its position.
Whereas, for the input activation values, the same value is placed on each and every one of the MAC units except that value changes with every clock cycle according to the input feature map.

Additionally, we have a few more signals like the valid_conv and end_conv signal which tell the outside world if the outputs of this convolver module are valid or not. But why would the convolver produce invalid results in the first place? After the window is done moving over a particular row of the input, it continues to wrap around to the next row creating invalid outputs. We could avoid calculation during the wrap-around period, but that would require us to stop the pipeline, which is against our streaming design principle. So instead we just discard the outputs that we believe are invalid with the help of a valid conv signal. Look at the following animation to understand better.

Convolver Animation

This animation shows how the various inputs to the Convolver are being modified with each clock cycle and also tracks the value of a particular variable as it makes its way to the output, this shows you the mathematical reasoning behind the design and how the convolution is actually being calculated.

If you observe the final value of x which is

w0*a0 + w1*a1 + w2*a2 + w3*a4 + w4*a5 + w5*a6 + w6*a8 + w7*a9 + w8*a10

The code snippet pertaining to the convolution unit has been shown below. It is fairly well commented and uses the Verilog generate loop to simplify the code for all those MAC units, also, the code is fully parameterized, meaning you can scale it to any size of N and K you wish depending on which neural network you are building.

ADVANTAGES OF THIS ARCHITECTURE:

The input feature map has to be sent in only once (i.e the stored activations have to be accessed from the memory only once) thus greatly reducing the memory access time and the storage requirement. The shift registers create a kind of local caching for the previously accessed values.
The convolution for any sized input can be calculated with this architecture without having to break the input or temporarily store it elsewhere due to low compute capability
The stride value can also be changed for some exotic architectures where a larger stride is required.

Now let's dig into:

The Code

Before we start writing Verilog for our design, we need to have a golden model. i.e. a basic model in code that is able to achieve the functionality we are targeting to create in hardware. You can code it in which ever language you like, but I choose python because I love python.

import numpy as np 
from scipy import signal

ksize = 3                                                               #ksize x ksize convolution kernel
im_size = 4                                                             #im_size x im_size input activation map

def strideConv(arr, arr2, s):                                             #the function that performs the 2D convolution
    return signal.convolve2d(arr, arr2[::-1, ::-1], mode='valid')[::s, ::s]

kernel =  np.arange(0,ksize*ksize,1).reshape((ksize,ksize))                   #the kernel is a matrix of increasing numbers
act_map = np.arange(0,im_size*im_size,1).reshape((im_size,im_size))           #the activation map is a matrix of increasin numbers

conv = strideConv(act_map,kernel,1)
print(kernel)
print(act_map)
print(conv) 

Running the above code gives us:

kernel
 [[0 1 2]
 [3 4 5]
 [6 7 8]] 

activation map
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]] 

convolution output
 [[258 294]
 [402 438]] 

So our goal would be to design a Verilog module that achieves the same set of outputs for any given inputs to the above python function.

NOTE on utilizing DSP blocks for MAC units - when the synthesis tools encounter a multiply and accumulate operation, they usually implement that code in a DSP block on the FPGA instead of using regular slice logic. However in some cases like high resource utilization scenarios the tools do not have enough dsp blocks and implement multipliers and adders in regular slice logic instead. To be sure, we could use the (* use_dsp = "yes" *) attribute in xilinx or a similar attribute in other tools (extract_mac for Altera tools) to force them to use the relevant DSP resources on the FPGA for that particular module.However, it is not necessary in most cases.

This document from Xilinx and this one from intel will tell you more about this stuff.

I have written lots of comments in the code to better explain it -

//file: convolver.v
`timescale 1ns / 1ps

module convolver #(
parameter n = 9'h00a,     // activation map size
parameter k = 9'h003,     // kernel size 
parameter s = 1,          // value of stride (horizontal and vertical stride are equal)
parameter N = 16,         //total bit width
parameter Q = 12          //number of fractional bits in case of fixed point representation.
)(
input clk,
input ce,
input global_rst,
input [N-1:0] activation,
input [(k*k)*16-1:0] weight1,
output[N-1:0] conv_op,
output valid_conv,
output end_conv
);

reg [31:0] count,count2,count3,row_count;
reg en1,en2,en3;

wire [15:0] tmp [k*k+1:0];
wire [15:0] weight [0:k*k-1];

//breaking our weights into separate variables. We are forced to do this because verilog does not allow us to pass multi-dimensional 
//arrays as parameters
//----------------------------------------------------------------------------------------------------------------------------------
generate
    genvar l;
    for(l=0;l<k*k;l=l+1)
    begin
        assign weight [l][N-1:0] = weight1[N*l +: N];       
    end 
endgenerate
//----------------------------------------------------------------------------------------------------------------------------------
assign tmp[0] = 32'h0000000;

//The following generate loop enables us to lay out any number of MAC units specified during the synthesis, without having to commit to a //fixed size 
generate
genvar i;
  for(i = 0;i<k*k;i=i+1)
  begin: MAC
    if((i+1)%k ==0)                       //end of the row
    begin
      if(i==k*k-1)                        //end of convolver
      begin
      (* use_dsp = "yes" *)               //this line is optional depending on tool behaviour
      mac_manual #(.N(N),.Q(Q)) mac(      //implements a*b+c
        .clk(clk),                        // input clk
        .ce(ce),                          // input ce
        .sclr(global_rst),                // input sclr
        .a(activation),                   // activation input [15 : 0] a
        .b(weight[i]),                    // weight input [15 : 0] b
        .c(tmp[i]),                       // previous mac sum input [32 : 0] c
        .p(conv_op)                       // output [32 : 0] p
        );
      end
      else
      begin
      wire [N-1:0] tmp2;
      //make a mac unit
      (* use_dsp = "yes" *)               //this line is optional depending on tool behaviour
      mac_manual #(.N(N),.Q(Q)) mac(                   
        .clk(clk), 
        .ce(ce), 
        .sclr(global_rst), 
        .a(activation), 
        .b(weight[i]), 
        .c(tmp[i]), 
        .p(tmp2) 
        );

      variable_shift_reg #(.WIDTH(32),.SIZE(n-k)) SR (
          .d(tmp2),                  // input [32 : 0] d
          .clk(clk),                 // input clk
          .ce(ce),                   // input ce
          .rst(global_rst),          // input rst
          .out(tmp[i+1])             // output [32 : 0] q
          );
      end
    end
    else
    begin
    (* use_dsp = "yes" *)               //this line is optional depending on tool behaviour
   mac_manual #(.N(N),.Q(Q)) mac2(                    
      .clk(clk), 
      .ce(ce),
      .sclr(global_rst),
      .a(activation),
      .b(weight[i]),
      .c(tmp[i]), 
      .p(tmp[i+1])
      );
    end 
  end 
endgenerate

//The following logic generates the 'valid_conv' and 'end_conv' output signals that tell us if the output is valid.
always@(posedge clk) 
begin
  if(global_rst)
  begin
    count <=0;                      //master counter: counts the clock cycles
    count2<=0;                      //counts the valid convolution outputs
    count3<=0;                      // counts the number of invalid onvolutions where the kernel wraps around the next row of inputs.
    row_count <= 0;                 //counts the number of rows of the output.  
    en1<=0;
    en2<=1;
    en3<=0;
  end
  else if(ce)
  begin
    if(count == (k-1)*n+k-1)        // time taken for the pipeline to fill up is (k-1)*n+k-1
    begin
      en1 <= 1'b1;
      count <= count+1'b1;
    end
    else
    begin 
      count<= count+1'b1;
    end
  end
  if(en1 && en2) 
  begin
    if(count2 == n-k)
    begin
      count2 <= 0;
      en2 <= 0 ;
      row_count <= row_count + 1'b1;
    end
    else 
    begin
      count2 <= count2 + 1'b1;
    end
  end

  if(~en2) 
  begin
  if(count3 == k-2)
  begin
    count3<=0;
    en2 <= 1'b1;
  end
  else
    count3 <= count3 + 1'b1;
  end
  //one in every 's' convolutions becomes valid, also some exceptional cases handled for high when count2 = 0
  if((((count2 + 1) % s == 0) && (row_count % s == 0))||(count3 == k-2)&&(row_count % s == 0)||(count == (k-1)*n+k-1))
  begin                                                                                                                        
    en3 <= 1;                                                                                                                             
  end
  else 
    en3 <= 0;
end
    assign end_conv = (count>= n*n+2) ? 1'b1 : 1'b0;
    assign valid_conv = (en1&&en2&&en3);
endmodule

As you can notice in the above code, we use the MAC units and shift registers to lay out the planned architecture. While the shift register is a straight forward shift-register module wherein you can supply the data width and depth of the shift register as parameters, the mac_manual module is also quite simple to implement, at least when we are working with integers.

//file: mac_manual.v
//note that this file has been modified after the addiion of fixed point arithmetic support
//this post has not been edited for the same keeping in mind first time readers. Do refer 
//to that post in this series.

module mac_manual #(
    parameter N = 16,
    parameter Q = 12
    )(
    input clk,sclr,ce,
    input [N-1:0] a,
    input [N-1:0] b,
    input [N-1:0] c,
    output reg [N-1:0] p
    );

always@(posedge clk,posedge sclr)
 begin
    if(sclr)
    begin
        p<=0;
    end
    else if(ce)
    begin
        p <= (a*b+c);                   //performs the multiply accumulate operation
    end
 end
endmodule

Verifying our design with a testbench

//convolver_tb.v

`timescale 1ns / 1ps
module convolver_tb;

    // Inputs
    reg clk;
    reg ce;
    reg [143:0] weight1;
    reg global_rst;
    reg [15:0] activation;

    // Outputs
    wire [31:0] conv_op;
    wire end_conv;
    wire valid_conv;
    integer i;
    parameter clkp = 40;
    // Instantiate the Unit Under Test (UUT)
    convolver #(9'h004,9'h003,1) uut (
        .clk(clk), 
        .ce(ce), 
        .weight1(weight1), 
        .global_rst(global_rst), 
        .activation(activation), 
        .conv_op(conv_op), 
        .end_conv(end_conv), 
        .valid_conv(valid_conv)
    );

    initial begin
        // Initialize Inputs
        clk = 0;
        ce = 0;
        weight1 = 0;
        global_rst = 0;
        activation = 0;
        // Wait 100 ns for global reset to finish
        #100;
         clk = 0;
        ce = 0;
        weight1 = 0;
        activation = 0;
         global_rst =1;
         #50;
         global_rst =0; 
         #10;   
        ce=1;
        //we use the same set of weights and activations as the sample input in the golden model (python code) above.
        weight1 = 144'h0008_0007_0006_0005_0004_0003_0002_0001_0000; 
        for(i=0;i<255;i=i+1) begin
        activation = i;
        #clkp; 
        end
    end 
      always #(clkp/2) clk=~clk;      
endmodule

Running the above test bench gives us the following result in Vivado:

convolver simulation result

The values corresponding to the state when the valid_conv signal is high are the convolution outputs. Its quite clear that these outputs match the ones the python script gave us for the same input.

Here is a link to the exact git commit at this point. i.e before adding the Fixed-point arithmetic support to this project.

Fixed-point Arithmetic

You might wanna read this post, further in this series and come back to understand the changes we're about to make.

The advantage of a fully modular device like ours is that we only have to change the arithmetic (addition and multiplication) modules in order to get the whole thing working with fixed point numbers. Nothing else needs to be changed as the concept of that binary point is only in our heads. The hardware still only sees regular integers (bit-fields).

//file: mac_manual_fp
module mac_manual_fp(
    input clk,sclr,ce,
    input [15:0] a,
    input [15:0] b,
    input [31:0] c,
    output reg [31:0] p
    );
    wire [31:0] sum;
    wire [31:0] m;

qmult #(3,12) mul(a,b,m);            //The fixed-point multiplier module
qadd  #(3,12) add(m,c,sum);          //The fixed-point adder module
always@(posedge clk,posedge sclr)
 begin
    if(sclr)
    begin
        p<=0;
    end
    else if(ce)
    begin
        p <= sum;
        //p <= (a*b+c);               //The previous mac_manual operation for regular integers
    end
 end
endmodule

We just need to change the convolver.v module to call mac_manual_fp instead of mac_manual. To understand how the qmult and qadd arithmetic modules work, you might wanna go through this post and come back.

This design unit was implemented on an FPGA as a part of a complete Processor + FPGA system to measure the actual acceleration it provides. You can read about it in Part Seven - System integration of the Convolution IP.

All the design files along with their test benches can be found at the Github Repo

PREVIOUS:The architecture outline
NEXT :The activation function