Common Design Patterns/Practices
The end goal of RTL design, especially for ASIC, is to produce smallest and fastest circuit possible. To do so, we need to understand how synthesis tools analyze and optimize the design. In addition, we are also concerned about the simulation speed, since waiting tests to run is effectively wasting engineering efforts. Although synthesis and simulation tools have numerous optimization passes and transformations, one important factor for the end result is the design pattern, i.e., do the code follow the tool’s design guide. Lots of optimization are tuned for a particular design pattern, making the code easier to to be understood and thus simplified by the tools. In addition, some design patterns may simplify the code structures and make the code more readable and reusable.
In this chapter we will go through some common design practices and how we should program the logic and structure the source code.
4.1 Compiler Directives and Packages
Similar to C/C++, SystemVerilog defines a preprocessing stage where macros are expanded in the original source code. The compiler directives are not turing-complete and less versatile than that of C/C++,meaning even fix-bounded recursive computation is difficult to be specified in SystemVerilog. Nevertheless, it allows some level of preprocessing in SystemVerilog.
4.1.1 Compiler Directives
There are several compiler directives defined by the language. We will cover some of the most used macros here:
`__FILE__
`__LINE__
`define
`else
`elseif
`ifdef
`ifndef
`endif
`undef
`timescale
`include
`__FILE__
and `__LINE__
are used the same way as __FILE__
and __LINE__
in C/C++. Users can use that for test bench debugging. During preprocessing, these two compiler directives will be replaced with the actually file name and line number.
`define
allows you to define macros, which can be used later in the code. We will show two examples where the first one defines values, and the second one define function-like code snippets, which takes arguments. Notice that unlike C/C++, macros have to be prefixed with `
when used in code.
`define VALUE 10
module top (input logic clk);
logic [31:0] a;
always_ff @(posedge clk)
a <= `VALUE;
endmodule
In the example above, we define `VALUE
to be 10, and used it as register value. Even though we cover the usage here, please avoid defining constant values as macros in such way. It is because:
- It is difficult to find where the macro is defined, e.g. either from a file or command line options
- There is no namespace regarding macro values. If there are two macros shares the same name, whichever gets parsed later will be used. This may cause unexpected bugs that is difficult to debug, since the compiler may not issue warning for macro re-definition.
We highly recommend to use define constants in a package, which will be covered later in this chapter.
Another way to use `define
is to define some code snippets which can be re-used later, as shown in the example below (also in code/04/macros_arguments.sv
):
`define REGISTER(NAME, WIDTH, VALUE, CLK) \
logic [WIDTH-1:0] NAME; \
always_ff @(posedge CLK) begin \
NAME <= VALUE; \
end
module top;
logic clk;
logic [15:0] in;
// declare 3 registers that are pipelined to signal in, in sequence
`REGISTER(reg1, 16, in, clk)
`REGISTER(reg2, 16, reg1, clk)
`REGISTER(reg3, 16, reg2, clk)
// set the clock to 0 at time = 0, then tick the clock every 2 unit of time
initial clk = 0;
always clk = #2 ~clk;
initial begin
for (int i = 0; i < 3; i++) begin
in = i;
// wait for a cycle
#4;
// print out the register value
$display("reg1: %d reg2: %d reg3: %d", reg1, reg2, reg3);
end
$finish;
end
endmodule
We will see the expected output, where x
denotes uninitialized register value:
reg1: 0 reg2: x reg3: x
reg1: 1 reg2: 0 reg3: x
reg1: 2 reg2: 1 reg3: 0
In the code example above, we first define three registers that are pipelined to signal in
(in chained fashion). The macro REGISTER
first defines the register given NAME
and WIDTH
, then it instantiate an always_ff
block and assign the VALUE
to the register as every clock cycle. Notice that we have to use \
for multi-line definitions.
Although sometimes using a macro may save time and make the code more reusable, it is important to find a balance between repetitive code segments and macro usage. Keep in mind that macro is substituted during preprocessing stage, it will make source-code level debugging challenging. You also need to be careful about macro re-definition since all the macros are in global namespace.
During the macro definition, sometimes you need to undefine some macro names for a different usage. Similar to C/C++, you can use `undef
to un-define the macro.
`ifdef
and `ifndef
can be used to test whether certain macro has been defined (or not defined). You need to close the compiler directives with `endif
. You can also add `else
and`elseif
to account for different scenarios. Notice that for a header file, they can be used together with `define
to provide an include guard, which allows the header file to be included in multiple places. Their usages are identical to those of C/C++, so we will not cover them here.
`timescale
is an important compiler directive useful to simulators. It specifies the unit of measurement for time and precision of time in specific design elements. There can be only be at most one timescale defined for any compilation-unit scope. In other words, it is illegal to define timescales at two different source files compiled together. The syntax for `timescale
is shown below:
// general syntax
`timescale time_unit / time_precision
// e.g.
`timescale 1ns / 1ps
`timescale 1ns / 1ns
The time_unit
argument specifies the unit of measurement time and delays, and the time_precision
argument specifies how delay values are rounded before used in simulation. time_precision
should be at least as precise as time_unit
, since time_precision
is used for finer precision of simulation. The unit of time_unit
and time_precision
can be s
, ms
, us
, ns
, ps
, and fs
. The integer part specifies an order of magnitude for the size of the value, in other words, the only valid number is 1
, 10
, and 100
.
Timescale is crucial to simulate jittering and timing violation. It is also required for any power-related analysis. It is highly recommend to include timescale in your top-level test bench, even though it is not used.
`include
serves the same purpose as #include
in C/C++, where it includes definitions from another file. It’s highly recommended to provide an include guard to the include file. If the filename is enclosed in quotes, e.g. `include "filename.svh"
, the compiler will first search its current working directory, and then search any user-specified locations. If the filename is enclosed in angle brackets, e.g. `include <filename.svh>
, the filename has to be files defined by language standard. This rule is similar to that of C/C++.
4.1.2 Packages
Although `include
provides a way for designers to share definitions, the compiler directives essentially asks the compiler to copy the content of included file into the source file, which is a legacy feature influenced by C. As modern programming languages start to use modules/packages to structure the source code, e.g. module in C++20, SystemVerilog introduce a construct called package
that allows designers to reuse definitions, interfaces, and functions. Since package
is synthesizable, it is highly recommend to use it in both RTL and test benches. Here is an example of package:
package my_def_pkg;
// local parameters
localparam VALUE = 42;
// struct
typedef struct {
logic a;
logic b;
} my_struct_t;
// enum
typedef enum logic { RED, GREEN } color_t;
// function
function logic and_op(logic a, logic b);
return a & b;
endfunction
endpackage: my_def_pkg
Here is an incomplete list of constructs that are allowed inside a package:
- parameter declaration, e.g.
parameter
andlocalparam
- function declaration, e.g. automatic function
- data declaration, e.g., struct and enum
- DPI import and export
- class declaration
- package import declaration
Since parameter
cannot be redefined in side a package, we highly recommend to use localparam
in lieu of parameter
since they are functionally identical in a package
. In other words, localparam
does not have the visibility restriction in a package
.
4.1.2.1 Package Import
To use the package definition in other modules, we need to use import
keyword to import definition. There are several ways to import contents of a package and we will cover two commonly used approaches here:
wildcard import. This is similar to Python’s
from pkg_name import *
:explicit import. This is similar to Python’s
from pkg_name import class_name
:
After importing, the identifiers (i.e. struct names or enum value names) can be used directly in the module. One thing to notice that there are several places where we can put package import. Depends on where the content of the package is used, there are two standard approaches to do so:
If the identifier is used for module port definition, the import needs to placed before port list:
Otherwise, we shall put the import inside the module:
4.1.2.2 Import Packages within a Package
Like software programming languages, you can import a package content inside another package, and the “chained” imports can be visible to the consumer. Here is an example (code/04/chained_packages.sv
) illustrates the package imports:
package def1_pkg;
typedef enum logic[1:0] {ADD, SUB, MULT, DIV} alu_opcode_t;
endpackage: def1_pkg
package def2_pkg;
// import alu_opcode_t from def1_pkg
import def1_pkg::alu_opcode_t;
// define a new struct that include alu_opcode_t
typedef struct {
alu_opcode_t alu_opcode;
logic[7:0] addr;
} opcode_t;
endpackage: def2_pkg
module top;
// alu_opcode_t is NOT accessible from def2_pkg
// the next line is ILLEGAL
// import def2_pkg::alu_opcode_t;
import def2_pkg::*;
opcode_t opcode;
endmodule: top
Notice unlike some software programming language such as Python, where the imported identifier is accessible as part of the new package, SystemVerilog prohibits such behavior. If you try to import alu_opcode_t
from def2_pkg
, you will get a recursive import error in the compiler.
4.1.2.3 Package Usage Caveats
Since the content of a package is scoped, when use wildcard import, there is a chance of naming conflict. A rule of thumb is that when a naming conflicts, always resort to explicit import. Some coding styles prohibit the usage of wildcard import, which make the code a little bit more verbose, but more readable and maintainable. The exact scoping rule is beyond of scope of this book, but interested user should refer to Table 26-1 in 1800-2017.
Another caveat is that packages have to be compiled before any module files that rely on them. One systematic way is to rely on build tools such as make
to ensure the order of compilation. Another simple way to do is to put packages before other sources while supplying file names to the tools.
4.2 Finite State Machines
Finite State Machine (FSM) is the core part of hardware control logic. How well the FSM is designed can directly impact the synthesis and verification effort, since these tools have somewhat restricted expectation of how a FSM should be written. Although the theory of FSM is beyond the scope of this book, we will try to cover as much as possible while going over the major topics regarding FSM.
4.2.1 Moore and Mealy FSM
Generally speaking there are two types of FSM commonly used in hardware design, namely Moore and Mealy machine. Moore machine, named after Edward F. Moore, is a type of FSM whose output values are determined solely by its current state. On the other hand, Mealy machine, named after George H. Mealy, is a type of FSM whose output values are determined by its current state and the current inputs. To draw a formal distinction between Moore and Mealy machine, we can refer to the following mathematical notations.
- A finite set of states \(S\)
- An initial state \(S_0\) such that \(S_0 \in S\)
- A finite input set \(\Sigma\)
- A finite output set \(\Lambda\)
- A state transition function \(T: \Sigma \times S \rightarrow S\)
- An output function \(G\)
For Moore machines, the output function is \(G: S \rightarrow \Lambda\), whereas for Mealy machines, the output function is \(G: \Sigma \times S \rightarrow \Lambda\). Although Moore and Mealy machine are mathematically equivalent, there is a major difference when represented as a state transition diagram, as shown in Figure 4 and 5, where both diagram describes the logic that counts consecutive ones and output 1 once the count reaches 2. As a notation, the label on edges in Moore machine represents the input values and the label on the node represents the output value. In Mealy machine, the label on the edge follows input/output notation.
Due to such difference, we will see timing and area related difference when we design Moore and Mealy machines in SystemVerilog: - To describe the same control logic, Moore machines tend to have more states than Mealy machines - The output from a Moore machines tends to have one extra cycle delay compared to Mealy machines.
Choosing which type of machine to use usually depends on the control logic you are trying to model. Since Mealy machines can be used as Moore machine if inputs are ignored when computing the outputs, Mealy machines are more general. Although nothing prevents you mixing these two machines together, it is highly recommend to stick to one coding style so that tools can recognize your design easily.
4.2.2 FSM State Encoding
There are several different ways to encode your states \(S\), one-hot encoding, Gray encoding, and binary encoding. Given \(|S| = N\):
- one-hot encoding implies that only one of its bits is set to
1
for a particular state. That means the total number of bits required to represent the states is \(N\). The Hamming distance of this encoding is 2, meaning we have to flip 2 bits for a state transition. - Gray encoding, named after Frank Gray, is a special encoding scheme that only requires \(log2(N)\) bits to encode. In addition, its Hamming distance is designed to be 1, which means only one bit change is required to transit a state
- Binary encoding means the state value is assigned by its index in the states. As a result, it requires \(log(N)\) to encode. Since each state transition may require flipping all bits, e.g., state 0 transits to state 3 for 2-bit state, its hamming distance is \(O(N)\).
Each encoding has its own advantages. For instance, since only one bit is required to test the state variable, one-hot encoding allows smaller multiplexing logic, and Gary encoding allows low switching power, thus favorable for low-power design. The choice to choose which encoding is more of an engineering topic depends on the design needs. As a result, many synthesis tools offer ability to recode FSM states during synthesis automatically. As a result, designers can code the FSM in one encoding scheme and synthesize it in a different scheme. However, this also implies that the synthesized version of RTL is different from the original RTL where all the verification is done. As a result, some corner-case bugs may occur when the tools re-encode the FSM. In general we recommend the design team decides on an encoding scheme early on based on some engineering experiment result. Doing so ensures the consistency between synthesis and verification.
In SystemVerilog, we typically use enum
to define states. Compared to old school methods such as `define
and localparam
, using enum
allows type-checking from the compiler, which makes the code safer and easier to debug. Below are several examples using one-hot encoding, Gray encoding, and binary encoding.
// on-hot encoding
typedef enum logic[3:0] {
IDLE = 4'b0001,
READY = 4'b0010,
BUSY = 4'b0100,
ERROR = 4'b1000
} hot_hot_state_t;
// Gray encoding
typedef enum logic[2:0] {
RED = 4'b00,
GREEN = 4'b01,
BLUE = 4'b11,
YELLOW = 4'b10
} gray_state_t;
// binary encoding
typedef enum logic[1:0] {
STAGE_0 = 2'd0,
STAGE_1 = 2'd1,
STAGE_2 = 2'd2,
STAGE_3 = 2'd3
} binary_state_t;
4.2.3 General FSM Structure
As indicated by the formal definition of FSM, we need to design two components of the FSM: state transition logic \(T\) and output function \(G\). However, since FSM needs to hold its state, we need another component that sequentially update the FSM state. As a result, a typical FSM always have three components, as shown in the Figure 6.
4.2.4 One-, Two-, and Three-Block FSM Coding Style
Although there are three necessary components for an FSM, sometimes we can merge some components together into a single process. As a result, we have three popular FSM coding style, commonly referred as one-block, two-block, and three-block FSM coding style.
In the following subsections, we will use count consecutive one as an example to show different coding styles. The definition of all states is shown below as a SystemVerilog package.
`ifndef COUNT_ONE_FSM_PKG
`define COUNT_ONE_FSM_PKG
package count_one_fsm_pkg;
typedef enum logic[1:0] {
moore_state0,
moore_state1,
moore_state2
} moore_state_t;
typedef enum logic {
mealy_state0,
mealy_state1
} mealy_state_t;
endpackage
`endif // COUNT_ONE_FSM_PKG
4.2.4.1 Three-Block FSM Coding Style
Three-block FSM coding style is usually implemented as a Moore machine where:
- One block is used to update
state
withnext_state
. - One block is used to determine
next_state
based onstate
and current inputs. - One block is used to compute output based on
state
.
The complete example of three-block FSM is shown below (code/04/three_block_fsm_moore.sv
):
module three_block_fsm_moore (
input logic clk,
input logic rst_n,
input logic in,
output logic out
);
import count_one_fsm_pkg::*;
moore_state_t state, next_state;
// block 1: state <- next_state
always_ff @(posedge clk, negedge rst_n) begin
if (!rst_n) begin
state <= moore_state0;
end
else begin
state <= next_state;
end
end
// block 2: determine next_state
always_comb begin
case (next_state)
moore_state0: begin
if (in) next_state = moore_state1;
else next_state = moore_state0;
end
moore_state1: begin
if (in) next_state = moore_state2;
else next_state = moore_state0;
end
moore_state2: begin
if (in) next_state = moore_state2;
else next_state = moore_state0;
end
default: begin
next_state = moore_state0;
end
endcase
end
// block 3: determine output based on state
always_comb begin
case (state)
moore_state0: out = 0;
moore_state1: out = 0;
moore_state2: out = 1;
default: out = 0;
endcase
end
endmodule: three_block_fsm_moore
4.2.4.2 Two-Block FSM Coding Style
Two-block FSM is usually implemented in Mealy machine where: 1. One block is used to update state
with next_state
. 2. One block is used to determine next_state
and the outputs, based on state
and current inputs.
The complete example of two-block FSM is shown below (code/04/two_block_fsm_mealy.sv
):
module two_block_fsm_mealy (
input logic clk,
input logic rst_n,
input logic in,
output logic out
);
import count_one_fsm_pkg::*;
mealy_state_t state, next_state;
// block 1: state <- next_state
always_ff @(posedge clk, negedge rst_n) begin
if (!rst_n) begin
state <= mealy_state0;
end
else begin
state <= next_state;
end
end
// block 2: determine next_state and output
always_comb begin
case (state)
mealy_state0: begin
if (in) begin
next_state = mealy_state1;
out = 0;
end
else begin
next_state = mealy_state0;
out = 0;
end
end
mealy_state1: begin
if (in) begin
next_state = mealy_state1;
out = 1;
end
else begin
next_state = mealy_state0;
out = 0;
end
end
endcase
end
endmodule: two_block_fsm_mealy
Using Mealy machine based two-block FSM has the advantage that output can update whenever input changes without the need to wait for the next cycle. However, it makes the maintenance difficult. Since the next state logic and output are coded together, if we need to adjust the FSM, significant restructure may be needed in two-block style. It is up to the design team to decide which style to use.
4.2.4.3 One-Block FSM Coding Style
One-block merges all the blocks together. As a result, maintaining and debugging such FSM is very challenging and we highly discourage people to adopt such FSM style unless absolute necessary. However, for completeness, we will show the code example people so that readers can recognize such programming style in practice.
module one_block_fsm_mealy (
input logic clk,
input logic rst_n,
input logic in,
output logic out
);
import count_one_fsm_pkg::*;
mealy_state_t state;
// one block: state update, next state, and output are in the same always_ff block
always_ff @(posedge clk, negedge rst_n) begin
if (!rst_n) begin
state <= mealy_state0;
end
else begin
case (state)
mealy_state0: begin
if (in) begin
state <= mealy_state1;
out <= 0;
end
else begin
state <= mealy_state0;
out <= 0;
end
end
mealy_state1: begin
if (in) begin
state <= mealy_state1;
out <= 1;
end
else begin
state <= mealy_state0;
out <= 0;
end
end
default: begin
state <= mealy_state0;
out <= 0;
end
endcase
end
end
endmodule: one_block_fsm_mealy
4.2.5 How to Write FSM Effectively
Designing an efficient FSM requires engineering work and experiments. A typical workflow is shown below:
- Identify states and state transition logic and turn it into a design specification.
- Implement FSM based on the specification
- (Optional) optimize the FSM based on feedbacks.
The first step of FSM design involves with design exploration about how many states are needed, what coding style to use, what state encoding to use, and what’s the output logic. A common way to visualize the FSM is to represent it in a state transition diagram. Another way to represent the FSM is to use tables, where each row represents a state transition. After all states have been identified, we can further optimize the FSM throw methods such as state reduction, where states with exactly the same logic (same outputs and same transition) can be merged into one.
Once the specification has been decided, translating it into FSM is very mechanical. Each transition arc can be expressed as a case item as we discussed earlier and so is the output logic. Once the implementation is done, we need to thoroughly test the it against common bugs such as dead lock or unreachable state. Some issues could be implementation related and some may be specification related. In any cases we need to fix the design/specification to meet the design requirements. We will discuss strategies about discovering deadlock and unreachable state when discussing formal verification later in the book.
4.3 Ready/Valid Handshake
Ready/valid handshake is one of the most used design pattern when transferring data in a latency-insensitive manner. It consists of two components, the source and the sink, where data flows from the former to the latter. The source uses valid
signal to indicate whether the data is valid and the sink uses ready
signal to indicate whether it is ready to receive data, as shown in the figure below.
Because ready/valid is latency-insensitive, each signal has precise semantics at the posedge of the clock (we assume we are dealing with synchronous circuit): - If the valid signal is high @(posedge clk)
, we know that data
is valid as well - If the ready signal is high @posedge (clk)
AND the valid signal is high as well, we complete the data transfer. The size of transfer is often referred as one word. - If the system wishes to transfer more data, then we need to complete a series of one-word transfer, until the entire packet is transferred.
The timing diagram below shows cases where a transfer should or should not occur.
Ready/valid handshake has several design pitfalls that needs to avoid: 1. If the source waits for the sink’s ready before asserting valid and vice versa, there will be chance of deadlock since both parties are waiting for each other. To avoid this, the control signal should be computed independently. 2. If the ready/valid signals are computed purely on combinational logic, there will be a combinational loop between the source and sink. To resolve this, either source or sink needs to register the control signals, or compute the signals based on some flopped states.
4.4 Commonly Used Design Building Blocks
In this section we lists some code examples of commonly used design building blocks. These circuits are commonly used in various circuit designs and are optimized for high synthesis quality.
4.4.1 Registers
There are various types registers, such as synchronous and asynchronous registers. Each type has their own benefits. The design team should decide ahead of time what types of registers to use consistently throughout the design. All the code examples here use negative reset.
4.4.2 Asynchronous Reset Registers
Asynchronous reset register has reset on its sensitivity list.
logic r, value;
always_ff @(posedge clk, negedge rst_n) begin
if (!rst_n) begin
r <= 1'b0;
end
else begin
r <= value;
end
end
4.4.2.1 Synchronous Reset Registers
Unlike Asynchronous reset registers, synchronous reset register only resets the register on clock edge, hence the name “synchronous”.
logic r, value;
always_ff @(posedge clk) begin
if (!rst) begin
r <= 1'b0;
end
else begin
r <= value;
end
end
4.4.2.2 Chip-enable Registers
Chip-enable registers has additional single that enables or disables the value update (sometimes called clock-gating). On ASIC, there are usually specially design cells to handle such logic. As a result, if you follow the code example below you will get optimal synthesis result. We will use asynchronous reset register as an example.
logic r, value;
always_ff @(posedge clk, negedge rst_n) begin
if (!rst_n) begin
r <= 1'b0;
end
else if (c_en) begin
r <= value;
end
end
In generally we do not recommend using your own logic control the register update, for instance, multiplexing the update value instead of using the syntax above, or creating your own clock based on the enable logic. These kinds of modification are unlikely to be picked up by the synthesis tools, hence reduce synthesis quality.
4.4.2.3 Power-up Values
Some FPGA tool chains allows initial values to be set along with declaration, as shown below. Since this approach does not work for ASIC, we do not recommend such approach if you want your code to be portable.
4.4.3 Multiplexer
Multiplexer is a type of hardware circuit that selects output signals from a list of input signals. There are many ways to implement a multiplexer and we will cover two common implementation of multiplexers.
4.4.3.1 case
-based Multiplexer
The simplest way to implement a multiplexer is using case
statement. It is straightforward to implement and also allows synthesis tools to recognize the multiplexer and optimize the netlist. Here is an example of multiplexer that takes 5 inputs. Notice that the number of inputs does not need to be 2’s power.
module Mux5
#(parameter int WIDTH = 1) (
input logic[WIDTH-1:0] I0,
input logic[WIDTH-1:0] I1,
input logic[WIDTH-1:0] I2,
input logic[WIDTH-1:0] I3,
input logic[WIDTH-1:0] I4,
input logic[$clog2(5):0] S,
output logic[WIDTH-1:0] O
);
always_comb begin
unique case (S)
0: O = I0;
1: O = I1;
2: O = I2;
3: O = I3;
4: O = I4;
default:
O = I0;
endcase
end
endmodule
Notice that default is used to handle edges cases where the select signal S
is out of range or containing x
.
A slightly shorten version is to merge all the input signals into an array and use index operator as multiplexer, as shown below:
module Mux
#(parameter int WIDTH=1,
parameter int NUM_INPUT=2) (
input logic[NUM_INPUT-1:0][WIDTH-1:0] I,
input logic[$clog2(NUM_INPUT)-1:0] S,
output logic[WIDTH-1:0] O
);
assign O = (S < NUM_INPUT)?
I[S]:
I[0];
endmodule
In the code example above, we implicitly ask the synthesis tool to create a multiplexer for us. There are several advantage of this approach:
- We let synthesis tool to do its job to optimize the design
- The module works with any arbitrary number inputs (
NUM_INPUT
has to be larger than 1), as well as outputs.
4.4.3.2 AOI Multiplexer
In situations where hand-optimization is required, we can implement an AOI max. AOI stands for AND-OR-Invert, which implies the the basic logic operation we are going to do with the inputs. AOI gates are efficient with CMOS technology since we can use NAND and NOR logic gate to construct AOI gate.
There are two components of AOI mux, namely a precoder and AOI logic. The precoder translate select signal into one-hot encoding, and AOI logic merge the inputs into output based on the one-hot-encoded select signal. Here is the complete implementation of the AOI mux with 5 inputs (code/04/aoi_mux.sv
).
module aoi_mux
#(parameter int WIDTH=1,
parameter int NUM_INPUT=2) (
input logic[NUM_INPUT-1:0][WIDTH-1:0] I,
input logic[$clog2(NUM_INPUT)-1:0] S,
output logic[WIDTH-1:0] O
);
// calculate the ceiling of num_input / 2
localparam NUM_OPS = (NUM_INPUT + 1) >> 1;
localparam MAX_RANGE = NUM_INPUT >> 1;
logic [NUM_INPUT-1:0] sel_one_hot;
// simplified one-hot precoder.
assign sel_one_hot = (S < NUM_INPUT)?
1 << S:
0;
// intermediate results
logic [NUM_OPS-1:0][WIDTH-1:0] inter_O;
// AOI logic part
always_comb begin
// working on each bit
for (int w = 0; w < WIDTH; w++) begin
// half the tree
for (int i = 0; i < MAX_RANGE; i++) begin
inter_O[i][w] = (sel_one_hot[i * 2] & I[i * 2][w]) |
(sel_one_hot[i * 2 + 1] & I[i * 2 + 1][w]);
end
// need to take care of odd number of inputs
if (NUM_INPUT % 2) begin
inter_O[MAX_RANGE][w] = sel_one_hot[MAX_RANGE * 2] & I[MAX_RANGE * 2][w];
end
end
end
// compute the final result, i.e. OR the intermediate result together
// notice that |inter_O doesn't work here since it will reduce to 1-bit signal
always_comb begin
O = 0;
for (int i = 0; i < NUM_OPS; i++) begin
O = O | inter_O[i];
end
end
endmodule
The example above can be explained with matrix operation. After one-hot encoding transformation, we create a matrix \(S\) where \(S[i] = sel\_one\_hot\) for \(i \in \{0, 1, \dots, NUM\_INPUT - 1\}\). In other words, all entries in matrix S is zero except for the column indicated by the select signal, which are all one’s. The input signals can be expressed as \(I\) where each row of \(I\) is one input. We then compute the following result: \[ O_{inter} = S \times I \]
Notice that since \(S\) only consists of one’s and zero’s, multiplication is effectively performing AND operation. Matrix \(O_{inter}\) has similar characteristic as matrix \(S\) due to the property of one-hot encoding. To obtain the result, we can do a row-wise OR reduction to obtain the final result. Since CMOS technology is more area efficient when we fuse AND and OR operation together, instead of computing one row at a time, we can compute two rows together, hence the variable NUM_OPS
is computed based on \(\lceil \frac{NUM\_INPUT}{2} \rceil\). Readers are encouraged to work out the process with some simple examples.
AOI mux is an example of how we can express the same logic in a clever way that is optimized for CMOS technology. This kind of optimization requires keen insight on the logic as well as deep understanding of logic synthesis. Unless required, we do not recommend to hand-optimize common logic such as adder or multiplexer since it may not achieve better result than synthesis tools and error prone. Use the syntax sugar offered by the SystemVerilog language and let synthesis tools do the heavy lifting. If the code follows the coding style, synthesis tools can pick up easily and perform automatic optimization.
4.5 Wishbone Protocol: A Case Study
A common place for bugs to occur is the interface between components, where each component may have different design assumptions. One approach to limit such bugs is to adhere to a well-specified protocol such that each component will follow and thus reduce the interface error. In this chapter we will take a look at a simple yet complete protocol, namely WIshbone, and how we can write RTL code based on the spec.
Unlike protocols such as AXI4, Wishbone is an open-source hardware bus interface, which allows engineers and hobbyists to share public domain designs.
4.5.1 Wishbone Introduction
Wishbone bus consists of two channels: a request channel which can either be read or write, and an acknowledge (ACK) channel. These two channels connect the bus master and slave together, as shown in the figure below.
The master has a list of signals specified by the specification. Notice that it is explicitly stated that IPs can change the interface name (PERMISSION 2.0.0), we will use the names used in the specification to make it easier to compare with the document. Notice that the specification follows the naming convention that suffix _O
indicates output port and _I
indicates input port.
There are a list of signals that’s shared between master and slave interfaces:
We’ll ignore TGD_I
and TGD_O
in this section, but keep in mind that they can transfer very useful metadata information such as error checking code to protect data.
Below shows the complete interface ports for the master (excluding the shared ports).
Signal Name | Function |
---|---|
ACK_I |
The acknowledge indicates the normal termination of a bus cycle |
ADR_O |
The address used for read/write request |
CYC_O |
The cycle output. When asserted, indicates a valid bus cycle in progress |
STALL_I |
When asserted, indicates that the current slave is not able to accept the transfer |
ERR_I |
When asserted, indicates an abnormal cycle termination |
LOCK_O |
When asserted, indicates the current bus cycle is uninterruptible |
RTY_I |
When asserted, indicates that the interface is not ready to accept/send data and the cycle should be retried |
SEL_O |
Indicates where valid data is expected on the DAT_I signal array during read cycles, and where it is placed on the DAT_O signal array during write cycles |
STB_O |
The strobe output indicates a valid data transfer cycle. It is used to qualify other signals on the interface. |
TGA_O |
Address tag type, which contains information associated with address lines, which can be qualified by STR_O . |
TGC_O |
Cycle tag type, which contains information associated with bus cycles, which can be qualified by signal CYC_O . |
WE_O |
Write enable output, which indicates whether the current local bus cycle is a read or write cycle. |
Again, we will ignore tag information. Interested readers should check out the specification.
The slave interface is symmetric with the master slave: XX_I
from master will have a correspondence port XX_O
in the slave and vice versa. In general, Wishbone interface is simpler than other bus interface such as Advanced Microcontroller Bus Architecture (AMBA), which is the reason why we can explain the protocol without lengthy details here.
4.5.2 Wishbone Master Example
We present here a simplified version of master module, where the read write behavior is controlled via a simple interface. For any real-world practice, we need to connect the master to an IP that directly controls the master’s behavior. We also drop the tag, lock, and byte select interface for simplicity, but keep in mind that in a real IP interface we need to implement this as well! We will focus on register read write instead of block transfer; we will also drop corner case handling such as error and retry. Interested readers should try to implement block transfer and other missing features.
First, we need to define the IO ports, where the width or the data is parametrized by WIDTH
. We also need to add other parameterization for control and data signals.
module wb_master #(parameter WIDTH=32,
parameter ADDR_WIDTH=16) (
input logic CLK_I,
input logic[WIDTH-1:0] DAT_I,
output logic[WIDTH-1:0] DAT_O,
input logic RST_I,
input logic ACK_I,
output logic[ADDR_WIDTH-1:0] ADR_O,
output logic CYC_O,
input logic STALL_I,
output logic STB_O,
output logic WE_O
// external controls
input logic write,
input logic enable,
input logic[ADDR_WIDTH-1:0] addr,
input logic[WIDTH-1:0] wdata,
output logic[WIDTH-1:0] rdata,
output logic ready,
output logic ack
);
Notice that due to the naming convention, STALL_I
is essentially slave’s ready signal, and STB_O
is the valid signal. With that in mind, we can quickly sketch out the logic for sending commands based on the controls signals. Notice that in Wishbone, every output will be registered. Notice that since we need to wait for the client to acknowledge the transition, we need a FSM to determine the transmission state (we will use 2-block FSM to implement this). Since we are only interested in one single register transfer, there is no need to keep track of the number of words transferred.
Based on the state, we have three different outputs:
always_comb begin
unique case (state)
IDLE: begin
CYC_O = 0;
STB_O = 0;
end
BUSY: begin
CYC_O = 1;
STB_O = 1;
end
endcase
end
We then need to change the state based on the control signals. Since we are only interested in one word transfer, we start the transaction when the external control signal enable
is high and the slave is ready. Based on whether it is a read or write request, we set the WB control data differently. After initiate the transaction, the master enter busy state, which waits for the slave to acknowledge back. After that the master signals the external client the end of transaction and returns back to idle state.
always_ff @(posedge CLK_I) begin
// reset on high
if (RST_I) begin
state <= IDLE;
// reset all registered outputs
ADDR_O <= 0;
WE_O <= 0;
DATA_O <= 0;
// external control signal
ack <= 0;
ready <= 1;
end
else begin
unique case (state)
IDLE: begin
// only when the we're asked to send data
// and slave is ready
if (enable && !STALL_I) begin
ADDR_O <= addr;
// write request
if (write) begin
DATA_O <= wdata;
WE_O <= 1;
end else begin
DATA_O <= 0;
WE_O <= 0;
end
SEL_O <= 1;
state <= BUSY;
// external control signal
ready <= 0;
ack <= 0;
end
else begin
// external control signal
ready <= 1;
ack <= 0;
end
end
BUSY: begin
// wait for slave ack
if (ACK_I) begin
// we good
state <= IDLE;
DATA_O <= 0;
// we assume control client will hold this signal until response gets back
if (enable) begin
ack <= 1;
if (!write) begin
// if it's a read
wdata <= DAT_I;
end
else begin
wdata <= 0;
end
end
end
end
endcase
end
end