Common Design Patterns/Practices

The end goal of RTL design, especially for ASIC, is to produce smallest and fastest circuit possible. To do so, we need to understand how synthesis tools analyze and optimize the design. In addition, we are also concerned about the simulation speed, since waiting tests to run is effectively wasting engineering efforts. Although synthesis and simulation tools have numerous optimization passes and transformations, one important factor for the end result is the design pattern, i.e., do the code follow the tool’s design guide. Lots of optimization are tuned for a particular design pattern, making the code easier to to be understood and thus simplified by the tools. In addition, some design patterns may simplify the code structures and make the code more readable and reusable.

In this chapter we will go through some common design practices and how we should program the logic and structure the source code.

4.1 Compiler Directives and Packages

Similar to C/C++, SystemVerilog defines a preprocessing stage where macros are expanded in the original source code. The compiler directives are not turing-complete and less versatile than that of C/C++,meaning even fix-bounded recursive computation is difficult to be specified in SystemVerilog. Nevertheless, it allows some level of preprocessing in SystemVerilog.

4.1.1 Compiler Directives

There are several compiler directives defined by the language. We will cover some of the most used macros here:

  1. `__FILE__
  2. `__LINE__
  3. `define
  4. `else
  5. `elseif
  6. `ifdef
  7. `ifndef
  8. `endif
  9. `undef
  10. `timescale
  11. `include

`__FILE__ and `__LINE__ are used the same way as __FILE__ and __LINE__ in C/C++. Users can use that for test bench debugging. During preprocessing, these two compiler directives will be replaced with the actually file name and line number.

`define allows you to define macros, which can be used later in the code. We will show two examples where the first one defines values, and the second one define function-like code snippets, which takes arguments. Notice that unlike C/C++, macros have to be prefixed with ` when used in code.

In the example above, we define `VALUE to be 10, and used it as register value. Even though we cover the usage here, please avoid defining constant values as macros in such way. It is because:

  1. It is difficult to find where the macro is defined, e.g. either from a file or command line options
  2. There is no namespace regarding macro values. If there are two macros shares the same name, whichever gets parsed later will be used. This may cause unexpected bugs that is difficult to debug, since the compiler may not issue warning for macro re-definition.

We highly recommend to use define constants in a package, which will be covered later in this chapter.

Another way to use `define is to define some code snippets which can be re-used later, as shown in the example below (also in code/04/macros_arguments.sv):

We will see the expected output, where x denotes uninitialized register value:

reg1:     0 reg2:     x reg3:     x
reg1:     1 reg2:     0 reg3:     x
reg1:     2 reg2:     1 reg3:     0

In the code example above, we first define three registers that are pipelined to signal in (in chained fashion). The macro REGISTER first defines the register given NAME and WIDTH, then it instantiate an always_ff block and assign the VALUE to the register as every clock cycle. Notice that we have to use \ for multi-line definitions.

Although sometimes using a macro may save time and make the code more reusable, it is important to find a balance between repetitive code segments and macro usage. Keep in mind that macro is substituted during preprocessing stage, it will make source-code level debugging challenging. You also need to be careful about macro re-definition since all the macros are in global namespace.

During the macro definition, sometimes you need to undefine some macro names for a different usage. Similar to C/C++, you can use `undef to un-define the macro.

`ifdef and `ifndef can be used to test whether certain macro has been defined (or not defined). You need to close the compiler directives with `endif. You can also add `else and`elseif to account for different scenarios. Notice that for a header file, they can be used together with `define to provide an include guard, which allows the header file to be included in multiple places. Their usages are identical to those of C/C++, so we will not cover them here.

`timescale is an important compiler directive useful to simulators. It specifies the unit of measurement for time and precision of time in specific design elements. There can be only be at most one timescale defined for any compilation-unit scope. In other words, it is illegal to define timescales at two different source files compiled together. The syntax for `timescale is shown below:

The time_unit argument specifies the unit of measurement time and delays, and the time_precision argument specifies how delay values are rounded before used in simulation. time_precision should be at least as precise as time_unit, since time_precision is used for finer precision of simulation. The unit of time_unit and time_precision can be s, ms, us, ns, ps, and fs. The integer part specifies an order of magnitude for the size of the value, in other words, the only valid number is 1, 10, and 100.

Timescale is crucial to simulate jittering and timing violation. It is also required for any power-related analysis. It is highly recommend to include timescale in your top-level test bench, even though it is not used.

`include serves the same purpose as #include in C/C++, where it includes definitions from another file. It’s highly recommended to provide an include guard to the include file. If the filename is enclosed in quotes, e.g. `include "filename.svh", the compiler will first search its current working directory, and then search any user-specified locations. If the filename is enclosed in angle brackets, e.g. `include <filename.svh>, the filename has to be files defined by language standard. This rule is similar to that of C/C++.

4.1.2 Packages

Although `include provides a way for designers to share definitions, the compiler directives essentially asks the compiler to copy the content of included file into the source file, which is a legacy feature influenced by C. As modern programming languages start to use modules/packages to structure the source code, e.g. module in C++20, SystemVerilog introduce a construct called package that allows designers to reuse definitions, interfaces, and functions. Since package is synthesizable, it is highly recommend to use it in both RTL and test benches. Here is an example of package:

Here is an incomplete list of constructs that are allowed inside a package:

  1. parameter declaration, e.g. parameter and localparam
  2. function declaration, e.g. automatic function
  3. data declaration, e.g., struct and enum
  4. DPI import and export
  5. class declaration
  6. package import declaration

Since parameter cannot be redefined in side a package, we highly recommend to use localparam in lieu of parameter since they are functionally identical in a package. In other words, localparam does not have the visibility restriction in a package.

4.1.2.1 Package Import

To use the package definition in other modules, we need to use import keyword to import definition. There are several ways to import contents of a package and we will cover two commonly used approaches here:

  1. wildcard import. This is similar to Python’s from pkg_name import *:

  2. explicit import. This is similar to Python’s from pkg_name import class_name:

After importing, the identifiers (i.e. struct names or enum value names) can be used directly in the module. One thing to notice that there are several places where we can put package import. Depends on where the content of the package is used, there are two standard approaches to do so:

  1. If the identifier is used for module port definition, the import needs to placed before port list:

  2. Otherwise, we shall put the import inside the module:

4.1.2.2 Import Packages within a Package

Like software programming languages, you can import a package content inside another package, and the “chained” imports can be visible to the consumer. Here is an example (code/04/chained_packages.sv) illustrates the package imports:

Notice unlike some software programming language such as Python, where the imported identifier is accessible as part of the new package, SystemVerilog prohibits such behavior. If you try to import alu_opcode_t from def2_pkg, you will get a recursive import error in the compiler.

4.1.2.3 Package Usage Caveats

Since the content of a package is scoped, when use wildcard import, there is a chance of naming conflict. A rule of thumb is that when a naming conflicts, always resort to explicit import. Some coding styles prohibit the usage of wildcard import, which make the code a little bit more verbose, but more readable and maintainable. The exact scoping rule is beyond of scope of this book, but interested user should refer to Table 26-1 in 1800-2017.

Another caveat is that packages have to be compiled before any module files that rely on them. One systematic way is to rely on build tools such as make to ensure the order of compilation. Another simple way to do is to put packages before other sources while supplying file names to the tools.

4.2 Finite State Machines

Finite State Machine (FSM) is the core part of hardware control logic. How well the FSM is designed can directly impact the synthesis and verification effort, since these tools have somewhat restricted expectation of how a FSM should be written. Although the theory of FSM is beyond the scope of this book, we will try to cover as much as possible while going over the major topics regarding FSM.

4.2.1 Moore and Mealy FSM

Generally speaking there are two types of FSM commonly used in hardware design, namely Moore and Mealy machine. Moore machine, named after Edward F. Moore, is a type of FSM whose output values are determined solely by its current state. On the other hand, Mealy machine, named after George H. Mealy, is a type of FSM whose output values are determined by its current state and the current inputs. To draw a formal distinction between Moore and Mealy machine, we can refer to the following mathematical notations.

  • A finite set of states \(S\)
  • An initial state \(S_0\) such that \(S_0 \in S\)
  • A finite input set \(\Sigma\)
  • A finite output set \(\Lambda\)
  • A state transition function \(T: \Sigma \times S \rightarrow S\)
  • An output function \(G\)

For Moore machines, the output function is \(G: S \rightarrow \Lambda\), whereas for Mealy machines, the output function is \(G: \Sigma \times S \rightarrow \Lambda\). Although Moore and Mealy machine are mathematically equivalent, there is a major difference when represented as a state transition diagram, as shown in Figure 4 and 5, where both diagram describes the logic that counts consecutive ones and output 1 once the count reaches 2. As a notation, the label on edges in Moore machine represents the input values and the label on the node represents the output value. In Mealy machine, the label on the edge follows input/output notation.

Figure 4: State transition diagram for Moore Machine.
Figure 4: State transition diagram for Moore Machine.
Figure 5: State transition diagram for Mealy Machine.
Figure 5: State transition diagram for Mealy Machine.

Due to such difference, we will see timing and area related difference when we design Moore and Mealy machines in SystemVerilog: - To describe the same control logic, Moore machines tend to have more states than Mealy machines - The output from a Moore machines tends to have one extra cycle delay compared to Mealy machines.

Choosing which type of machine to use usually depends on the control logic you are trying to model. Since Mealy machines can be used as Moore machine if inputs are ignored when computing the outputs, Mealy machines are more general. Although nothing prevents you mixing these two machines together, it is highly recommend to stick to one coding style so that tools can recognize your design easily.

4.2.2 FSM State Encoding

There are several different ways to encode your states \(S\), one-hot encoding, Gray encoding, and binary encoding. Given \(|S| = N\):

  • one-hot encoding implies that only one of its bits is set to 1 for a particular state. That means the total number of bits required to represent the states is \(N\). The Hamming distance of this encoding is 2, meaning we have to flip 2 bits for a state transition.
  • Gray encoding, named after Frank Gray, is a special encoding scheme that only requires \(log2(N)\) bits to encode. In addition, its Hamming distance is designed to be 1, which means only one bit change is required to transit a state
  • Binary encoding means the state value is assigned by its index in the states. As a result, it requires \(log(N)\) to encode. Since each state transition may require flipping all bits, e.g., state 0 transits to state 3 for 2-bit state, its hamming distance is \(O(N)\).

Each encoding has its own advantages. For instance, since only one bit is required to test the state variable, one-hot encoding allows smaller multiplexing logic, and Gary encoding allows low switching power, thus favorable for low-power design. The choice to choose which encoding is more of an engineering topic depends on the design needs. As a result, many synthesis tools offer ability to recode FSM states during synthesis automatically. As a result, designers can code the FSM in one encoding scheme and synthesize it in a different scheme. However, this also implies that the synthesized version of RTL is different from the original RTL where all the verification is done. As a result, some corner-case bugs may occur when the tools re-encode the FSM. In general we recommend the design team decides on an encoding scheme early on based on some engineering experiment result. Doing so ensures the consistency between synthesis and verification.

In SystemVerilog, we typically use enum to define states. Compared to old school methods such as `define and localparam, using enum allows type-checking from the compiler, which makes the code safer and easier to debug. Below are several examples using one-hot encoding, Gray encoding, and binary encoding.

4.2.3 General FSM Structure

As indicated by the formal definition of FSM, we need to design two components of the FSM: state transition logic \(T\) and output function \(G\). However, since FSM needs to hold its state, we need another component that sequentially update the FSM state. As a result, a typical FSM always have three components, as shown in the Figure 6.

Figure 6: General FSM structure for Moore and Mealy machine.
Figure 6: General FSM structure for Moore and Mealy machine.

4.2.4 One-, Two-, and Three-Block FSM Coding Style

Although there are three necessary components for an FSM, sometimes we can merge some components together into a single process. As a result, we have three popular FSM coding style, commonly referred as one-block, two-block, and three-block FSM coding style.

In the following subsections, we will use count consecutive one as an example to show different coding styles. The definition of all states is shown below as a SystemVerilog package.

4.2.4.1 Three-Block FSM Coding Style

Three-block FSM coding style is usually implemented as a Moore machine where:

  1. One block is used to update state with next_state.
  2. One block is used to determine next_state based on state and current inputs.
  3. One block is used to compute output based on state.

The complete example of three-block FSM is shown below (code/04/three_block_fsm_moore.sv):

4.2.4.2 Two-Block FSM Coding Style

Two-block FSM is usually implemented in Mealy machine where: 1. One block is used to update state with next_state. 2. One block is used to determine next_state and the outputs, based on state and current inputs.

The complete example of two-block FSM is shown below (code/04/two_block_fsm_mealy.sv):

Using Mealy machine based two-block FSM has the advantage that output can update whenever input changes without the need to wait for the next cycle. However, it makes the maintenance difficult. Since the next state logic and output are coded together, if we need to adjust the FSM, significant restructure may be needed in two-block style. It is up to the design team to decide which style to use.

4.2.4.3 One-Block FSM Coding Style

One-block merges all the blocks together. As a result, maintaining and debugging such FSM is very challenging and we highly discourage people to adopt such FSM style unless absolute necessary. However, for completeness, we will show the code example people so that readers can recognize such programming style in practice.

4.2.5 How to Write FSM Effectively

Designing an efficient FSM requires engineering work and experiments. A typical workflow is shown below:

  1. Identify states and state transition logic and turn it into a design specification.
  2. Implement FSM based on the specification
  3. (Optional) optimize the FSM based on feedbacks.

The first step of FSM design involves with design exploration about how many states are needed, what coding style to use, what state encoding to use, and what’s the output logic. A common way to visualize the FSM is to represent it in a state transition diagram. Another way to represent the FSM is to use tables, where each row represents a state transition. After all states have been identified, we can further optimize the FSM throw methods such as state reduction, where states with exactly the same logic (same outputs and same transition) can be merged into one.

Once the specification has been decided, translating it into FSM is very mechanical. Each transition arc can be expressed as a case item as we discussed earlier and so is the output logic. Once the implementation is done, we need to thoroughly test the it against common bugs such as dead lock or unreachable state. Some issues could be implementation related and some may be specification related. In any cases we need to fix the design/specification to meet the design requirements. We will discuss strategies about discovering deadlock and unreachable state when discussing formal verification later in the book.

4.3 Ready/Valid Handshake

Ready/valid handshake is one of the most used design pattern when transferring data in a latency-insensitive manner. It consists of two components, the source and the sink, where data flows from the former to the latter. The source uses valid signal to indicate whether the data is valid and the sink uses ready signal to indicate whether it is ready to receive data, as shown in the figure below.

Figure 7: Ready/Valid block diagram
Figure 7: Ready/Valid block diagram

Because ready/valid is latency-insensitive, each signal has precise semantics at the posedge of the clock (we assume we are dealing with synchronous circuit): - If the valid signal is high @(posedge clk), we know that data is valid as well - If the ready signal is high @posedge (clk) AND the valid signal is high as well, we complete the data transfer. The size of transfer is often referred as one word. - If the system wishes to transfer more data, then we need to complete a series of one-word transfer, until the entire packet is transferred.

The timing diagram below shows cases where a transfer should or should not occur.

Figure 8: No data transfer
Figure 8: No data transfer
Figure 9: No data transfer
Figure 9: No data transfer
Figure 10: One successful ready/valid data transfer
Figure 10: One successful ready/valid data transfer

Ready/valid handshake has several design pitfalls that needs to avoid: 1. If the source waits for the sink’s ready before asserting valid and vice versa, there will be chance of deadlock since both parties are waiting for each other. To avoid this, the control signal should be computed independently. 2. If the ready/valid signals are computed purely on combinational logic, there will be a combinational loop between the source and sink. To resolve this, either source or sink needs to register the control signals, or compute the signals based on some flopped states.

4.4 Commonly Used Design Building Blocks

In this section we lists some code examples of commonly used design building blocks. These circuits are commonly used in various circuit designs and are optimized for high synthesis quality.

4.4.1 Registers

There are various types registers, such as synchronous and asynchronous registers. Each type has their own benefits. The design team should decide ahead of time what types of registers to use consistently throughout the design. All the code examples here use negative reset.

4.4.2 Asynchronous Reset Registers

Asynchronous reset register has reset on its sensitivity list.

4.4.2.1 Synchronous Reset Registers

Unlike Asynchronous reset registers, synchronous reset register only resets the register on clock edge, hence the name “synchronous”.

4.4.2.2 Chip-enable Registers

Chip-enable registers has additional single that enables or disables the value update (sometimes called clock-gating). On ASIC, there are usually specially design cells to handle such logic. As a result, if you follow the code example below you will get optimal synthesis result. We will use asynchronous reset register as an example.

In generally we do not recommend using your own logic control the register update, for instance, multiplexing the update value instead of using the syntax above, or creating your own clock based on the enable logic. These kinds of modification are unlikely to be picked up by the synthesis tools, hence reduce synthesis quality.

4.4.2.3 Power-up Values

Some FPGA tool chains allows initial values to be set along with declaration, as shown below. Since this approach does not work for ASIC, we do not recommend such approach if you want your code to be portable.

4.4.3 Multiplexer

Multiplexer is a type of hardware circuit that selects output signals from a list of input signals. There are many ways to implement a multiplexer and we will cover two common implementation of multiplexers.

4.4.3.1 case-based Multiplexer

The simplest way to implement a multiplexer is using case statement. It is straightforward to implement and also allows synthesis tools to recognize the multiplexer and optimize the netlist. Here is an example of multiplexer that takes 5 inputs. Notice that the number of inputs does not need to be 2’s power.

Notice that default is used to handle edges cases where the select signal S is out of range or containing x.

A slightly shorten version is to merge all the input signals into an array and use index operator as multiplexer, as shown below:

In the code example above, we implicitly ask the synthesis tool to create a multiplexer for us. There are several advantage of this approach:

  1. We let synthesis tool to do its job to optimize the design
  2. The module works with any arbitrary number inputs (NUM_INPUT has to be larger than 1), as well as outputs.

4.4.3.2 AOI Multiplexer

In situations where hand-optimization is required, we can implement an AOI max. AOI stands for AND-OR-Invert, which implies the the basic logic operation we are going to do with the inputs. AOI gates are efficient with CMOS technology since we can use NAND and NOR logic gate to construct AOI gate.

There are two components of AOI mux, namely a precoder and AOI logic. The precoder translate select signal into one-hot encoding, and AOI logic merge the inputs into output based on the one-hot-encoded select signal. Here is the complete implementation of the AOI mux with 5 inputs (code/04/aoi_mux.sv).

The example above can be explained with matrix operation. After one-hot encoding transformation, we create a matrix \(S\) where \(S[i] = sel\_one\_hot\) for \(i \in \{0, 1, \dots, NUM\_INPUT - 1\}\). In other words, all entries in matrix S is zero except for the column indicated by the select signal, which are all one’s. The input signals can be expressed as \(I\) where each row of \(I\) is one input. We then compute the following result: \[ O_{inter} = S \times I \]

Notice that since \(S\) only consists of one’s and zero’s, multiplication is effectively performing AND operation. Matrix \(O_{inter}\) has similar characteristic as matrix \(S\) due to the property of one-hot encoding. To obtain the result, we can do a row-wise OR reduction to obtain the final result. Since CMOS technology is more area efficient when we fuse AND and OR operation together, instead of computing one row at a time, we can compute two rows together, hence the variable NUM_OPS is computed based on \(\lceil \frac{NUM\_INPUT}{2} \rceil\). Readers are encouraged to work out the process with some simple examples.

AOI mux is an example of how we can express the same logic in a clever way that is optimized for CMOS technology. This kind of optimization requires keen insight on the logic as well as deep understanding of logic synthesis. Unless required, we do not recommend to hand-optimize common logic such as adder or multiplexer since it may not achieve better result than synthesis tools and error prone. Use the syntax sugar offered by the SystemVerilog language and let synthesis tools do the heavy lifting. If the code follows the coding style, synthesis tools can pick up easily and perform automatic optimization.

4.5 Wishbone Protocol: A Case Study

A common place for bugs to occur is the interface between components, where each component may have different design assumptions. One approach to limit such bugs is to adhere to a well-specified protocol such that each component will follow and thus reduce the interface error. In this chapter we will take a look at a simple yet complete protocol, namely WIshbone, and how we can write RTL code based on the spec.

Unlike protocols such as AXI4, Wishbone is an open-source hardware bus interface, which allows engineers and hobbyists to share public domain designs.

4.5.1 Wishbone Introduction

Wishbone bus consists of two channels: a request channel which can either be read or write, and an acknowledge (ACK) channel. These two channels connect the bus master and slave together, as shown in the figure below.

Figure 11: Wishbone channel diagram
Figure 11: Wishbone channel diagram

The master has a list of signals specified by the specification. Notice that it is explicitly stated that IPs can change the interface name (PERMISSION 2.0.0), we will use the names used in the specification to make it easier to compare with the document. Notice that the specification follows the naming convention that suffix _O indicates output port and _I indicates input port.

There are a list of signals that’s shared between master and slave interfaces:

Table 4: Interface signals shared between Wishbone master and slave.
Signal Name Function
CLK_I All Wishbone output signals are registered at the rising edge of CLK_I. All Wishbone input signals are stable before the rising edge of CLK_I
DAT_I The data input array to pass binary data. Maximum 64-bit
DAT_O The data output array to pass binary data. Maximum 64-bit
RST_I Reset signal. This signal only resets the Wishbone interface, not required to reset the other part of the IP.
TGD_I Data tag type, which contains additional information about the data. Must be specified in the IP datasheet.
TGD_O Data tag type, same as TGD_I

We’ll ignore TGD_I and TGD_O in this section, but keep in mind that they can transfer very useful metadata information such as error checking code to protect data.

Below shows the complete interface ports for the master (excluding the shared ports).

Table 5: Wishbone master interface ports.
Signal Name Function
ACK_I The acknowledge indicates the normal termination of a bus cycle
ADR_O The address used for read/write request
CYC_O The cycle output. When asserted, indicates a valid bus cycle in progress
STALL_I When asserted, indicates that the current slave is not able to accept the transfer
ERR_I When asserted, indicates an abnormal cycle termination
LOCK_O When asserted, indicates the current bus cycle is uninterruptible
RTY_I When asserted, indicates that the interface is not ready to accept/send data and the cycle should be retried
SEL_O Indicates where valid data is expected on the DAT_I signal array during read cycles, and where it is placed on the DAT_O signal array during write cycles
STB_O The strobe output indicates a valid data transfer cycle. It is used to qualify other signals on the interface.
TGA_O Address tag type, which contains information associated with address lines, which can be qualified by STR_O.
TGC_O Cycle tag type, which contains information associated with bus cycles, which can be qualified by signal CYC_O.
WE_O Write enable output, which indicates whether the current local bus cycle is a read or write cycle.

Again, we will ignore tag information. Interested readers should check out the specification.

The slave interface is symmetric with the master slave: XX_I from master will have a correspondence port XX_O in the slave and vice versa. In general, Wishbone interface is simpler than other bus interface such as Advanced Microcontroller Bus Architecture (AMBA), which is the reason why we can explain the protocol without lengthy details here.

4.5.2 Wishbone Master Example

We present here a simplified version of master module, where the read write behavior is controlled via a simple interface. For any real-world practice, we need to connect the master to an IP that directly controls the master’s behavior. We also drop the tag, lock, and byte select interface for simplicity, but keep in mind that in a real IP interface we need to implement this as well! We will focus on register read write instead of block transfer; we will also drop corner case handling such as error and retry. Interested readers should try to implement block transfer and other missing features.

First, we need to define the IO ports, where the width or the data is parametrized by WIDTH. We also need to add other parameterization for control and data signals.

Notice that due to the naming convention, STALL_I is essentially slave’s ready signal, and STB_O is the valid signal. With that in mind, we can quickly sketch out the logic for sending commands based on the controls signals. Notice that in Wishbone, every output will be registered. Notice that since we need to wait for the client to acknowledge the transition, we need a FSM to determine the transmission state (we will use 2-block FSM to implement this). Since we are only interested in one single register transfer, there is no need to keep track of the number of words transferred.

Based on the state, we have three different outputs:

We then need to change the state based on the control signals. Since we are only interested in one word transfer, we start the transaction when the external control signal enable is high and the slave is ready. Based on whether it is a read or write request, we set the WB control data differently. After initiate the transaction, the master enter busy state, which waits for the slave to acknowledge back. After that the master signals the external client the end of transaction and returns back to idle state.

4.5.3 Wishbone Slave Example

4.5.4 Interconnect and Arbiter