Simulation and Test Benches

A significant portion of the language are dedicated to test benches and testing. In this chapter we will cover some commonly used techniques to write efficient test bench for your hardware designs.

6.1 How SystemVerilog Simulator Works

Before we delve into details of how to write a proper test bench, we need to establish a deep understanding of how simulator works and how it schedules events. This will help us troubleshoot bugs and errors in the future.

A specification-compliant SystemVerilog simulator follows a discrete event execution model, where the simulation time advances with value updates. The hardware design is inherently parallel, where processes such as always_comb and always_ff are executed currently. Each time the value of a net/variable changes, we will have an update event and any processes that are sensitive to that event need to be evaluated as well, which is called evaluation event. At each “timestamp”, the simulator needs to first compute the update events, evaluate update events, and loop back to see if there is more update events triggered by previous update events.

The term for “timestamp” in SystemVerilog is simulation time. It can be transformed back to real time using the timescale compiler directive introduced earlier in the book. We use simulation time, or simply time throughout the entire chapter to avoid confusion.

Although the design and test bench is parallel by nature, most simulators are single-threaded and follows certain rules to evaluate the code to ensure it is conceptually correct. Typically the simulator divides the unit time slop into multiple regions where events can be scheduled in a pre-defined orders. In each region, the events can be scheduled arbitrarily, allowing simulator performs optimization when it sees fit. Figure 12 shows how the time slot is divided into different regions and the execution flow between different regions.

Figure 12: Event scheduling regions. Image taken from SystemVerilog LRM Figure4-1
Figure 12: Event scheduling regions. Image taken from SystemVerilog LRM Figure4-1

PLI regions will be discussed in much details later in the book. For now it is enough to know there are regions reserved for third-party libraries that can be loaded into the simulator and can have direct access to the simulator state.

Fully cover each region requires much lengthy details and readers are encouraged to read through the language LRM and even try to implement a simple interpreter-based simulator. We will focus on three major regions: active event region, inactive events region, and NBA events region.

Generally speaking, any events (e.g. blocking assignment) specified in the always_comb and continuous assignment are evaluated in the active event region. The simulator continues evaluate the events in the active event region in a loop until no events left in the region. If there is an explicit timing control, e.g. #0 delay control, in the process, the process will be suspended and the following events are scheduled into the inactive events region. Again, the simulator runs in loop to clear out the events in the inactive events region.

The NBA events region contains nonblocking assignment update. It will only be executed after precedent active and inactive region are cleared.

6.1.1 Simulation order

The SystemVerilog LRM guarantees a certain scheduling order. Any simulator claims to be standard compliant should obey the execution order:

  1. Statements within a begin-end block shall be executed in lexical order, i.e., the order in whey they appear in the source code
  2. NBAs shall be performed in the order the statement where executed.

To understand the second requirement, let’s consider the following example:

At the end of simulation time, variable a will be first assign to 0, and then 1.

As one can suspect, such ordering poses a hard restriction on reordering-related compiler optimization. Simulation vendors typically employ different types of optimization to ensure the semantics is met, but necessary the actual ordering of execution. For instance, if no third-party entity is expected to read out the exact simulation order (e.g. debugger that allows step through), we can reorder the statements as long as it is side-effects free and matches the ordering semantics. This significantly speeds up the simulation but requires extra flags if users wish to debug and step through the code, e.g. -line_debug flag in Xcelium. Verilator, on the other hand, only offers reordered simulation order for the sake of performance. As a result, it is not standard compliant.

The SystemVerilog LRM, however, does not specify the ordering at which processes are evaluated. As a result, it is up to the simulator to decide which process to execute first. This introduce nondeterminism among the simulators. Another source of nondeterminism comes from the fact that simulator may suspend the process and place partially completed events as pending event in the event region whenever it encounters a timing control statement. This typically happens in the test bench instead of the RTL design, since synthesizable RTL disallows timing control except for always_ff.

6.2 Timing Controls

Timing is one of the most important factor to consider when writing a test bench. Should the signal be stable before the clock edge, or how long should the signal be valid for? What does delay mean? This section will cover various aspect of timing controls.

The compiler directive `timescale specifies the precision at which the simulator should run. Since different modules may have different timescale, the simulator needs to make a decision on how to represent simulation time. In most simulators, in fact any simulator that supports VPI standard (discussed later), simulation time is represented as an unsigned 64-bit integer, even though the RTL model may expect the time to be a float. To do so, time is rounded off to the specified precision and then scaled to the simulation time units. Consider the following example:

For all modules, 10ps is the finest precision so 1 simulation time unit corresponds to 10ps. Before we convert every delay into the simulation time, we first round the delay into module’s precision. So 1.2 in module A becomes \(1.2ns = 120 \times 10ps\), i.e. 120 10-picoseconds unit; 3.4 in module B becomes \(3.4us = 340 \times 10us\), i.e. 340 10-microseconds. Then we scale everything into simulation time. Hence 1.2 in module A becomes 120 10-picoseconds and 3.4 in module B becomes 340000 10-picoseconds.

To obtain the simulation time, we can use $time, which can be printed out either via %d or %t in the $display function.

The most common usage of timing control is setting the clock. A standard code style is shown below:

Notice that the clock changes its value every 10 units of time, hence the clock period is 20 units of time. Because this always block runs forever, we have to terminate the simulation with the builtin SystemVerilog task $finish, as shown below:

To synchronize the values against the clock, we highly discourage readers to set delays by hand, which is error-prone and reduce the readability. Instead, we recommend to use timing controls (@) directly. Here is an example:

In such way, we are guaranteed that signals input1 and input2 are set before the rising edge of the clock signal, regardless of the clock period! If you have checking/assertion logics, you can place them after the negative edge of the clock, assuming there is no synchronous logic depends on negative edge of the clock in your design (dual triggering typically happens in some high-performance design), as shown below:

We will discuss more complex but reusable test bench design pattern later in the chapter.

6.2.1 Fork and Join

Because hardware is inherently concurrent, in many cases we want to have multiple threads performing tasks at the same time, either driving or checking differently parts of the design. SystemVerilog offers fork and join semantics that is similar to that of software programming languages, e.g. std::thread in C++.

The general syntax for fork and join is shown below. Notice that each statement inside the fork join is an individual thread, so if you want complex logic, you need to enclose it with begin and end block.

Here is a simple example to illustrate how to use fork and join:

Run the file (code/06/ with xrun we will get:

Thread 2 finished at                    5
Thread 1 finished at                   10
Thread 3 finished at                   20

Notice that you can even have nested fork join, i.e. one thread can spawn multiple threads as well. Although the fork join semantics is similar to software programming languages, there are some properties we need to keep in mind:

  1. All statements are executed concurrently, regardless of whether it is simulated on a single CPU core or not.
  2. Timing controls are local to each fork block and are computed relative to the simulation time when entering the block.
  3. It is always a good practice to name the fork block, especially when you’re creating variables inside, as shown below:

  4. Since fork and join is part of SystemVerilog’s timing control, it is not allowed inside function. You need to use task instead.
  5. Any objects declared inside the fork-join block are managed by the simulator, so we don’t need to worry about dangling references or memory leaks. However, they should be declared as automatic so that it is local to the block.
  6. You cannot put fork-join inside always_comb. Different Join Semantics

There are three different join keywords we can use in SystemVerilog and each have different semantics:

  • join: this keyword blocks the execution until all the forked processes finish, This is similar to join() in software threads
  • join_any: this keyword blocks until any of the forked processes finishes. As a result, some processes may still be running when the execution of the main thread continues
  • join_none: this keyword does not block and execution as the forked processes continue to execute in the background.

6.3 Standard Data Structures

SystemVerilog introduces many common data structures to help designers build complex test logic. These data structure interfaces are heavily influenced by C++ standard libraries. We will take a quick look at some commonly used data structures. Interested readers should refer to LRM for more information. Keep in mind that all the data structures introduced in this sub-chapter is not synthesizable, as with any construct discussed in this chapter.

6.3.1 Dynamic Array

Most arrays in SystemVerilog are fix-sized and their dimensions cannot be changed at run time. Dynamic array, as its name suggests, is an unpacked array whose dimension can be changed at runtime. To decare a dynamic array we can use the following syntax

You can also combine it with other arrays, as shown below, which declares a fix-sized array of dynamic arrays.

To initialize the dynamic array, we can use the keyword new with the targeted dimension:

Keep in mind that even though we have initialized the dynamic array, the content of each array element is still uninitialized. As a result, you can get x when reading the element values.

To loop through the array, we can simply do

Notice that we implicitly create an index variable i with the foreach keyword.

Below is a list of methods associated with the dynamic array:

  • size(): in additional to the standard system call function $size(), dynamic array has a method that returns the size of the array.
  • delete(): clears all the elements and becomes an empty array.

6.3.2 Queue

Queue is SystemVerilog’s equivalence for vector in C++. To declare a queue, we can use the following syntax:

Like normal arrays, queue supports slicing operations: - Like the usual slicing operator, the indexing is inclusive, that is, queue[a:b] should returns b - a + 1 elements. - If the slicing is out of range or malformed, e.g., queue[1:0], an empty queue should be returned. - If any 4-state value containing x or z is used for slicing, an empty queue should be returned.

Looping through the queue is the same as looping through dynamic arrays:

Below is a list of methods associated with the queue: - size(): in additional to the standard system call function $size(), size() returns the size of the queue. - delete(index): deletes the element based on given index; if index is not provided as a function argument, clear the queue. - insert(index, value): insert the value into given index. - push_back(value): put the element to the end of the queue. - pop_back(): removes and returns the last element of the queue. If the queue is empty, default value for the data type is returned and a warning may be issued. - push_front(value): put the element to the front of the queue. - pop_front(): removes and returns the first element of the queue. If the queue is empty, default value for the data type is returned and a warning may be issued.

6.3.3 Associative Array

Associative array is SystemVerilog’s equivalence for map containers in C++. The index expression can be any legal SystemVerilog type and the size of the container grows as more elements are inserted. To declare an associative array, we can use the following syntax:

SystemVerilog supports using * as a wildcard for index type with the following restrictions:

  1. The index type must be an integral type, but can be different size. The “true value” is used for indexing; that is, SystemVerilog needs to resolve two values with different sizes to the same index location if their values match.
  2. 4-state values with x and z is illegal.
  3. Non-integral index types/values are illegal and will result in an error
  4. String can be used, but will be casted as integral values.

To initialize the associative map when declaring it, we can use the following syntax:

Similar to other data structures, we can loop through the associative array using foreach keyword:

Below is a list of useful methods for associative array: - size(): in additional to the standard system call function $size(), size() returns the number of elements in the associative array. - delete([index]): if index is provided, deletes the index and its associated value from the array. If index is not provided as function argument, clear the entire array. - exists(index): returns 1 if the element with given index exists and 0 otherwise.

6.4 Event Control and Synchronization

Because the programming model lof a standard RTL test bench requires concurrency and therefore synchronization, SystemVerilog offers various constructs and keywords to help programmers reason about the concurrency.

The basic synchronization unit is event, which can be either named or unnamed. An unnamed event is created implicitly through detecting the values changes on nets and variables. There are three types of value changes that can trigger an event:

  • posedge: it happens when the net become non-zero from zero or from x/z to 1, , e.g. 0 -> 1 or 0 -> x
  • negedge: it happens when the net becomes non-one from one or from x/z to 0, e.g. 1 -> 0 or 1 -> x
  • edge: it happens whenever posedge or negedge happens.

Only integral values or strings can be used in the implicit event.

To synchronize the logic with edge-triggered events, we need to use @ keyword as shown below. Notice that we have seen the event control in always_ff and previous sections on how to write a simple test bench!

Events can also be OR-ed together so that the code can be synchronized by any of the events, as shown below:

Notice that SystemVerilog also offers a syntax sugar that uses comma (,) as OR operator in the events, which we have seen in the always_ff earlier:

Another way to synchronize events is blocking the execution until a condition becomes true. This is called level-sensitive, as oppose to edge-sensitive in the case of using @. To do so, we need the wait keyword, which evaluates a specified condition. If the condition is false, the following procedure statement shall be blocked until that condition becomes true. Below shows an example (code/06/ of wait with fork:

After running the example we will see the following printout, which is expected.

@(10) a = 1

Although @ and wait seems similar, they are fundamentally different since one is edge-triggered and the other is level-triggered. One direct implication of this is they are scheduled differently in the simulator.

A named event can be constructed through the builtin type in SystemVerilog, event, which allows aliasing, as shown bellow.

To trigger a named event, we can use -> and ->>. ->> is the non-blocking version of ->. To wait for an event to be triggered, we can use triggered with wait keyword, as shown below.

We should expect similar output as the wait example:

@(10) e is triggered

There are several advantages of using events compared to using normal signals

  • Events can be passed into tasks and other hierarchy due to its aliasing semantics.
  • Events avoid a common case of race condition. Considering the following example:

    If the simulator evaluates the wait statement and updating a value at the same time, the ordering of execution is undetermined since this is a race condition. Using triggered however, is guaranteed to be executed properly, regardless of the ordering of execution.

6.4.1 Semaphore: How to Avoid Race Conditions

A natural challenging in a concurrent software system is race condition. Since hardware simulation is done typically in software, race condition can happen if not taken care of. SystemVerilog offers a construct called semaphore to facilitate shared resource synchronization. In this chapter we assume readers have some basic knowledge in POSIX Threads (pthread). If not, we highly recommend reading over the Linux manual page of pthreads(7) and other related pages.

To initialize the semaphore we can use the following syntax, where we declare and initialize a semaphore s with 10 initial resources:

To get certain number of resource from the semaphore, we can use the get() method. Notice that this method is blocking, meaning the the next procedural statement will be evaluated only after the function returns, i.e. successfully obtaining the desired resources.

To release resources back to the semaphore, we can use put methods. This will unlock threads that’s waiting for resources:

A best-effort getting resource can be done via try_get(). Notice that this method is non-blocking and caller thread should check the return value to see how many resources actually get allocated.

Here is an example (code/06/ of semaphore with fork-join:

We should expect the following output:

Thread 1 finished @ 10
Thread 2 finished @ 20
Thread 3 finished @ 20

Although SystemVerilog does not offer the mutex construct, it can easily been implemented by setting the initial resource to 1. Interested readers should try to implement a mutex in SystemVerilog.

6.4.2 Mailboxes: Thread-safe Messaging Passing

Mailbox is an message passing construct that allows message exchanges between different processes. As the name suggested, its design follows the concept of “mailbox” in real life. That is, the mail box has a fixed capacity and mails will be rejected if the mail box is full: the delivery person need to come back later to make another deliver attempt. Similarly, mailbox in SystemVerilog is also a fixed capacity containers that is able to block a process’s deliver attempt if it is full.

To create a mailbox, we can use the following constructor:

Notice that the default constructor set the capacity to 0, which implies unlimited capacity. In this cases mailbox functions as a FIFO with unlimited capacity.

To put a message into a mailbox, we can simply use put(obj). obj can be any expression or object handles. To get a message from a mailbox, we can use get() method. Both put() and get() follows FIFO ordering, which can be a nice property for verification work. Notice that put() and get() are blocking, meaning if the mailbox is full, put() will block the current process until there is an empty space in the mailbox, and get() will block until the there is one message in the mailbox.

If non-blocking functions call is required, we can use try_put() and try_get(). try_put() returns 0 if the mailbox is full and a positive integer if the action is successful. try_get() returns 0 if the mailbox is empty and positive number if the action is successful. Since we are trying to assigning to a variable with potentially incompatible type, a negative number will be returned if type error happens.

To check the number of messages in the mailbox, we can use num() methods. Please notice that there will be a race condition if a process calls num() first and then use the result to decide whether to put/get messages. Since these two actions are not atomic, another process can perform an action such that by the time the a message is put into the target process, the previous result from num() is not accurate anymore! Designers should consider to use try_get/try_put() instead!

Here is an example (code/06/ of using variable methods of the mailbox:

We will see the following output:

[0]: @(10) put in value: 0
[1]: @(10) get value: 0
[0]: @(20) put in value: 1
[1]: @(20) get value: 1
[0]: @(30) put in value: 2
[2]: @(30) get value: 2 after 30 attempts
[0]: @(40) put in value: 3
[2]: @(40) get value: 3 after 10 attempts

Notice due to the simulator scheduling difference, you may see slightly different outputs as thread 1 and thread 2 might swap output values. This is because thread 1 and thread 2 are competing to get messages from the same mailbox.

Keen readers may notice that one can implement a semaphore from a mailbox. Readers are encouraged to try it out!

6.5 Generator, Driver, Monitor, and Scoreboard Design Pattern

As the design gets more complex, it is common to go through several generations where micro-architecture of the design are tweaked each generation to get better performance. To avoid wasting efforts on build testing infrastructure for each generation, designers have adopted the so called generator-driver-monitor-scoreboard pattern. The main idea behind such testing pattern is to reuse as many components as possible while testing different sets of design (often called design under test (DUT)).

The four components servers the following purposes: - Generator: generates test stimulus on transaction-level. - Driver: takes test stimulus and drives the DUT. Performs transaction-level to signal-level conversion. - Monitor: monitors the the DUT interface and extract signals of interests. Performs signal-level to transaction-level conversion - Scoreboard: compares the transaction-level information to a gold model and reports if there is an error

As long as the interface between each component is well-defined, we can either replace or reuse some components depends on the DUT. Each component typically communicates through thread-safe channels, such as mailbox where the content of the message is typically a transaction class or struct. We will first cover each components one by one, then show the complete example of test environment where all the components interacting with each other.

6.5.1 A Simple Ready-Valid Design

Before we explain the test environment, lets take a brief look at our DUT, which is a simple multiplier with ready valid interface. Ideally the multiplier would be actually implemented in a multi-cycle fashion. Since this is not the focus of this chapter, we will substitute it with a combinational version of multiplier with several fake pipeline stages. Reader shall refers to the relevant section in the design if unfamiliar with ready-valid design pattern. Here is the code for our DUT (code/06/ Notice that we mimic a pipelined multiplier for the sake of simplicity.

Our mult_ex listens to the ready_in and use the operand a and b to produce hi and lo. Once the result is ready, we set the valid to high. Everything is controlled by a simple 2-block FSM.

Another design aspect we need to take care of is the interface. We will connect each component directly using interface to make the code simpler and easier to maintain. Here is a simple interface design we will use (code/06/

Notice that we use modport to directly connect some outputs of the driver to the monitor. If needed, we can split the interface into two separated ones should the design gets complex, i.e. one interface for the driver and one for the monitor.

6.5.2 Generator Design

The role of generator is to produce input stimulus to our dut. There are many ways to produce desired inputs; here is just a few commonly used ones: 1. Constrained random. This method utilizes simulator’s solver to produce large quantities of random yet valid inputs to test out the dut. We will discuss it later in the book 2. Trace replay. This is used to recreate realistic testing environment where the input are obtained from real-world usage. 3. Manually generated input sequence. For a small design, testing sequence can be directly coded by the designers. Although it is not scalable, it is often used as direct test to test out some corner cases that are difficult to cover using constrained random.

To design a generator, we first need to write an input class that encapsulate all the input information for a transaction. In this case since we’re working with a multiplier, we can just put operand values in the class:

Notice that the GeneratorXact class is parametrized by operand width. In this case we can re-use the same class on multipliers with different widths.

To communicate with the driver, we will use a mailbox inside the generator, which will be passed in during the constructor. We’ll use simple random number generator for the operands, and we will cover constrained random later in the book. Here is the

Notice that we have a public task main() that’s used to produce input transactions. This task will be called inside the test environment.

6.5.3 Driver Design

The role of the driver is to serialize the input stimulus onto the interface bus. Unlike the generator, it needs to understand the interface protocol our dut is using, in this case, a simple ready-valid handshake. It pulls the transaction from the mailbox used by the generator, and then drives the net, as shown in the code below (code/06/

Notice that in addition to the main() task, we have a reset task that’s responsible to initialize the dut. We also need to obey the ready-valid protocol, that is, we shall wait until the dut is ready, otherwise we will hold the pending transaction and wait. The driver does not need to know the details such as the total number of transactions. All it does is to take one transaction from the mailbox (if any), and then drive the interface.

Also notice that we use a new syntax using the keyword virtual to get the reference for interface. Doing so allows us to directly set values to the interface as if the interface is an object.

6.5.4 Monitor Design

The monitor taps into the interface bus and de-serialize signals into the high-level transaction class. Similar to the driver, it needs to understand the interface protocol and then put the transaction object into a mailbox shared with the scoreboard. Because it usually takes multiple cycles to complete the data collection, monitor typically has internal state to store information.

Below shows the monitor that listens to the interface ports and gather data when the dut is ready (code/06/

Similar to driver, it uses a mailbox to interact with the scoreboard. It waits for the input valid signal goes high, then grab the input signals a and b. Then wait until the valid output goes high, i.e. dut has successfully computed the output, and then grab the output lo and hi. Once we have everything we need to for the packet, we assemble the packet and put it into the mailbox. Notice that monitor does not care about whether the output of the computation is correct or not!

6.5.5 Scoreboard Design

Once we have complete packet, we can compare the output against the model we have. We can also figure out if there is any packet missing or malformed. Scoreboard typically interface with a high-level function model that’s either written in C/C++ or SystemVerilog. We will cover how to interface with C/C++ model later. For now we will simply compute the gold output in SystemVerilog. Below shows the scoreboard for our multiplier (code/06/ :

Notice that we also keep track of number of transactions, in case there is some protocol bug that drops transaction packets. Once we get the transaction from the mailbox, we simply compute the gold output and assert the result.

6.5.6 Test Environment Setup

Now we have all the major components written, the next step is to set up the test environment. The role of the environment is to instantiate and run the test suites. Below shows an example of test environment (`code/06/

In the test environment, we instantiates the test components as well as the mailboxes. Notice that the constructor takes the full interface and use modport when instantiating test components. The main test task, task() uses fork so each component runs concurrently. We finish the test when the number of transaction received in the scoreboard equals to the number generated from the generator. The entry task is run(), which first resets the dut, then calls test(), and eventually finish().

To use the test environment, we need to following test bench code (code/06/mult_top):

The test bench top drives the clock as well as the reset signal. Notice that in order to avoid infinite loop when we have a missing packet (the end condition will never trigger), we set a terminal condition based on the number of cycles run.