CVA6’s Instruction and WriteBack Data cache

Our work on Culsans (see previous blog entry) forced us to become very familiar with the architecture of the caches in CVA6. We did an in-depth analysis of the RTL code and we decided to share these information, given there is not much documentation available on this topic.

CVA6’s source files include 3 different caches

  • an instruction cache
    • 16kB
    • 4 ways, set-associative
    • 128bits cache lines
    • VIPT
    • random replacement strategy (via LFSR)
  • a write-back (WB) data cache
    • 32kB
    • 8 ways, set-associative
    • 128bits cache lines
    • VIPT
    • random replacement strategy (via LFSR)
  • a write-through (WT) data cache
    • same characteristics as the WB cache

In this post we are going to focus on the first two caches, since a very well written paper by Thales and the University Grenoble Alpes, focused on the WT cache, is already available.

By default the std_cache_subsystem module is instantiated: this includes the I-cache and the WB D-cache. By defining the parameter WT_DCACHE it is possible to make use of the wt_cache_subsystem.

Cache subsystem

std_cache_subsystem

The cache subsystem includes the two caches and an arbiter, that selects which request coming from the caches must be propagated to the AXI interface.

Instruction cache

The instruction cache receives data requests from the frontend module. In case an address is not present in the cache, a translation request must be forwarded to the MMU (in case caching is enabled).

The following diagram depicts the state machine controlling the cache behaviour.

icache_fsm

At reset, or in case a flush command (e.g. via FENCE.I instruction) is received, the controller empties the content of the cache; this operation consists of resetting all the valid bits – no writeback since the instruction cache doesn’t modify the data.

From the IDLE state the controller goes in READ state upon reception of a data from the frontend unit. The controller checks whether the data is already in the memory; in case of a hit, the controller can accept a second data request (if present), otherwise it goes back to IDLE state; in case of a miss, the controller goes in MISS state, waits for the data from the main memory, then it goes back to IDLE state. Since kill operations (e.g. due to an exception) are asynchronous events, in the sense that they can occur while a memory operation is ongoing, the controller must wait for the completion of the address translation or AXI transactions before becoming IDLE again. Dedicated KILL_* states have been introduced for this porpuse.

Data cache

The following diagram schematically represents the architecture of the data cache.

dcache

It is composed by 6 elements:

  • 3 cache controllers
  • a miss handler block
  • a tag comparator
  • the memory itself (8 SRAM blocks for the data, 8 SRAM blocks for the tags – 8 being the number of ways)

There is one cache controller for each of the possible sources of data requests: the store unit, the load unit and the page table walker (PTW). These modules compete with each other, and with the miss handler, to access the tag comparator (and therefore the cache content).

cache_ctrl_fsm

Upon request of a data, the cache controller issues a request to the tag comparator. The tag comparator can process only one request at a time, therefore the incoming requests are prioritized: the miss handler has requests, if they are not related to AMO operations, have highest priority, then come PTW requests, load unit requests, store unit requests and AMO requests.

In case of a hit on a read request, the cache controller can immediately issue a second request, if available. In case of a hit on a write request, the cache controller goes in STORE_REQ state, waits until completion of the update of the cache content, and then goes back to IDLE state. In case of a miss, the controller must wait for the data coming from the shared memory; a request is sent to the miss handler, which will take care of interfacing with the outside world. The FSM goes first into WAIT_REFILL_GNT, to wait for the fetch of data performed by the miss handler, then into IDLE, WAIT_CRITICAL_WORD or WAIT_REFILL_VALID depending on whether the request was respectively a write, read or bypass. "Critical word" indicates the word which is pointed by the address specified during the request; the name distinguishes it from the whole cacheline (2x 64bit words). The "critical word" is the one which is forwarded to the requesting port in case of a read. WAIT_REFILL_VALID is probably a confusing name for a wait for a data coming from the bypass AXI interface. Access to non-cacheable memory locations is modeled like a cache miss.

The miss status holding register (MSHR) is a data structure which

  1. stores the information about the cache miss which is currently being process
  2. is used to synchronize the cache controllers and the miss handler: each cache controller has to wait for the miss handler to finish its operations (e.g. evicting the same cache line that the cache controller is writing, or operatating on the same address the cache controller is reading) before accessing the cache – the WAIT_MSHR state accomplishes this function.

miss_handler_fsm

The miss handler is quite a complex component. It not only handles miss requests, it also serves AMO requests (coming from the CPU’s execution stage), it takes care of cache flushing and writeback operations.

Upon reception of a flush command (e.g. FENCE instruction), the miss handler scans the cache content by issueing requests to the tag comparator: every valid cache line is invalidated and every dirty cacheline’s content is written back to the shared memory.

Upon reception of a miss request from one of the cache controllers, the miss handler must:

  • identify if there is an empty cache line
  • evict a cacheline (if there is no empty line), selected pseudo-randomly via an LFSR
  • fetch the data from the shared memory
  • save the new cacheline

Bypass requests have been said to be handled in a similar way as cache misses, but obviously the cache remains untouched (no eviction and replacement).

AMO requests are served when there are no other ongoing operations and are performed as any other memory access which bypasses the cache. In total the miss handler must arbitrate 4 possible bypass requests: AMO and one request per cache controller.

A RISC-V MCU to for ROS2

roscore-v – this is the name of the latest project we’ve started together with Acceleration Robotics, a Spanish robotics semiconductor startup that designs robot compute architectures to make robots faster.

The roscore-v project aims to optimize the ROS processing flow by creating a native ROS 2 hardware implementation in an RISC-V MCU, a dedicated hardware accelerator that supports the computations needed for robotics.

The project will be developed as part of the ROS 2 Hardware Acceleration Working Group and powered by one of the OpenHW Group‘s CORE-V CPUs.
The plan is to start by developing an FPGA prototype, which will be converted into a custom chip in a second stage.

3…2…1 Let’s start with Culsans!

Culsans – the Etruscan version of Janus, the two-faced and also four-faced god, god of the first and last of the year, of the beginning and the end, of the cardinal points and thus of order in general.

For us, Culsans is a tightly-coupled cache coherency unit for a multi-core processor based on CVA6.
Like the ancient god, its responsibilities are to maintain order (and data consistency) among the memory accesses performed by the 2~4 CPUs which are part of the system.

We don’t want to develop another OpenPiton – our focus is more on speed rather then scalability.

It’s now time to go writing some RTL now.
Stay tuned if you want to know more about this and our upcoming activities.

Adding IRQ support to CVA6 in LiteX

In the previous blog post we left the CVA6 porting in LiteX basically without interrupt support. Since this is an unacceptable solution, I’ve decided to tackle this problem as second step.

What I had to do:

  1. remove the definition of the UART_POLLING macro (in core.py)
  2. make sure that the interrupt sources were routed as input in the PLIC, included in the cva6_wrapper (in core.py)
  3. define the PLIC’s register map and irq helper functions (in irq.h)
  4. implement the isr() and plic_init() functions for CVA6 (in isr.c) and use them (in crt0.S)

The first 2 points are trivial. The last 2 are not particularly complex, since comparable implementations were already available for other cores (e.g. Rocket or BlackParrot).

However, this is what happened when launching a simulation

        __   _ __      _  __
       / /  (_) /____ | |/_/
      / /__/ / __/ -_)>  <
     /____/_/\__/\__/_/|_|
   Build your hardware, easily!

 (c) Copyright 2012-2022 Enjoy-Digital
 (c) Copyright 2007-2015 M-Labs

 BIOS built on Jul 11 2022 15:32:12
 BIOS CRC passed (30b7046a)

 LiteX git sha1: --------

--=============== SoC ==================--
CPU:		CVA6 @ 1MHz
BUS:		WISHBONE 32-bit @ 4GiB
CSR:		32-bit data
ROM:		128KiB
SRAM:		8KiB


--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
Timeout
No boot medium found

--============= Console ================--

litex> hheellpp

As you notice from the console prompt, every key received via UART was displayed twice. The problem was not simply graphical, but the command was also interpreted incorrectly.

Some debugging led me to the uart_isr() function as source of the problem

void uart_isr(void)
{
	unsigned int stat, rx_produce_next;
	stat = uart_ev_pending_read();
	if(stat &amp; UART_EV_RX) {
		while(!uart_rxempty_read()) {
			rx_produce_next = (rx_produce + 1) &amp; UART_RINGBUFFER_MASK_RX;
			if(rx_produce_next != rx_consume) {
				rx_buf[rx_produce] = uart_rxtx_read();
				rx_produce = rx_produce_next;
			}
			uart_ev_pending_write(UART_EV_RX);
		}
	}
    ...
}

In particular, uart_ev_pending_write() was executed twice, which meant that uart_rxempty_read() returned 0 (i.e. not empty RX FIFO) twice. The waveforms obtained in simulation indicated that the execution of uart_rxempty_read() started before uart_ev_pending_write() has been able to finish its execution and clear the status. Adding a fence instruction right after uart_ev_pending_write() helped solving the problem

        __   _ __      _  __
       / /  (_) /____ | |/_/
      / /__/ / __/ -_)>  <
     /____/_/\__/\__/_/|_|
   Build your hardware, easily!

 (c) Copyright 2012-2022 Enjoy-Digital
 (c) Copyright 2007-2015 M-Labs

 BIOS built on Jul 11 2022 21:08:06
 BIOS CRC passed (30b7046a)

 LiteX git sha1: --------

--=============== SoC ==================--
CPU:		CVA6 @ 1MHz
BUS:		WISHBONE 32-bit @ 4GiB
CSR:		32-bit data
ROM:		128KiB
SRAM:		8KiB


--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
Timeout
No boot medium found

--============= Console ================--

litex> help

LiteX BIOS, available commands:

help                     - Print this help
ident                    - Identifier of the system
crc                      - Compute CRC32 of a part of the address space
flush_cpu_dcache         - Flush CPU data cache

boot                     - Boot from Memory
reboot                   - Reboot
serialboot               - Boot from Serial (SFL)

mem_list                 - List available memory regions
mem_read                 - Read address space
mem_write                - Write address space
mem_copy                 - Copy address space
mem_test                 - Test memory access
mem_speed                - Test memory speed
mem_cmp                  - Compare memory content

The PR is here.

On the bitter side, I’ve not been able to test the implementation on FPGA. I have a Nexys A7 board, hosting a Xilinx Arti7 100T, which is apparently too small for the latest and greatest version of CVA6. An idea for the next iteration could be to investigate this problem.

Porting OpenHW’s CVA6 to Litex

I’ve just seen the LiteX‘s console appearing on my terminal after booting my Digilent Nexys A7 board with a LiteX bitfile containinig OpenHW’s CVA6 core.

        __   _ __      _  __
       / /  (_) /____ | |/_/
      / /__/ / __/ -_)>  <
     /____/_/\__/\__/_/|_|
   Build your hardware, easily!

 (c) Copyright 2012-2022 Enjoy-Digital
 (c) Copyright 2007-2015 M-Labs

 BIOS built on May 19 2022 12:08:20
 BIOS CRC passed (c172daf0)

 LiteX git sha1: 48b523cf

--=============== SoC ==================--
CPU:            CVA6 @ 75MHz
BUS:            WISHBONE 32-bit @ 4GiB
CSR:            32-bit data
ROM:            128KiB
SRAM:           8KiB
MAIN-RAM:       8KiB 

--========== Initialization ============--
Memtest at 0x40000000 (8.0KiB)...
  Write: 0x40000000-0x40002000 8.0KiB   
   Read: 0x40000000-0x40002000 8.0KiB   
Memtest OK
Memspeed at 0x40000000 (Sequential, 8.0KiB)...
  Write speed: 17179869183.10GiB/s
   Read speed: 17179869183.10GiB/s

--============== Boot ==================--
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
             Timeout
No boot medium found

--============= Console ================--

litex> help
LiteX BIOS, available commands:

help                     - Print this help
ident                    - Identifier of the system
crc                      - Compute CRC32 of a part of the address space
flush_cpu_dcache         - Flush CPU data cache                                   
leds                     - Set Leds value                                         
                                                                                  
boot                     - Boot from Memory                                       
reboot                   - Reboot
serialboot               - Boot from Serial (SFL)

mem_list                 - List available memory regions
mem_read                 - Read address space
mem_write                - Write address space
mem_copy                 - Copy address space
mem_test                 - Test memory access
mem_speed                - Test memory speed
mem_cmp                  - Compare memory content


Ident: LiteX SoC on Nexys4DDR 2022-05-19 12:07:58
crc <address> <length>
boot <address> [r1] [r2] [r3]
Available memory regions:
ROM       0x10000000 0x20000 
SRAM      0x20000000 0x2000 
MAIN_RAM  0x40000000 0x2000 
CSR       0x80000000 0x10000 

litex> 

Being my first experiment with this tool I got pretty excited and I decided that it was worth a post on PlanV’s brand new blog page. So, there we go!

I’ve already mentioned that I am new to LiteX, but I have been a long time friend of CVA6, having contributed to its integration in Hensoldt Cyber’s MiG-V. Using it as starting point for my adventure with LiteX has simply been a natural choice.

Adding a core is actually an easy process; what you have to do is:

  • pack the source files in a dedicated pythondata repository
  • create a cva6 class which connects the Verilog ports to the LiteX ports

Wrapping CVA6

By looking at the cores already included in the tool, I noticed that the CPU-rest_of_the_system interface is normally composed by

  • one or more “data” buses (Wishbone or AXI or OBI)
  • a bus of interrupt sources
  • JTAG signals
  • clock and reset

CVA6’s ports are:

  • AXI interface
  • hart ID
  • boot address
  • IRQ (2 bits, for M and S mode) – these normally come from the PLIC
  • IPI – this is normally generated by the CLINT
  • timer IRQ – this is normally generated by the CLINT
  • debug request – coming from the debug module
  • clock and reset

Hart ID and boot address can be hardcoded. The routing of the interrupt and debug request signals puzzled me. cv32e40p shows an example of how to integrate the debug module (DM) at core level. I could have followed the same approach and integrated DM, PLIC and CLINT at LiteX level, but there were (and still are) some open questions (1) which prevented me from following this path. Therefore I decided to integrate all these components in a Verilog wrapper; the approach is not new in the LiteX ecosystem: I’ve seen the same happens e.g. for Rocket Chip.

cva6_wrapper.png

The architecture is probably not optimal, since the data must traverse 2 interconnectors (the AXI interconnector in the wrapper and the main Wishbone interconnector in the LiteX system), but it is good enough for a first attempt.

The memory mapping is

Target Start addr Length
DM 0x00000000 0x1000
CLINT 0x02000000 0xC0000
PLIC 0x0C000000 0xh3FFFFFF
others 0x10000000 0xEFFFFFFF

The pythondata-cpu-cva6 repository contains a snapsot of the original CVA6 repository Together with the original source files and cva6_wrapper.sv the pythondata repository also includes 2 SystemVerilog packages: ariane_pkg.sv, which is a copy of the one included in the CVA6 repository and which specifies some parameters specific for the current implementation, and cva6_wrapper_pkg.sv, which defines the memory mapping within the wrapper as well as the configuration parameter for CVA6.

CVA6 class

This part of the design is quite trivial. I just had to connect the IOs of the cva6_wrapper module with the appropriate signals in the CVA6 class (derived from the CPU class). On top of this, the AXI interface has to be converted to Wishbone (since the system bus uses this protocol) and the memory map for the system peripherals has to be defined. The system peripherals have to be mapped to addresses higher than 0x10000000 (see the mapping done in the CVA6 wrapper).

Target Start addr
ROM 0x10000000
SRAM 0x20000000
CSR 0x80000000

Note that the io_region parameter must be included in the CSR address range.

The Verilog source files are included in the project by parsing the flist file in pythondata-cpu-cva6. CVA6 offers several versions of this file list, depending on the project configuration. For the time being I only used the cv64a6_imafdc_sv39.

Testing

I’ve tested the code both using Verilator and an FPGA board.

To start a simulation with Verilator you just have to run

litex_sim --cpu-type=cva6 --trace

--trace is necessary to generate the waveforms, for debugging

To implement the code on FPGA, the command to run is

python3 -m litex_boards.targets.digilent_nexys4ddr --cpu-type=cva6 --build

from within the litex-board folder.

The behaviour with Verilator and the FPGA is slightly different. The simulation showed no problem, while the console froze in FPGA after printing a few characters.

The issue seems to be related to the interrupt handling, but I’ve not yet found the root cause. I’ve just found that compiling the code with the the flag -DUART_POLLING overcomes the problem.

Another problem I encountered is a failure during the SDRAM initialization

--========== Initialization ============--
Initializing SDRAM @0x40000000...
Switching SDRAM to software control.
Read leveling:
  m0, b00: |00000000000000000000000000000000| delays: -
  m0, b01: |00000000000000000000000000000000| delays: -
  m0, b02: |11111111111110000000000000000000| delays: 08+-08
  m0, b03: |00000000000001111111111111111111| delays: 22+-09
  m0, b04: |00000000000000000000000000000000| delays: -
  m0, b05: |00000000000000000000000000000000| delays: -
  m0, b06: |00000000000000000000000000000000| delays: -
  m0, b07: |00000000000000000000000000000000| delays: -
  best: m0, b03 delays: -
  m1, b00: |00000000000000000000000000000000| delays: -
  m1, b01: |00000000000000000000000000000000| delays: -
  m1, b02: |11111111111100000000000000000000| delays: 07+-07
  m1, b03: |00000000000001111111111111111111| delays: 22+-09
  m1, b04: |00000000000000000000000000000000| delays: -
  m1, b05: |00000000000000000000000000000000| delays: -
  m1, b06: |00000000000000000000000000000000| delays: -
  m1, b07: |00000000000000000000000000000000| delays: -
  best: m1, b03 delays: -
Switching SDRAM to hardware control.
Memtest at 0x40000000 (2.0MiB)...
  Write: 0x40000000-0x40200000 2.0MiB     
   Read: 0x40000000-0x40200000 2.0MiB     
  bus errors:  14/256
  addr errors: 0/8192
  data errors: 520296/524288
Memtest KO
Memory initialization failed

I have not yet debugged this issue. It is possible to workaround this problem by not including an SDRAM in the design, by building the system with the --integrated-main-ram-size option

python3 -m litex_boards.targets.digilent_nexys4ddr --cpu-type=cva6 --integrated-main-ram-size=8192 --build

Notes

(1)

  • how to connect the interrupt variable of the CPU class (which contains all the interrupt sources at platform level) if the actual CPU expects a single IRQ bit (generated by the PLIC)?
  • how to define the memory mapping of the DM, PLIC and CLINT?