Our work on Culsans (see previous blog entry) forced us to become very familiar with the architecture of the caches in CVA6. We did an in-depth analysis of the RTL code and we decided to share these information, given there is not much documentation available on this topic.

CVA6’s source files include 3 different caches

  • an instruction cache
    • 16kB
    • 4 ways, set-associative
    • 128bits cache lines
    • VIPT
    • random replacement strategy (via LFSR)
  • a write-back (WB) data cache
    • 32kB
    • 8 ways, set-associative
    • 128bits cache lines
    • VIPT
    • random replacement strategy (via LFSR)
  • a write-through (WT) data cache
    • same characteristics as the WB cache

In this post we are going to focus on the first two caches, since a very well written paper by Thales and the University Grenoble Alpes, focused on the WT cache, is already available.

By default the std_cache_subsystem module is instantiated: this includes the I-cache and the WB D-cache. By defining the parameter WT_DCACHE it is possible to make use of the wt_cache_subsystem.

Cache subsystem

std_cache_subsystem

The cache subsystem includes the two caches and an arbiter, that selects which request coming from the caches must be propagated to the AXI interface.

Instruction cache

The instruction cache receives data requests from the frontend module. In case an address is not present in the cache, a translation request must be forwarded to the MMU (in case caching is enabled).

The following diagram depicts the state machine controlling the cache behaviour.

icache_fsm

At reset, or in case a flush command (e.g. via FENCE.I instruction) is received, the controller empties the content of the cache; this operation consists of resetting all the valid bits – no writeback since the instruction cache doesn’t modify the data.

From the IDLE state the controller goes in READ state upon reception of a data from the frontend unit. The controller checks whether the data is already in the memory; in case of a hit, the controller can accept a second data request (if present), otherwise it goes back to IDLE state; in case of a miss, the controller goes in MISS state, waits for the data from the main memory, then it goes back to IDLE state. Since kill operations (e.g. due to an exception) are asynchronous events, in the sense that they can occur while a memory operation is ongoing, the controller must wait for the completion of the address translation or AXI transactions before becoming IDLE again. Dedicated KILL_* states have been introduced for this porpuse.

Data cache

The following diagram schematically represents the architecture of the data cache.

dcache

It is composed by 6 elements:

  • 3 cache controllers
  • a miss handler block
  • a tag comparator
  • the memory itself (8 SRAM blocks for the data, 8 SRAM blocks for the tags – 8 being the number of ways)

There is one cache controller for each of the possible sources of data requests: the store unit, the load unit and the page table walker (PTW). These modules compete with each other, and with the miss handler, to access the tag comparator (and therefore the cache content).

cache_ctrl_fsm

Upon request of a data, the cache controller issues a request to the tag comparator. The tag comparator can process only one request at a time, therefore the incoming requests are prioritized: the miss handler has requests, if they are not related to AMO operations, have highest priority, then come PTW requests, load unit requests, store unit requests and AMO requests.

In case of a hit on a read request, the cache controller can immediately issue a second request, if available. In case of a hit on a write request, the cache controller goes in STORE_REQ state, waits until completion of the update of the cache content, and then goes back to IDLE state. In case of a miss, the controller must wait for the data coming from the shared memory; a request is sent to the miss handler, which will take care of interfacing with the outside world. The FSM goes first into WAIT_REFILL_GNT, to wait for the fetch of data performed by the miss handler, then into IDLE, WAIT_CRITICAL_WORD or WAIT_REFILL_VALID depending on whether the request was respectively a write, read or bypass. "Critical word" indicates the word which is pointed by the address specified during the request; the name distinguishes it from the whole cacheline (2x 64bit words). The "critical word" is the one which is forwarded to the requesting port in case of a read. WAIT_REFILL_VALID is probably a confusing name for a wait for a data coming from the bypass AXI interface. Access to non-cacheable memory locations is modeled like a cache miss.

The miss status holding register (MSHR) is a data structure which

  1. stores the information about the cache miss which is currently being process
  2. is used to synchronize the cache controllers and the miss handler: each cache controller has to wait for the miss handler to finish its operations (e.g. evicting the same cache line that the cache controller is writing, or operatating on the same address the cache controller is reading) before accessing the cache – the WAIT_MSHR state accomplishes this function.

miss_handler_fsm

The miss handler is quite a complex component. It not only handles miss requests, it also serves AMO requests (coming from the CPU’s execution stage), it takes care of cache flushing and writeback operations.

Upon reception of a flush command (e.g. FENCE instruction), the miss handler scans the cache content by issueing requests to the tag comparator: every valid cache line is invalidated and every dirty cacheline’s content is written back to the shared memory.

Upon reception of a miss request from one of the cache controllers, the miss handler must:

  • identify if there is an empty cache line
  • evict a cacheline (if there is no empty line), selected pseudo-randomly via an LFSR
  • fetch the data from the shared memory
  • save the new cacheline

Bypass requests have been said to be handled in a similar way as cache misses, but obviously the cache remains untouched (no eviction and replacement).

AMO requests are served when there are no other ongoing operations and are performed as any other memory access which bypasses the cache. In total the miss handler must arbitrate 4 possible bypass requests: AMO and one request per cache controller.