Our work on Culsans (see previous blog entry) forced us to become very familiar with the architecture of the caches in CVA6. We did an in-depth analysis of the RTL code and we decided to share these information, given there is not much documentation available on this topic.
CVA6’s source files include 3 different caches
- an instruction cache
- 16kB
- 4 ways, set-associative
- 128bits cache lines
- VIPT
- random replacement strategy (via LFSR)
- a write-back (WB) data cache
- 32kB
- 8 ways, set-associative
- 128bits cache lines
- VIPT
- random replacement strategy (via LFSR)
- a write-through (WT) data cache
- same characteristics as the WB cache
In this post we are going to focus on the first two caches, since a very well written paper by Thales and the University Grenoble Alpes, focused on the WT cache, is already available.
By default the std_cache_subsystem
module is instantiated: this includes the I-cache and the WB D-cache. By defining the parameter WT_DCACHE
it is possible to make use of the wt_cache_subsystem
.
Cache subsystem
The cache subsystem includes the two caches and an arbiter, that selects which request coming from the caches must be propagated to the AXI interface.
Instruction cache
The instruction cache receives data requests from the frontend
module. In case an address is not present in the cache, a translation request must be forwarded to the MMU (in case caching is enabled).
The following diagram depicts the state machine controlling the cache behaviour.
At reset, or in case a flush command (e.g. via FENCE.I
instruction) is received, the controller empties the content of the cache; this operation consists of resetting all the valid
bits – no writeback since the instruction cache doesn’t modify the data.
From the IDLE
state the controller goes in READ
state upon reception of a data from the frontend
unit. The controller checks whether the data is already in the memory; in case of a hit, the controller can accept a second data request (if present), otherwise it goes back to IDLE
state; in case of a miss, the controller goes in MISS
state, waits for the data from the main memory, then it goes back to IDLE
state. Since kill operations (e.g. due to an exception) are asynchronous events, in the sense that they can occur while a memory operation is ongoing, the controller must wait for the completion of the address translation or AXI transactions before becoming IDLE
again. Dedicated KILL_*
states have been introduced for this porpuse.
Data cache
The following diagram schematically represents the architecture of the data cache.
It is composed by 6 elements:
- 3 cache controllers
- a miss handler block
- a tag comparator
- the memory itself (8 SRAM blocks for the data, 8 SRAM blocks for the tags – 8 being the number of ways)
There is one cache controller for each of the possible sources of data requests: the store unit, the load unit and the page table walker (PTW). These modules compete with each other, and with the miss handler, to access the tag comparator (and therefore the cache content).
Upon request of a data, the cache controller issues a request to the tag comparator. The tag comparator can process only one request at a time, therefore the incoming requests are prioritized: the miss handler has requests, if they are not related to AMO operations, have highest priority, then come PTW requests, load unit requests, store unit requests and AMO requests.
In case of a hit on a read request, the cache controller can immediately issue a second request, if available.
In case of a hit on a write request, the cache controller goes in STORE_REQ
state, waits until completion of the update of the cache content, and then goes back to IDLE
state.
In case of a miss, the controller must wait for the data coming from the shared memory; a request is sent to the miss handler, which will take care of interfacing with the outside world. The FSM goes first into WAIT_REFILL_GNT
, to wait for the fetch of data performed by the miss handler, then into IDLE
, WAIT_CRITICAL_WORD
or WAIT_REFILL_VALID
depending on whether the request was respectively a write, read or bypass.
"Critical word" indicates the word which is pointed by the address specified during the request; the name distinguishes it from the whole cacheline (2x 64bit words). The "critical word" is the one which is forwarded to the requesting port in case of a read.
WAIT_REFILL_VALID
is probably a confusing name for a wait for a data coming from the bypass AXI interface.
Access to non-cacheable memory locations is modeled like a cache miss.
The miss status holding register (MSHR) is a data structure which
- stores the information about the cache miss which is currently being process
- is used to synchronize the cache controllers and the miss handler: each cache controller has to wait for the miss handler to finish its operations (e.g. evicting the same cache line that the cache controller is writing, or operatating on the same address the cache controller is reading) before accessing the cache – the
WAIT_MSHR
state accomplishes this function.
The miss handler is quite a complex component. It not only handles miss requests, it also serves AMO requests (coming from the CPU’s execution stage), it takes care of cache flushing and writeback operations.
Upon reception of a flush command (e.g. FENCE
instruction), the miss handler scans the cache content by issueing requests to the tag comparator: every valid cache line is invalidated and every dirty cacheline’s content is written back to the shared memory.
Upon reception of a miss request from one of the cache controllers, the miss handler must:
- identify if there is an empty cache line
- evict a cacheline (if there is no empty line), selected pseudo-randomly via an LFSR
- fetch the data from the shared memory
- save the new cacheline
Bypass requests have been said to be handled in a similar way as cache misses, but obviously the cache remains untouched (no eviction and replacement).
AMO requests are served when there are no other ongoing operations and are performed as any other memory access which bypasses the cache. In total the miss handler must arbitrate 4 possible bypass requests: AMO and one request per cache controller.
Nice information