gem5; first clear MOESI multicore analysis of the counter example

This commit is contained in:
Ciro Santilli 六四事件 法轮功
2020-06-05 07:00:06 +00:00
parent 0d5c7f5c4c
commit 8a4fc8e9ae

View File

@ -13001,7 +13001,7 @@ Tested in gem5 d7d9bc240615625141cd6feddbadd392457e49eb.
This is the simplest of all protocols, and therefore the first one you should study to learn how Ruby works.
To study it, we can take an approach similar to what was done at: <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus>>.
To study it, we can take an approach similar to what was done at: <<gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus>>.
Our full command line will be something like
@ -13033,7 +13033,7 @@ It presumably implements a crossbar switch along the lines of: https://en.wikipe
One simple example of its operation can be seen at: xref:gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis[xrefstyle=full]
But arguably interesting effects can only be observed when we have more than 1 CPUs as in <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus>>.
But arguably interesting effects can only be observed when we have more than 1 CPUs as in <<gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus>>.
TODO: describe it in more detail. It appears to be a very simple mechanism.
@ -15225,11 +15225,9 @@ so we understand that by default the classic cache:
The actual representation is done via separate state bits: https://github.com/gem5/gem5/blob/9fc9c67b4242c03f165951775be5cd0812f2a705/src/mem/cache/cache_blk.hh#L66 and MOESI appears explicitly only on the pretty printing.
This pretty printing appears for example in the `--trace Cache` lines as shown at <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches>>.
This pretty printing appears for example in the `--trace Cache` lines as shown at <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches>> and with a few more transitions visible at xref:gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus[xrefstyle=full].
===== gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
TODO is this the minimal setup that allows us to see the <<gem5-crossbar-interconnect>>? Can we see anything in `AtomicSimpleCPU`?
===== gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs
It would be amazing to analyze a simple example with interconnect packets possibly invalidating caches of other CPUs.
@ -15237,18 +15235,20 @@ To observe it we could create one well controlled workload with instructions tha
If we don't use such instructions that flush memory, we would only see the interconnect at work when caches run out.
For this study, we will use the same CLI as <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis>> but with multiple CPUs and a content like <<atomic-cpp>> which shares a variable across threads.
For this study, we will use the same CLI as <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis>> but with multiple CPUs and a multithreaded which shares a variable across threads.
We will then focus on the behaviour or the memory of the shared variable to see if we can observe cache coherency on the crossbar:
We can use link:userland/c/atomic.c[] (see also <<c-multithreading>>) at LKMC 7c01b29f1ee7da878c7cc9cb4565f3f3cf516a92 and gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 was with as in <<detailed-gem5-analysis-of-how-data-races-happen>>:
....
./run \
--arch aarch64 \
--cli-args '2 100' \
--cli-args '2 10' \
--cpus 3 \
--emulator gem5 \
--gem5-worktree master3 \
--userland userland/cpp/atomic/aarch64_add.cpp \
--trace FmtFlag,Cache,DRAM,ExecAll,XBar \
--userland userland/c/atomic.c \
-- \
--caches \
;
....
@ -15258,6 +15258,165 @@ The <<gem5-config-dot,`config.dot.svg`>> now looks like this but with 3 CPUs ins
.`config.dot.svg` for a system with two TimingSimpleCPU with caches.
image::{cirosantilli-media-base}gem5_config_TimingSimpleCPU_caches_2_CPUs_12c917de54145d2d50260035ba7fa614e25317a3.svg?sanitize=true[height=600]
Once again we focus on the shared function region `my_thread_main` which is where the interesting cross core memory collisions will be happening.
As a maybe-not-so-interesting, we have a look at the very first `my_thread_main` icache hit points:
....
93946000: Cache: system.cpu1.icache: access for ReadReq [8b0:8b3] IF miss
93946000: Cache: system.cpu1.icache: createMissPacket: created ReadCleanReq [880:8bf] IF from ReadReq [8b0:8b3] IF
93946000: Cache: system.cpu1.icache: handleAtomicReqMiss: Sending an atomic ReadCleanReq [880:8bf] IF
93946000: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[5] packet ReadCleanReq [880:8bf] IF
93946000: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[5] packet ReadCleanReq [880:8bf] IF SF size: 1 lat: 1
93946000: Cache: system.cpu0.icache: handleSnoop: snoop hit for ReadCleanReq [880:8bf] IF, old state is state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
93946000: Cache: system.cpu0.icache: new state is state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
93946000: DRAM: system.mem_ctrls: recvAtomic: ReadCleanReq 0x880
93946000: Cache: system.cpu1.icache: handleAtomicReqMiss: Receive response: ReadResp [880:8bf] IF in state 0
93946000: Cache: system.cpu1.icache: Block addr 0x880 (ns) moving from state 0 to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
93946000: ExecEnable: system.cpu1: A0 T0 : @my_thread_main : sub sp, sp, #48 : IntAlu : D=0x0000003fffd6b9a0 flags=(IsInteger)
93946500: Cache: system.cpu1.icache: access for ReadReq [8b4:8b7] IF hit state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
93946500: Cache: system.cpu1.dcache: access for WriteReq [a19a8:a19af] hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0x14 set: 0x66 way: 0
93946500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+4 : str x0, [sp, #8] : MemWrite : D=0x0000007ffffefc70 A=0x3fffd6b9a8 flags=(IsInteger|IsMemRef|IsStore)
....
Now that we know how to read cache logs from <<gem5-event-queue-timingsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches>>, it is easier to understand what happened:
* the physical address for `my_thread_main` is at 0x8b0, which gets requested is a miss, since it is the first time CPU1 goes near that region, since CPU1 was previously executing in standard library code far from our text segment
* CPU0 already has has that cache line (0x880) in its cache at <<what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5,state E of MOESI>>, so it snoops and moves to S. We can look up the logs to see exactly where CPU0 had previously read that address:
+
....
table: 1, dirty: 0
59135500: Cache: system.cpu0.icache: Block addr 0x880 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0
59135500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[1] packet WritebackClean [8880:88bf]
59135500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[1] packet WritebackClean [8880:88bf] SF size: 0 lat: 1
59135500: DRAM: system.mem_ctrls: recvAtomic: WritebackClean 0x8880
59135500: ExecEnable: system.cpu0: A0 T0 : @frame_dummy : stp
....
* the request does touch RAM, it does not get served by the other cache directly. CPU1 is now also at state S for the block
* the second cache request from CPU1 is 4 bytes further ahead 0x8b4, and this time it is of course a hit.
+
Since this is an STR, it also does a dcache access, to 0xA19A8 in this case near its stack SP, and it is a hit, which is not surprising, since basically stack accesses are the very first thing any C code does, and there must be some setup code running on CPU1 before `my_thread_main`.
Now let's look for the incremented integer address that is shared across threads. We know from <<detailed-gem5-analysis-of-how-data-races-happen>> that the read happens at `my_thread_main+36`, so searching for he first occurrence:
....
93952500: Cache: system.cpu1.icache: access for ReadReq [8d4:8d7] IF hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x23 way: 0
93952500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] miss
93952500: Cache: system.cpu1.dcache: createMissPacket: created ReadSharedReq [2040:207f] from ReadReq [2060:2063]
93952500: Cache: system.cpu1.dcache: handleAtomicReqMiss: Sending an atomic ReadSharedReq [2040:207f]
93952500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[6] packet ReadSharedReq [2040:207f]
93952500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[6] packet ReadSharedReq [2040:207f] SF size: 0 lat: 1
93952500: DRAM: system.mem_ctrls: recvAtomic: ReadSharedReq 0x2040
93952500: Cache: system.cpu1.dcache: handleAtomicReqMiss: Receive response: ReadResp [2040:207f] in state 0
93952500: Cache: system.cpu1.dcache: Block addr 0x2040 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
93952500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36 : ldr x0, [x0] : MemRead : D=0x0000000000000000 A=0x411060 flags=(IsInteger|IsMemRef|IsLoad)
....
so we determine its physical address of 0x2060. It was a miss, and then it went into E.
So we look ahead to the following accesses to that physical address, before CPU2 reaches that point of the code and starts making requests as well.
First there is the STR for the first LDR which is of course a hit:
....
93954500: Cache: system.cpu1.dcache: access for WriteReq [2060:2063] hit state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
93954500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+52 : str x1, [x0] : MemWrite : D=0x0000000000000001 A=0x411060 flags=(IsInteger|IsMemRef|IsStore)
....
If found the line in E, so we presume that it moves it to M. Then the second read confirms that it was in M:
....
93964500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] hit state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
93964500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36 : ldr x0, [x0] : MemRead : D=0x0000000000000001 A=0x411060 flags=(IsInteger|IsMemRef|IsLoad)
....
and so on.
Now let's jump to when CPU2 starts making requests.
The first time this happens is on its first LDR at:
....
94058500: Cache: system.cpu2.dcache: access for ReadReq [2060:2063] miss
94058500: Cache: system.cpu2.dcache: createMissPacket: created ReadSharedReq [2040:207f] from ReadReq [2060:2063]
94058500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Sending an atomic ReadSharedReq [2040:207f]
94058500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet ReadSharedReq [2040:207f]
94058500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet ReadSharedReq [2040:207f] SF size: 1 lat: 1
94058500: Cache: system.cpu1.dcache: handleSnoop: snoop hit for ReadSharedReq [2040:207f], old state is state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94058500: Cache: system.cpu1.dcache: new state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94058500: CoherentXBar: system.membus: recvAtomicBackdoor: Not forwarding ReadSharedReq [2040:207f]
94058500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Receive response: ReadResp [2040:207f] in state 0
94058500: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 0 to state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
94058500: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+36 : ldr x0, [x0] : MemRead : D=0x0000000000000009 A=0x411060 flags=(IsInteger|IsMemRef|IsLoad)
....
and from this we see:
* CPU1 moves from M to O
* CPU2 moves from I to S
It also appears that no DRAM was accessed since there are no logs for it, so did the XBar get the value directly from the other cache? TODO: why did the earlier `93946000: DRAM` read happened then, since CPU0 had the line when CPU1 asked for it?
The above log sequence also makes it clear that it is the XBar that maintains coherency: it appears that the CPU2 caches tells the XBar what it is doing, and then the XBar tells other caches on other CPUs about it, which leads CPU1 to move to O.
Then CPU1 hits its LDR on O:
....
94060500: Cache: system.cpu1.dcache: access for ReadReq [2060:2063] hit state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94060500: ExecEnable: system.cpu1: A0 T0 : @my_thread_main+36 : ldr x0, [x0] : MemRead : D=0x0000000000000009 A=0x411060 flags=(IsInteger|IsMemRef|IsLoad)
....
and then CPU2 writes moving to M and moving CPU1 to I:
....
94060500: Cache: system.cpu2.dcache: access for WriteReq [2060:2063] hit state: 5 (S) valid: 1 writable: 0 readable: 1 dirty: 0 | tag: 0 set: 0x81 way: 0
94060500: Cache: system.cpu2.dcache: createMissPacket: created UpgradeReq [2040:207f] from WriteReq [2060:2063]
94060500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Sending an atomic UpgradeReq [2040:207f]
94060500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet UpgradeReq [2040:207f]
94060500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[10] packet UpgradeReq [2040:207f] SF size: 1 lat: 1
94060500: Cache: system.cpu1.dcache: handleSnoop: snoop hit for UpgradeReq [2040:207f], old state is state: d (O) valid: 1 writable: 0 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94060500: Cache: system.cpu1.dcache: new state is state: 0 (I) valid: 0 writable: 0 readable: 0 dirty: 0 | tag: 0xffffffffffffffff set: 0x81 way: 0
94060500: CoherentXBar: system.membus: recvAtomicBackdoor: Not forwarding UpgradeReq [2040:207f]
94060500: Cache: system.cpu2.dcache: handleAtomicReqMiss: Receive response: UpgradeResp [2040:207f] in state 5
94060500: Cache: system.cpu2.dcache: Block addr 0x2040 (ns) moving from state 5 to state: f (M) valid: 1 writable: 1 readable: 1 dirty: 1 | tag: 0 set: 0x81 way: 0
94060500: ExecEnable: system.cpu2: A0 T0 : @my_thread_main+52 : str x1, [x0] : MemWrite : D=0x000000000000000a A=0x411060 flags=(IsInteger|IsMemRef|IsStore)
....
and so on, they just keep fighting over that address and changing one another's state.
====== gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby
Now let's do the exact same we did for <<gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus>>, but with <<gem5-ruby-build,Ruby>> rather than the classic system.
Since we have fully understood coherency in that previous example, it should now be easier to understand what is going on with Ruby:
....
./run \
--arch aarch64 \
--cli-args '2 10' \
--cpus 3 \
--emulator gem5 \
--trace FmtFlag,DRAM,ExecAll,Ruby \
--userland userland/c/atomic.c \
-- \
--cpu-type AtomicSimpleCPU \
--ruby \
;
....
Note that now the `--trace Cache,XBar` flags have no effect, since Ruby replaces those classic memory model components entirely with the Ruby version, so we enable the `Ruby` flag version instead. Note however that this flag is very verbose and produces about 10x more output than the classic memory experiment.
Also remember that ARM's default Ruby protocol is `'MOESI_CMP_directory'`.
First we note that the output of the experiment is the same:
....
atomic 20
non-atomic 19
....
TODO
===== gem5 event queue MinorCPU syscall emulation freestanding example analysis
The events <<gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis,for the Atomic CPU>> were pretty simple: basically just ticks.
@ -17626,6 +17785,8 @@ https://stackoverflow.com/questions/12346487/what-do-each-memory-order-mean
TODO let's understand that fully one day.
This is the C++ version of the more general <<memory-consistency>> concept.
[[cpp-parallel-algorithms]]
===== C++ parallel algorithms
@ -23858,15 +24019,35 @@ fatal: fatal condition interrupts.size() != numThreads occurred: CPU system.cpu
https://en.wikipedia.org/wiki/Cache_coherence
In simple terms, when a certain group of caches of different CPUs are coherent, reads on one core always see the writes previously made by other cores. TODO: is it that strict, or just ordering? TODO what about simultaneous read and writes?
http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf mentions that:
Cache coherence:
- guarantees eventual write propagation
- guarantees a single order of all writes to same location
- no guarantees on when writes propagate
And notably it contrasts that with <<memory-consistency>>, which according to them is about ordering requirements on _different_ addresses.
Algorithms to keep the caches of different cores of a system coherent. Only matters for multicore systems.
The main goal of such systems is to reduce the number of messages that have to be sent on the coherency bus, and even more importantly, to memory (which passes first through the coherency bus).
The main software use case example to have in mind is that of multiple threads incrementing an atomic counter as in link:userland/cpp/atomic/std_atomic.cpp[], see also: <<atomic-cpp>>. Then, if one processors writes to the cache, other processors have to know about it before they read from that address.
Note that cache coherency only applies to memory read/write instructions that explicitly make coherency requirements.
Even if caches are coherent, this is still not enough to avoid data race conditions, because this does not enforce atomicity of read modify write sequences. This is for example shown at: <<detailed-gem5-analysis-of-how-data-races-happen>>.
In most ISAs, this tends to be the minority of instructions, and is only used when something is going to modify memory that is known to be shared across threads. For example, the a <<x86-thread-synchronization-primitives, x86 LOCK>> would be used to increment atomic counters that get incremented across several threads. Outside of those cases, cache coherency is not guaranteed, and behaviour is undefined.
==== Memory consistency
According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf "memory consistency" is about ordering requirements of different memory addresses.
This is represented explicitly in C++ for example <<cpp-memory-order>>.
===== Sequential Consistency
According to http://www.inf.ed.ac.uk/teaching/courses/pa/Notes/lecture07-sc.pdf[], the strongest possible consistency, everything nicely ordered as you'd expect.
==== Can caches snoop data from other caches?
@ -24130,17 +24311,26 @@ Splits the Shared of <<msi-cache-coherence-protocol>> into a new Exclusive state
Exclusive is entered from Invalid after a PrRd, but only if the reply came from DRAM (<<can-caches-snoop-data-from-other-caches,or if we snooped that no one sent the reply to DRAM for us to read it>>)! If the reply came from another cache, we go directly to shared instead. It is this extra information that allows for the split of S.
This is why the simplified transition diagram shown in many places e.g.: https://upload.wikimedia.org/wikipedia/commons/c/c1/Diagrama_MESI.GIF is not a proper state machine: I can go to either S or E given a PrRd.
The advantage of this over MSI is that when we move from Exclusive to Modified, no invalidate message is required, reducing bus traffic: https://en.wikipedia.org/wiki/MESI_protocol#Advantages_of_MESI_over_MSI
This is a common case on read write modify loops. On MSI, it would first do PrRd, send BusRd (to move any M to S), get data, and go to Shared, then PrWr must send BusUpgr to invalidate other Shared and move to M.
This is a common case on read write modify loops. On MSI, it would:
With MESI, the PrRd could go to E instead of S depending on who services it. If it does go to E, then the PrWr only moves it to M, there is no need to send BusUpgr because we know that no one else is in S.
* first do PrRd
* send BusRd (to move any M to S), get data, and go to Shared
* then PrWr must send BusUpgr to invalidate other Shared and move to M
With MESI:
* the PrRd could go to E instead of S depending on who services it
* if it does go to E, then the PrWr only moves it to M, there is no need to send BusUpgr because we know that no one else is in S
gem5 12c917de54145d2d50260035ba7fa614e25317a3 has two <<gem5-ruby-build,Ruby>> MESI models implemented: `MESI_Two_Level` and `MESI_Three_Level`.
==== MOSI cache coherence protocol
https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 we have to edit it.
https://en.wikipedia.org/wiki/MOSI_protocol The critical MSI vs MOSI section was a bit bogus though: https://en.wikipedia.org/w/index.php?title=MOSI_protocol&oldid=895443023 but I edited it :-)
In MSI, it feels wasteful that an MS transaction needs to flush to memory: why do we need to flush right now, since even more caches now have that data? Why not wait until later ant try to gain something from this deferral?
@ -24175,6 +24365,8 @@ https://en.wikipedia.org/wiki/MOESI_protocol
In gem5 9fc9c67b4242c03f165951775be5cd0812f2a705, MOESI is the default cache coherency protocol of the <<gem5-ruby-build,classic memory system>> as shown at xref:what-is-the-coherency-protocol-implemented-by-the-classic-cache-system-in-gem5[xrefstyle=full].
A good an simple example showing several MOESI transitions in the classic memory model can be seen at: xref:gem5-event-queue-atomicsimplecpu-syscall-emulation-freestanding-example-analysis-with-caches-and-multiple-cpus[xrefstyle=full].
gem5 12c917de54145d2d50260035ba7fa614e25317a3 has several <<gem5-ruby-build,Ruby>> MOESI models implemented: `MOESI_AMD_Base`, `MOESI_CMP_directory`, `MOESI_CMP_token` and `MOESI_hammer`.
== About this repo