CONFIG_VIRTIO_PCI=y @@ -12589,6 +12593,16 @@ CONFIG_VIRTIO_BLK=y
diff --git a/index.html b/index.html index 58505386..d2915783 100644 --- a/index.html +++ b/index.html @@ -1115,146 +1115,147 @@ pre{ white-space:pre }
BaseSimpleCPU
DerivO3CPU
-DerivO3CPU pipeline stagesCommMonitorsee also: Section 19.17, “gem5 ARM platforms”.
+see also: Section 19.18, “gem5 ARM platforms”.
This generates yet new separate images with new magic constants:
@@ -5692,7 +5690,7 @@ sched_getcpu = 0The number of cores is modified as explained at: Section 19.2.2.1, “Number of cores”
+The number of cores is modified as explained at: Section 19.3.1, “Number of cores”
taskset from the util-linux package sets the initial core affinity of a program:
arm and aarch64 configs present in the official ARM gem5 Linux kernel fork as described at: Section 19.8, “gem5 arm Linux kernel patches”. Some of the configs present there are added by the patches.
arm and aarch64 configs present in the official ARM gem5 Linux kernel fork as described at: Section 19.9, “gem5 arm Linux kernel patches”. Some of the configs present there are added by the patches.
Jason’s magic x86_64 config: http://web.archive.org/web/20171229121642/http://www.lowepower.com/jason/files/config which is referenced at: http://web.archive.org/web/20171229121525/http://www.lowepower.com/jason/setting-up-gem5-full-system.html. QEMU boots with that by removing # CONFIG_VIRTIO_PCI is not set.
the driver for that hardware type, e.g. hard drive / SSD type. Here, Linux does not know how to communicate with a given hardware to get bytes from it at all. In simiulation, the most important often missing one is virtio which needs:
+the driver for that hardware type, e.g. hard drive/SSD/virtio type.
+Here, Linux does not know how to communicate with a given hardware to get bytes from it at all, so you can’t even see.
+In simulation, the most important often missing one is virtio which needs:
+CONFIG_VIRTIO_PCI=y @@ -12589,6 +12593,16 @@ CONFIG_VIRTIO_BLK=y
Bibliography:
+In this section we will play with them.
The following files contain examples to access that data and test it out:
+First get a virtual address to play with:
Now we can verify that linux/virt_to_phys_user.out gave the correct physical address in the following ways:
but the problem is that this method does not allow to easily run a different script without running the boot again. The ./gem5.sh script works around that by using m5 readfile as explained further at: Section 19.5.3, “gem5 checkpoint restore and run a different script”.
but the problem is that this method does not allow to easily run a different script without running the boot again. The ./gem5.sh script works around that by using m5 readfile as explained further at: Section 19.6.3, “gem5 checkpoint restore and run a different script”.
Now you can play a fun little game with your friends:
@@ -19113,16 +19143,17 @@ cat out/gem5-bench-dhrystone.txtThose problems should be insignificant if the benchmark runs for long enough however.
Besides optimizing a program for a given CPU setup, chip developers can also do the inverse, and optimize the chip for a given benchmark!
The rabbit hole is likely deep, but let’s scratch a bit of the surface.
./run --arch arm --cpus 2 --emulator gem5@@ -19168,8 +19199,8 @@ getconf _NPROCESSORS_CONF
User mode simulation QEMU v4.0.0 always shows the number of cores of the host, presumably because the thread switching uses host threads directly which would make that harder to implement.
At 369a47fc6e5c2f4a7f911c1c058b6088f8824463 + 1 QEMU appears to spawn 3 host threads plus one for every new guest thread created. Remember that userland/posix/pthread_count.c spawns N + 1 total threads if you count the main thread.
Some info at: TimingSimpleCPU analysis #1 but highly TODO :-)
TODO These look promising:
Can be set across emulators with:
+./run --memory 512M@@ -19609,9 +19643,112 @@ get_avphys_pages() * sysconf(_SC_PAGESIZE) = 0x1D178000
AV means available and gives the free memory: https://stackoverflow.com/questions/14386856/c-check-available-ram/57659190#57659190
This can be explored pretty well from gem5 config.ini.
+se.py just has a single DDR3_1600_8x8 DRAM with size given as Memory size and physical address starting at 0.
fs.py also has that DDR3_1600_8x8 DRAM, but can have more memory types. Notably, aarch64 has as shown on RealView.py VExpress_GEM5_Base:
0x00000000-0x03ffffff: ( 0 - 64 MiB) Boot memory (CS0) +0x04000000-0x07ffffff: ( 64 MiB - 128 MiB) Reserved +0x08000000-0x0bffffff: (128 MiB - 192 MiB) NOR FLASH0 (CS0 alias) +0x0c000000-0x0fffffff: (192 MiB - 256 MiB) NOR FLASH1 (Off-chip, CS4) +0x80000000-XxXXXXXXXX: ( 2 GiB - ) DRAM+
We place the entry point of our baremetal executables right at the start of DRAM with our Baremetal linker script.
+This can be seen indirectly with:
+./getvar --arch aarch64 --emulator gem5 entry_address+
which gives 0x80000000 in decimal, or more directly with some some gem5 tracing:
+./run \ + --arch aarch64 \ + --baremetal baremetal/arch/aarch64/no_bootloader/exit.S \ + --emulator gem5 \ + --trace ExecAll,-ExecSymbol \ + --trace-stdout \ +;+
and we see that the first instruction runs at 0x80000000:
+0: system.cpu: A0 T0 : 0x80000000+
TODO: what are the boot memory and NOR FLASH used for?
+CommMonitorYou can place this SimObject in between two ports to get extra statistics about the packets that are going through.
+It only works on timing requests, and does not seem to dump any memory values, only add extra statistics.
+For example, the patch patches/manual/gem5-commmonitor-se.patch hack a CommMonitor between the CPU and the L1 cache on top of gem5 1c3662c9557c85f0d25490dc4fbde3f8ab0cb350:
patch -d "$(./getvar gem5_source_dir)" -p 1 < patches/manual/gem5-commmonitor-se.patch+
That patch was done largely by copying what fs.py --memcheck does with a MemChecker object.
You can then run with:
+./run \ + --arch aarch64 \ + --emulator gem5 \ + --userland userland/arch/aarch64/freestanding/linux/hello.S \ + -- \ + --caches \ + --cpu-type TimingSimpleCPU \ +;+
and now we have some new extra histogram statistics such as:
+system.cpu.dcache_mon.readBurstLengthHist::samples 1+
One neat thing about this is that it is agnostic to the memory object type, so you don’t have to recode those statistics for every new type of object that operates on memory packets.
+TODO These look promising:
and also: gem5-dist: https://publish.illinois.edu/icsl-pdgem5/
As of gem5 872cb227fdc0b4d60acc7840889d567a6936b6e1 defaults to 2GHz for fs.py:
Analogous to QEMU:
Analogous to QEMU, on the first shell:
We are unable to use gdbserver because of networking as mentioned at: Section 14.3.1.3, “gem5 host to guest networking”
gem5’s secondary core GDB setup is a hack and spawns one gdbserver for each core in separate ports, e.g. 7000, 7001, etc.
Analogous to QEMU’s Snapshot, but better since it can be started from inside the guest, so we can easily checkpoint after a specific guest event, e.g. just before init is done.
since boot has already happened, and the parameters are already in the RAM of the snapshot.
In order to debug checkpoint restore bugs, this minimal setup using userland/freestanding/gem5_checkpoint.S can be handy:
A quick way to get a gem5 syscall emulation mode or full system checkpoint to observe is:
You want to automate running several tests from a single pristine post-boot state.
gem5 can switch to a different CPU model when restoring a checkpoint.
Besides switching CPUs after a checkpoint restore, fs.py also has the --fast-forward option to automatically run the script from the start on a less detailed CPU, and switch to a more detailed CPU at a given tick.
The in-tree util/cpt_upgrader.py is a tool to upgrade checkpoints taken from an older version of gem5 to be compatible with the newest version, so you can update gem5 without having to re-run the simulation that generated the checkpoints.
Remember that in the gem5 command line, we can either pass options to the script being run as in:
m5ops are magic instructions which lead gem5 to do magic things, like quitting or dumping stats.
m5 is a guest command line utility that is installed and run on the guest, that serves as a CLI front-end for the m5ops
This can be a good test m5ops since it executes very quickly.
End the simulation.
Makes gem5 dump one more statistics entry to the gem5 m5out/stats.txt file.
End the simulation with a failure exit event:
Send a guest file to the host. 9P is a more advanced alternative.
Read a host file pointed to by the fs.py --script option to stdout.
Ermm, just another m5 readfile that only takes integers and only from CLI options? Is this software so redundant?
Trivial combination of m5 readfile + execute the script.
gem5 allocates some magic instructions on unused instruction encodings for convenient guest instrumentation.
Let’s study how the gem5 m5 executable uses them:
include/gem5/asm/generic/m5ops.h also describes some annotation instructions.
https://gem5.googlesource.com/arm/linux/ contains an ARM Linux kernel forks with a few gem5 specific Linux kernel patches on top of mainline created by ARM Holdings on top of a few upstream kernel releases.
drm: Add component-aware simple encoder allows you to see images through VNC, see: Section 13.3, “gem5 graphic mode”
gem5: Add support for gem5’s extended GIC mode adds support for more than 8 cores, see: Section 19.2.2.1.2, “gem5 ARM full system with more than 8 cores”
gem5: Add support for gem5’s extended GIC mode adds support for more than 8 cores, see: Section 19.3.1.2, “gem5 ARM full system with more than 8 cores”
Tested on 649d06d6758cefd080d04dc47fd6a5a26a620874 + 1.
We have observed that with the kernel patches, boot is 2x faster, falling from 1m40s to 50s.
When you run gem5, it generates an m5out directory at:
The files in that directory contains some very important information about the run, and you should become familiar with every one of them.
Contains UART output, both from the Linux kernel or from the baremetal system.
m5out/system.workload.dmesg filem5out/system.workload.dmesg fileThis file used to be called just m5out/system.dmesg, but the name was changed after the workload refactorings of March 2020.
This file contains important statistics about the run:
and after that the file size went down to 21KB.
We can make gem5 dump statistics in the HDF5 format by adding the magic h5:// prefix to the file name as in:
TODO
This describes the internals of the gem5 m5out/stats.txt file.
The m5out/config.ini file, contains a very good high level description of the system:
Modifying the config.ini file manually does nothing since it gets overwritten every time.
The m5out/config.dot file contains a graphviz .dot file that provides a simplified graphical view of a subset of the gem5 config.ini.
We use the m5term in-tree executable to connect to the terminal instead of a direct telnet.
We have made a crazy setup that allows you to just cd into submodules/gem5, and edit Python scripts directly there.
By default, we use configs/example/fs.py script.
But can the people from the project be convinced of that?
These are just very small GTest tests that test a single class in isolation, they don’t run any executables.
This section is about running the gem5 in-tree tests.
This error happens when the following instruction limits are reached:
In order to use different build options, you might also want to use gem5 build variants to keep the build outputs separate from one another.
How to use it in LKMC: Section 18.8, “Debug the emulator”.
./build-gem5 --gem5-build-type fast@@ -21961,7 +22097,7 @@ Exiting @ tick 18446744073709551615 because simulate() limit reached
Profiling builds as of 3cea7d9ce49bda49c50e756339ff1287fd55df77 both use: -g -O3 and disable asserts and logging like the gem5 fast build and:
TODO test properly, benchmark vs GCC.
If there gem5 appears to have a C++ undefined behaviour bug, which is often very difficult to track down, you can try to build it with the following extra SCons options:
gem5 has two types of memory system:
Tested in gem5 d7d9bc240615625141cd6feddbadd392457e49eb.
This is the simplest of all protocols, and therefore the first one you should study to learn how Ruby works.
Crossbar or XBar in the code, is the default CPU interconnect that gets used by fs.py if --ruby is not given.
Python 3 support was mostly added in 2019 Q3 at arounda347a1a68b8a6e370334be3a1d2d66675891e0f1 but remained buggy for some time afterwards.
gem5 has a few in tree CPU models for different purposes.
From this we see that there are basically only 4 C++ CPU models in gem5: Atomic, Timing, Minor and O3. All others are basically parametrizations of those base types.
BaseSimpleCPUBaseSimpleCPUSimple abstract CPU without a pipeline.
AtomicSimpleCPUAtomicSimpleCPUAtomicSimpleCPU: the default one. Memory accesses happen instantaneously. The fastest simulation except for KVM, but not realistic at all.
TimingSimpleCPUTimingSimpleCPUTimingSimpleCPU: memory accesses are realistic, but the CPU has no pipeline. The simulation is faster than detailed models, but slower than AtomicSimpleCPU.
Generic in-order superscalar core.
DerivO3CPUDerivO3CPUGeneric out-of-order core. "O3" Stands for "Out Of Order"!
DerivO3CPU pipeline stagesDerivO3CPU pipeline stagesMentioned at: http://www.m5sim.org/Visualization
The gem5 platform is selectable with the --machine option, which is named after the analogous QEMU -machine option, and which sets the --machine-type.
Certain ISAs like ARM have bootloaders that are automatically run before the main image to setup basic system state.
CommMonitorYou can place this SimObject in between two ports to get extra statistics about the packets that are going through.
-It only works on timing CPUs, and does not seem to dump any memory values, only add extra statistics.
-For example, the patch patches/manual/gem5-commmonitor-se.patch hack a CommMonitor between the CPU and the L1 cache on top of gem5 1c3662c9557c85f0d25490dc4fbde3f8ab0cb350:
patch -d "$(./getvar gem5_source_dir)" -p 1 < patches/manual/gem5-commmonitor-se.patch-
which you can run with:
-./run \ - --arch aarch64 \ - --emulator gem5 \ - --userland userland/arch/aarch64/freestanding/linux/hello.S \ - -- \ - --caches \ - --cpu-type TimingSimpleCPU \ -;-
and now we have some new extra histogram statistics such as:
-system.cpu.dcache_mon.readBurstLengthHist::samples 1-
One neat thing about this is that it is agnostic to the memory object type, so you don’t have to recode those statistics for every new type of object that operates on memory packets.
-Internals under other sections:
@@ -25493,8 +25586,7 @@ type=SetAssociativeCPU0 already has has that cache line (0x880) in its cache at state E of MOESI, so it snoops and moves to S. We can look up the logs to see exactly where CPU0 had previously read that address:
table: 1, dirty: 0 -59135500: Cache: system.cpu0.icache: Block addr 0x880 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0 +59135500: Cache: system.cpu0.icache: Block addr 0x880 (ns) moving from state 0 to state: 7 (E) valid: 1 writable: 1 readable: 1 dirty: 0 | tag: 0 set: 0x22 way: 0 59135500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[1] packet WritebackClean [8880:88bf] 59135500: CoherentXBar: system.membus: recvAtomicBackdoor: src system.membus.slave[1] packet WritebackClean [8880:88bf] SF size: 0 lat: 1 59135500: DRAM: system.mem_ctrls: recvAtomic: WritebackClean 0x8880 @@ -25627,10 +25719,11 @@ type=SetAssociative-and so on, they just keep fighting over that address and changing one another’s state.
-+19.21.4.4.1. gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby
++19.21.4.5. gem5 event queue TimingSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs and Ruby
-Now let’s do the exact same we did for gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs, but with Ruby rather than the classic system.
+Now let’s do the exact same we did for gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis with caches and multiple CPUs, but with Ruby rather than the classic system and TimingSimpleCPU (atomic does not work with Ruby)
@@ -25669,9 +25762,8 @@ non-atomic 19Since we have fully understood coherency in that previous example, it should now be easier to understand what is going on with Ruby:
@@ -25645,7 +25738,7 @@ type=SetAssociative --trace FmtFlag,DRAM,ExecAll,Ruby \ --userland userland/c/atomic.c \ -- \ - --cpu-type AtomicSimpleCPU \ + --cpu-type TimingSimpleCPU \ --ruby \ ;TODO
The events for the Atomic CPU were pretty simple: basically just ticks.
TODO like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard but with the hazard.
Like gem5 event queue MinorCPU syscall emulation freestanding example analysis but even more complex since for the gem5 DerivO3CPU!
This section and children are tested at LKMC 144a552cf926ea630ef9eadbb22b79fe2468c456.
Let’s have a look at the arguably simplest example userland/arch/aarch64/freestanding/linux/hazardless.S.
Now let’s do the same as in gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazardless but with a hazard: userland/arch/aarch64/freestanding/linux/hazard.S.
Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: hazard but now with an LDR stall: userland/arch/aarch64/freestanding/linux/stall.S.
Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall but now with an LDR stall: userland/arch/aarch64/freestanding/linux/stall-gain.S.
Like gem5 event queue DerivO3CPU syscall emulation freestanding example analysis: stall-gain but now with some dependencies after the LDR: userland/arch/aarch64/freestanding/linux/stall-hazard4.S.
Now let’s try to see some Speculative execution in action with userland/arch/aarch64/freestanding/linux/speculative.S.
libguestfs: https://serverfault.com/questions/246835/convert-directory-to-qemu-kvm-virtual-disk-image/916697#916697, in particular vfs-minimum-size
use methods described at: Section 19.5.3, “gem5 checkpoint restore and run a different script” instead of putting builds on the root filesystem
+use methods described at: Section 19.6.3, “gem5 checkpoint restore and run a different script” instead of putting builds on the root filesystem
The cache sizes were chosen to match the host 2017 Lenovo ThinkPad P51 to improve the comparison. Ideally we should also use the same standard library.
Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: Section 19.9.3.2, “gem5 only dump selected stats”
+Note that this will take a long time, and will produce a humongous ~40Gb stats file as explained at: Section 19.10.3.2, “gem5 only dump selected stats”
Sources: