Characterize the AWS Graviton memory subsystem using ASCT

**Ontem** às 23:00

Characterize the AWS Graviton memory subsystem using ASCT

Tópico: Characterize the AWS Graviton memory subsystem using ASCT
Categoria: Tutoriais | Programação & Tecnologia
Idioma Principal: Português (Conteúdo de Tecnologia)

Descrição do Conteúdo / Informações:
-------------------------------------------------------------------------
If you've ever deployed memory-bound workloads on AWS Graviton, you know that CPU compute speed is only part of the story. Another factor in real-world performance is how efficiently your code accesses the memory subsystem, specifically the cache hierarchy, interconnects, and physical DRAM.

In this article, I will walk through how to use the Arm System Characterization Tool (ASCT) to analyze the memory subsystem of AWS Graviton2 (c6g) and Graviton4 (c8g) instances. You will learn how to explore the CPU core topology, verify cache structures, and run low-level benchmarks to measure memory latency and single-core streaming bandwidth. Comparing results across Graviton generations provides practical insights into how architectural changes, such as larger caches and DDR5 memory, impact application performance.

Identify AWS Graviton CPU topology, cache hierarchy, and NUMA configuration

Memory subsystem impact on AWS Graviton performance

When dealing with memory-bound applications, memory behavior is often the primary driver of overall performance. The CPU-side memory subsystem (caches, interconnects, and DRAM) directly determines how fast instructions can retrieve their data. Knowing the specific latencies, streaming bandwidth, and cache resource sharing on your EC2 instances is useful to diagnose performance bottlenecks and tune your code for Graviton.

Identify CPU, cache, and NUMA topology

Before running benchmarks, you need to understand how the target systems are organized. Factors like the number of cores, cluster grouping, DRAM generation, and even kernel versions can change how memory behaves. Having this topological map beforehand is key to making sense of the benchmark numbers collected later.

For this analysis, I am using two AWS Graviton instances:

• AWS Graviton2 (c6g.16xlarge) instance with Arm Neoverse N1 cores.

• AWS Graviton4 (c8g.16xlarge) instance with Arm Neoverse V2 cores.

Both test systems run Ubuntu 24.04 with 64 CPUs, 128 GB RAM, and 32 GB of storage.

If you want to follow along with the same setup, you can launch these two EC2 instances using the AWS Console or CLI:

• A c6g.16xlarge running Ubuntu 24.04 (Graviton2)

• A c8g.16xlarge running Ubuntu 24.04 (Graviton4)

Make sure you can SSH into both instances.

Collect basic system information

When comparing multiple instances, setting a descriptive hostname on each system makes it easier to keep track of outputs when using $(hostname) in file paths and terminal commands.

On the Graviton4 instance, run:

sudo hostnamectl set-hostname graviton4-c8g

On the Graviton2 instance, run:

sudo hostnamectl set-hostname graviton2-c6g

Log out and log back in for the new hostname to appear in your shell prompt.

Now, record the kernel version and operating system by running the following command on your systems:

uname -a && cat /etc/os-release

Both systems should return a similar output (note that your specific kernel version might differ depending on recent packages and updates):

Linux graviton2-c6g 6.17.0-1007-aws #7~24.04.1-Ubuntu SMP Thu Jan 22 20:37:30 UTC 2026 aarch64 aarch64 aarch64 GNU/Linux
PRETTY_NAME="Ubuntu 24.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.4 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Next, print out the CPU information:

lscpu

The lscpu output provides the CPU model, core count, threads per core, sockets, and NUMA node configurations. Here is the output from the Graviton2 instance:

Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
Stepping: r3p1
BogoMIPS: 243.75
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimdd
p
Caches (sum of all):
L1d: 4 MiB (64 instances)
L1i: 4 MiB (64 instances)
L2: 64 MiB (64 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-63

The Graviton4 instance contains Neoverse-V2 cores:

Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: ARM
Model name: Neoverse-V2
Model: 1
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 1
Stepping: r0p1
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc d
cpop sha3 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svep
mull svebitperm svesha3 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
Caches (sum of all):
L1d: 4 MiB (64 instances)
L1i: 4 MiB (64 instances)
L2: 128 MiB (64 instances)
L3: 36 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-63

These instances represent two generations of Arm Neoverse server cores. Graviton2 uses Neoverse N1 cores featuring a private 1 MB L2 cache per core and a 32 MB shared L3. Graviton4 uses Neoverse V2 cores, doubling the private L2 cache to 2 MB and sharing a 36 MB L3. These differences in cache capacity and memory architecture directly impact latency and bandwidth results.

Notice how much longer the Flags output is for Neoverse V2. Because Neoverse V2 is based on the Armv9 architecture, it adds extensions like SVE2, BF16, and I8MM which are absent on the Armv8-based Neoverse N1.

Check DRAM configuration

Confirm the memory size with free -h:

free -h

Both instances have 128 GB RAM with similar outputs:

total used free shared buff/cache available
Mem: 123Gi 1.6Gi 121Gi 1.2Mi 1.2Gi 121Gi

Graviton2 instances use DDR4 memory, while Graviton4 instances upgrade to DDR5. DDR5 provides higher bandwidth per channel than DDR4, though it typically carries a slightly higher access latency baseline in nanoseconds.

Also note that both systems operate as a single NUMA node. This means all 64 cores have uniform access to all physical memory. On multi-socket systems with multiple NUMA nodes, access latency depends on which node the data resides on.

Explore the core and cluster topology

Arm-based systems typically group cores into clusters that share specific cache levels. Understanding these cluster boundaries is critical because latency and bandwidth behavior will shift when threads cross them.

You can use lscpu -e to inspect the per-core details:

lscpu -e

The output layout is identical on both systems:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes
1 0 0 1 1:1:1:0 yes
2 0 0 2 2:2:2:0 yes
3 0 0 3 3:3:3:0 yes
...
63 0 0 63 63:63:63:0 yes

On both systems, every core has its own unique L1d, L1i, and L2 index, confirming these caches are private. Meanwhile, the L3 index is 0 for all cores, confirming a single shared Last-Level Cache (LLC) across all 64 cores.

Visualize the topology with hwloc

The hwloc utility generates a tree diagram of how cores, caches, and memory are arranged. Install it using the package manager:

sudo apt-get install -y hwloc

Generate a topology PNG:

hwloc-ls --of png > topology.png

Here is the topology diagram from a Graviton2 c6g instance:

And here is the diagram from a Graviton4 c8g instance:

These diagrams outline the cache layout, showing each core's private L1 and L2 caches and how they connect to the shared L3 cache.

Enumerate caches from sysfs

To read cache properties directly from the Linux kernel, you can probe sysfs.

Save the following script as cache.sh:

for c in /sys/devices/system/cpu/cpu0/cache/index*; do
echo "=== $(basename $c) ==="
echo "Level: $(cat $c/level)"
echo "Type: $(cat $c/type)"
echo "Size: $(cat $c/size)"
echo "Shared CPU list: $(cat $c/shared_cpu_list)"
echo
done

Run the script on each instance:

bash ./cache.sh

On Graviton2 (c6g), the output is:

=== index0 ===
Level: 1
Type: Data
Size: 64K
Shared CPU list: 0

=== index1 ===
Level: 1
Type: Instruction
Size: 64K
Shared CPU list: 0

=== index2 ===
Level: 2
Type: Unified
Size: 1024K
Shared CPU list: 0

=== index3 ===
Level: 3
Type: Unified
Size: 32768K
Shared CPU list: 0-63

On Graviton4 (c8g), the output is:

=== index0 ===
Level: 1
Type: Data
Size: 64K
Shared CPU list: 0

=== index1 ===
Level: 1
Type: Instruction
Size: 64K
Shared CPU list: 0

=== index2 ===
Level: 2
Type: Unified
Size: 2048K
Shared CPU list: 0

=== index3 ===
Level: 3
Type: Unified
Size: 36864K
Shared CPU list: 0-63

Both systems use an index3 shared L3 cache. On Graviton2, the 32 MB L3 is shared among all 64 cores. On Graviton4, it is a 36 MB L3 cache shared among all 64 cores, verified by the shared_cpu_list of 0-63.

Reference system profiles

Here is a summary of the system specifications:

Property
Graviton2 (c6g)
Graviton4 (c8g)

CPU model
Neoverse N1
Neoverse V2

Core count
64
64

L1D / L1I
64 KB / 64 KB
64 KB / 64 KB

L2 (private)
1 MB
2 MB

L3 (shared)
32 MB
36 MB

DRAM type
DDR4
DDR5

With the physical layout of these processors mapped out, you can analyze the cache hierarchies and how their configurations affect application latency and bandwidth.

Analyze AWS Graviton cache hierarchy and performance characteristics

Cache levels and performance cliffs

Every memory fetch in your application goes through a search from closest to farthest cache level. If a data request misses in L1 and L2, it falls back to L3, and eventually to physical DRAM.

Because cache access speed scales inversely with capacity (L1 is extremely fast but small; L3 is larger but slower), knowing where these boundaries lie helps you identify where performance cliffs occur as your workload's active memory footprint (working set) grows.

Cache levels on AWS Graviton systems

Both Graviton2 and Graviton4 use 4-way L1 caches with 64-byte lines and 8-way L2 caches with 64-byte lines.

The primary architectural change is the L2 cache size. Graviton2 has a 1 MB private L2 cache per core. Graviton4 doubles this to a 2 MB private L2 cache. This doubled private cache size is beneficial for multi-threaded workloads, as it keeps more data closer to the execution pipeline and minimizes cache-coherency traffic traveling over the shared L3 interconnect.

Querying structured cache properties

You can use a shell script to generate a structured summary of the associativity and line sizes across all levels.

Save this script as cache2.sh:

for cpu in 0; do
echo "=== CPU $cpu ==="
for idx in /sys/devices/system/cpu/cpu${cpu}/cache/index*; do
level=$(cat $idx/level)
type=$(cat $idx/type)
size=$(cat $idx/size)
ways=$(cat $idx/ways_of_associativity)
line=$(cat $idx/coherency_line_size)
shared=$(cat $idx/shared_cpu_list)
echo " L${level} ${type}: ${size}, ${ways}-way, ${line}B line, shared with CPUs: ${shared}"
done
done

Run the script on each instance:

bash ./cache2.sh

Graviton2 output:

=== CPU 0 ===
L1 Data: 64K, 4-way, 64B line, shared with CPUs: 0
L1 Instruction: 64K, 4-way, 64B line, shared with CPUs: 0
L2 Unified: 1024K, 8-way, 64B line, shared with CPUs: 0
L3 Unified: 32768K, 16-way, 64B line, shared with CPUs: 0-63

Graviton4 output:

=== CPU 0 ===
L1 Data: 64K, 4-way, 64B line, shared with CPUs: 0
L1 Instruction: 64K, 4-way, 64B line, shared with CPUs: 0
L2 Unified: 2048K, 8-way, 64B line, shared with CPUs: 0
L3 Unified: 36864K, 12-way, 64B line, shared with CPUs: 0-63

Three attributes to pay attention to:

•
Associativity: Higher associativity reduces set conflict misses but can raise base lookup latency.

•
Line size: Almost universally 64 bytes on modern CPUs, which dictates memory access alignment.

•
Shared CPU list: Shows exactly which cores compete for or share the cache level.

Microarchitectural concepts

Before running the benchmarks, it is helpful to keep a few CPU concepts in mind:

Cache line and spatial locality

When you fetch a single byte from memory, the hardware retrieves an entire 64-byte block (a cache line). If your code accesses data sequentially, this improves efficiency because the next 63 bytes are already in L1. If your code jumps around randomly, you waste bandwidth since you fetch 64 bytes but only use one.

Associativity and conflict misses

Caches are organized into sets. A 4-way set associative cache can hold up to 4 distinct memory blocks that map to the same cache set. If your access pattern maps 5 active variables to the exact same set, one will get evicted, causing a conflict miss even if the rest of the cache is completely empty.

Hardware prefetching

Modern Arm cores have sophisticated prefetchers that detect sequential or strided memory access loops and pre-load data into caches before the instructions request it. To measure the true raw latency of the memory hierarchy, the pointer-chase benchmark is randomized to prevent prefetcher interference.

Cycles vs. Nanoseconds

When comparing systems with different clock speeds, cycle counts can be misleading. A 3.0 GHz core with a 12-cycle L2 latency accesses L2 in 4.0 ns. A 2.8 GHz core with the same 12-cycle latency takes 4.3 ns. Measuring in nanoseconds normalizes the comparison when looking at different architectures.

💡 Tip: Arm publishes technical documentation for all cores. You can search the Arm Developer documentation portal for "Neoverse N1 TRM" or "Neoverse V2 TRM" to compare your system observations with the official microarchitecture specifications.

With the cache architecture understood, you can use the Arm System Characterization Tool (ASCT) to measure these latency steps empirically.

Measure AWS Graviton cache and memory latency using the ASCT pointer chase

The pointer-chase technique

Measuring raw memory latency without prefetcher interference requires a technique called pointer chasing.

A pointer chase uses a linked list where each element's value is the memory address of the next element. The CPU cannot resolve the address of node $N+1$ until the load for node $N$ completes. This dependent chain of load instructions prevents out-of-order execution and hardware prefetching from overlapping access latencies, showing the true round-trip time to each level of the memory hierarchy.

Measure cache and memory latency with ASCT

Before running the benchmark, you need to install the Arm System Characterization Tool (ASCT) on your systems. Follow the ASCT install guide to install it.

Once installed, you can verify it by running:

asct version

ASCT's latency-sweep benchmark automates a pointer chase, sweeping memory allocation sizes from 128 bytes to 1 GiB using randomized allocations. It uses 1 GiB huge pages to prevent TLB page table walks from inflating the results, and automatically outputs optimal data sizes and latencies for each cache tier.

Since ASCT configures huge pages and pins threads to specific CPU cores for consistency, you must run it with sudo.

Run the sweep on both systems and save the results into hostname-labeled folders:

sudo asct run latency-sweep --output-dir latency_results_$(hostname)

The output on the Graviton2 (c6g) instance:

Latencies at different levels of cache
--------------------------------------
Lower Bound Upper Bound Optimum Datasize Latency [ns]
L1 128 64K 32.0625K 1.6
L2 64K 512K 288K 5.4
LLC 1M 32M 16.5M 28.8
DRAM 64M 1G 544M 95.0

Here is the latency graph generated by ASCT for Graviton2:

The Graviton4 (c8g) results are shown below:

Latencies at different levels of cache
--------------------------------------
Lower Bound Upper Bound Optimum Datasize Latency [ns]
L1 128 64K 32.0625K 1.4
L2 256K 1M 640K 4.0
LLC 8M 8M 8M 21.2
DRAM 64M 1G 544M 110.0

Here is the latency graph for Graviton4:

The steps where the latency jumps correspond directly to the physical cache boundaries mapped earlier. This provides an empirical validation that the benchmark measurements align with the hardware specifications.

Cache latency analysis

Analyzing the results shows clear performance profiles:

•
L1 Latency: Very close on both systems (1.4 ns vs 1.6 ns).

•
L2 Latency: Graviton4 displays a 26% latency improvement (4.0 ns vs 5.4 ns). Even though Graviton4 has a larger L2 cache (2 MB vs 1 MB), Neoverse V2's updated cache pipeline design makes it faster.

•
LLC Latency: Graviton4 shows a 26% improvement here as well (21.2 ns vs 28.8 ns), reflecting a faster interconnect and L3 cache structure.

•
DRAM Latency: Graviton4 has a slightly higher unloaded baseline latency than Graviton2 (110 ns vs 95 ns). This is expected because Graviton4 uses DDR5 memory, which trades slightly higher access latency for higher throughput.

💡 Note on DDR5 Latency: You might notice that Graviton4's unloaded baseline DRAM latency is slightly higher than Graviton2's. This is a known characteristic of the transition from DDR4 to DDR5. While unloaded single-request latency is higher, DDR5 introduces structural improvements like independent dual 32-bit subchannels and doubled bank groups. Under real-world multi-threaded workloads, this allows the memory subsystem to handle concurrent requests and contention vastly better, preventing the "latency wall" that occurs when a DDR4 bus becomes saturated.

Compare results side-by-side

ASCT has a diff utility that compares two output directories and calculates the delta percentages. Copy both folders onto the same machine and run:

asct diff latency_results_graviton2-c6g/ latency_results_graviton4-c8g/

This prints a table detailing the delta percentages for each cache and system configuration field.

Measure AWS Graviton single-core memory bandwidth with ASCT

Bandwidth vs. latency characteristics

While latency measures the duration of a single memory access, bandwidth represents the volume of data the datapath can sustain when the CPU processes multiple concurrent requests. A slightly higher baseline latency is often acceptable if the core maintains enough outstanding memory requests in flight to saturate the interface. Measuring both metrics provides a complete profile of each cache level.

Measure single-core bandwidth with ASCT

You can use ASCT's bandwidth-sweep benchmark to query the cache boundary sizes from the latency run and sweep through those sizes to record the peak throughput (in GB/s) achieved by a single core.

Run the sweep on both systems:

sudo asct run bandwidth-sweep --output-dir bandwidth_results_$(hostname)

The Graviton2 (c6g) output:

Bandwidth at different levels of cache
--------------------------------------
Datasize Used Level Bandwidth [GB/s]
32.0625K L1 159.4
288K L2 73.4
16.5M LLC 35.9
544M DRAM 20.9

The bandwidth graph:

The Graviton4 (c8g) output:

Bandwidth at different levels of cache
--------------------------------------
Datasize Used Level Bandwidth [GB/s]
32.0625K L1 321.0
640K L2 95.1
8M LLC 78.0
544M DRAM 37.0

The bandwidth graph:

Interpret bandwidth benchmark results

Comparing the single-core bandwidth results across the generations shows a clear generational leap:

•
L1 Bandwidth: Graviton4 delivers double the throughput of Graviton2 (321.0 GB/s vs 159.4 GB/s). This showcases Neoverse V2's improved load/store execution bandwidth.

•
L2 Bandwidth: Graviton4 shows a 29% improvement (95.1 GB/s vs 73.4 GB/s) due to the wider L2 execution paths.

•
LLC Bandwidth: Graviton4 more than doubles LLC throughput (78.0 GB/s vs 35.9 GB/s), showing the efficiency of the upgraded mesh interconnect.

•
Single-Core DRAM Bandwidth: Graviton4 reaches 37.0 GB/s compared to Graviton2's 20.9 GB/s. Because Neoverse V2 has a deeper queue structure, it can keep more memory requests active in parallel to take advantage of the faster DDR5 channels.

Normalize to Bytes per Cycle

To evaluate the architectural efficiency independent of clock speed, we can convert these raw GB/s metrics into Bytes per Cycle using each processor's nominal clock rate (2.5 GHz for Graviton2 and 2.8 GHz for Graviton4):

Bytes/cycle = (GB/s × 10^9) / Clock (Hz)

Normalizing the data highlights the microarchitectural efficiency gains of the execution ports and cache datapaths across generations, as detailed in the Arm System Characterization Tool User Guide:

Cache Level
Graviton2 (2.5 GHz)
Graviton4 (2.8 GHz)
Architectural Efficiency Delta

L1 Data
63.76 B/cycle
114.64 B/cycle
+79.8%

L2 Unified
29.36 B/cycle
33.96 B/cycle
+15.7%

LLC (L3)
14.36 B/cycle
27.86 B/cycle
+94.0%

DRAM
8.36 B/cycle
13.21 B/cycle
+58.0%

This normalization shows that the performance gains on Graviton4 are not merely from a higher clock speed. For instance, the L1 data path efficiency nearly doubles, validating the widened load/store execution capability of the Neoverse V2 core. Similarly, the dramatic jump in LLC efficiency proves that the upgraded mesh interconnect can transport significantly more data per clock cycle than its predecessor.

Conclusion

Analyzing the low-level memory performance of AWS Graviton2 and Graviton4 instances highlights the concrete differences between these two processor generations:

•
Cache Architecture: Doubling the private L2 cache on Graviton4 (to 2 MB per core) retains larger per-thread working sets closer to the execution pipeline, reducing traffic and contention on the shared L3 cache.

•
Access Latency: Graviton4's updated cache pipeline achieves 26% lower latency in the L2 and L3 tiers, helping offset the slightly higher unloaded DRAM latency introduced by DDR5.

•
Streaming Throughput: Single-core bandwidth sweeps show a significant performance increase on Graviton4, including doubled throughput at the L1 and L3 levels.

These metrics offer practical guidance for software development on AWS Graviton. By structuring data designs to remain within L2 boundaries or layout patterns that optimize L3 utilization, you can design workloads that run more efficiently on EC2.

The examples here used AWS Graviton instances, but the methodology works on any Arm Linux machine with a NUMA-enabled kernel. Install ASCT and follow the same steps.

Hope you enjoy the AI generated cover image and next year will be the year for the Minnesota Wild hockey team.