AI × Quant Trader Series — Day 21¶

What is CPU Cache Optimization?¶

Reading time: ~20 minutes
Prerequisites: Basic Computer Architecture, Shared Memory IPC, Lock-Free Programming
Focus: understanding why memory access often dominates performance in High Frequency Trading systems

Part 1: Introduction¶

Modern CPUs are incredibly fast.

Modern memory is not.

A processor can execute billions of instructions every second, yet it often spends a surprising amount of time waiting for data to arrive from memory.

For High Frequency Trading systems, waiting is expensive.

Reducing computation is important.

Reducing memory latency is often even more important.

This is why experienced low-latency engineers spend as much time thinking about memory layout as they do about algorithms.

Part 2: Why Memory Is Slow¶

Suppose a CPU needs one integer.

Where that integer is located determines how long the processor must wait.

Approximate access latency:

CPU Register

↓

< 1 ns

↓

L1 Cache

↓

~1 ns

↓

L2 Cache

↓

~3–5 ns

↓

L3 Cache

↓

~10–20 ns

↓

Main Memory (RAM)

↓

~60–100+ ns

Although exact numbers vary across processors, the pattern is always the same:

Accessing RAM is dramatically slower than accessing cache.

The CPU is rarely waiting for arithmetic.

It is waiting for memory.

Part 3: What Is CPU Cache?¶

CPU cache is a small amount of extremely fast memory located close to the processor.

Instead of reading directly from RAM every time, the CPU keeps recently used data in cache.

Most modern processors contain multiple cache levels:

L1 Cache
L2 Cache
L3 Cache

Each level balances size against speed.

Smaller caches are faster.

Larger caches are slower but can store more data.

Part 4: Why Cache Exists¶

Programs often reuse the same data repeatedly.

Example:

for (int i = 0; i < 1000000; i++)
    sum += prices[i];

The processor predicts that nearby memory will also be needed.

Instead of loading one integer,

it loads an entire cache line.

Subsequent reads become much faster because the data is already available.

This behavior is known as spatial locality.

Part 5: Temporal Locality¶

Another important principle is temporal locality.

Recently accessed data is likely to be accessed again.

Example:

price += spread;
price *= multiplier;
price -= fee;

The variable price remains inside the CPU cache while multiple operations are performed.

No additional memory access is required.

Efficient software takes advantage of this natural behavior.

Part 6: Cache Lines¶

The CPU does not load individual bytes.

Instead, it loads fixed-size blocks known as cache lines.

On most modern x86 processors,

a cache line is:

64 Bytes

Accessing one integer often loads several neighboring values automatically.

This explains why sequential memory access is usually much faster than random access.

Part 7: Cache-Friendly Data Structures¶

Data layout strongly influences cache performance.

Consider two approaches.

Linked List¶

Node

↓

Pointer

↓

Node

↓

Pointer

↓

Node

Every node may reside in a different memory location.

The CPU repeatedly waits for memory.

Contiguous Array¶

Value

Value

Value

Value

Value

Neighboring values are loaded together.

The CPU processes data continuously with minimal waiting.

This is why arrays often outperform linked lists despite identical algorithmic complexity.

Part 8: Cache Misses¶

A cache miss occurs when requested data is not present in cache.

The processor must retrieve it from a slower level.

Typical sequence:

L1 Miss

↓

L2 Miss

↓

L3 Miss

↓

RAM Access

Each miss increases latency.

In High Frequency Trading,

avoiding cache misses is often more valuable than reducing arithmetic operations.

Part 9: Cache Optimization Techniques¶

Professional low-latency systems apply several techniques:

Keep hot data together
Minimize pointer chasing
Prefer contiguous memory
Reuse allocated objects
Avoid unnecessary allocations
Align frequently accessed structures
Reduce working set size

The objective is simple:

Keep important data inside the CPU cache as long as possible.

Part 10: Cache Optimization in High Frequency Trading¶

Trading systems process enormous volumes of data:

Market updates
Order books
Positions
Risk limits
Execution reports

These structures are accessed continuously.

Poor memory layout may force the processor to wait for RAM thousands of times every second.

Good cache utilization allows trading systems to process market events with significantly lower latency.

Part 11: Common Misconceptions¶

A faster algorithm does not always produce a faster system.

For example,

an algorithm with better theoretical complexity may perform worse if it causes excessive cache misses.

Likewise,

allocating many small objects dynamically often hurts performance more than a few additional arithmetic operations.

Modern performance engineering is frequently limited by memory rather than computation.

Part 12: Where godzilla.dev Fits¶

The architecture of godzilla.dev emphasizes efficient memory access throughout the trading pipeline.

Market data structures, event queues, and communication channels are designed with cache locality in mind, helping reduce unnecessary memory traffic during high-throughput workloads.

Combined with shared memory, lock-free programming, and event-driven architecture, careful cache-aware design contributes to predictable low-latency performance.

Rather than optimizing isolated algorithms, the framework focuses on optimizing how data moves through the system.

Part 13: Key Takeaways¶

CPU Cache Optimization is the practice of organizing data and algorithms to maximize cache efficiency and minimize slow memory accesses.

Key principles include:

Favor contiguous memory
Exploit spatial locality
Exploit temporal locality
Reduce cache misses
Design cache-friendly data structures

For modern High Frequency Trading systems, memory access patterns often have a greater impact on performance than computational complexity alone.

Understanding the memory hierarchy is therefore an essential skill for building low-latency software.

What's Next?¶

The next article explores one of the most common performance problems caused by modern CPU caches:

What is False Sharing?