Skip to content

AI × Quant Trader Series — Day 22

Memory Layout in C++

Reading time: ~20 minutes
Prerequisites: Basic C++, CPU Cache Optimization, Shared Memory IPC
Focus: understanding how memory layout affects performance in modern trading systems


Part 1: Introduction

Two programs may execute exactly the same algorithm.

Both may have the same computational complexity.

Yet one consistently runs twice as fast.

Why?

In many cases, the answer has nothing to do with the algorithm itself.

It has everything to do with how data is arranged in memory.

For modern processors, memory layout is often one of the largest determinants of performance.

This is especially true in High Frequency Trading, where millions of market events are processed every second.

Good software is not only about writing efficient code.

It is also about organizing data efficiently.


Part 2: What Is Memory Layout?

Memory layout describes how objects are organized in memory.

For example, consider this simple structure:

struct Order {
    int id;
    double price;
    int quantity;
};

Although it appears straightforward, the compiler decides:

  • Where each field is stored
  • How much padding is inserted
  • How the object is aligned in memory

These decisions directly affect cache efficiency and memory bandwidth.


Part 3: Object Alignment

Modern CPUs access aligned memory more efficiently than unaligned memory.

Suppose a processor expects an 8-byte value to begin at an 8-byte boundary.

Address

0

8

16

24

Accessing aligned data typically requires fewer CPU operations.

To guarantee alignment, C++ provides:

alignas(64)
struct Order {
    ...
};

Alignment is particularly important for cache-sensitive applications.


Part 4: Padding

Compilers often insert unused bytes between fields.

Example:

struct Example {
    char flag;
    double price;
    int qty;
};

Memory may actually look like:

flag

padding

padding

padding

padding

padding

padding

padding

price

qty

padding

The extra bytes improve alignment but increase memory usage.

Understanding padding helps reduce unnecessary memory traffic.


Part 5: Field Ordering

The order of fields matters.

Example A:

struct A {
    char flag;
    double price;
    int qty;
};

Example B:

struct B {
    double price;
    int qty;
    char flag;
};

Both structures represent the same information.

However,

the second layout often contains less padding.

Better field ordering improves cache utilization.


Part 6: Array of Structures (AoS)

Many beginners naturally organize data as:

struct Order {
    double price;
    int quantity;
};

Order orders[100000];

Memory looks like:

Price Qty

Price Qty

Price Qty

Price Qty

This approach is intuitive.

However,

algorithms that only need prices still load quantities into cache.

Unnecessary data consumes memory bandwidth.


Part 7: Structure of Arrays (SoA)

An alternative organization is:

double prices[100000];
int quantities[100000];

Memory becomes:

Price Price Price Price ...

Qty Qty Qty Qty ...

Now a pricing algorithm loads only prices.

Cache efficiency improves significantly.

Many numerical libraries and HFT systems prefer this layout for data-intensive workloads.


Part 8: Contiguous Memory

Processors perform best when data is stored continuously.

Example:

Order

Order

Order

Order

Order

Sequential access allows hardware prefetchers to load future cache lines automatically.

In contrast,

pointer-based structures such as linked lists require frequent memory jumps, increasing cache misses.

Whenever possible, contiguous memory should be preferred.


Part 9: Hot Data and Cold Data

Not every field is accessed equally often.

Example:

struct Order {
    double price;
    int quantity;
    uint64_t timestamp;
    std::string comment;
}

A trading strategy may only use:

  • Price
  • Quantity

The comment field is rarely accessed.

Professional systems often separate:

Hot Data

Frequently accessed.

Must remain cache-friendly.


Cold Data

Rarely accessed.

Can be stored elsewhere.

Separating hot and cold data reduces cache pollution.


Part 10: False Sharing

Memory layout also affects multi-threaded performance.

Suppose two threads modify different variables stored inside the same cache line.

Although the variables are unrelated,

the CPU repeatedly synchronizes the entire cache line.

Performance degrades dramatically.

This phenomenon is known as False Sharing.

Proper alignment and padding can eliminate this issue.


Part 11: Memory Layout in High Frequency Trading

Trading systems continuously process:

  • Market data
  • Order books
  • Positions
  • Risk limits
  • Execution reports

These structures are accessed millions of times every second.

Poor memory layout increases:

  • Cache misses
  • Memory bandwidth
  • CPU stalls

Professional trading platforms therefore invest significant effort in organizing data efficiently before optimizing algorithms.


Part 12: Where godzilla.dev Fits

The design of godzilla.dev places strong emphasis on data-oriented programming and cache-aware memory layouts.

Core trading structures are organized to reduce unnecessary memory accesses while supporting high-throughput event processing.

Rather than treating memory organization as an implementation detail, the framework considers it a fundamental part of system architecture.

This philosophy complements other low-latency techniques such as shared memory communication, lock-free programming, and event-driven processing.


Part 13: Key Takeaways

Memory layout determines how data is organized inside memory.

Good layouts improve:

  • Cache locality
  • Memory bandwidth
  • CPU utilization
  • Overall throughput

Key optimization techniques include:

  • Proper alignment
  • Reducing padding
  • Field reordering
  • Contiguous storage
  • Structure of Arrays (SoA)
  • Separating hot and cold data

For High Frequency Trading systems, memory layout is often as important as algorithm design itself.


Performance Engineering Notes

When optimizing low-latency software, engineers often focus on reducing computation.

In practice, memory movement frequently dominates execution time.

Organizing data to match modern CPU architectures can produce larger performance improvements than replacing one algorithm with another.

Design your data first.

Then optimize your code.


What's Next?

The next article explores one of the most common memory-related performance issues in multi-threaded software:

  • What is False Sharing?