I am not getting the memory speed that I am expecting

Discussion:

(too old to reply)

Peter Olcott

2010-03-26 15:29:25 UTC

I want to understand why I am not getting the memory speed
that I am expecting. I wrote a very memory bandwidth
intensive C++ program and it is reporting that the memory
speed that I am getting is about 121 MB per second.

MemTest86 and it showed:
Intel Core-i5 750 2.67 Ghz (quad core)
32K L1 88,893 MB/Sec
256K L2 37,560 MB/Sec
8 MB L3 26,145 MB/Sec
8.0 GB RAM 11,852 MB/Sec

Why is my memory intensive process only getting such a tiny
fraction of the 11,852 MB/Sec memory bandwidth speed?

Paul

2010-03-26 16:48:00 UTC

Permalink

Post by Peter Olcott
I want to understand why I am not getting the memory speed
that I am expecting. I wrote a very memory bandwidth
intensive C++ program and it is reporting that the memory
speed that I am getting is about 121 MB per second.
Intel Core-i5 750 2.67 Ghz (quad core)
32K L1 88,893 MB/Sec
256K L2 37,560 MB/Sec
8 MB L3 26,145 MB/Sec
8.0 GB RAM 11,852 MB/Sec
Why is my memory intensive process only getting such a tiny
fraction of the 11,852 MB/Sec memory bandwidth speed?

According to the "CPUID" tab here, the Core i5 has a cache line
size of 64 bytes.

http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5%20I5-750%20BV80605001911AP%20(BX80605I5750%20-%20BXC80605I5750).html

As I understand it, the processor deals in cache line sized transactions.
Your memory is dual channel, each DIMM is 8 bytes wide, two DIMMs is
16 bytes wide. A burst memory transfer would need to be 4 cycles
worth, to populate a cache line. Additional time is needed, to prepare
for the next data burst from memory.

If we take 11852 / 64, that represents the number of burst transfers
we could do in one second. That is 185 million per second.

Now, imagine we do random access of one byte, over the entire memory
space. This is a "cache busting" pattern. None of the data caches
will be effective, because each attempt to access a single byte,
results in the least recently used cache line being evicted, and
then filled with a cache line sized chunk from the main memory.

So you're doing a little worse than the "cache busting" rate, and
for that, I haven't a potential explanation.

Both Intel and AMD, should be able to provide you with architecture
bibles and programmer optimization hints. These can make a great
difference to program performance if followed.

http://www.intel.com/products/processor/manuals/

PDF page 249 Section 5.7.2 "Increasing bandwidth of Memory Fills"

http://www.intel.com/Assets/PDF/manual/248966.pdf

But if the things you do are inherently cache busters, such as
event simulation for digital simulators or the like, then there
may not be much of anything you can do.

If you were writing your own version of Photoshop, then you'd see
instances of the 11,852MB/sec bandwidth, when bringing in portions
of a picture for processing. And if the nearest neighbours needed
for some picture processing algorithm are in cache, you'll see
great processing speeds there as well. Locality of access is
important to performance. In some cases, you have to rewrite
or re-order how you do things with memory, for best performance.
If you're "cache busting", the processor can't fix that for you.

Just a guess,
Paul

Peter Olcott

2010-03-26 17:41:30 UTC

Permalink

Post by Paul

Post by Peter Olcott
I want to understand why I am not getting the memory
speed that I am expecting. I wrote a very memory
bandwidth intensive C++ program and it is reporting that
the memory speed that I am getting is about 121 MB per
second.
Intel Core-i5 750 2.67 Ghz (quad core)
32K L1 88,893 MB/Sec
256K L2 37,560 MB/Sec
8 MB L3 26,145 MB/Sec
8.0 GB RAM 11,852 MB/Sec
Why is my memory intensive process only getting such a
tiny fraction of the 11,852 MB/Sec memory bandwidth
speed?

According to the "CPUID" tab here, the Core i5 has a cache
line
size of 64 bytes.
http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5%20I5-750%20BV80605001911AP%20(BX80605I5750%20-%20BXC80605I5750).html
As I understand it, the processor deals in cache line
sized transactions.
Your memory is dual channel, each DIMM is 8 bytes wide,
two DIMMs is
16 bytes wide. A burst memory transfer would need to be 4
cycles
worth, to populate a cache line. Additional time is
needed, to prepare
for the next data burst from memory.
If we take 11852 / 64, that represents the number of burst
transfers
we could do in one second. That is 185 million per second.
Now, imagine we do random access of one byte, over the
entire memory
space. This is a "cache busting" pattern. None of the data
caches
will be effective, because each attempt to access a single
byte,
results in the least recently used cache line being
evicted, and
then filled with a cache line sized chunk from the main
memory.
So you're doing a little worse than the "cache busting"
rate, and
for that, I haven't a potential explanation.
Both Intel and AMD, should be able to provide you with
architecture
bibles and programmer optimization hints. These can make a
great
difference to program performance if followed.
http://www.intel.com/products/processor/manuals/
PDF page 249 Section 5.7.2 "Increasing bandwidth of Memory
Fills"
http://www.intel.com/Assets/PDF/manual/248966.pdf
But if the things you do are inherently cache busters,
such as
event simulation for digital simulators or the like, then
there
may not be much of anything you can do.

Except one thing: Understand why I am only getting 2/3 of
the cache busting rate instead of 100% of the cache busting
rate. Your explanation was very helpful. Before your
explanation I thought that I was only getting 1% of the
cache busting rate.

I am accessing the data randomly across all of allocated
memory, in 32-bit integer chunks. Allocated memory is
between 400 MB and 1.5 GB.
for (uint32 N = 0; N < Max; N++)
num = Data[num]; // finite state machine

Data is initalized with
for (uint32 N = 0; N < Data.size(); N++)
Data[N] = (rand() * rand()) % size;

Post by Paul
If you were writing your own version of Photoshop, then
you'd see
instances of the 11,852MB/sec bandwidth, when bringing in
portions
of a picture for processing. And if the nearest neighbours
needed
for some picture processing algorithm are in cache, you'll
see
great processing speeds there as well. Locality of access
is
important to performance. In some cases, you have to
rewrite
or re-order how you do things with memory, for best
performance.
If you're "cache busting", the processor can't fix that
for you.
Just a guess,
Paul

Paul

2010-03-28 11:45:02 UTC

Permalink

Post by Peter Olcott

Post by Paul

Post by Peter Olcott
I want to understand why I am not getting the memory
speed that I am expecting. I wrote a very memory
bandwidth intensive C++ program and it is reporting that
the memory speed that I am getting is about 121 MB per
second.
Intel Core-i5 750 2.67 Ghz (quad core)
32K L1 88,893 MB/Sec
256K L2 37,560 MB/Sec
8 MB L3 26,145 MB/Sec
8.0 GB RAM 11,852 MB/Sec
Why is my memory intensive process only getting such a
tiny fraction of the 11,852 MB/Sec memory bandwidth
speed?

According to the "CPUID" tab here, the Core i5 has a cache
line
size of 64 bytes.
http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5%20I5-750%20BV80605001911AP%20(BX80605I5750%20-%20BXC80605I5750).html
As I understand it, the processor deals in cache line
sized transactions.
Your memory is dual channel, each DIMM is 8 bytes wide,
two DIMMs is
16 bytes wide. A burst memory transfer would need to be 4
cycles
worth, to populate a cache line. Additional time is
needed, to prepare
for the next data burst from memory.
If we take 11852 / 64, that represents the number of burst
transfers
we could do in one second. That is 185 million per second.
Now, imagine we do random access of one byte, over the
entire memory
space. This is a "cache busting" pattern. None of the data
caches
will be effective, because each attempt to access a single
byte,
results in the least recently used cache line being
evicted, and
then filled with a cache line sized chunk from the main
memory.
So you're doing a little worse than the "cache busting"
rate, and
for that, I haven't a potential explanation.
Both Intel and AMD, should be able to provide you with
architecture
bibles and programmer optimization hints. These can make a
great
difference to program performance if followed.
http://www.intel.com/products/processor/manuals/
PDF page 249 Section 5.7.2 "Increasing bandwidth of Memory
Fills"
http://www.intel.com/Assets/PDF/manual/248966.pdf
But if the things you do are inherently cache busters,
such as
event simulation for digital simulators or the like, then
there
may not be much of anything you can do.

Except one thing: Understand why I am only getting 2/3 of
the cache busting rate instead of 100% of the cache busting
rate. Your explanation was very helpful. Before your
explanation I thought that I was only getting 1% of the
cache busting rate.
I am accessing the data randomly across all of allocated
memory, in 32-bit integer chunks. Allocated memory is
between 400 MB and 1.5 GB.
for (uint32 N = 0; N < Max; N++)
num = Data[num]; // finite state machine
Data is initalized with
for (uint32 N = 0; N < Data.size(); N++)
Data[N] = (rand() * rand()) % size;

I took a look at the first source you released. I gave it a try
in Knoppix (a Linux LiveCD distro). That distro is a 32 bit version.

What I find bizarre about the behavior of the code, is when the
size is 100,000,000 32 bit numbers, the time to do the "random
walk" is

20.05 seconds
20.03 seconds
20.05 seconds
20.04 seconds

The random walk completes precisely in the same time, every time.
Now, if I switch to 200,000,000 32 bit numbers and recompile, I get

63.23 seconds (should be 40.1 seconds or so)
65.29 seconds
63.70 seconds
61.97 seconds

The variance is larger. And as you've already noted, the larger
array runs slower. Doubling the array, should have taken twice the
time or 40.1 seconds.

My hypothesis in this case, is this is related to the TLB and page
tables. The kernel does seem to have some kind of option, to allow
reservation of huge pages (but as far as I know, they're not
really that big.).

http://lwn.net/Articles/375098/

My processor has up to 32 TLB entries of 4MB size, which is still
not enough to prevent TLB misses. I expect that is backed by
L1 and L2 cache. The x86 uses hardware table walks, and my guess
would be, if a TLB miss and associated table walk hits in the
L1 or L2 cache, that might prevent a total disaster. So while
32 entries doesn't sound like a lot, maybe it still gets the
benefit of being filled from one of the caches, rather than
a more expensive table walk in main memory.

I wanted to experiment some more, but I usually restrict myself to
LiveCD environments. I'd have to install some version of Linux, if
I wanted to rebuild the kernel and add stuff. I tried to download
a 64 bit version of Ubuntu (on the chance that the wider range of
TLB page sizes might do some good), but when I went to use
Synaptic Package Manager to install g++ to compile the code, the
repository wasn't wired up correctly. So that's all the
testing I've managed to complete so far - just on a 32 bit
OS, and not with any added kernel features.

So whatever slows it down, there is some variance in the execution
time as a result. The variance with the smaller array, is
amazingly small.

Paul