Sometimes it is simply not possible to increase the temporal reuse of an algorithm. Most modern processors have instructions for handling non-temporal data that can be used to optimize cache usage in such cases, see Section 5.3, “Non-Temporal Data”.
Use non-temporal prefetches to hint to the processor what cache lines you know will be evicted from the cache before they are reused. This frees up cache space, which may allow other data that was previously evicted to be successfully cached.
When writing continuous regions of non-temporal data, use non-temporal store instructions instead of non-temporal prefetches to avoid fetching the overwritten data from memory.
Tip | |
---|---|
Having too many active non-temporal store streams will result in partially filled store buffers being written back to memory. This severely impacts the performance of the application. |
Tip | |
---|---|
Consider blocking algorithms that would otherwise have too many parallel non-temporal store streams in flight. |
Tip | |
---|---|
Iteratively apply non-temporal prefetches to data structures with large reuse distances. Start adding prefetches to accesses to those data structures that contribute a lot to the total number of fetches. |