8.5. Blocking

8.5. Blocking
	Chapter 8. Issue Reference

Figure 8.7. Blocking Issue

A blocking issue means that there is an opportunity to reduce the number of cache misses or fetches by processing a smaller piece of the data set at a time, thereby reusing cache lines before they are evicted from the cache. For a general description of blocking, see Section 5.2.3.1, “Blocking”.

Freja can suggest three types of blocking:

Temporal blocking.
It may be possible to increase the reuse of the same data that has already been used. Typically occurs when an algorithm performs multiple iterations over a data set that is too large to fit in the cache. By performing multiple iterations at a time on a smaller part of the data that fits in the cache, the cache line reuse can be improved.
Spatial blocking.
It may be possible to increase the reuse of other data in the cache lines that have already been used. Typically occurs within an iteration of an algorithm that touches too many other locations in the data set before reusing a location in the original cache line. By running the algorithm on smaller parts of the data set at a time the cache line reuse can be improved.
Spatial and temporal blocking.
It may be possible improve both kinds of reuse, typically both between iterations and within iterations of an algorithm.

The blocking issues contain several sections describing different loop levels in the loop hierarchy related to the issue, descriptions of these and other issue specific sections follows. The issues contain the following sections:

Statistics for instructions of this issue
Instructions benefiting from blocking
Instructions previously writing to related data (and also: Next instructions to write to related data).
Loops and barriers related to this issue
Loop statistics
Loop instructions

Different loops play different roles for blocking with respect to data reuse of a particular instruction. The following example contains four loop levels, some of which possess important traits to consider when blocking this piece of code. From the inside out:

for (i = 0; i < ITER; i++)
  for (j = 0; j < J_SIZE; j++) {
    a[j][0] = 0;
    for (k = 0; k < K_SIZE; k++)
      for (m = 0; m < M_SIZE; m++)
        b[m] += a[j][k];
  }

Instructions benefiting from blocking
The instructions whose cache line reuse may be possible to improve.
for m is the loop containing the target instruction a[][]. The goal is to have this instruction revisit data more aggressively, even before all its slots have been visited.
Outermost loop without barriers.
The outermost loop without barriers (data dependencies). This may represent a suitable loop level to block at.
for k is the outermost loop not having any blocking write instructions. (In this example, we find such a blocking write instruction in the next outer loop level: a[j][0] = ...).
Next outer loop, introducing barriers.
The innermost loop with data dependencies on some variable (which does not have to be the target instruction of the blocking issue). Blocking cannot normally be performed on this or on outer loop levels. The code may first need to be transformed to remove data dependency barriers. All barriers that have been detected are reported in this section.
for j is the next outer loop level with barriers in this example. (In this case, however, the barriers are inconsequential to the semantics of the program, at least for the example transform below, and would not need to be removed.)
Target loop
The optimal, outer loop level to block around.
for i is the optimal loop level to use as the outer loop. All values in a[][] are used in each iteration of this loop level.
To block this loop hierarchy in an optimal way, this is the loop level which should be moved closer to the instruction under consideration, by making each iteration of the inner loops work on a smaller chunk of data. Another way to say this is to split one or several loop nestings outwards past this loop level.

The following would be one proper blocking of this loop. The for j loop has been split up by the for i loop:

for (jj = 0; jj < J_SIZE; jj += BLOCK)
  for (i = 0; i < ITER; i++)
    for (j = jj; j < min(jj + BLOCK, J_SIZE); j++) {
      a[j][0] = 0;
      for (k = 0; k < K_SIZE; k++)
        for (m = 0; m < M_SIZE; m++)
          b[m] += a[j][k];
    }

The ideas underlying the blocking issue, and the sections presented here represents an ideal situation. In reality, due to effects of the sampling process, Freja may not have complete information about all loop levels. In such cases the closest identified loop level tends to be selected.

To help the programmer to judge how close the identified loop is from the optimal blocking level, Freja displays the number of iterations of that loop that corresponds to one data reuse by the instruction under consideration. The programmer can take this number and, while moving outwards from the target instruction, scan loops until the reported iteration count has been accounted for.


8.4. Loop Fusion		8.6. Software Prefetch Issues