Memory Access Patterns when using parallel programming

    A modern memory system is organized as a hierarchy, where the largest, and also slowest, part of memory is known as main memory. Main memory is organized into pages, a subset of which will be available to a given application. The memory levels closer to the processor are successively smaller and faster and are collectively known as cache. When a program is compiled, the compiler will arrange for its data objects to be stored in main memory; they will be transferred to cache when needed. If a value required for a computation is not already in a cache (we call this a cache “miss”), it must be retrieved from higher levels of the memory hierarchy, a process that can be quite expensive. Program data is brought into cache in chunks called blocks, each of which will occupy a line of cache. Data that is already in cache may need to be removed, or “evicted”, to make space for a new block of data. The memory hierarchy is (with rare exceptions) not explicitly programmable by either the user or the compiler. Rather, data are fetched into cache and evicted from it dynamically as the need arises. Given the penalty paid whenever values must be retrieved from other levels of memory, various strategies have been devised that can help the compiler and the programmer to (indirectly) reduce the frequency with which this situation occurs. A major goal is to organize data accesses so that values are used as often as possible while they are still in cache. The most common strategies for doing so are based on the fact that programming languages typically specify that the elements of arrays be stored contiguously in memory. Thus, if an array element is fetched into cache,"nearby" elements of the array will be in the same cache block and will be fetched as part of the same transaction. If a computation that uses any of these values can be performed while they are still in cache, it will be beneficial for performance. To get good performance, a matrix-based computation should access the elements of the array row by row, not column by column.

  • **Method Recommended**
   1: for (int i=0; i<n; i++)
   2:     for (int j=0; j<n; j++)
   3:         sum += a[i][j];
  • **Method Not Recommended**
   1: for (int j=0; j<n; j++)
   2:     for (int i=0; i<n; i++)
   3:         sum += a[i][j];