DragonFly On-Line Manual Pages

PAR_MEM(8)                          LMBENCH                         PAR_MEM(8)

NAME
       par_mem - memory parallelism benchmark

SYNOPSIS
       par_mem [ -L <line size> ] [ -M <len> ] [ -W <warmups> ] [ -N
       <repetitions> ]

DESCRIPTION
       par_mem measures the available parallelism in the memory hierarchy, up
       to len bytes.  Modern processors can often service multiple memory
       requests in parallel, while older processors typically blocked on LOAD
       instructions and had no available parallelism (other than that provided
       by cache prefetching).  par_mem measures the available parallelism at a
       variety of points, since the available parallelism is often a function
       of the data location in the memory hierarchy.

       In order to measure the available parallelism par_mem conducts a
       variety of experiments at each memory size; one for each level of
       parallelism.  It builds a pointer chain of the desired length.  It then
       creates an array of pointers which point to chain entries which are
       evenly spaced across the chain.  Then it starts running the pointers
       forward through the chain in parallel.  It can then measure the average
       memory latency for each level of parallelism, and the available
       parallelism is the minimum average memory latency for parallelism 1
       divided by the average memory latency across all levels of available
       parallelism.

       For example, the inner loop which measures parallelism 2 would look
       something like:

       for (i = 0; i < N; ++i) {     p0 = (char **)*p0;  p1 = (char **)*p1; }

       in a for loop (the overhead of the for loop is not significant; the
       loop is an unrolled loop 100 loads long).  In this case, if the
       hardware can process two LOAD operations in parallel, then the overall
       latency of the loop should be equivalent to that of a single pointer
       chain, so the measured parallelism would be roughly two.  If, however,
       the hardware can only process a single LOAD operation at once, or if
       there is (significant) resource contention between the two LOAD
       operations, then the loop will be much slower than a loop with a single
       pointer chain, so the measured parallelism will be less than two, and
       probably no smaller than one.

OUTPUT
       Output format is intended as input to xgraph or some similar program
       (we use a perl script that produces pic input).  There is a set of data
       produced for each stride.  The data set title is the stride size and
       the data points are the array size in megabytes (floating point value)
       and the load latency over all points in that array.

SEE ALSO
       lmbench(8), line(8), cache(8), tlb(8), par_ops(8).

AUTHOR
       Carl Staelin and Larry McVoy

       Comments, suggestions, and bug reports are always welcome.

(c)2000 Carl Staelin and Larry McVoy
                                    $Date$                          PAR_MEM(8)