Last updates: Tue May 8 19:16:06 2001 Fri Nov 12 15:26:10 2004 Thu Nov 13 18:30:20 2008 Mon Mar 1 16:28:36 2010
            OpenMP is a relatively new (1997) development
            in parallel computing.  It is a language-independent
            specification of multithreading, and implementations are
            available from several vendors, including
        
            OpenMP is implemented as comments or directives
            in Fortran, C, and C++ code, so that its presence is
            invisible to compilers lacking OpenMP support.
            Thus, you can develop code that will run everywhere, and
            when OpenMP is available, will run even faster.
        
            The OpenMP Consortium maintains a very useful
            Web site at
            
                http://www.openmp.org/,
            
            with links to vendors and resources.
        
            There is an excellent overview of the advantages of 
            OpenMP over POSIX threads ( pthreads )
            and PVM/MPI in the paper
            
                OpenMP:  A Proposed Industry Standard API for Shared
                Memory Processors,
            
            also available in
            
                HTML
            
            and
            
                PDF
            .
            This is a must-read if you are getting started in
            parallel programming.  It contains two simple examples
            programmed with OpenMP, ptheads,
            and MPI.
        
            The paper also gives a very convenient tabular comparison of
            OpenMP directives with Silicon Graphics
            parallelization directives.
        
            OpenMP can be used on uniprocessor and
            multiprocessor systems with shared memory.  It can also be
            used in programs that run on homogeneous or heterogeneous
            distributed memory environments, which are typically
            supported by systems like Linda, MPI
            , and PVM, although the OpenMP
             part of the code will only provide parallelization
            on those processors providing shared memory.
        
            In distributed memory environments, the programmer must
            manually partition data between processors, and make special
            library calls to move the data back and forth.  While that
            kind of code can also be used in shared memory systems,
            OpenMP is much simpler to program.
            Thus, you can start parallelization of an application using
            OpenMP, and then later add MPI or
            PVM calls:  the two forms of parallelization
            can peacefully coexist in your program.
        
            An extensive bibliography on multithreading, including
            OpenMP, is available at
            
                http://www.math.utah.edu/pub/tex/bib/index-table-m.html#multithreading.
            
            MPI and PVM are covered in a
            separate bibliography:
            
                http://www.math.utah.edu/pub/tex/bib/index-table-p.html#pvm
            
        
OpenMP benchmark:  computation of pi
        
            This simple benchmark for the computation of pi is taken
            from the paper above.  Its read statement has
            been modified to read from stdin instead of the
            non-redirectable /dev/tty, and an extra final
            print statement has been added to show an
            accurate value of pi.
        
            Follow this link for the
            
                source code
            
            a
            
                shell script
            
            to run the benchmark, a UNIX
            
                Makefile,
            
            and a small
            
                awk program
            
            to extract the timing results for inclusion in tables like
            the ones below.
        
OpenMP directives during compilation:
            | Compaq/DEC | f90 | -omp | 
| Compaq/DEC | f95 | -omp | 
| IBM | xlf90_r | -qsmp=omp -qfixed | 
| IBM | xlf95_r | -qsmp=omp -qfixed | 
| PGI | pgf77 | -mp | 
| PGI | pgf90 | -mp | 
| PGI | pgcc | -mp | 
| PGI | pgCC | -mp | 
| SGI | f77 | -mp | 
          Once you have compiled with OpenMP support, the
          executable may still not run multithreaded, unless you
          preset an environment variable that defines the number of
          threads to use.  On most of the above systems, this variable
          is called OMP_NUM_THREADS.  This has no effect on the
          IBM systems; I'm still trying to find out what is expected
          there.
        
           When the Compaq/DEC benchmark below was run, there was one
           other single-CPU-bound process on the machine, so we should
           expect to have only 3 available CPUs.  As the number of
           threads increases beyond the number of available CPUs, we
           should expect a performance drop, unless those threads have
           idle time, such as from I/O activity.  For this simple
           benchmark, the loop is completely CPU bound.  Evidently,
           3 threads make almost perfect use of the machine, at a cost
           of only two simple OpenMP directives added to
           the original scalar program.
        
        | 
                     | 
            ||
| 
                     | 
            ||
| Number of threads | Wallclock Time (sec) | Speedup | 
| 1 | 8.310 | 1.000 | 
| 2 | 4.030 | 2.062 | 
| 3 | 2.780 | 2.989 | 
| 4 | 2.130 | 3.901 | 
| 5 | 3.470 | 2.395 | 
| 6 | 2.930 | 2.836 | 
| 7 | 2.520 | 3.298 | 
| 8 | 2.280 | 3.645 | 
        | 
                     | 
            ||
| 
                     | 
            ||
| Number of threads | Wallclock Time (sec) | Speedup | 
| 1 | 6.210 | 1.000 | 
| 2 | 3.110 | 1.997 | 
| 3 | 4.000 | 1.552 | 
| 4 | 4.390 | 1.415 | 
        | 
                     | 
            ||
| 
                     | 
            ||
| Number of threads | Wallclock Time (sec) | Speedup | 
| 1 | 28.61 | 1.000 | 
| 2 | 14.33 | 1.997 | 
| 3 | 9.61 | 2.977 | 
| 4 | 7.63 | 3.750 | 
| 5 | 9.79 | 2.922 | 
| 6 | 9.80 | 2.919 | 
| 7 | 9.85 | 2.905 | 
| 8 | 13.15 | 2.176 | 
The previous two systems were essentially idle when the benchmark was run, and, as expected, the optimal speedup is obtained when the thead count matches the number of CPUs.
The next one is a large shared system on which the load average was about 40 (that is, about 2/3 busy) when the benchmark was run. With a large number of CPUs, the work per thread is reduced, and eventually, communication and scheduling overhead dominates computation. Consequently, the number of iterations was tripled for this benchmark. Since large tables of numbers are less interesting, the speedup is shown graphically as well. At 100% efficiency, the speedup would be a 45-degree line in the plot. With a machine of this size, it is almost impossible to ever find it idle, though it would be interesting to see how well the benchmark would scale without competition from other users for the CPUs.
        | 
                     | 
            ||
| 
                     | 
            ||
| Number of threads | Wallclock Time (sec) | Speedup | 
| 1 | 32.651 | 1.000 | 
| 2 | 16.348 | 1.997 | 
| 3 | 10.943 | 2.984 | 
| 4 | 8.272 | 3.947 | 
| 5 | 7.178 | 4.549 | 
| 6 | 5.794 | 5.635 | 
| 7 | 4.927 | 6.627 | 
| 8 | 4.446 | 7.344 | 
| 9 | 4.021 | 8.120 | 
| 10 | 3.577 | 9.128 | 
| 11 | 3.409 | 9.578 | 
| 12 | 3.021 | 10.808 | 
| 13 | 2.928 | 11.151 | 
| 14 | 2.645 | 12.344 | 
| 15 | 2.493 | 13.097 | 
| 16 | 2.414 | 13.526 | 
| 17 | 2.208 | 14.788 | 
| 18 | 2.170 | 15.047 | 
| 19 | 2.051 | 15.920 | 
| 20 | 2.051 | 15.920 | 
| 21 | 2.082 | 15.683 | 
| 22 | 1.791 | 18.231 | 
| 23 | 1.824 | 17.901 | 
| 24 | 2.457 | 13.289 | 
| 25 | 2.586 | 12.626 | 
| 26 | 3.134 | 10.418 | 
| 27 | 5.200 | 6.279 | 
| 28 | 5.454 | 5.987 | 
| 29 | 3.431 | 9.516 | 
| 30 | 2.427 | 13.453 | 
| 31 | 3.021 | 10.808 | 
| 32 | 2.418 | 13.503 | 
| 33 | 5.092 | 6.412 | 
| 34 | 7.601 | 4.296 | 
| 35 | 8.790 | 3.715 | 
| 36 | 6.369 | 5.127 | 
| 37 | 6.232 | 5.239 | 
| 38 | 5.588 | 5.843 | 
| 39 | 6.470 | 5.047 | 
| 40 | 7.166 | 4.556 | 
| 41 | 6.218 | 5.251 | 
| 42 | 7.450 | 4.383 | 
| 43 | 6.298 | 5.184 | 
| 44 | 6.475 | 5.043 | 
| 45 | 15.411 | 2.119 | 
| 46 | 7.466 | 4.373 | 
| 47 | 8.293 | 3.937 | 
| 48 | 6.872 | 4.751 | 
| 49 | 8.884 | 3.675 | 
| 50 | 8.006 | 4.078 | 
| 51 | 9.614 | 3.396 | 
| 52 | 25.223 | 1.294 | 
| 53 | 10.789 | 3.026 | 
| 54 | 32.958 | 0.991 | 
| 55 | 35.816 | 0.912 | 
| 56 | 36.213 | 0.902 | 
| 57 | 8.301 | 3.933 | 
| 58 | 11.487 | 2.842 | 
| 59 | 71.526 | 0.456 | 
| 60 | 10.361 | 3.151 | 
| 61 | 52.518 | 0.622 | 
| 62 | 33.081 | 0.987 | 
| 63 | 32.493 | 1.005 | 
| 64 | 95.322 | 0.343 | 
        | 
                     (4 EV6 21264 CPUs, 500 MHz, 4GB RAM) OSF/1 4.0F  | 
            ||
| 
                     | 
            ||
| Number of threads | Wallclock Time (sec) | Speedup | 
| 1 | 26.470 | 1.000 | 
| 2 | 13.260 | 1.996 | 
| 3 | 8.840 | 2.994 | 
| 4 | 6.650 | 3.980 | 
| 5 | 8.080 | 3.276 | 
| 6 | 6.770 | 3.910 | 
| 7 | 6.850 | 3.864 | 
| 8 | 6.670 | 3.969 | 
| 9 | 7.200 | 3.676 | 
| 10 | 7.130 | 3.712 | 
| 11 | 7.120 | 3.718 | 
| 12 | 6.690 | 3.957 | 
| 13 | 7.180 | 3.687 | 
| 14 | 7.300 | 3.626 | 
| 15 | 7.170 | 3.692 | 
| 16 | 6.710 | 3.945 | 
        | 
                     (32 EV6.7 21264A CPUs, 667 MHz, 8GB RAM)  | 
            ||
| 
                     | 
            ||
| Number of threads | Wallclock Time (sec) | Speedup | 
| 1 | 2.500 | 1.000 | 
| 2 | 1.600 | 1.562 | 
| 3 | 1.300 | 1.923 | 
| 4 | 1.500 | 1.667 | 
| 5 | 2.000 | 1.250 | 
| 6 | 2.000 | 1.250 | 
| 7 | 1.800 | 1.389 | 
| 8 | 1.200 | 2.083 | 
| 9 | 1.500 | 1.667 | 
| 10 | 1.900 | 1.316 | 
| 11 | 1.900 | 1.316 | 
| 12 | 1.900 | 1.316 | 
| 13 | 3.200 | 0.781 | 
| 14 | 2.400 | 1.042 | 
| 15 | 1.900 | 1.316 | 
| 16 | 2.200 | 1.136 | 
| 17 | 1.900 | 1.316 | 
| 18 | 1.800 | 1.389 | 
| 19 | 2.100 | 1.190 | 
| 20 | 1.600 | 1.562 | 
| 21 | 2.600 | 0.962 | 
| 22 | 1.500 | 1.667 | 
| 23 | 1.800 | 1.389 | 
| 24 | 1.600 | 1.562 | 
| 25 | 1.500 | 1.667 | 
| 26 | 2.100 | 1.190 | 
| 27 | 1.800 | 1.389 | 
| 28 | 1.700 | 1.471 | 
| 29 | 2.200 | 1.136 | 
| 30 | 2.400 | 1.042 | 
| 31 | 2.100 | 1.190 | 
| 32 | 2.500 | 1.000 | 
| 33 | 2.500 | 1.000 | 
| 34 | 1.900 | 1.316 | 
| 35 | 1.800 | 1.389 | 
| 36 | 2.500 | 1.000 | 
| 37 | 1.600 | 1.562 | 
| 38 | 1.600 | 1.562 | 
| 39 | 2.200 | 1.136 | 
| 40 | 2.500 | 1.000 | 
| 41 | 2.200 | 1.136 | 
| 42 | 1.500 | 1.667 | 
| 43 | 3.100 | 0.806 | 
| 44 | 2.400 | 1.042 | 
| 45 | 2.500 | 1.000 | 
| 46 | 2.400 | 1.042 | 
| 47 | 2.500 | 1.000 | 
| 48 | 1.600 | 1.562 | 
| 49 | 3.300 | 0.758 | 
| 50 | 2.200 | 1.136 | 
| 51 | 2.600 | 0.962 | 
| 52 | 3.200 | 0.781 | 
| 53 | 2.400 | 1.042 | 
| 54 | 1.800 | 1.389 | 
| 55 | 3.000 | 0.833 | 
| 56 | 4.900 | 0.510 | 
| 57 | 1.800 | 1.389 | 
| 58 | 2.700 | 0.926 | 
| 59 | 3.100 | 0.806 | 
| 60 | 2.700 | 0.926 | 
| 61 | 3.600 | 0.694 | 
| 62 | 3.000 | 0.833 | 
| 63 | 2.300 | 1.087 | 
| 64 | 3.700 | 0.676 | 
| 
                         (two 8-core CPUs, 128 threads, 1200 MHz UltraSPARC T2 Plus, 64GB RAM) Solaris 10  | 
                
| 
                         | 
                
                         
                         | 
                
| 
                         | 
                
                         
                         | 
                
| 
                         | 
                
                         
                         | 
                
| 
                         (4 CPUs, 16 threads/CPU) GNU/Linux  | 
                
                         
                         |