OpenMP: overview and resource guide

Last updates: Tue May 8 19:16:06 2001 Fri Nov 12 15:26:10 2004 Thu Nov 13 18:30:20 2008 Mon Mar 1 16:28:36 2010

OpenMP is a relatively new (1997) development in parallel computing. It is a language-independent specification of multithreading, and implementations are available from several vendors, including

OpenMP is implemented as comments or directives in Fortran, C, and C++ code, so that its presence is invisible to compilers lacking OpenMP support. Thus, you can develop code that will run everywhere, and when OpenMP is available, will run even faster.

The OpenMP Consortium maintains a very useful Web site at http://www.openmp.org/, with links to vendors and resources.

There is an excellent overview of the advantages of OpenMP over POSIX threads ( pthreads ) and PVM/MPI in the paper OpenMP: A Proposed Industry Standard API for Shared Memory Processors, also available in HTML and PDF . This is a must-read if you are getting started in parallel programming. It contains two simple examples programmed with OpenMP, ptheads, and MPI.

The paper also gives a very convenient tabular comparison of OpenMP directives with Silicon Graphics parallelization directives.

OpenMP can be used on uniprocessor and multiprocessor systems with shared memory. It can also be used in programs that run on homogeneous or heterogeneous distributed memory environments, which are typically supported by systems like Linda, MPI, and PVM, although the OpenMP part of the code will only provide parallelization on those processors providing shared memory.

In distributed memory environments, the programmer must manually partition data between processors, and make special library calls to move the data back and forth. While that kind of code can also be used in shared memory systems, OpenMP is much simpler to program. Thus, you can start parallelization of an application using OpenMP, and then later add MPI or PVM calls: the two forms of parallelization can peacefully coexist in your program.

An extensive bibliography on multithreading, including OpenMP, is available at http://www.math.utah.edu/pub/tex/bib/index-table-m.html#multithreading. MPI and PVM are covered in a separate bibliography: http://www.math.utah.edu/pub/tex/bib/index-table-p.html#pvm

`OpenMP` benchmark: computation of pi

This simple benchmark for the computation of pi is taken from the paper above. Its read statement has been modified to read from stdin instead of the non-redirectable /dev/tty, and an extra final print statement has been added to show an accurate value of pi.

Follow this link for the source code a shell script to run the benchmark, a UNIX Makefile, and a small awk program to extract the timing results for inclusion in tables like the ones below.

Here is a table of compiler options needed to enable OpenMP directives during compilation:

Compaq/DEC	f90	-omp
Compaq/DEC	f95	-omp
IBM	xlf90_r	-qsmp=omp -qfixed
IBM	xlf95_r	-qsmp=omp -qfixed
PGI	pgf77	-mp
PGI	pgf90	-mp
PGI	pgcc	-mp
PGI	pgCC	-mp
SGI	f77	-mp

Once you have compiled with OpenMP support, the executable may still not run multithreaded, unless you preset an environment variable that defines the number of threads to use. On most of the above systems, this variable is called OMP_NUM_THREADS. This has no effect on the IBM systems; I'm still trying to find out what is expected there.

When the Compaq/DEC benchmark below was run, there was one other single-CPU-bound process on the machine, so we should expect to have only 3 available CPUs. As the number of threads increases beyond the number of available CPUs, we should expect a performance drop, unless those threads have idle time, such as from I/O activity. For this simple benchmark, the loop is completely CPU bound. Evidently, 3 threads make almost perfect use of the machine, at a cost of only two simple OpenMP directives added to the original scalar program.

Plot of Compaq/DEC Alpha 4100-5/466 speedup

Compaq/DEC Alpha 4100-5/466: Four 466MHz CPUs
100,000,000 iterations
Number of threads	Wallclock Time (sec)	Speedup
1	8.310	1.000
2	4.030	2.062
3	2.780	2.989
4	2.130	3.901
5	3.470	2.395
6	2.930	2.836
7	2.520	3.298
8	2.280	3.645

Intel Pentium III: Two 600 MHz CPUs
100,000,000 iterations
Number of threads	Wallclock Time (sec)	Speedup
1	6.210	1.000
2	3.110	1.997
3	4.000	1.552
4	4.390	1.415

SGI Origin 200: Four 195MHz R10000 CPUs
100,000,000 iterations
Number of threads	Wallclock Time (sec)	Speedup
1	28.61	1.000
2	14.33	1.997
3	9.61	2.977
4	7.63	3.750
5	9.79	2.922
6	9.80	2.919
7	9.85	2.905
8	13.15	2.176

The previous two systems were essentially idle when the benchmark was run, and, as expected, the optimal speedup is obtained when the thead count matches the number of CPUs.

The next one is a large shared system on which the load average was about 40 (that is, about 2/3 busy) when the benchmark was run. With a large number of CPUs, the work per thread is reduced, and eventually, communication and scheduling overhead dominates computation. Consequently, the number of iterations was tripled for this benchmark. Since large tables of numbers are less interesting, the speedup is shown graphically as well. At 100% efficiency, the speedup would be a 45-degree line in the plot. With a machine of this size, it is almost impossible to ever find it idle, though it would be interesting to see how well the benchmark would scale without competition from other users for the CPUs.

SGI Origin 2000: Sixty-four 195MHz R10000 CPUs
300,000,000 iterations
Number of threads	Wallclock Time (sec)	Speedup
1	32.651	1.000
2	16.348	1.997
3	10.943	2.984
4	8.272	3.947
5	7.178	4.549
6	5.794	5.635
7	4.927	6.627
8	4.446	7.344
9	4.021	8.120
10	3.577	9.128
11	3.409	9.578
12	3.021	10.808
13	2.928	11.151
14	2.645	12.344
15	2.493	13.097
16	2.414	13.526
17	2.208	14.788
18	2.170	15.047
19	2.051	15.920
20	2.051	15.920
21	2.082	15.683
22	1.791	18.231
23	1.824	17.901
24	2.457	13.289
25	2.586	12.626
26	3.134	10.418
27	5.200	6.279
28	5.454	5.987
29	3.431	9.516
30	2.427	13.453
31	3.021	10.808
32	2.418	13.503
33	5.092	6.412
34	7.601	4.296
35	8.790	3.715
36	6.369	5.127
37	6.232	5.239
38	5.588	5.843
39	6.470	5.047
40	7.166	4.556
41	6.218	5.251
42	7.450	4.383
43	6.298	5.184
44	6.475	5.043
45	15.411	2.119
46	7.466	4.373
47	8.293	3.937
48	6.872	4.751
49	8.884	3.675
50	8.006	4.078
51	9.614	3.396
52	25.223	1.294
53	10.789	3.026
54	32.958	0.991
55	35.816	0.912
56	36.213	0.902
57	8.301	3.933
58	11.487	2.842
59	71.526	0.456
60	10.361	3.151
61	52.518	0.622
62	33.081	0.987
63	32.493	1.005
64	95.322	0.343

Plot of Compaq AlphaServer ES40 DEC6600/500 speedup

Compaq AlphaServer ES40 DEC6600/500 (4 EV6 21264 CPUs, 500 MHz, 4GB RAM) OSF/1 4.0F
1,000,000,000 iterations
Number of threads	Wallclock Time (sec)	Speedup
1	26.470	1.000
2	13.260	1.996
3	8.840	2.994
4	6.650	3.980
5	8.080	3.276
6	6.770	3.910
7	6.850	3.864
8	6.670	3.969
9	7.200	3.676
10	7.130	3.712
11	7.120	3.718
12	6.690	3.957
13	7.180	3.687
14	7.300	3.626
15	7.170	3.692
16	6.710	3.945

Plot of Compaq AlphaServer ES40 Sierra/667 speedup

Compaq AlphaServer ES40 Sierra/667 (32 EV6.7 21264A CPUs, 667 MHz, 8GB RAM)
100,000,000 iterations
Number of threads	Wallclock Time (sec)	Speedup
1	2.500	1.000
2	1.600	1.562
3	1.300	1.923
4	1.500	1.667
5	2.000	1.250
6	2.000	1.250
7	1.800	1.389
8	1.200	2.083
9	1.500	1.667
10	1.900	1.316
11	1.900	1.316
12	1.900	1.316
13	3.200	0.781
14	2.400	1.042
15	1.900	1.316
16	2.200	1.136
17	1.900	1.316
18	1.800	1.389
19	2.100	1.190
20	1.600	1.562
21	2.600	0.962
22	1.500	1.667
23	1.800	1.389
24	1.600	1.562
25	1.500	1.667
26	2.100	1.190
27	1.800	1.389
28	1.700	1.471
29	2.200	1.136
30	2.400	1.042
31	2.100	1.190
32	2.500	1.000
33	2.500	1.000
34	1.900	1.316
35	1.800	1.389
36	2.500	1.000
37	1.600	1.562
38	1.600	1.562
39	2.200	1.136
40	2.500	1.000
41	2.200	1.136
42	1.500	1.667
43	3.100	0.806
44	2.400	1.042
45	2.500	1.000
46	2.400	1.042
47	2.500	1.000
48	1.600	1.562
49	3.300	0.758
50	2.200	1.136
51	2.600	0.962
52	3.200	0.781
53	2.400	1.042
54	1.800	1.389
55	3.000	0.833
56	4.900	0.510
57	1.800	1.389
58	2.700	0.926
59	3.100	0.806
60	2.700	0.926
61	3.600	0.694
62	3.000	0.833
63	2.300	1.087
64	3.700	0.676

Sun SPARC Enterprise T5240
(two 8-core CPUs, 128 threads, 1200 MHz UltraSPARC T2 Plus, 64GB RAM)
Solaris 10

10⁸ iterations

Plot of Sun SPARC Enterprise T5240 speedup

10⁹ iterations

10¹⁰ iterations

Test machine for benchmarking (vendor withheld)
(4 CPUs, 16 threads/CPU) GNU/Linux

OpenMP: overview and resource guide

OpenMP benchmark: computation of pi

`OpenMP` benchmark: computation of pi