| | | ||
ANU College of Engineering and Computer Science School of Computer Science | ||
| COMP8320Laboratory 01 - week 1, 2011Introduction to the T2 Multicore ProcessorThis session will provide an introduction to multicore programming onthe UltraSPARC T2 using OpenMP and automatic parallelization techniques.The primary objective is to you give you a feeling for the performanceof a multicore computer and of the more straightforward programmingparadigms available for it. In this session, you will encounter issues, some quite deep, relatingto multicore computer programming and performance. These notes will askyou questions on these, as they come up. Its good to think about them,but as time is limited, quickly check your understanding by asking thedemonstrator, and then move on. Logging in and Setting UpYou will first need to customize your command line environment for thiscourse. To do this, simply add to your ~/.bashrc file the line:
Warning! the T2's /dept/dcs is a`copy' of /dept/dcs on the student system. It may beout-of-sync! Copy files from /dept/dcs/comp8320 on thestudent system. Copy the files for the session into your home account area:
Automatic Parallelization: integer arithmetic-intensive programsThe program sum.c initializes an array of integers (its sizedetermined by its command line argument), and computes its sum. It uses64-bit integers (long long int) in order to avoid overflow. Thetime it takes to do this is also measured. Open it with a text editor(e.g. using the command emacs sum.c &) and inspect the code.Compile it using the command:
You will see that it did not parallelize the sum loop. This is becauseit treats sum as a normal variable. Now try compiling it again,instructing it to use special techniques (reductions) to parallelizesuch loops:
Repeat the above for 4 threads. Note that you can use the arrow keys toedit and re-execute previously typed commands. Finally, comment out the line to print out the total insum.c, and re-compile. Why has the compiler decided the loopsare not `profitable'? Hints: try running the program; what doyou observe? Or try re-compiling without -fast. Thisillustrates a potential pitfall when you are using an aggressivelyoptimizing compiler. Restore the print statement and re-compile beforecontinuing. Automatic Parallelization: memory-intensive programsInspect the program dcopy.c. This program is similar tosum.c except that it copies the array into a second array (which isnow double procession floating point) instead of summing it. Compile itwith the command:
Note that while the T2 has nominally 64 CPUs, only 56 have been reservedfor the exported virtualized T2 (wallaman) on the studentsystem (execute the command /usr/sbin/psrinfo to verify this).In fact only 7 of the 56 are real CPUs; using a technique called hardwarethreading, 8 sets of registers all share a real CPU (and theinteger, floating point and load/store execution units). It is thusextremely cheap; the question is, how close does it come to emulatingthe expensive alternative (56 real CPUs)? Run the sum program for up to 32 threads; you can use:
Automatic Parallelization: memory- and floating point intensive programInspect the program dadd.c. This program is similar todcopy.c except it adds a multiple of one array and a second,storing it into a third. Repeat the above exercise for dcopy.cfor this program. How do the speedups compare now?OpenMP: parallel loops and reductionsWhile automatic parallelization works well for many simple programs, there are situations where the programmer will need to specify theparallelism more explicitly. One of the simplest paradigms, with anestablished user community, is . OpenMP usesdirectives, annotations to a normal C or Fortran program, whichinstructs the compiler how to parallelize the code. This enables aprogram to be parallelized incrementally, which is often a greatadvantage. Often (but not always!), these directives only affect thespeed of computation, and not the result of the computation.Copy dsum.c (a double version of sum.cinto a new file sum_omp.c (cp dsum.cdsum_omp.c). Just above the second for loop, add the OpenMPdirective:
OpenMP: the issue of atomicityCopy dsum_omp.c into a new file dsum_omp_atomic.c. Indsum_omp_atomic.c, remove the reduction(+:sum) part ofthe directive. Compile and re-run with 1, 2 and 4 threads.
export OMP_NUM_THREADS=1; ./dsum_omp_atomic 1000000 export OMP_NUM_THREADS=2; ./dsum_omp_atomic 1000000 export OMP_NUM_THREADS=4; ./dsum_omp_atomic 1000000 export OMP_NUM_THREADS=8; ./dsum_omp_atomic 1000000 We will now look at an alternate way in OpenMP of correcting this problem.We can protect the update of the variable sum by adding the line:
Concluding RemarksIn this session, we have looked at relatively simple techniques toharness the power of multicore computing. In doing so, we have alsoencountered some non-trivial concepts and seen some pitfalls related toparallel programming. As a review, consider the following questions:
The examples have been oriented to parallelizing simple loops. But the T2is designed for commercial applications; how are they programmed toharness concurrency? Generally, threads are explicitly programmed, infor example Java. The programming is more complex, too complex to coverin a one hour session, but the issues of data hazards, speedups, sharedand private data apply equally. Last modified: 31/08/2011, 16:07
|
| | | |
Please direct all enquiries to: Page authorised by: Head of Department, DCS |
The Australian National University — CRICOS Provider Number 00120C |