Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
ANU - College of Engineering and Computer Science - SoCS - COMP8320
| |
ANU College of Engineering and Computer Science School of Computer Science

COMP8320Laboratory 01 - week 1, 2011

Introduction to the T2 Multicore Processor


This session will provide an introduction to multicore programming onthe UltraSPARC T2 using OpenMP and automatic parallelization techniques.The primary objective is to you give you a feeling for the performanceof a multicore computer and of the more straightforward programmingparadigms available for it.

In this session, you will encounter issues, some quite deep, relatingto multicore computer programming and performance. These notes will askyou questions on these, as they come up. Its good to think about them,but as time is limited, quickly check your understanding by asking thedemonstrator, and then move on.

Logging in and Setting Up

You will first need to customize your command line environment for thiscourse. To do this, simply add to your ~/.bashrc file the line:
    source /dept/dcs/comp8320/login/Bashrc
Make sure the line is properly terminated (press the `Enter' key at the end ofthe line -- otherwise it won't work!). To ensure that this file always getssourced when you log in, add to your ~/.profile file the line:
    source ~/.bashrc

Warning! the T2's /dept/dcs is a`copy' of /dept/dcs on the student system. It may beout-of-sync! Copy files from /dept/dcs/comp8320 on thestudent system.

Copy the files for the session into your home account area:

    cp -r /dept/dcs/comp8320/public/lab01/ .
To access the T2, simply type the command:
    ssh -Y wallaman
and cd to your lab01 sub-directory.The following editors are available on wallaman:
    xemacs, vi
Note that wallaman imports your home directory, so you can editfiles on the CSIT labs workstations.

Automatic Parallelization: integer arithmetic-intensive programs

The program sum.c initializes an array of integers (its sizedetermined by its command line argument), and computes its sum. It uses64-bit integers (long long int) in order to avoid overflow. Thetime it takes to do this is also measured. Open it with a text editor(e.g. using the command emacs sum.c &) and inspect the code.Compile it using the command:
    cc -fast -xautopar -xloopinfo -o sum sum.c
This compiles the program under full optimization (-fast) andattempts to parallelize any loops (-xautopar) provided thatdoing so is both safe and worthwhile. It also reports on theparallelization of each loop (-xloopinfo).

You will see that it did not parallelize the sum loop. This is becauseit treats sum as a normal variable. Now try compiling it again,instructing it to use special techniques (reductions) to parallelizesuch loops:

    cc -fast -xautopar -xloopinfo -xreduction -o sum sum.c
Run the program:
    ./sum 1000000
and note the time taken. This will be the serial execution time. To run this program using two threads, execute the command:
    export OMP_NUM_THREADS=2; ./sum 1000000
and note the time. The Solaris operating system will try to schedule thetwo threads on different CPUs (assuming they are available), and if sowe now have a parallel program running! How close did this come to the idealreduction in time (being halved)?

Repeat the above for 4 threads. Note that you can use the arrow keys toedit and re-execute previously typed commands.

Finally, comment out the line to print out the total insum.c, and re-compile. Why has the compiler decided the loopsare not `profitable'? Hints: try running the program; what doyou observe? Or try re-compiling without -fast. Thisillustrates a potential pitfall when you are using an aggressivelyoptimizing compiler. Restore the print statement and re-compile beforecontinuing.

Automatic Parallelization: memory-intensive programs

Inspect the program dcopy.c. This program is similar tosum.c except that it copies the array into a second array (which isnow double procession floating point) instead of summing it. Compile itwith the command:
    cc -fast -xautopar -xloopinfo -o dcopy dcopy.c
Run the program with a single thread (serially):
    export OMP_NUM_THREADS=1; ./dcopy 1000000
Repeat the above for 2, 4, 8, 16 and 32 threads and observe the decrease in time (after 4 threads, run the program several times andlook for the best time). Note that you can use the command:
    ./threadrun 32 ./dcopy 1000000
to do this for you.How close were these to the idealspeedup (decrease in time is proportional to the number of threads)?

Note that while the T2 has nominally 64 CPUs, only 56 have been reservedfor the exported virtualized T2 (wallaman) on the studentsystem (execute the command /usr/sbin/psrinfo to verify this).In fact only 7 of the 56 are real CPUs; using a technique called hardwarethreading, 8 sets of registers all share a real CPU (and theinteger, floating point and load/store execution units). It is thusextremely cheap; the question is, how close does it come to emulatingthe expensive alternative (56 real CPUs)?

Run the sum program for up to 32 threads; you can use:

    ./threadrun 32 ./sum 1000000
How does its scalabilitycompare to dcopy? Compile the double precision version of the sum program, dsum.c.
    cc -fast -xautopar -xloopinfo -xreduction -o dsum dsum.c
Run the dsum program for up to 32 threads.How does its scalability compare to dsum?

Automatic Parallelization: memory- and floating point intensive program

Inspect the program dadd.c. This program is similar todcopy.c except it adds a multiple of one array and a second,storing it into a third. Repeat the above exercise for dcopy.cfor this program. How do the speedups compare now?

OpenMP: parallel loops and reductions

While automatic parallelization works well for many simple programs, there are situations where the programmer will need to specify theparallelism more explicitly. One of the simplest paradigms, with anestablished user community, is . OpenMP usesdirectives, annotations to a normal C or Fortran program, whichinstructs the compiler how to parallelize the code. This enables aprogram to be parallelized incrementally, which is often a greatadvantage. Often (but not always!), these directives only affect thespeed of computation, and not the result of the computation.

Copy dsum.c (a double version of sum.cinto a new file sum_omp.c (cp dsum.cdsum_omp.c). Just above the second for loop, add the OpenMPdirective:

    #pragma omp parallel for reduction(+:sum)
which instructs the compiler to parallelize the loop, applying thereduction technique on the variable sum.Also, just above the first loop, add the line:
    #pragma omp parallel for
Now compile the program to be parallelized usingOpenMP.
    cc -fast -xopenmp -o dsum_omp dsum_omp.c
Run the program in single threaded mode:
    export OMP_NUM_THREADS=1; ./dsum_omp 1000000
and repeat for 2, 4 and 8 threads. Compare the performance with the auto-parallelized programs. Which is better? Is this surprising?

OpenMP: the issue of atomicity

Copy dsum_omp.c into a new file dsum_omp_atomic.c. Indsum_omp_atomic.c, remove the reduction(+:sum) part ofthe directive. Compile and re-run with 1, 2 and 4 threads.
    cc -fast -xopenmp -o dsum_omp_atomic dsum_omp_atomic.c
    export OMP_NUM_THREADS=1; ./dsum_omp_atomic 1000000
    export OMP_NUM_THREADS=2; ./dsum_omp_atomic 1000000
    export OMP_NUM_THREADS=4; ./dsum_omp_atomic 1000000
    export OMP_NUM_THREADS=8; ./dsum_omp_atomic 1000000
In particular, run with 8 threads several times and observe thevariation in the output. For the reported sum, what do you observe aboutthe reported value as opposed to the correct value (given by the singlethreaded execution)? This phenomenon is called a data hazardor race hazard, and is a common pitfall in parallelprogramming.

We will now look at an alternate way in OpenMP of correcting this problem.We can protect the update of the variable sum by adding the line:

    #pragma omp atomic
just above it (inside the loop). This will force the instructions implementing the statement sum += array[i]; to be executed as if by a single instruction.Re-compile and re-run the programfor up to 8 threads.You will observe the correct result is reported, but what about the time!

Concluding Remarks

In this session, we have looked at relatively simple techniques toharness the power of multicore computing. In doing so, we have alsoencountered some non-trivial concepts and seen some pitfalls related toparallel programming. As a review, consider the following questions:
  • How effectively does the T2 scale using separate cores? (scaling from 1 to 4 threads)?
  • How effectively does the T2 scale using hardware threading? (scaling from 8 to 32 threads)?
  • Is the above different for integer versus floating point? For memory-intensive vs less memory intensive?
  • How effective is automatic parallelization for simple loops?
  • What causes a race hazard? Within a parallelized loop, is forcingan atomic update likely to be useful?

The examples have been oriented to parallelizing simple loops. But the T2is designed for commercial applications; how are they programmed toharness concurrency? Generally, threads are explicitly programmed, infor example Java. The programming is more complex, too complex to coverin a one hour session, but the issues of data hazards, speedups, sharedand private data apply equally.

Last modified: 31/08/2011, 16:07

| | |