Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
ANU - College of Engineering and Computer Science - SoCS - COMP8320
| |
ANU College of Engineering and Computer Science School of Computer Science

COMP8320Laboratory 02 - week 2, 2011

OpenMP on Solaris


In this session, we will look at how OpenMP is implemented on SPARC/Solaris.It will also serve as a catch-up form the previous session.

In this session, you will encounter issues, some quite deep, relatingto multicore computer programming and performance. These notes will askyou questions on these, as they come up. Its good to think about them,but as time is limited, quickly check your understanding by asking thedemonstrator, and then move on.

Logging in and Setting Up

Log in to wallaman. You should see a prompt like:
    u8914893@wallaman:~>
If not, your account may not be set up properly for the T2. See the instructions for ; or get helpfrom your demonstrator.Go to the same directory that you used for Lab 01.

Complete Laboratory 01

If you have not done so, complete the previous week's lab.

OpenMP: How the atomic directive is implemented

Recall that we protected the update of the variable sum in dsum_omp_atomic.c.by adding the line:
    #pragma omp atomic
Check that the directive is still there and recompile:
    cc -fast -xopenmp -o dsum_omp_atomic dsum_omp_atomic.c
and verify that the program works correctly (albeit slowly). Comparewith the original dsum_omp program:
    ./threadrun 32 ./dsum_omp 100000
    ./threadrun 32 ./dsum_omp_atomic 100000
If you ask the compiler to produce an assembly listing of its compilation:
    cc -fast -xopenmp -S dsum_omp_atomic.c
and then search for the keyword `atomic' indsum_omp_atomic.s, you will see how the OpenMP atomicdirective is implemented by the OpenMP runtime library(mt). What do you think the associated function calls do?(recall the shared memory slide from ?The atomic directive is howeverefficient in cases were it is called less frequently.

OpenMP: how loops are parallelized

Ask the compiler to produce an assembly listing of its compilation of the dsum_omp:
    cc -fast -xopenmp -S dsum_omp.c
Load dsum_omp.s into an editor and go to the bottom of thefile. Search backward for the faddd (Floating point ADD Double)instruction and look for a (heavily unrolled) loop (which also has a lotof ldd (LoaD Double) instructions). Note how aggressively thecompiler is optimizing the loop. Scroll upwards, and you will see thatthis loop is actually part of a function (subroutine), called somethinglike _$d1B30.main. Near the entry of the function, you will seethat it calls a system function __mt_get_next_chunk...(), whichinstructs it on which iterations to work on.

Now locate for the entry point to the main() (near the top ofthe file). Search for consecutive call instructions.You will see a call to the master function__mt_MasterFunction_rtc_(); go past thistill you find the second one. This is for the second loop;you will see (a number of instructions up) that the address of_$d1B30.main is being placed on the stack for this function to use.

So how does this work? The first call to master function creates thethreads and sets them to execute the function for the first parallelloop. The threads idle between this and the second call, whichcauses the threads to wake up and execute the function for the second loop.

You can verify that the first call to master function creates thethreads and determine the overhead of thread creation by removing thefirst #pragma omp, and seeing how that affects the executiontime of the second loop.

OpenMP: how reductions are implemented

We will now look at how reductions are implemented in OpenMP. Not only isthis important in itself, the exercise will uncover more features of OpenMPand issues in parallel programming.The file dsum_omp_psum.c is set up to implement reductionsusing a technique call partial sums. Inspect this file.Instead of a single thread of execution in a normal program, when anOpenMP program executes, $OMP_NUM_THREADS threads getcreated. These are then activated whenever a parallel loop isexecuted. In this case, each thread is given a segment of the array tosum. Then, in a non-parallel loop, these partial sums are addedtogether to get the total.

The program uses an array psums to do this. Two issues arise:how does the program determine the size of the array, and how do thethreads index the array. The former can be done by a call to the OpenMPintrinsic function omp_get_max_threads(). The latter can bedone by calling the intrinsic omp_get_thread_num() whichreturns a unique id for each thread. However, this can only be donein a (parallel) part of the program when all the threads are active!

This brings us to the concept of parallel regions. So far,a parallel region has been a simple loop, but we want each threadto get its thread id outside the loop. In C programs, a region can bedefined in a code block ({ ... } ). You will see such a code blockaround the call omp_get_thread_num() and the subsequent for loop.Just above this block, insert the directive:

    #pragma omp parallel private(thr_id)
The variable thr_id stores the value fromomp_get_thread_num(). It must be declared as a `private'variable in order to ensure each thread gets a unique copy of it (if itwas a normal (shared) variable, only a single value could be stored init! You could later try removing the private(thr_id) part, andsee what happens!).

So far so good, but we have not actually instructed the compiler toparallelize the loop! To do so, insert the directive:

    #pragma omp for
just above the for loop.Compile the program using:
    cc -fast -xopenmp -o dsum_omp_psum dsum_omp_psum.c
Run the program in single threaded mode:
    export OMP_NUM_THREADS=1; ./dsum_omp_psum 1000000
and repeat for 2, 4 and 8 threads. Compare with dsum_omp; havewe achieved the same performance?

Programming Exercise

For SMP systems (with CPUs on separate chips with cachecoherency hardware between them), performance will be highly degradedunless we pad out the psum array so that there is one elementused per cache line (typically 8 words). The phenomenon is calledfalse cache line sharing. However, as it is a multicoreprocessor with CPUs on a single chip, this makes little difference onthe T2.

As an exercise, verify this by copying dsum_omp_psum.cto a new file dsum_omp_psum_pad.c and `pad out' thepsums[] array by a factor of 8 (i.e. make it 8 timeslarger, and only use every 8th element). Note that the (level 2)cache line size is 64 bytes, so every element that is usedwill be on a separate cache line. Compile and run this program andcompare it with dsum_omp_psum.

Concluding Remarks

In this session, we have looked at how the relatively simpleOpenMP model is implemented using a threaded programming model, in this case Solaris threads (closely related to Posix pthreads).As a review, consider the following questions:
  • How is OpenMP implemented: how are loops parallelized, how is the atomic directive implemented, and how are reductions implemented?

The examples have been oriented to parallelizing simple loops. But the T2is designed for commercial applications; how are they programmed toharness concurrency? Generally, threads are explicitly programmed, infor example Java. The programming is more complex, too complex to coverin a one hour session, but the issues of data hazards, speedups, sharedand private data apply equally.

Extra Exercise: Atomic Operations on the SPARC

We have suspected that in the mt runtime library, the atomic directives are ultimately implemented in terms of (SPARC)atomic instructions, which are used to synchronize the VCPUs on the T2.You can investigate this, First locate where the mt shared library that the dsum_omp_atomic uses is:
    ldd dsum_omp_atomic
You will be able to guess that it is must be in/lib/libmtsk.so.1. Get a disassembly of this file:
    objdump -d /lib/libmtsk.so.1 > libmtsk.dis
Now search libmtsk.dis for b_atomic to locatethe function called at the start of an OpenMP atomic section of code.You will notice that it is more complex than you would expect (ask your demonstrator if you are curious, but a hint is thatSolaris uses adaptive spinlocks).You will also notice that it calls a function called atomic_store.Go to the bottom of the file and search for atomic_store.You will see that it uses the swap instruction, which atomicallyswaps the contents of a register and a memory location(other SPARC atomic instructions are cas (compare and swap)and ldstub load-store unsigned byte). In the same area, you willsee other functions to do primitive synchronization operations,such as the atomic decrement of a memory location.

If you repeat this exercise for the function that is called when you endand atomic region (search for e_atomic, you will see that it similarly uses the atomic_store function.

Last modified: 3/08/2011, 11:27

| | |