| | | ||
ANU College of Engineering and Computer Science School of Computer Science | ||
| COMP8320Laboratory 02 - week 2, 2011OpenMP on SolarisIn this session, we will look at how OpenMP is implemented on SPARC/Solaris.It will also serve as a catch-up form the previous session. In this session, you will encounter issues, some quite deep, relatingto multicore computer programming and performance. These notes will askyou questions on these, as they come up. Its good to think about them,but as time is limited, quickly check your understanding by asking thedemonstrator, and then move on. Logging in and Setting UpLog in to wallaman. You should see a prompt like:
Complete Laboratory 01If you have not done so, complete the previous week's lab.OpenMP: How the atomic directive is implementedRecall that we protected the update of the variable sum in dsum_omp_atomic.c.by adding the line:
./threadrun 32 ./dsum_omp_atomic 100000
OpenMP: how loops are parallelizedAsk the compiler to produce an assembly listing of its compilation of the dsum_omp:
Now locate for the entry point to the main() (near the top ofthe file). Search for consecutive call instructions.You will see a call to the master function__mt_MasterFunction_rtc_(); go past thistill you find the second one. This is for the second loop;you will see (a number of instructions up) that the address of_$d1B30.main is being placed on the stack for this function to use. So how does this work? The first call to master function creates thethreads and sets them to execute the function for the first parallelloop. The threads idle between this and the second call, whichcauses the threads to wake up and execute the function for the second loop. You can verify that the first call to master function creates thethreads and determine the overhead of thread creation by removing thefirst #pragma omp, and seeing how that affects the executiontime of the second loop. OpenMP: how reductions are implementedWe will now look at how reductions are implemented in OpenMP. Not only isthis important in itself, the exercise will uncover more features of OpenMPand issues in parallel programming.The file dsum_omp_psum.c is set up to implement reductionsusing a technique call partial sums. Inspect this file.Instead of a single thread of execution in a normal program, when anOpenMP program executes, $OMP_NUM_THREADS threads getcreated. These are then activated whenever a parallel loop isexecuted. In this case, each thread is given a segment of the array tosum. Then, in a non-parallel loop, these partial sums are addedtogether to get the total.The program uses an array psums to do this. Two issues arise:how does the program determine the size of the array, and how do thethreads index the array. The former can be done by a call to the OpenMPintrinsic function omp_get_max_threads(). The latter can bedone by calling the intrinsic omp_get_thread_num() whichreturns a unique id for each thread. However, this can only be donein a (parallel) part of the program when all the threads are active! This brings us to the concept of parallel regions. So far,a parallel region has been a simple loop, but we want each threadto get its thread id outside the loop. In C programs, a region can bedefined in a code block ({ ... } ). You will see such a code blockaround the call omp_get_thread_num() and the subsequent for loop.Just above this block, insert the directive:
So far so good, but we have not actually instructed the compiler toparallelize the loop! To do so, insert the directive:
Programming ExerciseFor SMP systems (with CPUs on separate chips with cachecoherency hardware between them), performance will be highly degradedunless we pad out the psum array so that there is one elementused per cache line (typically 8 words). The phenomenon is calledfalse cache line sharing. However, as it is a multicoreprocessor with CPUs on a single chip, this makes little difference onthe T2.As an exercise, verify this by copying dsum_omp_psum.cto a new file dsum_omp_psum_pad.c and `pad out' thepsums[] array by a factor of 8 (i.e. make it 8 timeslarger, and only use every 8th element). Note that the (level 2)cache line size is 64 bytes, so every element that is usedwill be on a separate cache line. Compile and run this program andcompare it with dsum_omp_psum. Concluding RemarksIn this session, we have looked at how the relatively simpleOpenMP model is implemented using a threaded programming model, in this case Solaris threads (closely related to Posix pthreads).As a review, consider the following questions:
The examples have been oriented to parallelizing simple loops. But the T2is designed for commercial applications; how are they programmed toharness concurrency? Generally, threads are explicitly programmed, infor example Java. The programming is more complex, too complex to coverin a one hour session, but the issues of data hazards, speedups, sharedand private data apply equally. Extra Exercise: Atomic Operations on the SPARCWe have suspected that in the mt runtime library, the atomic directives are ultimately implemented in terms of (SPARC)atomic instructions, which are used to synchronize the VCPUs on the T2.You can investigate this, First locate where the mt shared library that the dsum_omp_atomic uses is:
If you repeat this exercise for the function that is called when you endand atomic region (search for e_atomic, you will see that it similarly uses the atomic_store function. Last modified: 3/08/2011, 11:27
|
| | | |
Please direct all enquiries to: Page authorised by: Head of Department, DCS |
The Australian National University — CRICOS Provider Number 00120C |