Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
Parallelizing the Construction of
Static Single Assignment Form
Jeremy Singer
University of Glasgow, UK
jeremy.singer@dcs.gla.ac.uk
Martin Ward
De Montfort University, UK
martin@gkc.org.uk
ABSTRACT
Data flow analysis and code optimization are generally im-
plemented using sequential algorithms. Given the commodi-
tization of multi-core platforms, it should be possible to use
parallel algorithms instead. This paper examines how the
standard sequential algorithm for constructing static single
assignment form (SSA) may be parallelized. We demon-
strate how this parallel algorithm may be realized in an ex-
isting compiler infrastructure, using light-weight threading
mechanisms for Java. We provide a quantitative evalua-
tion of our parallel SSA construction pass. The maximum
speedups are around 4x when applied to real-world Java
programs on a quad-core hyperthreaded Intel Core i7.
1. INTRODUCTION
The multi-core / many-core architectural paradigm is now
well-established, not only for high performance computing,
but also for commodity end-user machines [10]. The soft-
ware industry recognises the need to adapt existing appli-
cations to make them multi-threaded, in order to take ad-
vantage of the underlying hardware parallelism. One such
application, which will be the focus of attention in this pa-
per, is the compiler. Over the past decade, the use of com-
piler technology has become ubiquitous, driven by the grow-
ing trend for just-in-time (JIT) compilation, i.e. client-side
compilation of bytecodes delivered over the internet.
At present, the compilation process is generally imple-
mented using sequential algorithms. In particular, this ob-
servation is true for the middle-end of compilers, dealing
with data flow analysis and optimization. These middle-end
algorithms are the key parts of a client-side JIT compiler like
Java HotSpot [13]. Parallelization is not entirely straight-
forward due to global data structures such as the symbol
table, the control flow graph, etc. In addition, until the
recent arrival of multi-core on client-side platforms, it was
not necessary to parallelize compilation on the client-side.
For high performance computing environments, paralleliza-
tion of the compilation process commonly takes place. For
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.
instance, software for the Large Hadron Collider [1] is com-
piled in parallel. Common Unix tools for parallel compila-
tion include distcc and parallel make [2]. However these
parallelize on a coarse granularity. Generally each parallel
task sequentially compiles one or more object files or pro-
gram modules.
The challenge we address in this paper is to parallelize
standard client-side data flow analysis, using lightweight
threading constructs to take advantage of hardware paral-
lelism on commodity multi-core platforms.
The conventional program representation for aggressive
compiler optimization is static single assignment form (SSA).
This intermediate representation is used by client-side Hotspot
[13], LLVM [14], and GCC [22, 23], inter alia. Construction
of SSA involves non-trivial data flow analysis and can take a
significant proportion of overall analysis time. Therefore, it
seems that parallelization of the SSA construction algorithm
is an obvious first step, for a case study on the parallelization
of client-side compilation.
Most open-source optimizing compilers implement some
variant of the classical SSA construction algorithm, by Cytron
et al, which is entirely sequential [7]. The chief advantages
of this algorithm are: (a) it is easy to implement, (b) it
is fairly efficient. In fact, the algorithm is not O(N) with
respect to the size N of the input control flow graph, how-
ever Cytron et al claim it to be ‘linear in practice’ since the
worst-case O(N2) behaviour only occurs with deeply nested,
irreducible control flow structures. There is empirical evi-
dence that such constructs are rare in real-world programs
[12].
In this paper, we show how the classical SSA construction
algorithm can be readily parallelized at a fine granularity.
We present a working implementation in a widely used opti-
mization infrastructure (Soot), discuss the implementation
decisions we had to make, and evaluate its performance.
Of particular concern are the overheads of thread spawning,
and whether these are mitigated by the parallelism obtained
when processing standard Java benchmark programs.
There are two key contributions in this paper.
1. We present an algorithm for the parallel construction
of SSA. To the best of our knowledge this is the first
such parallel algorithm.
2. We give an evaluation of a realistic implementation
using Soot and the Java fork/join framework on two
commodity multi-core platforms.
2. ALGORITHM
Algorithms for the construction of SSA based on data flow
analysis of a control flow graph (following the style of Cytron
et al [7]) can be divided into two distinct phases.
1. Insertion of φ-functions at control flow merge points,
in order to satisfy the SSA property that each variable
use has a single reaching definition, which dominates
the use.
2. Renaming of local variables, in order to satisfy the SSA
property that each variable has a single definition in
the program text.
In this section, we discuss how to adapt the two phases
of the classical SSA construction algorithm to make them
amenable for parallelism.
In passing, we note that there are many other data flow
based SSA construction algorithms, e.g. [28, 3, 8]. These are
more efficient than the classical algorithm, although they
may be more complex to implement. However in general,
they have a similar structure to the classical algorithm, so it
should be possible to parallelize them using similar strategies
to those presented in this paper.
Certain SSA construction algorithms are not data flow
based, such as the syntax-directed algorithm by Brandis and
Mo¨ssenbo¨ck [6]. In such cases, the strategies given in this
paper would not be directly applicable. We do not consider
such algorithms further in this work.
2.1 Insertion of φ-functions
2.1.1 Commentary on Sequential Algorithm
The classical algorithm, shown in Algorithm 1, iterates
over each variable v from the original program, to determine
where to insert φ-functions for v. This is the outermost for
loop starting at line 2. The algorithm is based on iteratively
processing a work-list W of basic blocks, for each variable
v. Initially, we insert into work-list W all basic blocks con-
taining definition statements for v in the original program,
lines 3–6. One-by-one, each basic block B is removed from
W . We assume that dominance frontier information has
been pre-computed for each basic block in the control flow
graph. If basic block X contains a definition of v, then it
is necessary to insert a φ-function for v at the beginning
of all blocks in the dominance frontier of B, i.e. DF(X),
lines 9–11. Blocks in DF(X) have several incoming defini-
tions of v, and thus require a φ-function for v at the head
of each block to act as a data flow merge operator. Note
that we only insert a φ-fn for v if none has been inserted
already. Therefore it is helpful to keep a bitvector record of
where we have inserted φ-functions for v, see line 10. Now,
if we do insert a φ-function, this is a fresh definition of v,
so the enclosing basic block must be inserted into work-list
W since it may trigger the insertion of further φ-functions
at its own dominance frontier, see line 13. Again, each basic
block should only be inserted into W once, so it is helpful
to keep another bitvector record of which basic blocks have
been inserted into W , see line 12.
The φ-function insertion algorithm is guaranteed to ter-
minate. There is a fixed number of variables in the original
program, so the outermost for loop is executed for a fixed
number of iterations. The control flow graph has a fixed
number of basic blocks N , thus the dominance frontier of
any basic block is bounded by N . Again, each basic block
can only be inserted into the work-list W once, so the num-
ber of insertions into W is bounded by N . Also, each itera-
tion of inner while loop starting at line 7 removes one item
from W . So W will eventually become empty, which is the
termination condition of the while loop.
Algorithm 1 Classical φ-function insertion algorithm
1: W ← {}
2: for all v : variable names in original program do
3: for all d : definition statements of variable v do
4: let B be the basic block containing d
5: W ←W ∪ {B}
6: end for
7: while W 6= {} do
8: remove a basic block X from W
9: for all Y : block ∈ DF(X) do
10: if Y does not contain a φ-function for v then
11: add v ← φ(...) at start of Y
12: if Y has not already been processed in W
then
13: W ←W ∪ {Y }
14: end if
15: end if
16: end for
17: end while
18: end for
2.1.2 Parallelization Strategy
The presentation of φ-function insertion in Algorithm 1
follows the original description from Cytron et al [7]. As it
stands, this is a sequential algorithm. However, it is clear
to see that each variable v is processed individually in a
single iteration of the outermost for loop. Variables do not
interfere with one another, in terms of data flow analysis
for inserting φ-functions. Therefore it is straightforward to
transform the for at line 2 loop into a parallel doall loop,
where all iterations may be executed in parallel.
The work-list W and all local variables declared within
the loop body must be privatized for each parallel loop it-
eration. Apart from this, the only issue is that parallel loop
iterations may attempt to insert φ-functions for different
variables concurrently into the same basic block.
One solution is to introduce a semaphore associated with
each basic block B to serialize the φ-function insertions at
the head of B. However there is the likelihood of high con-
tention on these semaphores (due to φ-functions often being
located together).
A more scalable solution is to maintain a private queue
of (φ-function, basic block) pairs for each thread, denoting
which basic blocks require particular φ-functions. Thus line
11 of the algorithm changes to:
11: add new Pair (v ← φ(. . .), Y ) to local queue
All these queued φ-functions can be inserted at the end
of the doall loop. Given an appropriate lookup table for
the local queues that can be indexed by basic block ids, the
lazy φ-insertion phase can be parallelized over basic blocks,
thus avoiding the earlier problem of concurrent updates to
individual basic blocks. Algorithm 2 shows this additional
code.
2.2 Renaming Local Variables
Algorithm 2 Amendment for lazy φ-function insertion
19: doall B : basic blocks in control flow graph
20: for all v : variable names in original program do
21: if there is a queued φ-function for v at block B then
22: insert φ-function at block B
23: end if
24: end for
25: end doall
2.2.1 Commentary on Sequential Algorithm
Once the φ-function insertion is complete, the second phase
is to rename all definitions and uses of local variables to
enforce the SSA property of having a single definition per
variable in the program text, while respecting the original
program’s data flow behaviour.
The presentation of the classical renaming algorithm by
Cytron et al [7] is phrased in terms of original variable names
and new variable names. At each control flow point in the
program, we must maintain a mapping between original vari-
able names and new variable names. This mapping changes
as variable definitions enter and exit scope.
The mapping is generated using an array of stacks S, in-
dexed by original variable name. An individual stack data
structure S(V ) contains new names corresponding to orig-
inal variable name V . A new variable name Vi is pushed
onto stack S(V ) at the ith definition of V encountered in
the original program text. At the same time, the new name
Vi replaces the old name V for that particular variable defi-
nition statement. Vi is popped off the stack when its corre-
sponding definition goes out of scope.
At each statement that uses V in the original program,
the new variable name Vi on top of S(V ) is substituted into
the statement in place of the original name V .
The CFG is processed according to a pre-order traversal
of its dominator tree. The set of basic blocks Children(X)
denotes the basic blocks whose immediate dominator is X,
i.e. their parent in the dominator tree is X.
The most complex part of the renaming algorithm involves
φ-function right-hand side operands. The jth operand on
the right-hand side of a φ-function for original variable V
must be renamed according to the new variable name Vi that
is in scope at the end of the jth predecessor basic block. Op-
erationally, a φ-function for V at the beginning of block B
is equivalent to n simple assignments to V at the end of the
n predecessor basic blocks of B. Thus it is most straightfor-
ward to rename φ-function right-hand operands during the
renaming of corresponding predecessor basic blocks. At each
basic block, we must look ahead to check for φ-functions in
successor basic blocks. The set of control flow successors for
basic block B is denoted as Succ(B).
Algorithm 3 gives the pseudo-code for the classical SSA
renaming algorithm due to Cytron et al.
2.2.2 Parallelization Strategy
The sequential renaming algorithm presented in Algorithm
3 cannot easily be parallelized using the same strategy as for
the φ-insertion algorithm, i.e. one thread per variable. Such
per-variable parallelism is not appropriate for the renaming
phase, since each thread would have to process each state-
ment, and would only perform a trivial amount of work at
that statement. Furthermore, many variables may not be in
scope for large parts of the program.
Algorithm 3 Classical SSA renaming algorithm
1: for all V : variables in original program do
2: Count(V )← 0
3: S(V )← EmptyStack
4: end for
5: call Search(EntryNode)
6:
7: procedure Search(X : BasicBlock)
8: for all A : statement in X do
9: if A is not a φ-function then
10: for all u : variables used in A do
11: replace use of u with S(u) in A
12: end for
13: end if
14: for all v : variables defined in A do
15: i← Count(v)
16: replace definition of v with vi in A
17: push vi onto S(v)
18: Count(v)← i+ 1
19: end for
20: end for
21: for all Y ∈ Succ(X) do
22: let j be the index of the φ-function operands in Y
that correspond to basic block X
23: for all F : φ-function in Y do
24: let V be the jth operand in F
25: replace V with S(V ) at the jth operand in F
26: end for
27: end for
28: for all Z ∈ Children(X) do
29: call Search(Z)
30: end for
31: for all A : statements in X do
32: for all Vi : variables defined in A do
33: let V be the original variable corresponding to Vi
34: pop S(V )
35: end for
36: end for
37: end procedure
Thus we adopt a different parallelization strategy for the
variable renaming phase. Note how the renaming algorithm
makes use of the recursive subroutine Search, which is de-
fined in Algorithm 3 from lines 7–37, and called recursively
at line 29. Successive recursive calls of Search perform a
pre-order traversal of the dominator tree of the control flow
graph. Each individual invocation of Search operates on an
individual region of the control flow graph.
Therefore, it is straightforward to parallelize this algo-
rithm based on the dominator tree. We begin by invoking
Search on the entry node for the control flow graph, then
we continue into all children blocks in the dominator tree in
parallel. Note that different definitions of the same original
variable V in sibling basic blocks M and N in the dominator
tree cannot interfere with one another at all. A definition of
V in M is not in scope in N , and is guaranteed not to come
into scope in any nodes reachable from both M and N due
to the previously inserted φ-functions.
This parallelism requires a modification to the data struc-
ture that keeps track of the current new variable name for
each original variable name. Whereas previously the stack
S(V ) for original variable V was a single, global data struc-
ture, now each invocation of the Search routine must main-
tain its own private version of S(V ). If V is not defined
in the current block, then S(V ) could be shared with the
dominating parent block and other siblings, using appropri-
ate shallow cloning techniques. However when V is defined
in the current block, the new name pushed onto S(V ) must
only be visible in this block, so S(V ) must be privatized to
this invocation of Search.
Therefore, when Search is invoked, at lines 5 and 29 in Al-
gorithm 3, we need to supply an array of stacks for the child
function, based on current invocation’s array of stacks S.
This may be implemented as a direct copy, either a shallow
or deep clone.
There is an additional benefit from the privatization of
the variable renaming stacks: Now there is no need to pop
variable names off the stack at the end of each invocation
of Search, as in lines 31–35 of Algorithm 3. Instead, when
a Search invocation completes, its private stack goes out of
scope and the reference to this private stack does not escape
the subroutine.
With these modifications to the algorithm to support par-
allelism, there are no data conflicts between the concurrent
threads. They have private stacks for renaming. Each block
is processed by a single thread, with the exception of φ-
functions, where each right-hand operand may be processed
by a different thread. Thus the φ-function data structure
must be able to handle concurrent modifications to different
right-hand operands.
The only global variable accessed by concurrent threads is
the Count array, where each individual element Count(V )
indicates the integer subscript to use for the next definition
of original variable V . We can deploy an atomic get/in-
crement operation to prevent multiple threads from being
allocated the same integer subscript for V . (In fact, it is
not necessary that new variable names should be formed by
appending a sequential integer subscript to the original vari-
able name. If making the Count array global causes prob-
lems due to contention, then we could append the block
number to the variable to get a unique name, rather than
using the global Count array.)
This leads to the following modifications to Algorithm 3
to enable parallelization:
The Search subroutine declared at line 7 now takes two
parameters, X is a BasicBlock where we begin processing,
and S is the private array of stacks, indexed by variable
name, to maintain the mapping between old and new vari-
able names for this invocation.
Since we retain a global Count array, we ensure that its
elements are updated atomically1 by removing line 18, and
modifying line 15 to:
15: i← GetAndIncrementAtomically(Count(V ))
The forall loop at line 28 becomes a doall loop as shown
in Algorithm 4, to enable parallel execution of the recursive
calls to Search. Each separate call gets its own private copy
of the array of stacks S. This could be implemented as a
shallow copy (pointer to parent’s stack) to avoid memory
bloat, or as a deep copy (entirely new array with copied
data) to avoid excessive pointer chasing.
Algorithm 4 Amendment for parallel recursion
28: doall Z ∈ Children(X)
29: call Search(Z, newCopyOf(S))
30: end doall
Finally, we remove lines 31–36, since private stacks are
effectively deallocated automatically when they go out of
scope when the invocation returns. Alternatively, if dynamic
manual memory management is used to handle the stacks, a
single call to free() at the end of the routine body suffices
to deallocate an invocation’s private stack.
3. IMPLEMENTATION
We chose to implement our parallel SSA construction tech-
niques using the Soot compiler framework [31, 32] v2.4.0.
Soot is an open-source, research-oriented static analysis toolkit
written in Java, that operates on Java bytecode with a
wide range of data flow analysis (DFA) passes. Most of the
DFA takes place on a three-address intermediate representa-
tion called Jimple [33]. There is an SSA construction pass,
which transforms the Jimple into SSA (known as Shimple).
This SSA construction pass follows the classical algorithm of
Cytron et al [7] exactly. Thus it provides an ideal vehicle for
demonstrating and evaluating our parallelization strategies
outlined in Section 2.
We use the Java fork/join framework [15] to provide lightweight
threading mechanisms to take advantage of underlying hard-
ware parallelism available on multi-core platforms. The fork/join
classes will be included in Java 7, but since we are work-
ing with Java 6, we use the JSR166y2 library. This pro-
vides a ForkJoinPool of worker threads, which can execute
ForkJoinTask instances. Each individual worker thread has
a double-ended queue (dequeue) of tasks. It is assumed that
there are no data dependence conflicts between the tasks. If
a worker thread has an empty dequeue, then it can steal a
task from the end of another thread’s dequeue to maintain
load-balancing, a` la Cilk [5].
In adapting the existing SSA construction algorithm for
Soot, we took a pragmatic approach. Many design decisions
1 This has not become a bottleneck. If it should, we could
avoid synchronization by making each subscript unique
within its basic block, then appending the basic block iden-
tifier onto the subscript, i.e. variable_subscript_block.
2 http://gee.cs.oswego.edu/dl/concurrency-interest/
(e.g. the control flow graph and basic block data structures)
could perhaps have been optimized for parallel access. How-
ever since these data structures cross-cut the entire analysis
framework, any changes would have far-reaching implica-
tions. Instead, we modify the existing codebase as little as
possible, only where absolutely necessary to support parallel
SSA construction.
This involves replacing some non-thread-safe objects e.g.
HashMap with thread-safe alternatives e.g. ConcurrentHashMap.
We also had to introduce a limited number of Semaphore ob-
jects to guard multi-threaded access to data, e.g. the first
and last fields in Soot HashChain objects. The doall loops
from the parallelized versions of Algorithms 1 and 3 are im-
plemented as invokeAll() operations on ArrayLists of Re-
cursiveTask objects.
Most of the rewritten code is localized to the soot.shimple.internal
package. We intend to contribute a patch back to the Soot
maintainers. The whole code took around three man-weeks
of development effort.
4. EVALUATION
4.1 Benchmarks
We use standard Java benchmark programs from the Da-
Capo and Java Grande suites, to evaluate our parallel SSA
construction algorithm.
The DaCapo suite of Java benchmarks [4], version 9.12
(bach), consists of large, widely used, real-world open-source
applications: in fact, typical input for a client-side JIT com-
piler. The Java Grande suite of benchmarks [27] contains
five large-scale scientific applications. We use the sequen-
tial versions of these programs. We observe that the Java
Grande codebase is significantly smaller than DaCapo, how-
ever it is representative of code from the scientific domain.
We execute each analysed application with a standard
workload (default for DaCapo and SizeA for Java Grande)
and record the classes that the JVM classloader accesses. We
ignore classes that do not belong to the benchmark distri-
bution, e.g. Java standard library classes. Now, given a list
of classes for each application, we run these classes through
the Soot analysis. Table 1 summarizes the applications we
select. For each benchmark, we report (i) the number of
methods that we will analyse in Soot, (ii) the arithmetic
mean of the bytecode instruction lengths of these selected
methods, and (iii) the bytecode instruction length of the
longest method.
4.2 Platform
Our commodity multi-core evaluation platform is described
in Table 2. We use a Java fork-join pool with twice the
number of worker threads as hardware threads in the sys-
tem. This parameter may be tuned; it has some effect on
performance, but we cannot explore tuning due to space re-
strictions.
4.3 Experiments
We compare two different SSA construction techniques.
The sequential version is the default Shimple builder pass
supplied in Soot v2.4.0. The parallel version is our custom
pass that uses Java fork/join parallelism to implement the
parallel SSA construction algorithm outlined in Section 2
earlier in this paper.
Vendor Intel
Codename Nehalem
Architecture Core i7-920
Cores x SMT Contexts 4x2
Per-core L1 i/d 32KB/32KB
Per-core L2 256KB
Shared L3 8MB
Core freq 2.67GHz
RAM size 6GB
OS Linux 2.6.31 (64bit)
JVM (1.6) 14.0-b16
max JVM heap 4GB
# FJ threads 16
Table 2: Evaluation platform for experiments
For all tests, we measure execution times of compilation
phases using the Java library call System.nanoTime(). All
times are the arithmetic means of five measurements, which
have a low coefficient of variation. We reduce timing vari-
ance by using large fixed size JVM heaps. We compute
speedups as: mean sequential time / mean parallel time.
For the benchmark applications, we report the speedup in
overall SSA construction time, i.e. the sum of φ-function in-
sertion and variable renaming. This SSA construction time
is summed over all methods analysed, for each individual
benchmark.
4.3.1 Comparison on Standard Benchmarks
Figure 1 shows speedups for parallel SSA construction on
the selected benchmark applications, on the Intel Core i7
platform. The speedup varies with the method size thresh-
old t. For a particular point on a benchmark’s speedup
curve, we apply the sequential SSA construction algorithm
to all methods with size < t, whereas we apply the parallel
SSA construction algorithm to all methods with size ≥ t.
If t is set too small, then almost methods are analysed in-
discriminately using the parallel algorithm. For many small
methods, the overhead of parallelism outweighs any over-
lapped execution gains. In many cases, this causes an over-
all slowdown (speedup scores below 1 in the graph). On the
other hand, if t is set too large, then hardly any methods are
analysed in parallel, so the SSA construction is almost al-
ways sequential. In all cases, the curves tend to speedup = 1
as t→∞.
Parallel SSA construction is more beneficial for large meth-
ods. The benchmarks with the largest mean method sizes
in Table 1, namely fop and batik in DaCapo and euler and
moldyn in Java Grande, show the best speedups in Figure
1.
We have not investigated alternative heuristics for select-
ing which individual methods for which the parallel SSA
construction algorithm is to be preferred to the sequential
algorithm. Since the parallelism depends on (i) number of
variables, and (ii) dominance properties of the control flow
graph, method size seems like a simple proxy measure for
overall complexity. Other heuristics might include software
metrics such as cyclomatic complexity [20].
4.3.2 Comparison on Inlined Benchmarks
The reason why many benchmarks do not give significant
speedups with parallel SSA construction is that most meth-
app description # methods mean length max length
D
a
C
a
p
o
avrora program simulator 2836 21.9 1083
batik SVG image processing 7137 34.3 41033
fop PDF generator 6749 44.5 33089
jython Python interpreter 20664 25.4 7846
luindex text indexing 1885 30.6 493
lusearch text search 1613 26.9 1187
pmd static analysis 6477 33.6 2881
sunflow raytracer 1109 50.4 6308
xalan XML parser 6189 29.5 2881
J
av
a
G
ra
n
d
e euler fluid dynamics 27 295.81 1822
moldyn molecular dynamics simulation 20 102.20 931
montecarlo Monte Carlo simulation 178 17.65 211
raytracer raytracer 65 30.63 229
search alpha-beta search 29 86.34 465
Table 1: Analysed Java applications from the DaCapo and Java Grande benchmark suites
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 1  10  100  1000  10000
s p
e e
d u
p
method length threshold
avrora
batik
fop
jython
luindex
lusearch
pmd
sunflow
xalan
 0
 0.5
 1
 1.5
 2
 2.5
 3
 1  10  100  1000  10000
s p
e e
d u
p
method length threshold
euler
moldyn
montecarlo
raytracer
search
Figure 1: Speedups for parallel SSA construction on the Dacapo (l) and Java Grande (r) benchmark suites
 0.2
 0.4
 0.6
 0.8
 1
 1.2
 1.4
 1.6
 1  10  100  1000  10000
s p
e e
d u
p
method length threshold
montecarlo
raytracer
search
Figure 2: Speedups for parallel SSA construction on
selected benchmarks with method inlining applied
ods are small, so the overhead of the parallel threads is not
mitigated by sufficient work for these threads to perform.
Method inlining is one optimization technique to reduce
this problem. The code of callee methods is reproduced at
corresponding call-sites in the caller method bodies. This
increases the overall size of the code, but it often reduces
execution time. Thus it is a valuable optimization for JIT
compilers [30] to be applied to the most frequently executed
regions of code.
We apply static method inlining in the Soot framework, to
rewrite the Java bytecode class files for some of our bench-
marks. The parameters used are: expansion-factor 20, max-
container-size 10000, max-inlinee-size 500. Table 3 shows
the new statistics for methods in each inlined benchmark,
and how this relates to the original benchmark code. A neg-
ative % change indicates a reduction in relation to the orig-
inal, a positive % change indicates an expansion in relation
to the original.
Now we compare the sequential versus the parallel imple-
mentation of SSA construction on these inlined benchmarks,
in the same way as before. Figure 2 shows the speedups at
various method thresholds. As before, at method thresh-
old t, all methods with size < t are transformed using the
sequential algorithm, whereas all methods with size >= t
are transformed using the parallel algorithm. The bench-
marks shown exhibit some speedups at method thresholds
around 500, whereas in the non-inlined versions (Figure 1)
no speedup was possible.
5. RELATEDWORK
To the best of our knowledge, no-one has previously re-
ported on a parallelization strategy for the construction of
SSA.
However, given the sudden abundance of parallel threads
in commodity hardware, there has been significant recent in-
terest in adapting classical data flow analysis frameworks to
take advantage of light-weight threading. Knoop [11] argues
that reverse data flow analysis, or demand-driven data flow
analysis problems (i.e. multiple data flow queries at specific
program points) are embarassingly parallel, and therefore
good candidates for deployment on multi-core systems. Ed-
vinsson and Lowe [9] motivate parallel data flow analysis
for rapid points-to analysis of an object-oriented program
as part of an integrated development environment. They
analyse the target methods of polymorphic calls in parallel.
They introduce heuristics to avoid spawning new analysis
threads if the thread management overhead is large in com-
parison with the amount of parallel execution. They evalu-
ate a Java implementation of their points-to analysis, on a
range of open-source Java applications. The average paral-
lel speedup is 1.78 with their best heuristic, on an eight-core
IA-32 system.
Rodriguez [26, 25] implements an interprocedural data
flow analysis algorithm using light-weight multithreading
based on the Actor model in Scala. Note that Scala Ac-
tors are implemented using the Java fork/join framework,
which we have used in our implementation work. Rodriguez
presents an object-oriented type analysis algorithm, and shows
how there is abundant potential parallelism when this anal-
ysis is applied to three DaCapo Java applications (up to
1000-way parallelism on an ideal machine).
Me´ndez-Lojo et al [21] present an Andersen-style points-to
analysis based on constraint graphs. The graphs can be ma-
nipulated in parallel with simple rewrite rules. Their imple-
mentation gives 3x speedup over the best sequential version
on an 8-core machine for substantial C benchmarks.
There is a body of older work on parallelism in data flow
analysis. For instance, Lee et al [18, 17] show how clas-
sical problems such as reaching definitions analysis can be
parallelized. They define three kinds of parallelism for data
flow problems. Our work falls into the third category: al-
gorithmic parallelism. They note that the granularity of
the problem decomposition is critical here. Such paralleliza-
tion is only effective for ‘large procedures or interprocedural
problems.’ Our results confirm that this observation still
holds. They give an empirical study, analysing reaching defi-
nitions in Fortran benchmark programs on a special-purpose
research hypercube-processor-interconnect machine. They
show speedups and some scalability for up to eight proces-
sors.
6. CONCLUSIONS
In this paper we have shown that the standard sequen-
tial algorithm for SSA construction may be parallelized. We
have implemented the parallel algorithm in an existing com-
piler infrastructure using light-weight threading mechanisms
for Java. The implementation has been tested on a rep-
resentative set of Java benchmark programs. For shorter
methods, the parallel execution is outweighed by the over-
head of the fork/join thread library. However the parallel
speedups are significant for larger methods, suggesting that
a threshold size is required to select which methods should
be subject to parallel SSA construction.
We anticipate that commodity hardware with increasingly
large numbers of cores and threads will become common over
the next few years: the challenge for programmers is to make
the best use of these cores.
We have three reasons for believing that the size of SSA-
based representations is also set to grow:
1. Given the current trend towards more aggressive op-
timisation, including loop unrolling and method inlin-
ing, compilers will have large method bodies to analyse
app # methods (% change) mean length (% change) max length
montecarlo 179 (+1%) 20.82 (+18%) 231
raytracer 69 (+6%) 40.31 (+32%) 406
search 32 (+10%) 89.87 (+4%) 484
Table 3: Method statistics for selected benchmarks with inlining applied
even in cases where the program was originally written
as a collection of small functions.
2. Another trend is the increasing amount of source code
that is produced by automatic code generators. more
source code is produced by code generators. Such pro-
grams do not necessarily have the same properties as
human written code. Methods are often larger, and
loops deeper.
3. There is an increasing need for interprocedural data
flow analysis. In the context of this paper, the SSA
construction operates at a single-procedure level. How-
ever SSA does extend naturally to the whole-program
scope [19, 29]. Again, our parallel SSA construction
techniques would be applicable to such large interpro-
cedural control flow graphs.
Exploiting all the available parallel hardware for the anal-
ysis of such large SSA-based program representations is there-
fore of great importance in reducing compile times.
In overall terms, the largest single contributor to compi-
lation time is parsing. Researchers are working on paral-
lelizing the parsing process [24]. We anticipate that every
significant phase of next-generation compilers should be en-
gineered to take advantage of underlying hardware paral-
lelism, to achieve scalability. This includes the data flow
analysis phase, of which SSA construction is generally a key
initial step.
One might argue that it would be better to perform se-
quential (i.e. single-threaded) SSA-construction on multiple
methods in parallel, rather than parallelizing SSA construc-
tion for a single method.
However we note that in a JIT compiler or an interactive
compilation server [16], requests for optimizing compilation
of methods will occur sporadically, unpredictably and not
necessarily concurrently. Where possible, it would be good
to parallelize both individual method SSA construction, and
handle multiple methods concurrently. We have only tackled
the former problem in this paper. Also, in a JIT context, re-
sponse time is often more important than total performance:
so reducing the analysis time for the largest methods will im-
prove worst case response time, even if it does not improve
performance on the small methods (where performance is
already adequate).
7. REFERENCES
[1] Ashby, S., Eulisse, G., Schmid, S., Tuura, L.: Parallel
compilation of CMS software. In: Proc. Computing in
High Energy and Nuclear Physics Conference (CHEP)
(2004)
[2] Baalbergen, E.: Design and implementation of parallel
make. Computing Systems 1(2), 135–158 (1988)
[3] Bilardi, G., Pingali, K.: Algorithms for computing the
static single assignment form. Journal of the ACM
50(3), 375–425 (May 2003)
[4] Blackburn, S.M., Garner, R., Hoffman, C., Khan,
A.M., McKinley, K.S., Bentzur, R., Diwan, A.,
Feinberg, D., Frampton, D., Guyer, S.Z., Hirzel, M.,
Hosking, A., Jump, M., Lee, H., Moss, J.E.B.,
Phansalkar, A., Stefanovic´, D., VanDrunen, T., von
Dincklage, D., Wiedermann, B.: The DaCapo
benchmarks: Java benchmarking development and
analysis. In: OOPSLA ’06: Proceedings of the 21st
annual ACM SIGPLAN conference on
Object-Oriented Programing, Systems, Languages,
and Applications. pp. 169–190 (Oct 2006)
[5] Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C.,
Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an
efficient multithreaded runtime system. In:
Proceedings of the fifth ACM SIGPLAN symposium
on Principles and practice of parallel programming.
pp. 207–216. ACM, New York, NY, USA (1995)
[6] Brandis, M.M., Mo¨ssenbo¨ck, H.: Single-pass
generation of static single-assignment form for
structured languages. ACM Transactions on
Programming Languages and Systems 16(6),
1684–1698 (Nov 1994)
[7] Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N.,
Zadeck, F.K.: Efficiently computing static single
assignment form and the control dependence graph.
ACM Trans. Program. Lang. Syst. 13(4), 451–490
(1991)
[8] Das, D., Ramakrishna, U.: A practical and fast
iterative algorithm for φ-function computation using
DJ graphs. ACM Transactions on Programming
Languages and Systems 27(3), 426–440 (2005)
[9] Edvinsson, M., Lowe, W.: A multi-threaded approach
for data-flow analysis. In: Proceedings of the IPDPS
2010 Workshop on Multi-Threaded Architectures and
Applications (2010)
[10] Hill, M., Marty, M.: Amdahl’s law in the multicore
era. IEEE Computer 41(7), 33–38 (2008)
[11] Knoop, J.: Data-flow analysis for multi-core
computing systems: A reminder to reverse data-flow
analysis. In: Martin, F., Nielson, H.R., Riva, C.,
Schordan, M. (eds.) Scalable Program Analysis. No.
08161 in Dagstuhl Seminar Proceedings, Schloss
Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany,
Dagstuhl, Germany (2008), http:
//drops.dagstuhl.de/opus/volltexte/2008/1575
[12] Knuth, D.: An empirical study of FORTRAN
programs. Software: Practice and Experience 1(2),
105–133 (1971)
[13] Kotzmann, T., Wimmer, C., Mo¨ssenbo¨ck, H.,
Rodriguez, T., Russell, K., Cox, D.: Design of the
Java HotSpot client compiler for Java 6. ACM Trans.
Archit. Code Optim. 5(1), 1–32 (2008)
[14] Lattner, C., Adve, V.: LLVM: A compilation
framework for lifelong program analysis and
transformation. In: Code Generation and
Optimization, IEEE/ACM International Symposium
on. p. 75. IEEE Computer Society, Los Alamitos, CA,
USA (2004)
[15] Lea, D.: A Java fork/join framework. In: Proceedings
of the ACM 2000 conference on Java Grande. pp.
36–43. ACM, New York, NY, USA (2000)
[16] Lee, H.B., Diwan, A., Moss, J.E.B.: Design,
implementation, and evaluation of a compilation
server. ACM Transactions on Programming Languages
and Systems 29 (2007)
[17] Lee, Y.f., Marlowe, T.J., Ryder, B.G.: Performing
data flow analysis in parallel. In: Supercomputing ’90:
Proceedings of the 1990 ACM/IEEE conference on
Supercomputing. pp. 942–951 (1990)
[18] Lee, Y.F., Ryder, B.G.: A comprehensive approach to
parallel data flow analysis. In: Proceedings of the 6th
international conference on Supercomputing. pp.
236–247 (1992)
[19] Liao, S.W., Diwan, A., Bosch, Jr., R.P., Ghuloum, A.,
Lam, M.S.: SUIF Explorer: an interactive and
interprocedural parallelizer. In: Proceedings of the 7th
ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming. pp. 37–48 (1999)
[20] McCabe, T.: A complexity measure. IEEE
Transactions on Software Engineering 2, 308–320
(1976)
[21] Me´ndez-Lojo, M., Mathew, A., Pingali, K.: Parallel
inclusion-based points-to analysis. In: Proceedings of
the ACM international conference on Object Oriented
Programming Systems Languages and Applications.
pp. 428–443 (2010)
[22] Novillo, D.: Tree SSA a new optimization
infrastructure for gcc. In: Proceedings of the 2003
GCC DevelopersaˆA˘Z´ Summit. pp. 181–193 (2003)
[23] Novillo, D.: Design and implementation of Tree SSA.
In: GCC DevelopersaˆA˘Z´ Summit (2004)
[24] Prabhu, P., Ramalingam, G., Vaswani, K.: Safe
programmable speculative parallelism. In: PLDI ’10:
Proceedings of the 2010 ACM SIGPLAN conference
on Programming language design and implementation.
pp. 50–61 (2010)
[25] Rodriguez, J., Lhota´k, O.: Actor-based parallel
dataflow analysis. In: Proceedings of the International
Conference in Compiler Construction (2011), to
appear
[26] Rodriguez, J.D.: A concurrent ifds dataflow analysis
algorithm using actors (2010),
http://hdl.handle.net/10012/5283
[27] Smith, L., Bull, J., Obdrizalek, J.: A parallel java
grande benchmark suite. In: Proceedings of the
ACM/IEEE 2001 Conference on Supercomputing. pp.
1–10 (2001)
[28] Sreedhar, V.C., Gao, G.R.: A linear time algorithm
for placing φ-nodes. In: Proceedings of the 22nd ACM
SIGPLAN-SIGACT Symposium on Principles of
Programming Languages. pp. 62–73 (1995)
[29] Staiger, S., Vogel, G., Keul, S., Wiebe, E.:
Interprocedural Static Single Assignment Form. In:
Proceedings of the 14th Working Conference on
Reverse Engineering. pp. 1–10 (2007)
[30] Suganuma, T., Yasue, T., Nakatani, T.: An empirical
study of method inlining for a Java just-in-time
compiler. In: Proceedings of the Java Virtual Machine
Research and Technology Symposium (2002)
[31] Valle´e-Rai, R., Co, P., Gagnon, E., Hendren, L., Lam,
P., Sundaresan, V.: Soot—a Java bytecode
optimization framework. In: Proceedings of the 1999
conference of the Centre for Advanced Studies on
Collaborative research. p. 13. IBM Press (1999)
[32] Valle´e-Rai, R., Gagnon, E., Hendren, L., Lam, P.,
Pominville, P., Sundaresan, V.: Optimizing Java
bytecode using the Soot framework: Is it feasible? In:
Proceedings of the International Conference on
Compiler Construction. pp. 18–34. Springer (2000)
[33] Vallee-Rai, R., Hendren, L.: Jimple: Simplifying Java
bytecode for analyses and transformations. Tech. Rep.
SABLE-TR-1998-4, McGill University, School of
Computer Science (1998)