Parallelizing the Construction of Static Single Assignment Form Jeremy Singer University of Glasgow, UK jeremy.singer@dcs.gla.ac.uk Martin Ward De Montfort University, UK martin@gkc.org.uk ABSTRACT Data flow analysis and code optimization are generally im- plemented using sequential algorithms. Given the commodi- tization of multi-core platforms, it should be possible to use parallel algorithms instead. This paper examines how the standard sequential algorithm for constructing static single assignment form (SSA) may be parallelized. We demon- strate how this parallel algorithm may be realized in an ex- isting compiler infrastructure, using light-weight threading mechanisms for Java. We provide a quantitative evalua- tion of our parallel SSA construction pass. The maximum speedups are around 4x when applied to real-world Java programs on a quad-core hyperthreaded Intel Core i7. 1. INTRODUCTION The multi-core / many-core architectural paradigm is now well-established, not only for high performance computing, but also for commodity end-user machines [10]. The soft- ware industry recognises the need to adapt existing appli- cations to make them multi-threaded, in order to take ad- vantage of the underlying hardware parallelism. One such application, which will be the focus of attention in this pa- per, is the compiler. Over the past decade, the use of com- piler technology has become ubiquitous, driven by the grow- ing trend for just-in-time (JIT) compilation, i.e. client-side compilation of bytecodes delivered over the internet. At present, the compilation process is generally imple- mented using sequential algorithms. In particular, this ob- servation is true for the middle-end of compilers, dealing with data flow analysis and optimization. These middle-end algorithms are the key parts of a client-side JIT compiler like Java HotSpot [13]. Parallelization is not entirely straight- forward due to global data structures such as the symbol table, the control flow graph, etc. In addition, until the recent arrival of multi-core on client-side platforms, it was not necessary to parallelize compilation on the client-side. For high performance computing environments, paralleliza- tion of the compilation process commonly takes place. For Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. instance, software for the Large Hadron Collider [1] is com- piled in parallel. Common Unix tools for parallel compila- tion include distcc and parallel make [2]. However these parallelize on a coarse granularity. Generally each parallel task sequentially compiles one or more object files or pro- gram modules. The challenge we address in this paper is to parallelize standard client-side data flow analysis, using lightweight threading constructs to take advantage of hardware paral- lelism on commodity multi-core platforms. The conventional program representation for aggressive compiler optimization is static single assignment form (SSA). This intermediate representation is used by client-side Hotspot [13], LLVM [14], and GCC [22, 23], inter alia. Construction of SSA involves non-trivial data flow analysis and can take a significant proportion of overall analysis time. Therefore, it seems that parallelization of the SSA construction algorithm is an obvious first step, for a case study on the parallelization of client-side compilation. Most open-source optimizing compilers implement some variant of the classical SSA construction algorithm, by Cytron et al, which is entirely sequential [7]. The chief advantages of this algorithm are: (a) it is easy to implement, (b) it is fairly efficient. In fact, the algorithm is not O(N) with respect to the size N of the input control flow graph, how- ever Cytron et al claim it to be ‘linear in practice’ since the worst-case O(N2) behaviour only occurs with deeply nested, irreducible control flow structures. There is empirical evi- dence that such constructs are rare in real-world programs [12]. In this paper, we show how the classical SSA construction algorithm can be readily parallelized at a fine granularity. We present a working implementation in a widely used opti- mization infrastructure (Soot), discuss the implementation decisions we had to make, and evaluate its performance. Of particular concern are the overheads of thread spawning, and whether these are mitigated by the parallelism obtained when processing standard Java benchmark programs. There are two key contributions in this paper. 1. We present an algorithm for the parallel construction of SSA. To the best of our knowledge this is the first such parallel algorithm. 2. We give an evaluation of a realistic implementation using Soot and the Java fork/join framework on two commodity multi-core platforms. 2. ALGORITHM Algorithms for the construction of SSA based on data flow analysis of a control flow graph (following the style of Cytron et al [7]) can be divided into two distinct phases. 1. Insertion of φ-functions at control flow merge points, in order to satisfy the SSA property that each variable use has a single reaching definition, which dominates the use. 2. Renaming of local variables, in order to satisfy the SSA property that each variable has a single definition in the program text. In this section, we discuss how to adapt the two phases of the classical SSA construction algorithm to make them amenable for parallelism. In passing, we note that there are many other data flow based SSA construction algorithms, e.g. [28, 3, 8]. These are more efficient than the classical algorithm, although they may be more complex to implement. However in general, they have a similar structure to the classical algorithm, so it should be possible to parallelize them using similar strategies to those presented in this paper. Certain SSA construction algorithms are not data flow based, such as the syntax-directed algorithm by Brandis and Mo¨ssenbo¨ck [6]. In such cases, the strategies given in this paper would not be directly applicable. We do not consider such algorithms further in this work. 2.1 Insertion of φ-functions 2.1.1 Commentary on Sequential Algorithm The classical algorithm, shown in Algorithm 1, iterates over each variable v from the original program, to determine where to insert φ-functions for v. This is the outermost for loop starting at line 2. The algorithm is based on iteratively processing a work-list W of basic blocks, for each variable v. Initially, we insert into work-list W all basic blocks con- taining definition statements for v in the original program, lines 3–6. One-by-one, each basic block B is removed from W . We assume that dominance frontier information has been pre-computed for each basic block in the control flow graph. If basic block X contains a definition of v, then it is necessary to insert a φ-function for v at the beginning of all blocks in the dominance frontier of B, i.e. DF(X), lines 9–11. Blocks in DF(X) have several incoming defini- tions of v, and thus require a φ-function for v at the head of each block to act as a data flow merge operator. Note that we only insert a φ-fn for v if none has been inserted already. Therefore it is helpful to keep a bitvector record of where we have inserted φ-functions for v, see line 10. Now, if we do insert a φ-function, this is a fresh definition of v, so the enclosing basic block must be inserted into work-list W since it may trigger the insertion of further φ-functions at its own dominance frontier, see line 13. Again, each basic block should only be inserted into W once, so it is helpful to keep another bitvector record of which basic blocks have been inserted into W , see line 12. The φ-function insertion algorithm is guaranteed to ter- minate. There is a fixed number of variables in the original program, so the outermost for loop is executed for a fixed number of iterations. The control flow graph has a fixed number of basic blocks N , thus the dominance frontier of any basic block is bounded by N . Again, each basic block can only be inserted into the work-list W once, so the num- ber of insertions into W is bounded by N . Also, each itera- tion of inner while loop starting at line 7 removes one item from W . So W will eventually become empty, which is the termination condition of the while loop. Algorithm 1 Classical φ-function insertion algorithm 1: W ← {} 2: for all v : variable names in original program do 3: for all d : definition statements of variable v do 4: let B be the basic block containing d 5: W ←W ∪ {B} 6: end for 7: while W 6= {} do 8: remove a basic block X from W 9: for all Y : block ∈ DF(X) do 10: if Y does not contain a φ-function for v then 11: add v ← φ(...) at start of Y 12: if Y has not already been processed in W then 13: W ←W ∪ {Y } 14: end if 15: end if 16: end for 17: end while 18: end for 2.1.2 Parallelization Strategy The presentation of φ-function insertion in Algorithm 1 follows the original description from Cytron et al [7]. As it stands, this is a sequential algorithm. However, it is clear to see that each variable v is processed individually in a single iteration of the outermost for loop. Variables do not interfere with one another, in terms of data flow analysis for inserting φ-functions. Therefore it is straightforward to transform the for at line 2 loop into a parallel doall loop, where all iterations may be executed in parallel. The work-list W and all local variables declared within the loop body must be privatized for each parallel loop it- eration. Apart from this, the only issue is that parallel loop iterations may attempt to insert φ-functions for different variables concurrently into the same basic block. One solution is to introduce a semaphore associated with each basic block B to serialize the φ-function insertions at the head of B. However there is the likelihood of high con- tention on these semaphores (due to φ-functions often being located together). A more scalable solution is to maintain a private queue of (φ-function, basic block) pairs for each thread, denoting which basic blocks require particular φ-functions. Thus line 11 of the algorithm changes to: 11: add new Pair (v ← φ(. . .), Y ) to local queue All these queued φ-functions can be inserted at the end of the doall loop. Given an appropriate lookup table for the local queues that can be indexed by basic block ids, the lazy φ-insertion phase can be parallelized over basic blocks, thus avoiding the earlier problem of concurrent updates to individual basic blocks. Algorithm 2 shows this additional code. 2.2 Renaming Local Variables Algorithm 2 Amendment for lazy φ-function insertion 19: doall B : basic blocks in control flow graph 20: for all v : variable names in original program do 21: if there is a queued φ-function for v at block B then 22: insert φ-function at block B 23: end if 24: end for 25: end doall 2.2.1 Commentary on Sequential Algorithm Once the φ-function insertion is complete, the second phase is to rename all definitions and uses of local variables to enforce the SSA property of having a single definition per variable in the program text, while respecting the original program’s data flow behaviour. The presentation of the classical renaming algorithm by Cytron et al [7] is phrased in terms of original variable names and new variable names. At each control flow point in the program, we must maintain a mapping between original vari- able names and new variable names. This mapping changes as variable definitions enter and exit scope. The mapping is generated using an array of stacks S, in- dexed by original variable name. An individual stack data structure S(V ) contains new names corresponding to orig- inal variable name V . A new variable name Vi is pushed onto stack S(V ) at the ith definition of V encountered in the original program text. At the same time, the new name Vi replaces the old name V for that particular variable defi- nition statement. Vi is popped off the stack when its corre- sponding definition goes out of scope. At each statement that uses V in the original program, the new variable name Vi on top of S(V ) is substituted into the statement in place of the original name V . The CFG is processed according to a pre-order traversal of its dominator tree. The set of basic blocks Children(X) denotes the basic blocks whose immediate dominator is X, i.e. their parent in the dominator tree is X. The most complex part of the renaming algorithm involves φ-function right-hand side operands. The jth operand on the right-hand side of a φ-function for original variable V must be renamed according to the new variable name Vi that is in scope at the end of the jth predecessor basic block. Op- erationally, a φ-function for V at the beginning of block B is equivalent to n simple assignments to V at the end of the n predecessor basic blocks of B. Thus it is most straightfor- ward to rename φ-function right-hand operands during the renaming of corresponding predecessor basic blocks. At each basic block, we must look ahead to check for φ-functions in successor basic blocks. The set of control flow successors for basic block B is denoted as Succ(B). Algorithm 3 gives the pseudo-code for the classical SSA renaming algorithm due to Cytron et al. 2.2.2 Parallelization Strategy The sequential renaming algorithm presented in Algorithm 3 cannot easily be parallelized using the same strategy as for the φ-insertion algorithm, i.e. one thread per variable. Such per-variable parallelism is not appropriate for the renaming phase, since each thread would have to process each state- ment, and would only perform a trivial amount of work at that statement. Furthermore, many variables may not be in scope for large parts of the program. Algorithm 3 Classical SSA renaming algorithm 1: for all V : variables in original program do 2: Count(V )← 0 3: S(V )← EmptyStack 4: end for 5: call Search(EntryNode) 6: 7: procedure Search(X : BasicBlock) 8: for all A : statement in X do 9: if A is not a φ-function then 10: for all u : variables used in A do 11: replace use of u with S(u) in A 12: end for 13: end if 14: for all v : variables defined in A do 15: i← Count(v) 16: replace definition of v with vi in A 17: push vi onto S(v) 18: Count(v)← i+ 1 19: end for 20: end for 21: for all Y ∈ Succ(X) do 22: let j be the index of the φ-function operands in Y that correspond to basic block X 23: for all F : φ-function in Y do 24: let V be the jth operand in F 25: replace V with S(V ) at the jth operand in F 26: end for 27: end for 28: for all Z ∈ Children(X) do 29: call Search(Z) 30: end for 31: for all A : statements in X do 32: for all Vi : variables defined in A do 33: let V be the original variable corresponding to Vi 34: pop S(V ) 35: end for 36: end for 37: end procedure Thus we adopt a different parallelization strategy for the variable renaming phase. Note how the renaming algorithm makes use of the recursive subroutine Search, which is de- fined in Algorithm 3 from lines 7–37, and called recursively at line 29. Successive recursive calls of Search perform a pre-order traversal of the dominator tree of the control flow graph. Each individual invocation of Search operates on an individual region of the control flow graph. Therefore, it is straightforward to parallelize this algo- rithm based on the dominator tree. We begin by invoking Search on the entry node for the control flow graph, then we continue into all children blocks in the dominator tree in parallel. Note that different definitions of the same original variable V in sibling basic blocks M and N in the dominator tree cannot interfere with one another at all. A definition of V in M is not in scope in N , and is guaranteed not to come into scope in any nodes reachable from both M and N due to the previously inserted φ-functions. This parallelism requires a modification to the data struc- ture that keeps track of the current new variable name for each original variable name. Whereas previously the stack S(V ) for original variable V was a single, global data struc- ture, now each invocation of the Search routine must main- tain its own private version of S(V ). If V is not defined in the current block, then S(V ) could be shared with the dominating parent block and other siblings, using appropri- ate shallow cloning techniques. However when V is defined in the current block, the new name pushed onto S(V ) must only be visible in this block, so S(V ) must be privatized to this invocation of Search. Therefore, when Search is invoked, at lines 5 and 29 in Al- gorithm 3, we need to supply an array of stacks for the child function, based on current invocation’s array of stacks S. This may be implemented as a direct copy, either a shallow or deep clone. There is an additional benefit from the privatization of the variable renaming stacks: Now there is no need to pop variable names off the stack at the end of each invocation of Search, as in lines 31–35 of Algorithm 3. Instead, when a Search invocation completes, its private stack goes out of scope and the reference to this private stack does not escape the subroutine. With these modifications to the algorithm to support par- allelism, there are no data conflicts between the concurrent threads. They have private stacks for renaming. Each block is processed by a single thread, with the exception of φ- functions, where each right-hand operand may be processed by a different thread. Thus the φ-function data structure must be able to handle concurrent modifications to different right-hand operands. The only global variable accessed by concurrent threads is the Count array, where each individual element Count(V ) indicates the integer subscript to use for the next definition of original variable V . We can deploy an atomic get/in- crement operation to prevent multiple threads from being allocated the same integer subscript for V . (In fact, it is not necessary that new variable names should be formed by appending a sequential integer subscript to the original vari- able name. If making the Count array global causes prob- lems due to contention, then we could append the block number to the variable to get a unique name, rather than using the global Count array.) This leads to the following modifications to Algorithm 3 to enable parallelization: The Search subroutine declared at line 7 now takes two parameters, X is a BasicBlock where we begin processing, and S is the private array of stacks, indexed by variable name, to maintain the mapping between old and new vari- able names for this invocation. Since we retain a global Count array, we ensure that its elements are updated atomically1 by removing line 18, and modifying line 15 to: 15: i← GetAndIncrementAtomically(Count(V )) The forall loop at line 28 becomes a doall loop as shown in Algorithm 4, to enable parallel execution of the recursive calls to Search. Each separate call gets its own private copy of the array of stacks S. This could be implemented as a shallow copy (pointer to parent’s stack) to avoid memory bloat, or as a deep copy (entirely new array with copied data) to avoid excessive pointer chasing. Algorithm 4 Amendment for parallel recursion 28: doall Z ∈ Children(X) 29: call Search(Z, newCopyOf(S)) 30: end doall Finally, we remove lines 31–36, since private stacks are effectively deallocated automatically when they go out of scope when the invocation returns. Alternatively, if dynamic manual memory management is used to handle the stacks, a single call to free() at the end of the routine body suffices to deallocate an invocation’s private stack. 3. IMPLEMENTATION We chose to implement our parallel SSA construction tech- niques using the Soot compiler framework [31, 32] v2.4.0. Soot is an open-source, research-oriented static analysis toolkit written in Java, that operates on Java bytecode with a wide range of data flow analysis (DFA) passes. Most of the DFA takes place on a three-address intermediate representa- tion called Jimple [33]. There is an SSA construction pass, which transforms the Jimple into SSA (known as Shimple). This SSA construction pass follows the classical algorithm of Cytron et al [7] exactly. Thus it provides an ideal vehicle for demonstrating and evaluating our parallelization strategies outlined in Section 2. We use the Java fork/join framework [15] to provide lightweight threading mechanisms to take advantage of underlying hard- ware parallelism available on multi-core platforms. The fork/join classes will be included in Java 7, but since we are work- ing with Java 6, we use the JSR166y2 library. This pro- vides a ForkJoinPool of worker threads, which can execute ForkJoinTask instances. Each individual worker thread has a double-ended queue (dequeue) of tasks. It is assumed that there are no data dependence conflicts between the tasks. If a worker thread has an empty dequeue, then it can steal a task from the end of another thread’s dequeue to maintain load-balancing, a` la Cilk [5]. In adapting the existing SSA construction algorithm for Soot, we took a pragmatic approach. Many design decisions 1 This has not become a bottleneck. If it should, we could avoid synchronization by making each subscript unique within its basic block, then appending the basic block iden- tifier onto the subscript, i.e. variable_subscript_block. 2 http://gee.cs.oswego.edu/dl/concurrency-interest/ (e.g. the control flow graph and basic block data structures) could perhaps have been optimized for parallel access. How- ever since these data structures cross-cut the entire analysis framework, any changes would have far-reaching implica- tions. Instead, we modify the existing codebase as little as possible, only where absolutely necessary to support parallel SSA construction. This involves replacing some non-thread-safe objects e.g. HashMap with thread-safe alternatives e.g. ConcurrentHashMap. We also had to introduce a limited number of Semaphore ob- jects to guard multi-threaded access to data, e.g. the first and last fields in Soot HashChain objects. The doall loops from the parallelized versions of Algorithms 1 and 3 are im- plemented as invokeAll() operations on ArrayLists of Re- cursiveTask objects. Most of the rewritten code is localized to the soot.shimple.internal package. We intend to contribute a patch back to the Soot maintainers. The whole code took around three man-weeks of development effort. 4. EVALUATION 4.1 Benchmarks We use standard Java benchmark programs from the Da- Capo and Java Grande suites, to evaluate our parallel SSA construction algorithm. The DaCapo suite of Java benchmarks [4], version 9.12 (bach), consists of large, widely used, real-world open-source applications: in fact, typical input for a client-side JIT com- piler. The Java Grande suite of benchmarks [27] contains five large-scale scientific applications. We use the sequen- tial versions of these programs. We observe that the Java Grande codebase is significantly smaller than DaCapo, how- ever it is representative of code from the scientific domain. We execute each analysed application with a standard workload (default for DaCapo and SizeA for Java Grande) and record the classes that the JVM classloader accesses. We ignore classes that do not belong to the benchmark distri- bution, e.g. Java standard library classes. Now, given a list of classes for each application, we run these classes through the Soot analysis. Table 1 summarizes the applications we select. For each benchmark, we report (i) the number of methods that we will analyse in Soot, (ii) the arithmetic mean of the bytecode instruction lengths of these selected methods, and (iii) the bytecode instruction length of the longest method. 4.2 Platform Our commodity multi-core evaluation platform is described in Table 2. We use a Java fork-join pool with twice the number of worker threads as hardware threads in the sys- tem. This parameter may be tuned; it has some effect on performance, but we cannot explore tuning due to space re- strictions. 4.3 Experiments We compare two different SSA construction techniques. The sequential version is the default Shimple builder pass supplied in Soot v2.4.0. The parallel version is our custom pass that uses Java fork/join parallelism to implement the parallel SSA construction algorithm outlined in Section 2 earlier in this paper. Vendor Intel Codename Nehalem Architecture Core i7-920 Cores x SMT Contexts 4x2 Per-core L1 i/d 32KB/32KB Per-core L2 256KB Shared L3 8MB Core freq 2.67GHz RAM size 6GB OS Linux 2.6.31 (64bit) JVM (1.6) 14.0-b16 max JVM heap 4GB # FJ threads 16 Table 2: Evaluation platform for experiments For all tests, we measure execution times of compilation phases using the Java library call System.nanoTime(). All times are the arithmetic means of five measurements, which have a low coefficient of variation. We reduce timing vari- ance by using large fixed size JVM heaps. We compute speedups as: mean sequential time / mean parallel time. For the benchmark applications, we report the speedup in overall SSA construction time, i.e. the sum of φ-function in- sertion and variable renaming. This SSA construction time is summed over all methods analysed, for each individual benchmark. 4.3.1 Comparison on Standard Benchmarks Figure 1 shows speedups for parallel SSA construction on the selected benchmark applications, on the Intel Core i7 platform. The speedup varies with the method size thresh- old t. For a particular point on a benchmark’s speedup curve, we apply the sequential SSA construction algorithm to all methods with size < t, whereas we apply the parallel SSA construction algorithm to all methods with size ≥ t. If t is set too small, then almost methods are analysed in- discriminately using the parallel algorithm. For many small methods, the overhead of parallelism outweighs any over- lapped execution gains. In many cases, this causes an over- all slowdown (speedup scores below 1 in the graph). On the other hand, if t is set too large, then hardly any methods are analysed in parallel, so the SSA construction is almost al- ways sequential. In all cases, the curves tend to speedup = 1 as t→∞. Parallel SSA construction is more beneficial for large meth- ods. The benchmarks with the largest mean method sizes in Table 1, namely fop and batik in DaCapo and euler and moldyn in Java Grande, show the best speedups in Figure 1. We have not investigated alternative heuristics for select- ing which individual methods for which the parallel SSA construction algorithm is to be preferred to the sequential algorithm. Since the parallelism depends on (i) number of variables, and (ii) dominance properties of the control flow graph, method size seems like a simple proxy measure for overall complexity. Other heuristics might include software metrics such as cyclomatic complexity [20]. 4.3.2 Comparison on Inlined Benchmarks The reason why many benchmarks do not give significant speedups with parallel SSA construction is that most meth- app description # methods mean length max length D a C a p o avrora program simulator 2836 21.9 1083 batik SVG image processing 7137 34.3 41033 fop PDF generator 6749 44.5 33089 jython Python interpreter 20664 25.4 7846 luindex text indexing 1885 30.6 493 lusearch text search 1613 26.9 1187 pmd static analysis 6477 33.6 2881 sunflow raytracer 1109 50.4 6308 xalan XML parser 6189 29.5 2881 J av a G ra n d e euler fluid dynamics 27 295.81 1822 moldyn molecular dynamics simulation 20 102.20 931 montecarlo Monte Carlo simulation 178 17.65 211 raytracer raytracer 65 30.63 229 search alpha-beta search 29 86.34 465 Table 1: Analysed Java applications from the DaCapo and Java Grande benchmark suites 0 0.5 1 1.5 2 2.5 3 3.5 4 1 10 100 1000 10000 s p e e d u p method length threshold avrora batik fop jython luindex lusearch pmd sunflow xalan 0 0.5 1 1.5 2 2.5 3 1 10 100 1000 10000 s p e e d u p method length threshold euler moldyn montecarlo raytracer search Figure 1: Speedups for parallel SSA construction on the Dacapo (l) and Java Grande (r) benchmark suites 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 10 100 1000 10000 s p e e d u p method length threshold montecarlo raytracer search Figure 2: Speedups for parallel SSA construction on selected benchmarks with method inlining applied ods are small, so the overhead of the parallel threads is not mitigated by sufficient work for these threads to perform. Method inlining is one optimization technique to reduce this problem. The code of callee methods is reproduced at corresponding call-sites in the caller method bodies. This increases the overall size of the code, but it often reduces execution time. Thus it is a valuable optimization for JIT compilers [30] to be applied to the most frequently executed regions of code. We apply static method inlining in the Soot framework, to rewrite the Java bytecode class files for some of our bench- marks. The parameters used are: expansion-factor 20, max- container-size 10000, max-inlinee-size 500. Table 3 shows the new statistics for methods in each inlined benchmark, and how this relates to the original benchmark code. A neg- ative % change indicates a reduction in relation to the orig- inal, a positive % change indicates an expansion in relation to the original. Now we compare the sequential versus the parallel imple- mentation of SSA construction on these inlined benchmarks, in the same way as before. Figure 2 shows the speedups at various method thresholds. As before, at method thresh- old t, all methods with size < t are transformed using the sequential algorithm, whereas all methods with size >= t are transformed using the parallel algorithm. The bench- marks shown exhibit some speedups at method thresholds around 500, whereas in the non-inlined versions (Figure 1) no speedup was possible. 5. RELATEDWORK To the best of our knowledge, no-one has previously re- ported on a parallelization strategy for the construction of SSA. However, given the sudden abundance of parallel threads in commodity hardware, there has been significant recent in- terest in adapting classical data flow analysis frameworks to take advantage of light-weight threading. Knoop [11] argues that reverse data flow analysis, or demand-driven data flow analysis problems (i.e. multiple data flow queries at specific program points) are embarassingly parallel, and therefore good candidates for deployment on multi-core systems. Ed- vinsson and Lowe [9] motivate parallel data flow analysis for rapid points-to analysis of an object-oriented program as part of an integrated development environment. They analyse the target methods of polymorphic calls in parallel. They introduce heuristics to avoid spawning new analysis threads if the thread management overhead is large in com- parison with the amount of parallel execution. They evalu- ate a Java implementation of their points-to analysis, on a range of open-source Java applications. The average paral- lel speedup is 1.78 with their best heuristic, on an eight-core IA-32 system. Rodriguez [26, 25] implements an interprocedural data flow analysis algorithm using light-weight multithreading based on the Actor model in Scala. Note that Scala Ac- tors are implemented using the Java fork/join framework, which we have used in our implementation work. Rodriguez presents an object-oriented type analysis algorithm, and shows how there is abundant potential parallelism when this anal- ysis is applied to three DaCapo Java applications (up to 1000-way parallelism on an ideal machine). Me´ndez-Lojo et al [21] present an Andersen-style points-to analysis based on constraint graphs. The graphs can be ma- nipulated in parallel with simple rewrite rules. Their imple- mentation gives 3x speedup over the best sequential version on an 8-core machine for substantial C benchmarks. There is a body of older work on parallelism in data flow analysis. For instance, Lee et al [18, 17] show how clas- sical problems such as reaching definitions analysis can be parallelized. They define three kinds of parallelism for data flow problems. Our work falls into the third category: al- gorithmic parallelism. They note that the granularity of the problem decomposition is critical here. Such paralleliza- tion is only effective for ‘large procedures or interprocedural problems.’ Our results confirm that this observation still holds. They give an empirical study, analysing reaching defi- nitions in Fortran benchmark programs on a special-purpose research hypercube-processor-interconnect machine. They show speedups and some scalability for up to eight proces- sors. 6. CONCLUSIONS In this paper we have shown that the standard sequen- tial algorithm for SSA construction may be parallelized. We have implemented the parallel algorithm in an existing com- piler infrastructure using light-weight threading mechanisms for Java. The implementation has been tested on a rep- resentative set of Java benchmark programs. For shorter methods, the parallel execution is outweighed by the over- head of the fork/join thread library. However the parallel speedups are significant for larger methods, suggesting that a threshold size is required to select which methods should be subject to parallel SSA construction. We anticipate that commodity hardware with increasingly large numbers of cores and threads will become common over the next few years: the challenge for programmers is to make the best use of these cores. We have three reasons for believing that the size of SSA- based representations is also set to grow: 1. Given the current trend towards more aggressive op- timisation, including loop unrolling and method inlin- ing, compilers will have large method bodies to analyse app # methods (% change) mean length (% change) max length montecarlo 179 (+1%) 20.82 (+18%) 231 raytracer 69 (+6%) 40.31 (+32%) 406 search 32 (+10%) 89.87 (+4%) 484 Table 3: Method statistics for selected benchmarks with inlining applied even in cases where the program was originally written as a collection of small functions. 2. Another trend is the increasing amount of source code that is produced by automatic code generators. more source code is produced by code generators. Such pro- grams do not necessarily have the same properties as human written code. Methods are often larger, and loops deeper. 3. There is an increasing need for interprocedural data flow analysis. In the context of this paper, the SSA construction operates at a single-procedure level. How- ever SSA does extend naturally to the whole-program scope [19, 29]. Again, our parallel SSA construction techniques would be applicable to such large interpro- cedural control flow graphs. Exploiting all the available parallel hardware for the anal- ysis of such large SSA-based program representations is there- fore of great importance in reducing compile times. In overall terms, the largest single contributor to compi- lation time is parsing. Researchers are working on paral- lelizing the parsing process [24]. We anticipate that every significant phase of next-generation compilers should be en- gineered to take advantage of underlying hardware paral- lelism, to achieve scalability. This includes the data flow analysis phase, of which SSA construction is generally a key initial step. One might argue that it would be better to perform se- quential (i.e. single-threaded) SSA-construction on multiple methods in parallel, rather than parallelizing SSA construc- tion for a single method. However we note that in a JIT compiler or an interactive compilation server [16], requests for optimizing compilation of methods will occur sporadically, unpredictably and not necessarily concurrently. Where possible, it would be good to parallelize both individual method SSA construction, and handle multiple methods concurrently. We have only tackled the former problem in this paper. Also, in a JIT context, re- sponse time is often more important than total performance: so reducing the analysis time for the largest methods will im- prove worst case response time, even if it does not improve performance on the small methods (where performance is already adequate). 7. REFERENCES [1] Ashby, S., Eulisse, G., Schmid, S., Tuura, L.: Parallel compilation of CMS software. In: Proc. Computing in High Energy and Nuclear Physics Conference (CHEP) (2004) [2] Baalbergen, E.: Design and implementation of parallel make. Computing Systems 1(2), 135–158 (1988) [3] Bilardi, G., Pingali, K.: Algorithms for computing the static single assignment form. Journal of the ACM 50(3), 375–425 (May 2003) [4] Blackburn, S.M., Garner, R., Hoffman, C., Khan, A.M., McKinley, K.S., Bentzur, R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S.Z., Hirzel, M., Hosking, A., Jump, M., Lee, H., Moss, J.E.B., Phansalkar, A., Stefanovic´, D., VanDrunen, T., von Dincklage, D., Wiedermann, B.: The DaCapo benchmarks: Java benchmarking development and analysis. In: OOPSLA ’06: Proceedings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programing, Systems, Languages, and Applications. pp. 169–190 (Oct 2006) [5] Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming. pp. 207–216. ACM, New York, NY, USA (1995) [6] Brandis, M.M., Mo¨ssenbo¨ck, H.: Single-pass generation of static single-assignment form for structured languages. ACM Transactions on Programming Languages and Systems 16(6), 1684–1698 (Nov 1994) [7] Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (1991) [8] Das, D., Ramakrishna, U.: A practical and fast iterative algorithm for φ-function computation using DJ graphs. ACM Transactions on Programming Languages and Systems 27(3), 426–440 (2005) [9] Edvinsson, M., Lowe, W.: A multi-threaded approach for data-flow analysis. In: Proceedings of the IPDPS 2010 Workshop on Multi-Threaded Architectures and Applications (2010) [10] Hill, M., Marty, M.: Amdahl’s law in the multicore era. IEEE Computer 41(7), 33–38 (2008) [11] Knoop, J.: Data-flow analysis for multi-core computing systems: A reminder to reverse data-flow analysis. In: Martin, F., Nielson, H.R., Riva, C., Schordan, M. (eds.) Scalable Program Analysis. No. 08161 in Dagstuhl Seminar Proceedings, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, Dagstuhl, Germany (2008), http: //drops.dagstuhl.de/opus/volltexte/2008/1575 [12] Knuth, D.: An empirical study of FORTRAN programs. Software: Practice and Experience 1(2), 105–133 (1971) [13] Kotzmann, T., Wimmer, C., Mo¨ssenbo¨ck, H., Rodriguez, T., Russell, K., Cox, D.: Design of the Java HotSpot client compiler for Java 6. ACM Trans. Archit. Code Optim. 5(1), 1–32 (2008) [14] Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis and transformation. In: Code Generation and Optimization, IEEE/ACM International Symposium on. p. 75. IEEE Computer Society, Los Alamitos, CA, USA (2004) [15] Lea, D.: A Java fork/join framework. In: Proceedings of the ACM 2000 conference on Java Grande. pp. 36–43. ACM, New York, NY, USA (2000) [16] Lee, H.B., Diwan, A., Moss, J.E.B.: Design, implementation, and evaluation of a compilation server. ACM Transactions on Programming Languages and Systems 29 (2007) [17] Lee, Y.f., Marlowe, T.J., Ryder, B.G.: Performing data flow analysis in parallel. In: Supercomputing ’90: Proceedings of the 1990 ACM/IEEE conference on Supercomputing. pp. 942–951 (1990) [18] Lee, Y.F., Ryder, B.G.: A comprehensive approach to parallel data flow analysis. In: Proceedings of the 6th international conference on Supercomputing. pp. 236–247 (1992) [19] Liao, S.W., Diwan, A., Bosch, Jr., R.P., Ghuloum, A., Lam, M.S.: SUIF Explorer: an interactive and interprocedural parallelizer. In: Proceedings of the 7th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp. 37–48 (1999) [20] McCabe, T.: A complexity measure. IEEE Transactions on Software Engineering 2, 308–320 (1976) [21] Me´ndez-Lojo, M., Mathew, A., Pingali, K.: Parallel inclusion-based points-to analysis. In: Proceedings of the ACM international conference on Object Oriented Programming Systems Languages and Applications. pp. 428–443 (2010) [22] Novillo, D.: Tree SSA a new optimization infrastructure for gcc. In: Proceedings of the 2003 GCC DevelopersaˆA˘Z´ Summit. pp. 181–193 (2003) [23] Novillo, D.: Design and implementation of Tree SSA. In: GCC DevelopersaˆA˘Z´ Summit (2004) [24] Prabhu, P., Ramalingam, G., Vaswani, K.: Safe programmable speculative parallelism. In: PLDI ’10: Proceedings of the 2010 ACM SIGPLAN conference on Programming language design and implementation. pp. 50–61 (2010) [25] Rodriguez, J., Lhota´k, O.: Actor-based parallel dataflow analysis. In: Proceedings of the International Conference in Compiler Construction (2011), to appear [26] Rodriguez, J.D.: A concurrent ifds dataflow analysis algorithm using actors (2010), http://hdl.handle.net/10012/5283 [27] Smith, L., Bull, J., Obdrizalek, J.: A parallel java grande benchmark suite. In: Proceedings of the ACM/IEEE 2001 Conference on Supercomputing. pp. 1–10 (2001) [28] Sreedhar, V.C., Gao, G.R.: A linear time algorithm for placing φ-nodes. In: Proceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. pp. 62–73 (1995) [29] Staiger, S., Vogel, G., Keul, S., Wiebe, E.: Interprocedural Static Single Assignment Form. In: Proceedings of the 14th Working Conference on Reverse Engineering. pp. 1–10 (2007) [30] Suganuma, T., Yasue, T., Nakatani, T.: An empirical study of method inlining for a Java just-in-time compiler. In: Proceedings of the Java Virtual Machine Research and Technology Symposium (2002) [31] Valle´e-Rai, R., Co, P., Gagnon, E., Hendren, L., Lam, P., Sundaresan, V.: Soot—a Java bytecode optimization framework. In: Proceedings of the 1999 conference of the Centre for Advanced Studies on Collaborative research. p. 13. IBM Press (1999) [32] Valle´e-Rai, R., Gagnon, E., Hendren, L., Lam, P., Pominville, P., Sundaresan, V.: Optimizing Java bytecode using the Soot framework: Is it feasible? In: Proceedings of the International Conference on Compiler Construction. pp. 18–34. Springer (2000) [33] Vallee-Rai, R., Hendren, L.: Jimple: Simplifying Java bytecode for analyses and transformations. Tech. Rep. SABLE-TR-1998-4, McGill University, School of Computer Science (1998)