1/2/01 1 Performance Impact of Multithreaded Java Server Applications Yue Luo, Lizy K. John Laboratory of Computer Architecture ECE Department University of Texas at Austin 2Outline • Motivation • VolanoMark Benchmark • Methodology • Results • Conclusion • Further Work Needed 3Motivation • Performance under the presence of a large number of threads is crucial for a commercial Java server. – Java applications are shifting from client-side to server- side. – Server needs to support multiple simultaneous client connections. – No select() or poll() or asynchronous I/O in Java – Current Java programming paradigm: one thread for one connection. 4VolanoMark Benchmark • VolanoMark is a 100% Pure Java server benchmark characterized by long-lasting network connections and high thread counts. – Based on real commercial software. – Server benchmark. – Long-lasting network connections and high thread counts. – Two threads for each client connection. 5VolanoMark Benchmark user1 user2 user3 user1 user2 user3 Server M es sa ge 1 Chat room 1 Chat room 2 Client 6Methodology • Performance counters used to study OS and user activity on Pentium III system. • Monitoring Tools -- Pmon – Developed in our lab. Better controlled. – Device driver to read performance counters – Low overhead 7Platform Parameters • Hardware – Uni-processor – CPU Frequency: 500MHz – L1 I Cache: 16KB, 4-way, 32 Byte/Line, LRU – L2 Cache: 512KB, 4-way, 32 Byte/Line, Non-blocking – Main Memory: 1GB • Software – Windows NT Workstation 4.0 – Sun JDK1.3.0-C with HotSpot server (build 2.0fcs-E, mixed mode) 8Monitoring Issues • Synchronize measurements with client connections to skip starting and shutdown process – Add wrapper to the client. – The wrapper starts an extra connection immediately before starting the client to trigger measurement. • Avoid counter overflow – Counting interval: 3sec 9Results • Decreasing CPI • OS has larger CPI • OS CPI decreases significantly • User CPI sees small fluctuation. Hotspot CPI 2 2.5 3 3.5 4 4.5 5 20 200 400 600 800 Connections USER OS OVERALL 10 Results • More instructions executed! • OS part increases significantly • User part increases slightly • Even more execution time is in OS mode due to the larger CPI in OS. • One guess: overhead in connection and thread management; some OS algorithm with non-linear complexity (e.g. O(N*logN) ) Hotspot Instructions per Connections 0.00E+00 5.00E+06 1.00E+07 1.50E+07 2.00E+07 2.50E+07 3.00E+07 3.50E+07 4.00E+07 4.50E+07 20 200 400 600 800 USER OS Regardless of the number of connections, each thread basically does the same thing. Therefore instructions for each connection should remain the same. 11 Results • Decreasing L1 I-Cache miss ratio • Beneficial interference between threads: They share program codes. • The more threads we have, the more likely we context switch to another thread that is executing the same part of the program thus codes in I cache and entries in ITLB are reused. Hotspot L1 Icache m isse s per Instruction 0% 2% 4% 6% 8% 10% 12% 14% 20 200 400 600 800 USER OS OVERALL Hotspot ITLB M isses per Instruction 0.00% 0.10% 0.20% 0.30% 0.40% 0.50% 0.60% 0.70% 0.80% 20 200 400 600 800 USER OS OVERALL 12 Results • Decreasing I stalls per instruction • As the result of decreasing I- cache miss ratio and ITLB miss ratio, instruction fetching stalls are lowered for both OS part and user part. Hotspot Is talls Cycles per Instruction 0 0.5 1 1.5 2 2.5 20 200 400 600 800 USER OS OVERALL 13 Results • Increasing L1 D cache miss ratio • OS: Huge increase –More thread data, larger data footprint –More context switches • User: Slight decrease Hotspot L1 Dcache m isses pe r Data Re ference 0% 5% 10% 15% 20% 20 200 400 600 800 USER OS OVERALL 14 Hots pot Cache M isses per Connection 0.00E+00 1.00E+06 2.00E+06 3.00E+06 4.00E+06 5.00E+06 6.00E+06 20 200 400 600 800 OS D-CACHE USER D-CACHE OS I-CACHE USER I-CACHE • With more connections, OS are doing more and incurring more data misses –Send & receive network packets –Threads scheduling & synchronization Results • Significant OS L1 D-cache Misses 15 Results • L2 Cache miss ratio Hots pot L2 Cache M iss Ratio 0% 2% 4% 6% 8% 10% 12% 14% 16% 20 200 400 600 800 USER OS OVERALL 16 • More branches in OS code • More branches are taken • May be due to more loops in OS code Hotspot Branch Fre quency 19% 20% 21% 22% 23% 24% 25% 20 200 400 600 800 USER OS OVERALL Hotspot Branch Taken Ratio 50% 55% 60% 65% 70% 75% 80% 20 200 400 600 800 USER OS OVERALL Results • Branches 17 • Branches in loops are easier to predict so we have more accurate branch predictions Hotspot Branch M is predict Ratio 0% 5% 10% 15% 20% 20 200 400 600 800 USER OS OVERALL Hotspot BTB M iss Ratio 0% 10% 20% 30% 40% 50% 60% 70% 20 200 400 600 800 USER OS OVERALL • Due to the beneficial code sharing among threads, BTB miss ratio decreases. Results • More accurate branch predictions 18 • Lower instruction stalls and better branch prediction result in larger resource stalls • May favor a CPU with more resources. Hotspot Resource Stalls per Instruction 0 0.2 0.4 0.6 0.8 1 1.2 1.4 20 200 400 600 800 USER OS OVERALL Results • More resource stalls 19 Conclusions • Multi-threading is an excellent approach to support multiple simultaneous client connections. Heavy multithreading is more crucial to Java server applications due to its lack of I/O multiplexing APIs. • Thread creation and synchronization as well as network connection management are the responsibility of the operating system. With more concurrent connections, more OS activity is involved in the server execution. • Threads usually share program code; thus instruction cache, ITLB and BTB will all benefit when the system context switch from one thread to another thread executing the same part of code. Multi- threading also benefits branch predictors. • Each thread will incur some code and data overheads especially in operating system mode. Given enough memory resources, the nonlinearly increasing overheads are the biggest impediment to performance scalability. Further tuning of the application and operating system may alleviate this problem. 20 Further work needed • More complex benchmark needed (eg SPEC jbb2000) to validate the results. We need to distinguish characterization of multi-threaded server applications from that of VolanoMark. • Find why much more instructions are executed in OS with more connections and try to reduce them. 21 L1 I-Cache M iss Ratio 0.0% 0.5% 1.0% 1.5% 2.0% 2.5% 3.0% 3.5% 4.0% 0 40 80 120 160 200 GREEN NATIVE Results on Sparc With Shade (backup slide) • Also observed decreasing L1 I-cache miss ratio 22 Results on Sparc With Shade (backup slide) • Also observed better branch prediction Branch M ispredict Rate 0% 2% 4% 6% 8% 10% 12% 14% 16% 0 40 80 120 160 200 GREEN NATIVE