Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
1/2/01 1
Performance Impact of
Multithreaded Java Server
Applications
Yue Luo,  Lizy K. John
Laboratory of Computer Architecture
ECE Department
University of Texas at Austin
2Outline
• Motivation
• VolanoMark Benchmark
• Methodology
• Results
• Conclusion
• Further Work Needed
3Motivation
• Performance under the presence of a large
number of threads is crucial for a commercial
Java server.
– Java applications are shifting from client-side to server-
side.
– Server needs to support multiple simultaneous client
connections.
– No select() or poll() or asynchronous I/O in Java
– Current Java programming paradigm: one thread for
one connection.
4VolanoMark Benchmark
• VolanoMark is a 100% Pure Java server
benchmark characterized by long-lasting network
connections and high thread counts.
– Based on real commercial software.
– Server benchmark.
– Long-lasting network connections and high thread
counts.
– Two threads for each client connection.
5VolanoMark Benchmark
user1 user2 user3 user1 user2 user3
Server
M
es
sa
ge
 
1
Chat room 1 Chat room 2
Client
6Methodology
• Performance counters used to study OS
and user activity on Pentium III system.
• Monitoring Tools -- Pmon
– Developed in our lab. Better controlled.
– Device driver to read performance counters
– Low overhead
7Platform Parameters
• Hardware
– Uni-processor
– CPU Frequency: 500MHz
– L1 I Cache: 16KB, 4-way, 32 Byte/Line, LRU
– L2 Cache: 512KB, 4-way, 32 Byte/Line, Non-blocking
– Main Memory: 1GB
• Software
– Windows NT Workstation 4.0
– Sun JDK1.3.0-C with HotSpot server (build 2.0fcs-E,
mixed mode)
8Monitoring Issues
• Synchronize measurements with client
connections to skip starting and shutdown
process
– Add wrapper to the client.
– The wrapper starts an extra connection immediately
before starting the client to trigger measurement.
• Avoid counter overflow
– Counting interval: 3sec
9Results
• Decreasing CPI
• OS has larger CPI
• OS CPI decreases
significantly
• User CPI sees
small fluctuation.
Hotspot CPI
2
2.5
3
3.5
4
4.5
5
20 200 400 600 800
Connections
USER OS OVERALL
10
Results
• More instructions executed!
• OS part increases
significantly
• User part increases slightly
• Even more execution time
is in OS mode due to the
larger CPI in OS.
• One guess: overhead in
connection and thread
management; some OS
algorithm with non-linear
complexity (e.g.
O(N*logN) )
Hotspot Instructions per Connections
0.00E+00
5.00E+06
1.00E+07
1.50E+07
2.00E+07
2.50E+07
3.00E+07
3.50E+07
4.00E+07
4.50E+07
20 200 400 600 800
USER OS
Regardless of the number of connections, each thread basically does the
same thing.  Therefore instructions for each connection should remain the
same.
11
Results
• Decreasing L1 I-Cache miss ratio
• Beneficial interference
between threads: They
share program codes.
• The more threads we have,
the more likely we context
switch to another thread
that is executing the same
part of the program thus
codes in I cache and entries
in ITLB are reused.
Hotspot L1 Icache m isse s per Instruction
0%
2%
4%
6%
8%
10%
12%
14%
20 200 400 600 800
USER OS OVERALL
Hotspot ITLB M isses  per Instruction
0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
0.80%
20 200 400 600 800
USER OS OVERALL
12
Results
• Decreasing I stalls per instruction
• As the result of decreasing I-
cache miss ratio and ITLB
miss ratio,  instruction
fetching stalls are lowered
for both OS part and user
part.
Hotspot Is talls  Cycles  per Instruction
0
0.5
1
1.5
2
2.5
20 200 400 600 800
USER OS OVERALL
13
Results
• Increasing L1 D cache miss ratio
• OS: Huge increase
–More thread data, larger
data footprint
–More context switches
• User: Slight decrease
Hotspot L1 Dcache m isses  pe r Data Re ference
0%
5%
10%
15%
20%
20 200 400 600 800
USER OS OVERALL
14
Hots pot Cache M isses per Connection
0.00E+00
1.00E+06
2.00E+06
3.00E+06
4.00E+06
5.00E+06
6.00E+06
20 200 400 600 800
OS D-CACHE
USER D-CACHE
OS I-CACHE
USER I-CACHE
• With more connections,
OS are doing more and
incurring more data misses
–Send & receive network
packets
–Threads scheduling &
synchronization
Results
• Significant OS L1 D-cache Misses
15
Results
• L2 Cache miss ratio
Hots pot L2 Cache M iss Ratio
0%
2%
4%
6%
8%
10%
12%
14%
16%
20 200 400 600 800
USER OS OVERALL
16
• More branches in OS code
• More branches are taken
• May be due to more loops
in OS code
Hotspot Branch Fre quency
19%
20%
21%
22%
23%
24%
25%
20 200 400 600 800
USER OS OVERALL
Hotspot Branch Taken Ratio
50%
55%
60%
65%
70%
75%
80%
20 200 400 600 800
USER OS OVERALL
Results
• Branches
17
• Branches in loops are
easier to predict so we
have more accurate branch
predictions
Hotspot Branch M is predict Ratio
0%
5%
10%
15%
20%
20 200 400 600 800
USER OS OVERALL
Hotspot BTB M iss  Ratio
0%
10%
20%
30%
40%
50%
60%
70%
20 200 400 600 800
USER OS OVERALL
• Due to the beneficial code
sharing among threads,
BTB miss ratio decreases.
Results
• More accurate branch predictions
18
• Lower instruction stalls
and better branch
prediction result in larger
resource stalls
• May favor a CPU with
more resources.
Hotspot Resource Stalls  per Instruction
0
0.2
0.4
0.6
0.8
1
1.2
1.4
20 200 400 600 800
USER OS OVERALL
Results
• More resource stalls
19
Conclusions
• Multi-threading is an excellent approach to support multiple
simultaneous client connections.  Heavy multithreading is more
crucial to Java server applications due to its lack of I/O
multiplexing APIs.
• Thread creation and synchronization as well as network
connection management are the responsibility of the operating
system.  With more concurrent connections, more OS activity is
involved in the server execution.
• Threads usually share program code; thus instruction cache, ITLB
and BTB will all benefit when the system context switch from one
thread to another thread executing the same part of code.  Multi-
threading also benefits branch predictors.
• Each thread will incur some code and data overheads especially
in operating system mode. Given enough memory resources, the
nonlinearly increasing overheads are the biggest impediment to
performance scalability.  Further tuning of the application and
operating system may alleviate this problem.
20
Further work needed
• More complex benchmark needed (eg SPEC
jbb2000) to validate the results.  We need to
distinguish characterization of multi-threaded
server applications from that of VolanoMark.
• Find why much more instructions are
executed in OS with more connections and try
to reduce them.
21
L1 I-Cache M iss  Ratio
0.0%
0.5%
1.0%
1.5%
2.0%
2.5%
3.0%
3.5%
4.0%
0 40 80 120 160 200
GREEN NATIVE
Results on Sparc With Shade (backup slide)
• Also observed decreasing L1 I-cache miss
ratio
22
Results on Sparc With Shade (backup slide)
• Also observed better branch prediction
Branch M ispredict Rate
0%
2%
4%
6%
8%
10%
12%
14%
16%
0 40 80 120 160 200
GREEN NATIVE