Java程序辅导

C C++ Java Python Processing编程在线培训 程序编写 软件开发 视频讲解

客服在线QQ:2653320439 微信:ittutor Email:itutor@qq.com
wx: cjtutor
QQ: 2653320439
SCHEDULING 
MACRO-DATAFLOW PROGRAMS 
ON 
TASK-PARALLEL RUNTIME SYSTEMS	

SAĞNAK TAŞIRLAR 
1	

Thesis	

2	

	

 	

Our thesis is that advances in task parallel runtime 
systems can enable a macro-dataflow programming 
model, like Concurrent Collections (CnC), to deliver 
productivity and performance on modern multicore 
processors.	

Approach	

3	

  Macro-dataflow for expressiveness	

  Determinism	

  Race/deadlock freedom	

  Higher level abstraction	

  Task parallel runtimes for performance	

  Portable scalability 	

  Contemporary consensus	

Motivation	

4	

  Parallelism not accessible to those who need it most	

  Imposed serial thinking	

  Parallelism for the masses, not just computer scientists	

  Parallel programming models of today:	

  Hide machine details but expose parallelism details	

  Constrain expressiveness	

Contributions	

5	

  Scheduling CnC on Habanero Java ★	

  Evaluation of scheduling performance for CnC ★	

  Introduction of Data Driven Futures (DDF) construct	

  Implementation of DDF construct	

  Implementation and evaluation of data driven runtime 
with DDFs	

★	

Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, 
Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, Sağnak Taşırlar, “The CnC 
Programming Model ”, submitted for publication to the Journal of Supercomputing, 2010	

Outline	

6	

  Background	

  CnC Scheduling	

  Data Driven Futures	

  Results	

  Wrap up	

Dynamic Task Parallelism	

7	

  Properties	

  Over exposure of parallelism	

  Scales up/down with # of cores	

  Scheduling maps sets of tasks to threads at runtime	

  Habanero Java (HJ) employs:	

  Finish/async parallelism	

  Feeds child tasks through lexical scope	

  Work sharing/stealing runtime scheduling	

CnC concepts	

8	

Collection	

Graphical 	

Notation	

Textual	

Notation	

Step	

 (SomeStep)	

Item	

 [SomeItem]	

Tag	

 	

[SomeItem]	

	

(SomeStep)	

  Step	

  Computation abstraction	

  Side effect free	

  Functional w.r.t. input	

  Special step: Environment	

  Item	

  Dynamic single 
assignment	

  Value not storage	

  Tag	

  Data tag to index items	

  Control tag to index steps	

Concurrent Collections model	

9	

  Can be classified as:	

  Declarative	

  Deterministic	

  Dynamic single assignment	

  Macro-dataflow	

  Coordination language	

  Goal: consider only semantic ordering constraints	

  Inherent in the application not the implementation	

  Will be described by the CnC graph	

Example Program Specification 
10	

“aaa”	

“ff”	

“qqq”	

“mmmmmmm”	

“aaa”	

“qqq”	

“mmmmmmm”	

Input 
string	

Sequences of repeated 
characters	

Filtered 
sequences	

  Break up an input string	

  Sequences of repeated single characters	

  Filter allowing only	

  Sequences of odd length	

“aaaffqqqmmmmmmm”	

11	

[input: 1] = “aaaffqqqmmmmmmm”	

[input: 2] = “rrhhhhxxx”	

…	

[input: j]	

 [results: j, s]	

(createSpan: j)	

 (processSpan: j, s)	

	

 	

[span: j, s]	

	

	

…	

[span: 1, 1] = “aaa”	

[span: 1, 2] = “ff”	

[span: 1, 3] = “qqq”	

[span: 1, 4] = “mmmmmmm”	

[results: 1, 1] = “aaa”	

[results: 1, 3] = “qqq”	

[results: 1, 4] = “mmmmmmm”	

	

CnC Implementation of Example 
Program	

	

	

	

Collection	

 Graphical 	

 Textual	

Step	

 ( … )	

Item	

 [ … ]	

Tag	

 < … >	

[ ]	

( )	

<>	

CnC-Habanero Java build model	

12	

Code to invoke the graph	

Code to put initial values in graph	

Code to implement abstract steps	

Abstract classes for all steps	

Definitions for all collections	

Graph definition and initialization	

Concurrent 
Collections 
Library	

Habanero Java 
Runtime 
Library	

Concurrent Collections 
Textual Graph	

Habanero Java 
source files	

Habanero Java 
source files	

.class files	

JAR Builder	

Java application	

User specified	

CnC compiler	

CnC Translator	

CnC Components	

Outline	

13	

  Background	

  CnC Scheduling	

  Data Driven Futures	

  Results	

  Wrap up	

CnC Scheduling Challenges	

14	

  Control & data dependences are first level constructs	

  Task parallel frameworks have them coupled	

  Step instances have multiple predecessors	

  Need to wait for all predecessors	

  Layered readiness concepts	

  Control dependence satisfied	

  Data dependence satisfied	

  Schedulable / Ready	

Eager scheduling	

15	

  Assume control dependence satisfaction is readiness	

  Conforms to task parallel runtime assumption	

  Wait till data dependences satisfaction for safety	

  Block on data prematurely tried to be read	

  Discard task reading prematurely, replay when data arrive	

  Use Java wait/notify for premature data access	

  Blocking granularity	

  Instance level vs Collection level	

  Blocked task blocks whole thread	

  Deadlock possibility	

  Need to create more threads as threads block	

Blocking Eager CnC Schedulers	

16	

ThreadΕ	

step1	

Get (data-tagγ)	

ItemCollectionΘ	

data-tagα	

data-tagβ	

 valueβ	

valueα	

wait	

ThreadΔ	

step2	

Put(data-tagγ,valueγ)	

notify	

time	

data-tagγ	

 placeHolderγ	

valueγ	

  Alternative eager scheduling	

  Blocking scheduler suffers from	

  Expensive recovery from premature read	

  Blocks whole thread	

  Creates new thread	

  Switch context to the new thread on every failure	

  Inform item instance on failed task and discard task	

  Throw an exception to unwind failed task	

  Catch by runtime and continue with another ready task	

  Recreate task when needed item arrives	

Data Driven Rollback & Replay	

17	

Data Driven Rollback & Replay	

18	

ThreadΕ	

step1	

Get (data-tagγ)	

ItemCollectionΘ	

data-tagα	

data-tagβ	

 valueβ	

valueα	

ThreadΔ	

step2	

Put(data-tagγ,valueγ)	

waitlistα	

waitlistβ	

data-tagγ	

 empty	

 waitlistγ	

Insert step1 to 
waitlistγ 	

Throw exception to 
unwind	

step3	

Recreate steps on 
waitlistγ  	

step1	

Get (data-tagγ)	

Get (data-tagδ)	

valueγ	

  Do not create tasks until data dependences satisfied	

  No failure, no recovery	

  Restrict computation frontier to ready tasks	

  Evaluation of data readiness condition	

  Busy waiting on data (delayed async scheduling)	

  Dataflow like readiness (data driven scheduling)	

  Register tasks on data	

  Data notifies consumer tasks when created	

Data Driven Scheduling	

19	

  Delayed async handling for work stealing scheduler	

  Guarded execution construct for HJ	

  Promote to async when guard evaluates to true	

Delayed Asyncs	

20	

Work Sharing Ready Task Queue	

asyncA	

asyncB	

asyncZ	

Delayed?	

Pop a Task	

Assign to thread	

Yes	

No	

Evaluate guard	

Is true?	

Yes	

Requeue	

No	

  Every CnC step is a guarded execution	

  Guard condition is the availability of items to consume	

  Task still created eagerly when provided control	

  Promotes to ready when data provided	

Delayed Async Scheduling	

21	

Outline	

22	

  Background	

  CnC Scheduling	

  Data Driven Futures	

  Results	

  Wrap up	

  Task parallel synchronization construct 	

  Acts as a reference to single assignment value	

  Creation	

  Create a dangling reference object	

  Resolution ( Put )	

  Resolve what value a DDF is referring to	

  Registration ( Await )	

  A task provides a consume list of DDFs on declaration	

  A task can only read DDFs that it is registered to	

Data Driven Futures (DDFs)	

23	

Data Driven Futures (DDFs)	

24	

DataDrivenFuture leftChild = new DataDrivenFuture();	

DataDrivenFuture rightChild = new DataDrivenFuture();	

finish {	

    async leftChild.put(leftChildCreator());	

    async rightChild.put(rightChildCreator());	

    async await ( leftChild ) useLeftChild(leftChild);	

    async await ( rightChild ) useRightChild(rightChild);	

    async await ( leftChild, rightChild ) 	

	

 	

 	

useBothChildren( leftChild, rightChild );	

}	

Contributions of DDFs	

  Non-series-parallel task 
dependency graphs 
support	

  Memory footprint 
reduction	

  Exposes only ready parts of 
the execution frontier	

  Not global lifetime	

  Creator: 	

  feeds consumers	

  gives access to producer	

  Lifetime restricted to	

  Creator lifetime	

  Resolver lifetime	

  Consumers lifetimes	

  Can be garbage collected 
on a managed runtime	

25	

TaskA	

TaskC	

TaskD	

TaskB	

TaskE	

TaskF	

Data Driven Scheduling	

26	

  Steps register self to items wrapped into DDFs	

TaskM	

PlaceHolderβ	

DDFβ	

DDFα	

TaskN	

DDFα	

Valueα	

DDFβ	

 DDFδ	

✕	

DDFδ	

✕	

create DDFα, DDFβ,  DDFδ 	

TaskC	

create TaskA resolving DDFα	

create TaskM reading DDFα, DDFβ	

create TaskN reading DDFβ,  DDFδ 	

PlaceHolderα	

PlaceHolderδ	

✕	

 DDFβ	

TaskA	

resolve DDFα 	

create TaskD resolving DDFδ	

 TaskD	

resolve DDFδ 	

create TaskB resolving DDFβ	

Valueδ	

TaskA	

ready queue	

TaskD	

 TaskB	

TaskB	

r solve DDFβ 	

Valueβ	

TaskM	

 TaskN	

Outline	

27	

  Background	

  CnC Scheduling	

  Data Driven Futures	

  Results	

  Wrap up	

Performance Evaluation Legend	

28	

  Coarse Grain Blocking	

  Eager blocking scheduling on item collections for CnC-HJ	

  Fine Grain Blocking	

  Eager blocking scheduling on item instances for CnC-HJ	

  Delayed Async	

  Data Driven scheduling via HJ delayed asyncs for CnC-HJ	

  Data Driven Rollback & Replay	

  Eager scheduling with replay and notifications for CnC-HJ	

  Data Driven Futures	

  Hand coded CnC application equivalent on HJ with DDFs	

Cholesky Decomposition Introduction	

29	

  Dense linear algebra kernel	

  Three inherent kernels	

  Need to be pipelined for best performance	

  Loop parallelism within some kernels	

  Data parallelism within some kernels	

  CnC shown to beat optimized libraries, like IntelMKL	

Cholesky Decomposition on Xeon	

30	

7,861	

 7,730	

8,818	

 8,789	

7,199	

1,890	

949	

593	

 663	

 578	

0	

1,000	

2,000	

3,000	

4,000	

5,000	

6,000	

7,000	

8,000	

9,000	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

✕12.4	

Minimum execution times of 30 runs of single threaded and 16-threaded executions for 
blocked Cholesky decomposition CnC application with Habanero-Java steps on Xeon 
with input matrix size 2000 × 2000 and with tile size 125 × 125	

✕14.8	

 ✕13.2	

Cholesky Decomposition on Xeon	

31	

10,081	

 10,010	

 10,305	

 10,309	

8,748	

2,472	

1,197	

 979	

 853	

 790	

0	

2,000	

4,000	

6,000	

8,000	

10,000	

12,000	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

Average execution times and 90% confidence interval of 30 runs of single threaded and 16-
threaded executions for blocked Cholesky decomposition CnC application with Habanero-
Java steps on Xeon with input matrix size 2000 × 2000 and with tile size 125 × 125	

✕10.5	

 ✕12.1	

 ✕11.1	

Cholesky Decomposition with MKL on Xeon	

32	

1,782	

 1,766	

1,834	

 1,821	

1,723	

484	

187	

 156	

 157	

 134	

0	

200	

400	

600	

800	

1,000	

1,200	

1,400	

1,600	

1,800	

2,000	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

Minimum execution times of 30 runs of single threaded and 16-threaded executions for 
blocked Cholesky decomposition CnC application with Habanero-Java and Intel MKL 
steps on Xeon with input matrix size 2000 × 2000 and with tile size 125 × 125	

✕11.7	

 ✕11.5	

 ✕12.8	

Cholesky Decomposition with MKL on Xeon	

33	

1,999	

 1,983	

 1,993	

 1,949	

1,775	

616	

250	

 231	

 194	

 156	

0	

200	

400	

600	

800	

1,000	

1,200	

1,400	

1,600	

1,800	

2,000	

2,200	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

Average execution times and 90% confidence interval of 30 runs of single threaded and 16-
threaded executions for blocked Cholesky decomposition CnC application with Habanero Java 
and Intel MKL steps on Xeon with input matrix size 2000 × 2000 and with tile size 125 × 125	

✕8.63	

 ✕10.1	

 ✕11.4	

Cholesky Decomposition on UltraSPARC T2	

34	

106,883	

103,631	

 100,573	

 103,297	

96,587	

5,489	

 5,388	

 5,651	

 5,259	

 4,950	

0	

10,000	

20,000	

30,000	

40,000	

50,000	

60,000	

70,000	

80,000	

90,000	

100,000	

110,000	

120,000	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 64-Workers	

Minimum execution times of 30 runs of single threaded and 64-threaded executions for 
blocked Cholesky decomposition CnC application with Habanero-Java steps on 
UltraSPARC T2 with input matrix size 2000 × 2000 and with tile size 125 × 125	

✕17.8	

 ✕19.6	

 ✕19.5	

Cholesky Decomposition on UltraSPARC T2	

35	

111,204	

 108,863	

104,185	

 106,695	

99,032	

7,035	

 6,993	

 6,958	

 6,339	

 5,681	

0	

10,000	

20,000	

30,000	

40,000	

50,000	

60,000	

70,000	

80,000	

90,000	

100,000	

110,000	

120,000	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 64-Workers	

Average execution times and 90% confidence interval of 30 runs of single threaded and 64-
threaded executions for blocked Cholesky decomposition CnC application with Habanero-
Java steps on UltraSPARC T2 with input matrix size 2000 × 2000 and with tile size 125 × 125	

✕15.0	

 ✕16.8	

 ✕17.4	

Black-Scholes formula	

36	

  Only one step	

  The Black-Scholes formula	

  Embarrassingly parallel	

  Good indicator of scheduling overhead	

Black-Scholes on Xeon	

37	

33,688	

 33,762	

 34,214	

 33,788	

 34,634	

4,148	

 4,155	

 4,216	

 4,145	

2,164	

0	

5,000	

10,000	

15,000	

20,000	

25,000	

30,000	

35,000	

40,000	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

Minimum execution times of 30 runs of single threaded and 16-threaded executions for 
blocked Black-Scholes CnC application with Habanero-Java steps on Xeon with input size 
1,000,000 and with tile size 62,500	

✕8.1	

 ✕8.1	

 ✕16.0	

Black-Scholes on Xeon	

38	

33,871	

 33,966	

 34,311	

 34,121	

 34,729	

4,300	

 4,309	

 4,279	

5,061	

2,353	

0	

5,000	

10,000	

15,000	

20,000	

25,000	

30,000	

35,000	

40,000	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

Average execution times and 90% confidence interval of 30 runs of single threaded and 
16-threaded executions for blocked Black-Scholes CnC application with Habanero-Java 
steps on Xeon with input size 1,000,000 and with tile size 62,500	

✕8.0	

 ✕6.7	

 ✕14.7	

Black-Scholes on UltraSPARC T2	

39	

164,252	

 166,342	

179,427	

 181,089	

163,639	

6,032	

 5,944	

 6,006	

 5,973	

 4,019	

0	

20,000	

40,000	

60,000	

80,000	

100,000	

120,000	

140,000	

160,000	

180,000	

200,000	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 64-Workers	

Minimum execution times of 30 runs of single threaded and 64-threaded executions for 
blocked Black-Scholes CnC application with Habanero-Java steps on UltraSPARC T2 with 
input size 1,000,000 and with tile size 15,625	

✕29.9	

 ✕30.3	

 ✕40.7	

✕27.2	

 ✕28.0	

Black-Scholes on UltraSPARC T2	

40	

165,473	

207,079	

 212,870	

 215,636	

164,545	

6,179	

 6,159	

 6,149	

 6,145	

 4,296	

0	

50,000	

100,000	

150,000	

200,000	

250,000	

Coarse Grain 
Blocking	

Fine Grain 
Blocking	

Delayed Async	

 Data Driven 
Rollback&Replay	

Data Driven 
Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 64-Workers	

Average execution times and 90% confidence interval of 30 runs of single threaded and 
64-threaded executions for blocked Black-Scholes CnC application with Habanero-Java 
steps on UltraSPARC T2 with input size 1,000,000 and with tile size 15,625	

✕34.6	

 ✕35.1	

 ✕38.3	

✕26.8	

 ✕33.6	

Rician Denoising	

41	

  Image processing algorithm	

  More than 4 kernels	

  Mostly stencil computations	

  Non trivial dependency graph	

  Fixed point algorithm	

  Enormous data size	

  CnC schedulers needed explicit memory management	

  DDFs took advantage of garbage collection	

Rician Denoising on Xeon	

42	

470,394	

495,089	

459,328	

345,531	

78,630	

56,892	

 51,632	

 52,208	

0	

100,000	

200,000	

300,000	

400,000	

500,000	

600,000	

Coarse Grain Blocking *	

Fine Grain Blocking *	

 Delayed Async *	

 Data Driven Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

Minimum execution times of 30 runs of single threaded and 16-threaded executions for 
blocked Rician Denoising CnC application with Habanero-Java steps on Xeon with input 
image size 2937 × 3872 and with tile size 267 × 484	

✕8.7	

 ✕8.9	

 ✕6.6	

Rician Denoising on Xeon	

43	

498,776	

 499,666	

 483,770	

349,051	

81,502	

58,313	

 53,569	

 53,817	

0	

100,000	

200,000	

300,000	

400,000	

500,000	

600,000	

Coarse Grain Blocking *	

Fine Grain Blocking *	

 Delayed Async *	

 Data Driven Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

Average execution times and 90% confidence interval of 30 runs of single threaded and 
16-threaded executions for blocked Rician Denoising CnC application with Habanero-Java 
steps on Xeon with input image size 2937 × 3872 and with tile size 267 × 484	

✕8.6	

 ✕9.0	

 ✕6.5	

Rician Denoising on UltraSPARC T2	

44	

1,979,340	

 1,932,078	

 1,932,488	

1,273,583	

189,444	

 188,134	

80,458	

 56,621	

0	

500,000	

1,000,000	

1,500,000	

2,000,000	

2,500,000	

Coarse Grain Blocking *	

Fine Grain Blocking *	

 Delayed Async *	

 Data Driven Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 64-Workers	

Minimum execution times of 30 runs of single threaded and 64-threaded executions for 
blocked Rician Denoising CnC application with Habanero-Java steps on UltraSPARC T2 
with input image size 2937 × 3872 and with tile size 267 × 484	

✕10.3	

 ✕24.0	

 ✕22.5	

Rician Denoising on UltraSPARC T2	

45	

1,988,894	

 1,939,976	

 1,943,861	

1,282,031	

192,451	

 190,600	

81,707	

 58,017	

0	

500,000	

1,000,000	

1,500,000	

2,000,000	

2,500,000	

Coarse Grain Blocking *	

Fine Grain Blocking *	

 Delayed Async *	

 Data Driven Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 64-Workers	

Average execution times and 90% confidence interval of 30 runs of single threaded and 
64-threaded executions for blocked Rician Denoising CnC application with Habanero-Java 
steps on UltraSPARC T2 with input image size 2937 × 3872 and with tile size 267 × 484	

✕10.2	

 ✕23.8	

 ✕22.1	

Heart Wall Tracking	

  Medical imaging application	

  Nested kernels	

  First level embarrassingly parallel	

  Second level with intricate dependency graph	

  Memory management	

  Many failures on eager schedulers	

  Blocking schedulers ran out of memory	

Heart Wall Tracking on Xeon	

47	

162,248	

 157,554	

 156,159	

47,989	

11,076	

 9,897	

0	

20,000	

40,000	

60,000	

80,000	

100,000	

120,000	

140,000	

160,000	

180,000	

Delayed Async	

 Data Driven Rollback&Replay	

 Data Driven Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

Minimum execution times of 13 runs of single threaded and 16-threaded executions for 
Heart Wall Tracking CnC application with C steps on Xeon with 104 frames	

✕3.4	

 ✕14.2	

 ✕15.8	

Heart Wall Tracking on Xeon	

48	

164,806	

158,224	

 156,635	

50,287	

11,351	

 10,097	

0	

20,000	

40,000	

60,000	

80,000	

100,000	

120,000	

140,000	

160,000	

180,000	

Delayed Async	

 Data Driven Rollback&Replay	

 Data Driven Futures	

E
xe
cu
ti
o
n
 i
n
 m
il
li
-s
e
cs
	

1-Worker	

 16-Workers	

Average execution times and 90% confidence interval of 13 runs of single threaded and 
16-threaded executions for Heart Wall Tracking CnC application with C steps on Xeon 
with 104 frames	

✕3.3	

 ✕13.9	

 ✕15.5	

Outline	

49	

  Background	

  CnC Scheduling	

  Data Driven Futures	

  Results	

  Wrap up	

Related work	

50	

  Alternative parallel programming models:	

  Either too verbose or constrained parallelism	

  Alternative futures, promises	

  Creation and resolution are coupled	

  Either lazy or blocking execution semantics	

  Support for unstructured parallelism	

  Nabbit library for Cilk++ allows arbitrary task graphs	

  Immediate successor atomic counter update for notification	

  Does not differentiate between data, control dependences	

Conclusions	

51	

  Macro-dataflow is a viable parallelism model	

  Provides expressiveness hiding parallelism concerns	

  Macro-dataflow can perform competitively	

  Taking advantage of modern task parallel models	

Future Work	

52	

  Compiling CnC to the Data Driven Runtime	

  Currently hand-ported	

  Need finer grain dependency analysis via tag functions	

  Data Driven Future support for Work Stealing	

  Compiler support for automatic DDF registration	

  Hierarchical DDFs	

  Locality aware scheduling support for DDFs	

Acknowledgments	

53	

  Committee	

  Zoran Budimlić, Keith D. Cooper, Vivek Sarkar, Lin Zhong	

  Journal of Supercomputing co-authors	

  Zoran Budimlić, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff P. Lowney, Ryan R. 
Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach 
  Habanero multicore software research project team-members	

  Zoran Budimlić, Vincent Cavé, Philippe Charles, Vivek Sarkar, Alina Simion Sbîrlea, 
Dragoș Sbîrlea, Jisheng Zhao 
  Intel Technology Pathfinding and Innovation Software and Services Group	

  Mark Hampton, Kathleen Knobe, Geoff P. Lowney, Ryan R. Newton, Frank Schlimbach 
  Benchmarks	

  Aparna Chandramowlishwaran (Georgia Tech.), Zoran Budimlić(Rice) for Cholesky Decomposition 
  Yu-Ting Chen (UCLA) for Rician Denoising 
  David Peixotto (Rice) for Black-Scholes Formula 
  Alina Simion Sbîrlea (Rice) for Heart Wall Tracking  
Feedback and clarifications	

54	

  Thanks for your attention