星期三, 十月 28, 2009

[讲座]The Future of Computing is Parallel

清华海外名师讲堂第58 Bill Dally教授清华大学演讲

 

演讲题目 The Future of Computing is Parallel

Prof.Bill Dally

Fellow of the American Academy of Arts & Sciences

Chairman of the Computer Science Department at Stanford University

Bell Professor of Engineering, Stanford University

Chief Scientist & Sr. VP of Research, NVIDIA



------------------------------------------------------------------
Moore's Law
more transistors
L^3 power scaling
no performance perdiction

Transistor-> Processor->Value
value chain broken for serial computers

Turning more transistor to values.
ILP not in programs

More power is spent moving data.
[Gordon Moore ISSCC 2003]

The energy is not limited by Floating Point Unit.
Chips are power limited.

Performance = Parallelism
Efficiency = Locality

Amdahl's Law doesn't apply to most future applications.
We need:
1. Many efficient processors
2. An exposed storage hierarchy(locality)
3. A programming system that abstract this

NASA application
domain expert: 27-169 times performance

Data Movement
scarce resource:
on-chip storage
off-chip bandwidth

Fermi - throughput computing

Avoid Denial Architect
Single thread processors
serial execution
- Denies parallelism
Flat Memory
- Denies locality
These illusions inhibit performance and efficiency

CUDA abstracts the GPU architecture

Throughput computing must evolve to meet the challenges of Exascale Computing

DARPA Study
Four challenges
Energy and Power
Memory and Storage
Concurrency and Locality
Resilliency
#1 is power

Energy
Heterogeneous architecture
(right core for right job)
Efficient processor
Agile memory system
(keep data and instruction access local)
Optimized
(minimize energy/op)
(minimize data movement)

Locality Chalelnge
automatically(as a cache) or explicitly(scratchpad)
move computation to data
-fast active messages

An NVIDIA ExaScale Machine in 2017
* GPU Node ~300W
- 2,400 throupht cores
- 40 TFLOPS(SP), 13TFLOPS(DP)
- Deep explicit on-chip storage hierarchy
* Node Memory
- 128 GB RAM
- 512 GB Phase-Change Memory for checkpoint and scratch
* Cabinet - 100KW
- Dragonfly network
* System - 10MW
- Dragonfly with optical links
* RAS
- ECC on all memory and links
- self-checking and application-level checking
- Fast local checkpoint

Conclusion
* Single thread performance is no longer scaling
* Performance = Parallelism
* Efficiency = Locality
* Application have lots of both
* Machines need lots of cores(parallelism) and exposed storage hierachy(locality)
* A programming system must abstract this
* Reaching an ExaScale requires evolving throughput computing.
- Agile memory
- Energy efficient cores and communication
- Efficient parallel mechanism

Q&A
Q: throughput
A: how many problems/time

Q(lhw):
What do you think about the future of dataflow programming model in the parallel computing era?
A:
does not explore locality well

1 条评论:

weight loss pills 说...

Parallel programming is used specifically to serve working software developers, not just computer scientists. It is a complete, highly accessible pattern language that will help any experienced developer "think parallel"-and start writing effective parallel code almost immediately. Instead of formal theory, it deliver proven solutions to the challenges faced by parallel programmers, and pragmatic guidance for using today's parallel APIs in the real world.