清华海外名师讲堂第58讲 Bill Dally教授清华大学演讲
演讲题目: The Future of Computing is Parallel
Prof.Bill Dally:
Fellow of the American Academy of Arts & Sciences
Chairman of the Computer Science Department at Stanford University
Bell Professor of Engineering, Stanford University
Chief Scientist & Sr. VP of Research, NVIDIA
------------------------------------------------------------------
Moore's Law
more transistors
L^3 power scaling
no performance perdiction
L^3 power scaling
no performance perdiction
Transistor-> Processor->Value
value chain broken for serial computers
Turning more transistor to values.
ILP not in programs
More power is spent moving data.
[Gordon Moore ISSCC 2003]
The energy is not limited by Floating Point Unit.
Chips are power limited.
Performance = Parallelism
Efficiency = Locality
Amdahl's Law doesn't apply to most future applications.
We need:
1. Many efficient processors
2. An exposed storage hierarchy(locality)
3. A programming system that abstract this
1. Many efficient processors
2. An exposed storage hierarchy(locality)
3. A programming system that abstract this
NASA application
domain expert: 27-169 times performance
Data Movement
scarce resource:
on-chip storage
off-chip bandwidth
on-chip storage
off-chip bandwidth
Fermi - throughput computing
Avoid Denial Architect
Single thread processors
serial execution
- Denies parallelism
Flat Memory
- Denies locality
These illusions inhibit performance and efficiency- Denies parallelism
Flat Memory
- Denies locality
CUDA abstracts the GPU architecture
Throughput computing must evolve to meet the challenges of Exascale Computing
DARPA Study
Four challenges
Energy and Power
Memory and Storage
Concurrency and Locality
Resilliency
#1 is powerEnergy and Power
Memory and Storage
Concurrency and Locality
Resilliency
Energy
Heterogeneous architecture
Agile memory system
(right core for right job)
Efficient processorAgile memory system
(keep data and instruction access local)
Optimized(minimize energy/op)
(minimize data movement)
(minimize data movement)
Locality Chalelnge
automatically(as a cache) or explicitly(scratchpad)
move computation to data
-fast active messages
move computation to data
-fast active messages
An NVIDIA ExaScale Machine in 2017
* GPU Node ~300W
- 2,400 throupht cores
- 40 TFLOPS(SP), 13TFLOPS(DP)
- Deep explicit on-chip storage hierarchy
* Node Memory- 40 TFLOPS(SP), 13TFLOPS(DP)
- Deep explicit on-chip storage hierarchy
- 128 GB RAM
- 512 GB Phase-Change Memory for checkpoint and scratch
* Cabinet - 100KW- 512 GB Phase-Change Memory for checkpoint and scratch
- Dragonfly network
* System - 10MW- Dragonfly with optical links
* RAS- ECC on all memory and links
- self-checking and application-level checking
- Fast local checkpoint
- self-checking and application-level checking
- Fast local checkpoint
Conclusion
* Single thread performance is no longer scaling
* Performance = Parallelism
* Efficiency = Locality
* Application have lots of both
* Machines need lots of cores(parallelism) and exposed storage hierachy(locality)
* A programming system must abstract this
* Reaching an ExaScale requires evolving throughput computing.
- Agile memory
- Energy efficient cores and communication
- Efficient parallel mechanism
Q&A
Q: throughput
A: how many problems/time
Q(lhw):
What do you think about the future of dataflow programming model in the parallel computing era?
A:
does not explore locality well
1 条评论:
Parallel programming is used specifically to serve working software developers, not just computer scientists. It is a complete, highly accessible pattern language that will help any experienced developer "think parallel"-and start writing effective parallel code almost immediately. Instead of formal theory, it deliver proven solutions to the challenges faced by parallel programmers, and pragmatic guidance for using today's parallel APIs in the real world.
发表评论