High Performance Parallelism Pearls Volume One - James Jeffers, James Reinders

High Performance Parallelism Pearls Volume One (eBook)

Multicore and Many-core Programming Approaches

James Jeffers, James Reinders (Autoren)

eBook Download: EPUB

2014 | 1. Auflage
600 Seiten
Elsevier Science (Verlag)
978-0-12-802199-6 (ISBN)

High Performance Parallelism Pearls shows how to leverage parallelism on processors and coprocessors with the same programming - illustrating the most effective ways to better tap the computational potential of systems with Intel Xeon Phi coprocessors and Intel Xeon processors or other multicore processors. The book includes examples of successful programming efforts, drawn from across industries and domains such as chemistry, engineering, and environmental science. Each chapter in this edited work includes detailed explanations of the programming techniques used, while showing high performance results on both Intel Xeon Phi coprocessors and multicore processors. Learn from dozens of new examples and case studies illustrating 'success stories' demonstrating not just the features of these powerful systems, but also how to leverage parallelism across these heterogeneous systems. - Promotes consistent standards-based programming, showing in detail how to code for high performance on multicore processors and Intel® Xeon Phi? - Examples from multiple vertical domains illustrating parallel optimizations to modernize real-world codes - Source code available for download to facilitate further exploration

James Reinders is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including the world's first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for a number of Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. James has published numerous articles, contributed to several books and is widely interviewed on parallelism. James has managed software development groups, customer service and consulting teams, business development and marketing teams. James is sought after to keynote on parallel programming, and is the author/co-author of three books currently in print including Structured Parallel Programming, published by Morgan Kaufmann in 2012.

Front Cover 1
High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches 4
Copyright 5
Contents 6
Contributors 16
Acknowledgments 40
Foreword 42
Humongous computing needs: Science years in the making 42
Open standards 42
Keen on many-core architecture 42
Xeon Phi is born: Many cores, excellent vector ISA 43
Learn highly scalable parallel programming 44
Future demands grow: Programming models matter 44
Preface 46
Inspired by 61 cores: A new era in programming 46
Chapter 1: Introduction 48
Learning from successful experiences 48
Code modernization 48
Modernize with concurrent algorithms 49
Modernize with vectorization and data locality 49
Understanding power usage 49
ISPC and OpenCL anyone? 49
Intel Xeon Phi coprocessor specific 50
Many-core, neo-heterogeneous 50
No “Xeon Phi” in the title, neo-heterogeneous programming 50
The future of many-core 51
Downloads 51
Chapter 2: From “Correct” to “Correct & Efficient”: A Hydro2D Case Study with Godunov’s Scheme
Scientific computing on contemporary computers 54
Modern computing environments 55
CEA’s Hydro2D 56
A numerical method for shock hydrodynamics 56
Euler’s equation 57
Godunov’s method 57
Where it fits 59
Features of modern architectures 60
Performance-oriented architecture 60
Programming tools and runtimes 61
Our computing environments 61
Paths to performance 62
Running Hydro2D 62
Hydro2D’s structure 62
Computation scheme 64
Data structures 64
Measuring performance 67
Optimizations 67
Memory usage 68
Thread-level parallelism 69
Arithmetic efficiency and instruction-level parallelism 77
Data-level parallelism 79
Summary 86
The coprocessor vs the processor 86
A rising tide lifts all boats 86
Performance strategies 88
Chapter 3: Better Concurrency and SIMD on HBM 90
The application: HIROMB - BOOS -Model 90
Key usage: DMI 91
HBM execution profile 91
Overview for the optimization of HBM 92
Data structures: Locality done right 93
Thread parallelism in HBM 97
Data parallelism: SIMD vectorization 102
Trivial obstacles 102
Premature abstraction is the root of all evil 105
Results 108
Profiling details 109
Scaling on processor vs. coprocessor 109
Contiguous attribute 111
Summary 113
References 113
Chapter 4: Optimizing for Reacting Navier-Stokes Equations 116
Getting started 116
Version 1.0: Baseline 117
Version 2.0: ThreadBox 120
Version 3.0: Stack memory 124
Version 4.0: Blocking 124
Version 5.0: Vectorization 127
Intel Xeon Phi coprocessor results 130
Summary 131
Chapter 5: Plesiochronous Phasing Barriers 134
What can be done to improve the code? 136
What more can be done to improve the code? 138
Hyper-Thread Phalanx 138
What is nonoptimal about this strategy? 140
Coding the Hyper-Thread Phalanx 140
How to determine thread binding to core and HT within core? 141
The Hyper-Thread Phalanx hand-partitioning technique 142
A lesson learned 144
Back to work 146
Data alignment 146
Use aligned data when possible 147
Redundancy can be good for you 147
The plesiochronous phasing barrier 150
Let us do something to recover this wasted time 152
A few “left to the reader” possibilities 156
Xeon host performance improvements similar to Xeon Phi 157
Summary 162
Chapter 6: Parallel Evaluation of Fault Tree Expressions 164
Motivation and background 164
Expressions 164
Expression of choice: Fault trees 164
An application for fault trees: Ballistic simulation 165
Example implementation 165
Syntax and parsing results 166
Creating evaluation arrays 166
Evaluating the expression array 168
Using ispc for vectorization 168
Other considerations 173
Summary 175
Chapter 7: Deep-Learning Numerical Optimization 176
Fitting an objective function 176
Objective functions and principle components analysis 181
Software and example data 182
Training data 183
Runtime results 186
Scaling results 188
Summary 188
Chapter 8: Optimizing Gather/Scatter Patterns 190
Gather/scatter instructions in Intel® architecture 192
Gather/scatter patterns in molecular dynamics 192
Optimizing gather/scatter patterns 195
Improving temporal and spatial locality 195
Choosing an appropriate data layout: AoS versus SoA 197
On-the-fly transposition between AoS and SoA 198
Amortizing gather/scatter and transposition costs 201
Summary 203
Chapter 9: A Many-Core Implementation of the Direct N-Body Problem 206
N-Body simulations 206
Initial solution 206
Theoretical limit 209
Reduce the overheads, align your data 211
Optimize the memory hierarchy 214
Improving our tiling 217
What does all this mean to the host version? 219
Summary 221
Chapter 10: N -Body Methods 222
Fast N -body methods and direct N -body kernels 222
Applications of N -body methods 223
Direct N -body code 224
Performance results 226
Summary 229
Chapter 11: Dynamic Load Balancing Using OpenMP 4.0 232
Maximizing hardware usage 232
The N-Body kernel 234
The offloaded version 238
A first processor combined with coprocessor version 240
Version for processor with multiple coprocessors 243
Chapter 12: Concurrent Kernel Offloading 248
Setting the context 248
Motivating example: particle dynamics 249
Organization of this chapter 250
Concurrent kernels on the coprocessor 251
Coprocessor device partitioning and thread affinity 251
Offloading from OpenMP host program 252
Offloading from MPI host program 254
Case study: concurrent Intel MKL dgemm offloading 255
Persistent thread groups and affinities on the coprocessor 257
Concurrent data transfers 257
Case study: concurrent MKL dgemm offloading with data transfers 258
Force computation in PD using concurrent kernel offloading 260
Parallel force evaluation using Newton’s 3rd law 260
Implementation of the concurrent force computation 262
Performance evaluation: before and after 267
The bottom line 268
Chapter 13. Heterogeneous Computing with MPI 272
MPI in the modern clusters 272
MPI task location 273
Single-task hybrid programs 276
Selection of the DAPL providers 278
The first provider ofa-v2-mlx4_0-1u 278
The second provider ofa-v2-scif0 and the impact of the intra-node fabric 279
The last provider, also called the proxy 279
Hybrid application scalability 281
Load balance 283
Task and thread mapping 283
Summary 284
Acknowledgments 285
Chapter 14: Power Analysis on the Intel® Xeon Phi™ Coprocessor 286
Power analysis 101 286
Measuring power and temperature with software 288
Creating a power and temperature monitor script 290
Creating a power and temperature logger with the micsmc tool 290
Power analysis using IPMI 292
Hardware-based power analysis methods 293
A hardware-based coprocessor power analyzer 296
Summary 299
Chapter 15: Integrating Intel Xeon Phi Coprocessors into a Cluster Environment 302
Early explorations 302
Beacon system history 303
Beacon system architecture 303
Hardware 303
Software environment 303
Intel MPSS installation procedure 305
Preparing the system 305
Installation of the Intel MPSS stack 306
Generating and customizing configuration files 308
MPSS upgrade procedure 312
Setting up the resource and workload managers 312
Torque 312
Prologue 313
Epilogue 315
TORQUE /coprocessor integration 315
Moab 316
Improving network locality 316
Moab/coprocessor integration 316
Health checking and monitoring 316
Scripting common commands 318
User software environment 320
Future directions 321
Summary 322
Acknowledgments 322
Chapter 16: Supporting Cluster File Systems on Intel® Xeon Phi™ Coprocessors 324
Network configuration concepts and goals 325
A look at networking options 325
Steps to set up a cluster enabled coprocessor 327
Coprocessor file systems support 328
Support for NFS 329
Support for Lustre® file system 329
Support for Fraunhofer BeeGFS ® (formerly FHGFS) file system 331
Support for Panasas® PanFS ® file system 332
Choosing a cluster file system 332
Summary 332
Chapter 17. NWChem: Quantum Chemistry Simulations at Scale 334
Introduction 334
Overview of single-reference CC formalism 335
NWChem software architecture 338
Global Arrays 338
Tensor Contraction Engine 339
Engineering an offload solution 340
Offload architecture 344
Kernel optimizations 345
Performance evaluation 348
Summary 351
Acknowledgments 352
Chapter 18: Efficient Nested Parallelism on Large-Scale Systems 354
Motivation 354
The benchmark 354
Baseline benchmarking 356
Pipeline approach—flat_arena class 357
Intel® TBB user-managed task arenas 358
Hierarchical approach—hierarchical_arena class 360
Performance evaluation 361
Implication on NUMA architectures 363
Summary 364
Chapter 19: Performance Optimization of Black-Scholes Pricing 366
Financial market model basics and the Black-Scholes formula 367
Financial market mathematical model 367
European option and fair price concepts 368
Black-Scholes formula 369
Options pricing 369
Test infrastructure 370
Case study 370
Preliminary version—Checking correctness 370
Reference version—Choose appropriate data structures 370
Reference version—Do not mix data types 372
Vectorize loops 373
Use fast math functions: erff() vs. cdfnormf() 376
Equivalent transformations of code 378
Align arrays 378
Reduce precision if possible 380
Work in parallel 381
Use warm-up 381
Using the Intel Xeon Phi coprocessor—“No effort” port 383
Use Intel Xeon Phi coprocessor: Work in parallel 384
Use Intel Xeon Phi coprocessor and streaming stores 385
Summary 385
Chapter 20: Data Transfer Using the Intel COI Library 388
First steps with the Intel COI library 388
COI buffer types and transfer performance 389
Applications 393
Summary 395
Chapter 21: High-Performance Ray Tracing 396
Background 396
Vectorizing ray traversal 398
The Embree ray tracing kernels 399
Using Embree in an application 399
Performance 401
Summary 404
Chapter 22: Portable Performance with OpenCL 406
The dilemma 406
A brief introduction to OpenCL 407
A matrix multiply example in OpenCL 411
OpenCL and the Intel Xeon Phi Coprocessor 413
Matrix multiply performance results 415
Case study: Molecular docking 416
Results: Portable performance 420
Related work 421
Summary 422
Chapter 23: Characterization and Optimization Methodology Applied to Stencil Computations 424
Introduction 424
Performance evaluation 425
AI of the test platforms 426
AI of the kernel 427
Standard optimizations 429
Automatic application tuning 433
The auto-tuning tool 439
Results 440
Summary 442
Chapter 24: Profiling-Guided Optimization 444
Matrix transposition in computer science 444
Tools and methods 446
"Serial”: Our original in-place transposition 447
"Parallel”: Adding parallelism with OpenMP 452
"Tiled”: Improving data locality 452
"Regularized”: Microkernel with multiversioning 458
"Planned”: Exposing more parallelism 464
Summary 468
Chapter 25: Heterogeneous MPI application optimization with ITAC 472
Asian options pricing 472
Application design 473
Synchronization in heterogeneous clusters 475
Finding bottlenecks with ITAC 476
Setting up ITAC 477
Unbalanced MPI run 478
Manual workload balance 481
Dynamic “Boss-Workers” load balancing 483
Conclusion 486
Chapter 26: Scalable Out-of-Core Solvers on a Cluster 490
Introduction 490
An OOC factorization based on ScaLAPACK 491
In-core factorization 492
OOC factorization 493
Porting from NVIDIA GPU to the Intel Xeon Phi coprocessor 494
Numerical results 496
Conclusions and future work 501
Acknowledgments 501
Chapter 27: Sparse Matrix-Vector Multiplication: Parallelization and Vectorization 504
Background 504
Sparse matrix data structures 505
Algorithm 1:. COO- based S p MV multiplication 505
Compressed data structures 506
Algorithm 2:. CRS- based S p MV multiplication 507
Algorithm 3:. BICRS- based S p MV multiplication 508
Blocking 509
Parallel SpMV multiplication 509
Partially distributed parallel SpMV 509
Algorithm 4:. P artially distributed parallel S p MV multiplication 510
Fully distributed parallel SpMV 510
Vectorization on the Intel Xeon Phi coprocessor 512
Implementation of the vectorized SpMV kernel 514
Evaluation 517
On the Intel Xeon Phi coprocessor 518
On Intel Xeon CPUs 519
Performance comparison 521
Summary 521
Chapter 28: Morton Order Improves Performance 524
Improving cache locality by data ordering 524
Improving performance 524
Matrix transpose 525
Matrix multiply 529
Summary 535
Author Index 538
Subject Index 542

Contributors

Mustafa AbdulJabbar King Abdullah University of Science and Technology, Saudi Arabia

Mustafa is a PhD candidate in the Extreme Computing Research Center at KAUST. He works on optimization of high-scale algorithms such as FMM and is interested in closing the gap between RMI-based execution models and real applications in molecular dynamics and fluid mechanics.

Jefferson Amstutz SURVICE Engineering Company, USA

Jefferson is a Software Engineer in the Applied Technology Operation of SURVICE. He explores interactive visualization and high-performance computing in support of applications for the Army Research Laboratory; he works to solve a variety of physics-based simulation problems in domains such as ballistic vulnerability analysis, radio frequency propagation, and soft-body simulation.

Cédric Andreolli Intel Corporation, France

Cédric is an application engineer in the Energy team at Intel Corporation. He helps optimize applications running on Intel platforms for the Oil and Gas industry.

Edoardo Aprà Pacific Northwest National Laboratory, USA

Edoardo is a Chief Scientist at the Environmental Molecular Sciences Laboratory within PNNL. His research focus is on high-performance computational algorithm and software development especially for chemical applications. He is the main developer of the molecular density functional theory (DFT) module in the NWChem package.

Nikita Astafiev Intel Corporation, Russia

Nikita is a senior software engineer in the Numerics team at Intel. He works on highly optimized math functions. His key areas of interest include automated floating-point error analysis and low-level optimizations.

Troy Baer National Institute for Computational Sciences, The University of Tennessee and Oak Ridge National Laboratory, USA

Troy leads the HPC systems team for the NICS Systems and Operations group. He has been involved large system deployments including Beacon, Nautilus, and Kraken. In April 2014, Troy received the Adaptive Computing Lifetime Achievement award for contributions in scheduling and resource management using Moab.

Carsten Benthin Intel Corporation, Germany

Carsten is a Graphics Research Scientist at Intel Corporation. His research interests include all aspects of ray tracing and high-performance rendering, throughput and high-performance computing, low-level code optimization, and massively parallel hardware architectures.

Per Berg Danish Meteorological Institute, Denmark

Per applies his mathematical modeling and scientific computing education to develop modeling software for applications in water environments (estuaries, ocean). Working for both private companies and public institutes, Per has been involved in numerous projects that apply models to solve engineering and scientific problems.

Vincent Betro National Institute for Computational Sciences, The University of Tennessee and Oak Ridge National Laboratory, USA

Vincent focuses his research on porting and optimizing applications for several architectures, especially the Intel Xeon Phi, and developing Computational Fluid Dynamics codes. He is also the training manager for the XSEDE project, and he has emphasized Xeon Phi Coprocessor training material development for Stampede and Beacon in this role.

Leonardo Borges Intel Corporation, USA

Leo is a Senior Staff Engineer and has been engaged with the Intel Many Integrated Core program from its early days. He specializes in HPC applying his background in numerical analysis and in developing parallel numerical math libraries. Leo is focused on optimization work related to the Oil & Gas industry.

Ryan Braby Joint Institute for Computational Sciences, The University of Tennessee and Oak Ridge National Laboratory, USA

Ryan is the Chief Cyberinfrastructure Officer for JICS. Ryan has been directly involved in the administration and/or deployment of 2 systems that ranked #1 on the Top 500 list, one system that ranked #1 on the Green 500 list, and 18 systems that were ranked in the top 50 on the Top 500 list.

Glenn Brook Joint Institute for Computational Sciences, The University of Tennessee and Oak Ridge National Laboratory, USA

Glenn currently directs the Application Acceleration Center of Excellence (AACE) and serves as the Chief Technology Officer at JICS. He is the principal investigator for the Beacon Project, which is funded by NSF and UT to explore the impact of emerging computing technologies such as the Intel Xeon Phi coprocessor on computational science and engineering.

Ilya Burylov Intel Corporation, Russia

Ilya is a senior software engineer in the Numerics team at Intel Corporation. His background is in computation optimizations for statistical, financial, and transcendental math functions algorithms. Ilya focuses on optimization of computationally intensive analytics algorithms and data manipulation steps for Big Data workflows within distributed systems.

Ki Sing Chan The Chinese University of Hong Kong, Hong Kong

Ki Sing is an undergraduate student at the Chinese University of Hong Kong majoring in Mathematics and Information Engineering with a minor in Computer Science. His first research experience took place in the Oak Ridge National Laboratory in Tennessee during the summer break in 2013. His research focuses on the implementation of a Cholesky Factorization algorithm for large dense matrix.

Gilles Civario Irish Centre for High-End Computing (ICHEC), Ireland

Gilles is a Senior Software Architect focused on designing and implementing tailored hardware and software solutions to users of the National Service and to ICHEC’s technology transfer client companies.

Guillaume Colin de Verdière Commissariat à l’Energie Atomique et aux Energies Alternatives (CEA), France

Guillaume is a senior expert at CEA. His current focus is on novel architectures especially the Intel Xeon Phi, a very promising technology that might potentially get us to an exascale machine. As a direct consequence of this focus, he is actively studying the impact of such novel technologies on legacy code evolution.

Eduardo D’Azevedo Computational Mathematics Group at the Oak Ridge National Laboratory, USA

Eduardo is a staff scientist with research interests that include developing highly scalable parallel solvers. He contributes to projects in materials science and fusion in the Scientific Discovery through Advanced Computing (SciDAC) program. He has developed out-of-core and compact storage extensions for the ScaLAPACK library and made fundamental contributions in optimal mesh generation.

Jim Dempsey QuickThread Programming, LLC, USA

Jim is a consultant specializing in high-performance computing (HPC) and optimization of embedded systems. Jim is the President of QuickThread Programming, LLC. Jim’s expertise includes high efficiency programming and optimization for Intel Xeon and Intel Xeon Phi processors.

Alejandro Duran Intel Corporation, Spain

Alejandro is an Application Engineer working with customers to help optimize their codes. He has been part of the OpenMP Language committee since 2005.

Manfred Ernst Intel Corporation, now at Google Incorporated, USA

Manfred is a member of the Chromium team at Google. Prior to joining Google, he was a Research Scientist at Intel Labs, where he developed the Embree Ray Tracing Kernels. His primary research interests are photorealistic rendering, acceleration structures for ray tracing, sampling, and data compression.

Kerry Evans Intel Corporation, USA

Kerry is a software engineer working primarily with customers on optimization of medical imaging software on Intel Xeon processors and Intel Xeon Phi coprocessors.

Rob Farber TechEnablement.com, USA

Rob is a consultant with an extensive background in HPC and a long history of...

Erscheint lt. Verlag	4.11.2014
Sprache	englisch
Themenwelt	Mathematik / Informatik ► Informatik ► Software Entwicklung
Themenwelt	Mathematik / Informatik ► Informatik ► Theorie / Studium
ISBN-10	0-12-802199-3 / 0128021993
ISBN-13	978-0-12-802199-6 / 9780128021996

Informationen gemäß Produktsicherheitsverordnung (GPSR)
Haben Sie eine Frage zum Produkt?

EPUB (Adobe DRM)

Kopierschutz: Adobe-DRM
Adobe-DRM ist ein Kopierschutz, der das eBook vor Mißbrauch schützen soll. Dabei wird das eBook bereits beim Download auf Ihre persönliche Adobe-ID autorisiert. Lesen können Sie das eBook dann nur auf den Geräten, welche ebenfalls auf Ihre Adobe-ID registriert sind.
Details zum Adobe-DRM

Dateiformat: EPUB (Electronic Publication)
EPUB ist ein offener Standard für eBooks und eignet sich besonders zur Darstellung von Belletristik und Sachbüchern. Der Fließtext wird dynamisch an die Display- und Schriftgröße angepasst. Auch für mobile Lesegeräte ist EPUB daher gut geeignet.

Systemvoraussetzungen:
PC/Mac: Mit einem PC oder Mac können Sie dieses eBook lesen. Sie benötigen eine Adobe-ID und die Software Adobe Digital Editions (kostenlos). Von der Benutzung der OverDrive Media Console raten wir Ihnen ab. Erfahrungsgemäß treten hier gehäuft Probleme mit dem Adobe DRM auf.
eReader: Dieses eBook kann mit (fast) allen eBook-Readern gelesen werden. Mit dem amazon-Kindle ist es aber nicht kompatibel.
Smartphone/Tablet: Egal ob Apple oder Android, dieses eBook können Sie lesen. Sie benötigen eine Adobe-ID sowie eine kostenlose App.
Geräteliste und zusätzliche Hinweise

Buying eBooks from abroad
For tax law reasons we can sell eBooks just within Germany and Switzerland. Regrettably we cannot fulfill eBook-orders from other countries.