Par Lab Boot Camp 2009: Short Course on Parallel Programming

Day 1

@08:52 Late waking up :( Going to miss the intro talk by Dave Patterson. Where is the live video feed? Grrr…

@10:46 This is what happens when you are late to an auditorium. I couldn’t connect. There are power sockets only on the walls; so everyone who are not sitting on the aisles are doomed. I mean everyone like me who has laptops with low battery life. I think I should get a Mac soon.

Anyway, John Kubiatowicz is giving an intro to parallel architectures starting right from the beginning – computer architecture 101. Till the break right now, it included the basic stuff – MIPS, sequential to intermediate – ILP, SIMD/MIMD etc.

@10:59 Thread level parallelism starts the 2nd half of the talk: multi-process using form/coend etc. Also includes h/w threading, issues with memory, memory hierarchy, caching – hit/miss… Lessons:

Cache vastly impacts performance
Actual performance in sometimes unpredictable. Depends a lot on the architecture.
Common techniques for improving cache performance: tiling and (something i don’t remember :P)

@11:21 Parallel architecture begins. The biggest challenge is the communication between the multi-processors.

Types of parallelism:

Superscalar
Fine-grained
Course-grained
Multi-threaded (not sure; can anyone correct me?)
Multi-processor

Parallel programming models:

Shared memory model: programs are collections of threads. Each one has private everything. Two classes of data – logically shared and logically private. Problems with thread sync, locks, race conditions etc. (Transactional memory can fix this, instead of locking) More problems with cache coherence but can be fixed with a broadcast bus between all the caches; however, does not scale over 64 caches. So we need memory consistency models: e.g., sequential consistency.
Message passing model: individual memories with message-based communication explicitly. MPI is now the de facto standard for message-based programming model. Requires special h/w support (e.g., dedicated message processor, Alewife) to be viable.

@11:56 John is sort of wrapping it up now. He gives very interesting talk. Most probably I would take his OS or Comp. Arch. class if I need one in the future.

@12:39 Lunch break. I think every conference/workshop should give free lunch, even when they are free :P

@01:11 Katherine Yelick is going to start on Pthreads after the lunch; I mean, right now. She is gonna talk about shared memory programming models using POSIX threads. Yay !!!

Good reminders for me. Its been a while I looked at all these things. Data, Stack, and Heap (shared); cobegin/coend, fork/join, future; Pthreads…

Use small number of threads (often equal to number of cores/processors or h/w threads), cause thread creation is expensive.
Race conditions: need to use atomic operation. One easy symptom of a race condition is non-deterministic behavior of the program, i.e., bugs that are not reproducible.
Use locks/mutexes to handle races. Multiple locks can lead to deadlock. Plus locks are expensive too. Some guys use custom locks to work around the performance loss, but it can lead to some more issues; requires careful coding.
Barriers: global synchronization point, where all the threads stop before going forward. Too many barriers can be performance pitfall – over-synchronization; use arbitrary DAG synchronization.

@01:55 Kathy is summarizing the talk ending with the fact that OpenMP is used as a better alternative nowadays. My bad, she is not done yet. Now talking about different use cases and examples. She speaks really fast; its not heard to follow but gives a feeling that we are going to miss a train :-S

Memory latency is, in most cases, more responsible for slowing down the parallelism than memory b/w.
Programs should have large contagious chunks of memory for better performance (sounds kinda natural!). Software controlled memory gives programmers more control.
Roofline performance model for measuring performance of parallel programming models.

@02:17 Done. In a nutshell, parallel programming is hard, but its the way to go. May be :-S

@02:24 Tim Mattson from Intel is going to talk on OpenMP. He is promising that we’ll be able to write parallel code after his talk. Wow, he is one of the authors of OpenMP or OpenCL!!! I already like him :)

OpenMP comes from SMP; its basically a standardization of SMP rather than a new model. It is a shared address space model. Oh and it is a directive based language.
OpenMP starts off as a sequential program and gets parallel threads when you need ’em; one master thread with many parallel regions in the middle connected by sequential executions.
Threads are created with the parallel keyword. All the threads execute the same code in the parallel region of the code. Variables inside a parallel block is local to each thread.

Stages of parallel programming:

Identify the concurrent tasks,
Expose the concurrent tasks,
Express the concurrency, and
Execute it.

@02:33 We are going through an example program to understand the stages. Btw, any parallel programming model should have incremental parallelization process.

OS might not give the number of threads requested by the programmer. So make sure actually how many threads the OS gave you.
Arrays can result in false sharing. Use critical sections instead.
There are constructs for one line loop-splitting and reduction. Use for, private, and reduction clauses.
ICV (Internal Control Variables): define state of runtime system to a thread. Used using omp_set and omp_get.
The schedule clause allows to tell OpenMP how to divide the operations in a loop.

@03:26 Tim is done. Funny guy, great talk! I do have some idea on how to write using OpenMP after this talk; at least I know now what to Google for.

@03:50 Michael Wrinn from Intel is gonna start on TBB. Apparently, it stands for Threading Building Blocks. And it requires a lot of knowledge of C++ features. Good to know that C and C++ are still well and alive after all these years, and hopefully they will come back stronger in parallel domain.

Key features:

Its a library, not a language or model per se, for C++.
Provides high-level abstraction for parallelization.
Not for i/o bound or real-time programs; only for C++; desktop-targeted; require more work than OpenMP.

Task-based programming

TBB parallel algorithms map tasks onto threads automatically,
Unfair scheduler that favor tasks with most recent cache usage,
Over- and under-subscription of core resources are prevented by task-stealing techniques.

@04:09 Either I am tired or the talk is hard to follow. I am not getting much of whatever is on the slides :(

@04:26 Heidi Pan from MIT is talking on Lithe: Enabling Efficient Composition of Parallel Libraries, that is how you all the different techniques simultaneously. Btw, I like her username at MIT: xoxo, kinda cute! And its good to hear someone giving a talk in a workshop and using words that the faint-hearted people would avoid :P

Apparently they create a hierarchy of schedulers with parent-child relationships and each one can use different model while communicating with a common interface/language.

@04:37 She is done. Damn! that was fast. There is supposed to be a hands-on session in SODA after this. Nope, we have someone who will tell how to logon to university machines. Then we’ll head over to SODA. WOW we are actually going to code on a super-computer (Cray XT4)!!! Its the 11^th fastest super-computer in the world!!!

@05:05 The hands on thing is gonna start soon. Bill and I are in the Woz lounge. We have gotta do particle simulation.

@07:47 Wrapping up. That was one parallel day! In the end, the hands-on didn’t go too well cause for some reason my password to Franklin was not working after the 1^st time. Anyway, we also have EECS orientation for the incoming students. Not sure how much I’ll be able to attend in the midst of everything.

Pages: 1 2 3 4

Mosharaf Chowdhury

Day 1

One thought on “Par Lab Boot Camp 2009: Short Course on Parallel Programming”

Leave a Reply Cancel reply