OMP parallelization
OpenMP (Open Multi-Processing) is an High level API for parallel programming in C, C++, and Fortran. It’s a widely-used, open-standard, and platform-independent library that allows developers to write parallel code that can run on multiple CPU cores, reducing the overall execution time of a program.
OpenMP does the heavy lifting in several areas:
- Thread Management: OpenMP automatically manages thread creation, synchronization, and termination, freeing the developer from the complexity of thread management.
- Work Distribution: OpenMP provides a way to distribute work among threads, ensuring that each thread has a portion of the work to execute.
- Synchronization: OpenMP provides built-in synchronization mechanisms, such as barriers and locks, to ensure that threads access shared data safely.
- Data Management: OpenMP provides a way to manage data sharing between threads, reducing the risk of data corruption and inconsistencies.
OpenMP makes it easy to experiment with parallelism by:
- Incremental Parallelism: OpenMP allows developers to incrementally add parallelism to existing serial code, making it easier to test and refine parallel code.
- Simple Directives: OpenMP uses simple directives (e.g.,
#pragma omp parallel
) to indicate parallel regions, making it easy to write parallel code. - Portability: OpenMP code is portable across different platforms, including Windows, macOS, and Linux.
Comparison with POSIX Threads (pthreads)
POSIX threads (pthreads) is a low-level, Unix-based threading API that provides a way to create and manage threads. Here’s a comparison with OpenMP:
Note: this blog stands as a code first introduction of the directives and clauses that openmp provides most of the time you can get away with just using openmp for loop parallelization but when you want to get a little deeper into scheduling and having private and reduction variables this would stand as a all in place reference.
OpenMP Directives#
- Parallel Directive
- Sections Directive
- Single Directive
- Master Directive
- Critical Directive
- Barrier Directive
OpenMP Clauses#
Parallel Directive#
The parallel
directive is used to specify a block of code that should be executed in parallel by multiple threads. This directive is the foundation of OpenMP programming, and it is used to create a team of threads that will execute the code within the parallel region.
#include <omp.h>
#include <stdio.h>
int main() {
#pragma omp parallel
{
printf("Hello from thread %d\n", omp_get_thread_num());
}
return 0;
}
Note: The omp_get_thread_num()
function is used to get the thread number of the current thread.
Sections Directive#
The sections
directive is used to divide a block of code into multiple sections that can be executed in parallel by different threads. This directive is useful when you have multiple independent tasks that can be executed concurrently.
#include <omp.h>
#include <stdio.h>
int main() {
#pragma omp parallel sections
{
#pragma omp section
{
printf("Section 1: Hello from thread %d\n", omp_get_thread_num());
}
#pragma omp section
{
printf("Section 2: Hello from thread %d\n", omp_get_thread_num());
}
}
return 0;
}
Single Directive#
The Single Directive is used to specify a block of code that should be executed by only one thread in a team. This directive is useful when you want to perform some operation that should only be done once, such as initializing a shared variable or printing a message.
#pragma omp single { // code to be executed by a single thread }
Example
#include <omp.h>
#include <stdio.h>
int main() {
int x = 0;
#pragma omp parallel
{
#pragma omp single
{
x = 10;
printf("Thread %d initialized x to %d\n", omp_get_thread_num(), x);
}
printf("Thread %d sees x as %d\n", omp_get_thread_num(), x);
}
return 0;
}
Output
Thread 0 initialized x to 10
Thread 0 sees x as 10
Thread 1 sees x as 10
Thread 2 sees x as 10
Thread 3 sees x as 10
As you can see, only one thread (Thread 0) initializes the variable x
to 10, and all other threads see the updated value of x
.
Note
The Single Directive does not imply a barrier, so the other threads may continue executing without waiting for the single thread to finish its execution. If you need to ensure that all threads wait for the single thread to finish, you can use the #pragma omp barrier
directive after the Single Directive.
Here’s an updated example with a barrier:
#include <omp.h>
#include <stdio.h>
int main() {
int x = 0;
#pragma omp parallel
{
#pragma omp single
{
x = 10;
printf("Thread %d initialized x to %d\n", omp_get_thread_num(), x);
}
#pragma omp barrier
printf("Thread %d sees x as %d\n", omp_get_thread_num(), x);
}
return 0;
}
This ensures that all threads wait for the single thread to finish initializing x
before they continue executing.
Master Directive#
The master
directive is used to specify a block of code that should be executed by the master thread only. The master thread is the thread that has a thread number of 0.
Syntax
#pragma omp master { // code to be executed by the master thread }
Example
#include <omp.h>
#include <stdio.h>
int main() {
int x = 0;
#pragma omp parallel
{
#pragma omp master
{
x = 10;
printf("Master thread: x = %d\n", x);
}
printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
}
return 0;
}
In this example, the master thread sets the value of x
to 10 and prints a message. All threads then print the value of x
.
Note
The master
directive does not imply a barrier, so the other threads may continue executing without waiting for the master thread to finish its execution.
When to use
The master
directive is useful when you need to perform some operation that should only be done once, such as initializing a shared variable or printing a message.
Comparison with Single Directive
The master
directive is similar to the single
directive, but the single
directive can be executed by any thread, while the master
directive is always executed by the master thread.
Here is an example that demonstrates the difference:
#include <omp.h>
#include <stdio.h>
int main() {
int x = 0;
#pragma omp parallel
{
#pragma omp single
{
x = 10;
printf("Single thread: x = %d\n", x);
}
printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
}
return 0;
}
In this example, the single
directive is used to set the value of x
to 10. The thread that executes the single
directive may not be the master thread.
Critical Directive#
The critical
directive is used to specify a block of code that should be executed by only one thread at a time. This directive is useful when you need to protect a critical section of code that should not be executed concurrently by multiple threads.
Syntax
#pragma omp critical [(name)] { // code to be executed by only one thread at a time }
Example
#include <omp.h>
#include <stdio.h>
int main() {
int x = 0;
#pragma omp parallel
{
#pragma omp critical
{
x = x + 1;
printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
}
}
return 0;
}
In this example, the critical
directive is used to protect the increment of the variable x
. Only one thread can execute the critical section at a time, ensuring that the increment is atomic.
Named Critical Sections
You can also specify a name for the critical section, which allows you to have multiple critical sections that can be executed concurrently.
#pragma omp critical (name) { // code to be executed by only one thread at a time }
Example
#include <omp.h>
#include <stdio.h>
int main() {
int x = 0;
int y = 0;
#pragma omp parallel
{
#pragma omp critical (x_lock)
{
x = x + 1;
printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
}
#pragma omp critical (y_lock)
{
y = y + 1;
printf("Thread %d: y = %d\n", omp_get_thread_num(), y);
}
}
return 0;
}
In this example, we have two named critical sections, x_lock
and y_lock
. These critical sections can be executed concurrently, but only one thread can execute each critical section at a time.
Best Practices
- Use the
critical
directive to protect critical sections of code that should not be executed concurrently by multiple threads. - Use named critical sections to allow multiple critical sections to be executed concurrently.
- Minimize the use of critical sections, as they can introduce performance overhead.
- Use other synchronization mechanisms, such as locks or atomic operations, when possible.
Barrier Directive
The barrier
directive is used to synchronize all threads in a team. When a thread reaches a barrier, it waits until all other threads in the team have also reached the barrier. Once all threads have reached the barrier, they can all proceed past it.
Syntax
#pragma omp barrier
Example
Here’s an example demonstrating the use of the barrier directive:
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
int main() {
#pragma omp parallel num_threads(4)
{
int thread_num = omp_get_thread_num();
// First part of the work
printf("Thread %d: Starting first part of work\n", thread_num);
sleep(thread_num); // Simulate different work times
#pragma omp barrier
// Second part of the work
printf("Thread %d: Starting second part of work\n", thread_num);
}
return 0;
}
Output
Thread 0: Starting first part of work
Thread 3: Starting first part of work
Thread 2: Starting first part of work
Thread 1: Starting first part of work
Thread 0: Starting second part of work
Thread 1: Starting second part of work
Thread 2: Starting second part of work
Thread 3: Starting second part of work
Explanation
- Four threads are created.
- Each thread prints a message and then sleeps for a duration equal to its thread number (simulating different work times).
- The
barrier
directive ensures that all threads wait at this point until every thread has reached the barrier. - Once all threads have reached the barrier, they all proceed to print the second message.
Important Notes
The
barrier
directive is implicit at the end ofparallel
,for
,sections
, andsingle
constructs, unless anowait
clause is specified.Using too many barriers can negatively impact performance, as threads may spend a lot of time waiting for each other.
Barriers should be used judiciously to ensure correct program behavior while minimizing synchronization overhead.
Be cautious of potential deadlocks when using barriers, especially in complex parallel regions.
The barrier directive is a powerful tool for synchronizing threads, but it should be used carefully to balance correctness and performance in your parallel programs.
Private Clause
The private
clause is used to specify that each thread should have its own copy of the listed variables. This is useful when you want each thread to work with its own instance of a variable, preventing data races and ensuring thread safety.
Syntax
#pragma omp parallel private(variable1, variable2, …)
or
#pragma omp for private(variable1, variable2, …)
Example
Here’s an example demonstrating the use of the private clause:
#include <omp.h>
#include <stdio.h>
int main() {
int i, sum = 0;
#pragma omp parallel for private(i) reduction(+:sum)
for (i = 1; i <= 100; i++) {
int temp = i * i;
sum += temp;
printf("Thread %d: i = %d, temp = %d\n", omp_get_thread_num(), i, temp);
}
printf("Sum of squares from 1 to 100: %d\n", sum);
return 0;
}
Output
The output might look something like this (note that the order of thread execution may vary):
Thread 0: i = 1, temp = 1
Thread 0: i = 2, temp = 4
...
Thread 1: i = 26, temp = 676
Thread 1: i = 27, temp = 729
...
Thread 2: i = 51, temp = 2601
Thread 2: i = 52, temp = 2704
...
Thread 3: i = 76, temp = 5776
Thread 3: i = 77, temp = 5929
...
Sum of squares from 1 to 100: 338350
Explanation
- The
i
variable is declared as private. This means each thread gets its own copy ofi
. - Inside the parallel for loop, each thread works with its own
i
, preventing any data races. - The
temp
variable is implicitly private because it’s declared inside the parallel region. - Each thread calculates the square of its assigned numbers and adds them to the shared
sum
variable (using the reduction clause, which we’ll cover in a future topic).
Important Notes
Variables declared as private are not initialized. They contain garbage values when entering the parallel region unless explicitly initialized.
Changes made to private variables are not visible outside the parallel region.
If you need to initialize private variables or use their values after the parallel region, consider using the
firstprivate
orlastprivate
clauses (which we’ll cover next).Array sections can also be made private using the syntax
private(array[start:length])
.The
private
clause is often used with loop index variables to ensure each thread has its own copy of the loop counter.
The private
clause is a fundamental tool in OpenMP for managing data in parallel regions, helping to prevent data races and ensure correct parallel execution.
Firstprivate Clause
The firstprivate
clause is similar to the private
clause, but it also initializes the private copies with the value of the original variable before entering the parallel region.
Syntax
#pragma omp parallel firstprivate(variable1, variable2, …)
or
#pragma omp for firstprivate(variable1, variable2, …)
Example
Here’s an example demonstrating the use of the firstprivate clause:
#include <omp.h>
#include <stdio.h>
int main() {
int x = 5;
#pragma omp parallel firstprivate(x)
{
printf("Thread %d: x initially = %d\n", omp_get_thread_num(), x);
x += omp_get_thread_num();
printf("Thread %d: x after modification = %d\n", omp_get_thread_num(), x);
}
printf("Outside parallel region: x = %d\n", x);
return 0;
}
Output
The output might look something like this:
Thread 0: x initially = 5
Thread 1: x initially = 5
Thread 2: x initially = 5
Thread 3: x initially = 5
Thread 0: x after modification = 5
Thread 1: x after modification = 6
Thread 2: x after modification = 7
Thread 3: x after modification = 8
Outside parallel region: x = 5
Explanation
- The variable
x
is initialized to 5 before the parallel region. - Using
firstprivate(x)
, each thread gets its own copy ofx
, initialized to 5. - Each thread then modifies its private copy of
x
by adding its thread number. - After the parallel region, the original
x
remains unchanged (still 5).
Important Notes
firstprivate
is particularly useful when you need to use the initial value of a variable in a parallel region, but also want each thread to have its own copy.For complex objects (like C++ objects with constructors),
firstprivate
ensures that the copy constructor is called to initialize the private copies.firstprivate
can have a performance overhead compared toprivate
, especially for large objects, as it needs to initialize each private copy.Changes made to firstprivate variables are not visible outside the parallel region unless you use other mechanisms (like reduction or lastprivate).
You can use
firstprivate
with arrays, but be cautious with large arrays as it will create a full copy for each thread.
Comparison with private
private
creates uninitialized copies for each thread.firstprivate
creates initialized copies, with the value from before entering the parallel region.
The firstprivate
clause is very useful when you need to work with initialized private copies in your parallel regions, combining the benefits of data isolation with proper initialization.
Lastprivate Clause#
The lastprivate
clause is used to specify that the value of one or more private variables should be copied back to the original variables after the parallel region or loop. Specifically, it copies the value from the last logical iteration of the loop or the last section of a sections construct.
Syntax
#pragma omp parallel for lastprivate(variable1, variable2, …)
or
#pragma omp sections lastprivate(variable1, variable2, …)
Example
Here’s an example demonstrating the use of the lastprivate clause:
#include <omp.h>
#include <stdio.h>
int main() {
int i, x = 0;
#pragma omp parallel for lastprivate(x)
for (i = 0; i < 100; i++) {
x = i;
printf("Thread %d: i = %d, x = %d\n", omp_get_thread_num(), i, x);
}
printf("After parallel for: x = %d\n", x);
return 0;
}
Output
The output might look something like this (note that the order of thread execution may vary):
Thread 0: i = 0, x = 0
Thread 1: i = 25, x = 25
Thread 2: i = 50, x = 50
Thread 3: i = 75, x = 75
...
Thread 3: i = 99, x = 99
After parallel for: x = 99
Explanation
- The variable
x
is declared as lastprivate. - Each thread has its own private copy of
x
during the parallel for loop. - After the loop completes, the value of
x
from the last logical iteration (i = 99) is copied back to the originalx
. - Outside the parallel region,
x
has the value from the last iteration (99).
Important Notes
lastprivate
is particularly useful when you need the final value of a variable after a parallel loop or sections construct.For loops, the “last” iteration is determined by the logical order of the loop, not the actual last thread to finish.
For
sections
constructs, the last section in lexical order determines the lastprivate value.lastprivate
can be combined withfirstprivate
if you need both initialization and final value preservation.There’s a potential performance overhead with
lastprivate
, as it requires additional synchronization to determine and copy the final value.
Comparison with private
and firstprivate
private
: Creates uninitialized copies, final values are not preserved.firstprivate
: Creates initialized copies, final values are not preserved.lastprivate
: May create uninitialized copies (unless combined withfirstprivate
), but preserves the final value from the logically last iteration or section.
The lastprivate
clause is valuable when you need to capture the final state of a variable after parallel execution, especially in loops where the final iteration produces an important result.
Reduction Clause
The reduction
clause is used to perform a reduction operation on one or more variables across all threads in a parallel region. It’s particularly useful for operations like summing, finding the maximum or minimum, or performing logical operations across a set of values computed in parallel.
Syntax
#pragma omp parallel for reduction(operator:variable)
Where operator
can be:
+
,-
,*
,&
,|
,^
,&&
,||
for built-in typesmax
,min
for arithmetic types
Example
Here’s an example demonstrating the use of the reduction clause to calculate the sum of squares:
#include <omp.h>
#include <stdio.h>
int main() {
long long sum = 0;
int i, n = 1000000;
#pragma omp parallel for reduction(+:sum)
for (i = 1; i <= n; i++) {
sum += i * i;
}
printf("Sum of squares from 1 to %d: %lld\n", n, sum);
return 0;
}
Output
The output will be:
Sum of squares from 1 to 1000000: 333333833333500000
Explanation
- The
sum
variable is specified as a reduction variable with the+
operator. - Each thread computes a partial sum for its chunk of the loop.
- OpenMP automatically combines these partial sums using the specified operator (
+
in this case). - The final result is stored in the original
sum
variable.
Important Notes
OpenMP creates a private copy of the reduction variable for each thread, initialized according to the reduction operator (e.g., 0 for
+
, 1 for*
, etc.).At the end of the parallel region, these private copies are combined using the specified operator to produce the final result.
Reduction operations are both thread-safe and efficient, as they avoid the need for explicit synchronization.
You can specify multiple reduction variables in one clause:
#pragma omp parallel for reduction(+:sum1,sum2) reduction(max:maxval)
Custom reduction operations can be defined for user-defined types in C++ using the
declare reduction
directive.
Common Use Cases
- Summing values:
reduction(+:sum)
- Finding maximum:
reduction(max:max_value)
- Finding minimum:
reduction(min:min_value)
- Logical AND of conditions:
reduction(&&:all_true)
- Logical OR of conditions:
reduction(||:any_true)
Performance Considerations
Reductions are generally very efficient, as they allow each thread to work independently and combine results only at the end. This often leads to good scalability. However, for very small loops or when the reduction operation is very simple, the overhead of creating and managing threads might outweigh the benefits of parallelization.
The reduction
clause is a powerful feature in OpenMP that simplifies parallel aggregation operations and helps avoid common pitfalls like race conditions when accumulating results across threads.
Schedule Clause
The schedule
clause is used to specify how iterations of a loop are divided among threads in a parallel region. It allows you to control the workload distribution, which can significantly impact the performance of your parallel program.
Syntax
#pragma omp parallel for schedule(kind[, chunk_size])
Where kind
can be:
static
dynamic
guided
auto
runtime
Types of Schedules
Static Schedule
- Iterations are divided into chunks of size
chunk_size
and distributed to threads in a round-robin fashion. - If
chunk_size
is not specified, iterations are divided equally among threads.
- Iterations are divided into chunks of size
Dynamic Schedule
- Iterations are divided into chunks of size
chunk_size
. - Each thread requests a new chunk when it finishes its current chunk.
- Iterations are divided into chunks of size
Guided Schedule
- Similar to dynamic, but the chunk size starts large and decreases over time.
- The chunk size is proportional to the number of unassigned iterations divided by the number of threads.
Auto Schedule
- The decision of scheduling is delegated to the compiler and/or runtime system.
Runtime Schedule
- The schedule is determined at runtime by the
OMP_SCHEDULE
environment variable.
- The schedule is determined at runtime by the
Example
Here’s an example demonstrating different schedule types:
#include <omp.h>
#include <stdio.h>
void test_schedule(const char* schedule_type) {
int i, n = 20;
#pragma omp parallel for schedule(runtime)
for (i = 0; i < n; i++) {
printf("Thread %d executing iteration %d\n", omp_get_thread_num(), i);
}
printf("\nTested with schedule type: %s\n\n", schedule_type);
}
int main() {
const char* schedules[] = {"static", "static,2", "dynamic", "dynamic,2", "guided", "auto"};
int num_schedules = sizeof(schedules) / sizeof(schedules[0]);
for (int i = 0; i < num_schedules; i++) {
omp_set_schedule(omp_sched_runtime, 0);
setenv("OMP_SCHEDULE", schedules[i], 1);
test_schedule(schedules[i]);
}
return 0;
}
Choosing the Right Schedule
- Use
static
when iterations have similar workloads and you want to minimize scheduling overhead. - Use
dynamic
orguided
when iterations have varying workloads to achieve better load balancing. - Use
auto
when you’re unsure and want the system to decide. - Use
runtime
for flexibility in testing different schedules without recompiling.
The choice of schedule can significantly affect performance, especially for loops with irregular workloads or when dealing with NUMA (Non-Uniform Memory Access) architectures.