OMP parallelization

OpenMP (Open Multi-Processing) is an High level API for parallel programming in C, C++, and Fortran. It’s a widely-used, open-standard, and platform-independent library that allows developers to write parallel code that can run on multiple CPU cores, reducing the overall execution time of a program.

OpenMP does the heavy lifting in several areas:

Thread Management: OpenMP automatically manages thread creation, synchronization, and termination, freeing the developer from the complexity of thread management.
Work Distribution: OpenMP provides a way to distribute work among threads, ensuring that each thread has a portion of the work to execute.
Synchronization: OpenMP provides built-in synchronization mechanisms, such as barriers and locks, to ensure that threads access shared data safely.
Data Management: OpenMP provides a way to manage data sharing between threads, reducing the risk of data corruption and inconsistencies.

OpenMP makes it easy to experiment with parallelism by:

Incremental Parallelism: OpenMP allows developers to incrementally add parallelism to existing serial code, making it easier to test and refine parallel code.
Simple Directives: OpenMP uses simple directives (e.g., #pragma omp parallel) to indicate parallel regions, making it easy to write parallel code.
Portability: OpenMP code is portable across different platforms, including Windows, macOS, and Linux.

Comparison with POSIX Threads (pthreads)

POSIX threads (pthreads) is a low-level, Unix-based threading API that provides a way to create and manage threads. Here’s a comparison with OpenMP:

Note: this blog stands as a code first introduction of the directives and clauses that openmp provides most of the time you can get away with just using openmp for loop parallelization but when you want to get a little deeper into scheduling and having private and reduction variables this would stand as a all in place reference.

OpenMP Directives#

OpenMP Clauses#

Parallel Directive#

The parallel directive is used to specify a block of code that should be executed in parallel by multiple threads. This directive is the foundation of OpenMP programming, and it is used to create a team of threads that will execute the code within the parallel region.

c Parallel Directive


#include <omp.h>
#include <stdio.h>

int main() {
    #pragma omp parallel
    {
        printf("Hello from thread %d\n", omp_get_thread_num());
    }
    return 0;
}

Note: The omp_get_thread_num() function is used to get the thread number of the current thread.

Sections Directive#

The sections directive is used to divide a block of code into multiple sections that can be executed in parallel by different threads. This directive is useful when you have multiple independent tasks that can be executed concurrently.

c Sections Directive


#include <omp.h>
#include <stdio.h>

int main() {
    #pragma omp parallel sections
    {
        #pragma omp section
        {
            printf("Section 1: Hello from thread %d\n", omp_get_thread_num());
        }
        #pragma omp section
        {
            printf("Section 2: Hello from thread %d\n", omp_get_thread_num());
        }
    }
    return 0;
}

Single Directive#

The Single Directive is used to specify a block of code that should be executed by only one thread in a team. This directive is useful when you want to perform some operation that should only be done once, such as initializing a shared variable or printing a message.

#pragma omp single { // code to be executed by a single thread }

Example

c Single Directive


#include <omp.h>
#include <stdio.h>

int main() {
    int x = 0;
    #pragma omp parallel
    {
        #pragma omp single
        {
            x = 10;
            printf("Thread %d initialized x to %d\n", omp_get_thread_num(), x);
        }
        printf("Thread %d sees x as %d\n", omp_get_thread_num(), x);
    }
    return 0;
}

Output

Thread 0 initialized x to 10
Thread 0 sees x as 10
Thread 1 sees x as 10
Thread 2 sees x as 10
Thread 3 sees x as 10

As you can see, only one thread (Thread 0) initializes the variable x to 10, and all other threads see the updated value of x.

Note

The Single Directive does not imply a barrier, so the other threads may continue executing without waiting for the single thread to finish its execution. If you need to ensure that all threads wait for the single thread to finish, you can use the #pragma omp barrier directive after the Single Directive.

Here’s an updated example with a barrier:

c Single Directive


#include <omp.h>
#include <stdio.h>

int main() {
    int x = 0;
    #pragma omp parallel
    {
        #pragma omp single
        {
            x = 10;
            printf("Thread %d initialized x to %d\n", omp_get_thread_num(), x);
        }
        #pragma omp barrier
        printf("Thread %d sees x as %d\n", omp_get_thread_num(), x);
    }
    return 0;
}

This ensures that all threads wait for the single thread to finish initializing x before they continue executing.

Master Directive#

The master directive is used to specify a block of code that should be executed by the master thread only. The master thread is the thread that has a thread number of 0.

Syntax

#pragma omp master { // code to be executed by the master thread }

Example

c Master Directive


#include <omp.h>
#include <stdio.h>

int main() {
    int x = 0;
    #pragma omp parallel
    {
        #pragma omp master
        {
            x = 10;
            printf("Master thread: x = %d\n", x);
        }
        printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
    }
    return 0;
}

In this example, the master thread sets the value of x to 10 and prints a message. All threads then print the value of x.

Note

The master directive does not imply a barrier, so the other threads may continue executing without waiting for the master thread to finish its execution.

When to use

The master directive is useful when you need to perform some operation that should only be done once, such as initializing a shared variable or printing a message.

Comparison with Single Directive

The master directive is similar to the single directive, but the single directive can be executed by any thread, while the master directive is always executed by the master thread.

Here is an example that demonstrates the difference:

c Master Directive


#include <omp.h>
#include <stdio.h>

int main() {
    int x = 0;
    #pragma omp parallel
    {
        #pragma omp single
        {
            x = 10;
            printf("Single thread: x = %d\n", x);
        }
        printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
    }
    return 0;
}

In this example, the single directive is used to set the value of x to 10. The thread that executes the single directive may not be the master thread.

Critical Directive#

The critical directive is used to specify a block of code that should be executed by only one thread at a time. This directive is useful when you need to protect a critical section of code that should not be executed concurrently by multiple threads.

Syntax

#pragma omp critical [(name)] { // code to be executed by only one thread at a time }

Example

c Critical Directive


#include <omp.h>
#include <stdio.h>

int main() {
    int x = 0;
    #pragma omp parallel
    {
        #pragma omp critical
        {
            x = x + 1;
            printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
        }
    }
    return 0;
}

In this example, the critical directive is used to protect the increment of the variable x. Only one thread can execute the critical section at a time, ensuring that the increment is atomic.

Named Critical Sections

You can also specify a name for the critical section, which allows you to have multiple critical sections that can be executed concurrently.

#pragma omp critical (name) { // code to be executed by only one thread at a time }

Example

c Critical Directive


#include <omp.h>
#include <stdio.h>

int main() {
    int x = 0;
    int y = 0;
    #pragma omp parallel
    {
        #pragma omp critical (x_lock)
        {
            x = x + 1;
            printf("Thread %d: x = %d\n", omp_get_thread_num(), x);
        }
        #pragma omp critical (y_lock)
        {
            y = y + 1;
            printf("Thread %d: y = %d\n", omp_get_thread_num(), y);
        }
    }
    return 0;
}

In this example, we have two named critical sections, x_lock and y_lock. These critical sections can be executed concurrently, but only one thread can execute each critical section at a time.

Best Practices

Use the critical directive to protect critical sections of code that should not be executed concurrently by multiple threads.
Use named critical sections to allow multiple critical sections to be executed concurrently.
Minimize the use of critical sections, as they can introduce performance overhead.
Use other synchronization mechanisms, such as locks or atomic operations, when possible.

Barrier Directive

The barrier directive is used to synchronize all threads in a team. When a thread reaches a barrier, it waits until all other threads in the team have also reached the barrier. Once all threads have reached the barrier, they can all proceed past it.

Syntax

#pragma omp barrier

Example

Here’s an example demonstrating the use of the barrier directive:

c Barrier Directive


#include <omp.h>
#include <stdio.h>
#include <unistd.h>

int main() {
    #pragma omp parallel num_threads(4)
    {
        int thread_num = omp_get_thread_num();
        
        // First part of the work
        printf("Thread %d: Starting first part of work\n", thread_num);
        sleep(thread_num); // Simulate different work times
        
        #pragma omp barrier
        
        // Second part of the work
        printf("Thread %d: Starting second part of work\n", thread_num);
    }
    return 0;
}

Output

Thread 0: Starting first part of work
Thread 3: Starting first part of work
Thread 2: Starting first part of work
Thread 1: Starting first part of work
Thread 0: Starting second part of work
Thread 1: Starting second part of work
Thread 2: Starting second part of work
Thread 3: Starting second part of work

Explanation

Four threads are created.
Each thread prints a message and then sleeps for a duration equal to its thread number (simulating different work times).
The barrier directive ensures that all threads wait at this point until every thread has reached the barrier.
Once all threads have reached the barrier, they all proceed to print the second message.

Important Notes

The barrier directive is implicit at the end of parallel, for, sections, and single constructs, unless a nowait clause is specified.
Using too many barriers can negatively impact performance, as threads may spend a lot of time waiting for each other.
Barriers should be used judiciously to ensure correct program behavior while minimizing synchronization overhead.
Be cautious of potential deadlocks when using barriers, especially in complex parallel regions.

The barrier directive is a powerful tool for synchronizing threads, but it should be used carefully to balance correctness and performance in your parallel programs.

Private Clause

The private clause is used to specify that each thread should have its own copy of the listed variables. This is useful when you want each thread to work with its own instance of a variable, preventing data races and ensuring thread safety.

Syntax

#pragma omp parallel private(variable1, variable2, …)

#pragma omp for private(variable1, variable2, …)

Example

Here’s an example demonstrating the use of the private clause:

c Private Clause


#include <omp.h>
#include <stdio.h>

int main() {
    int i, sum = 0;
    
    #pragma omp parallel for private(i) reduction(+:sum)
    for (i = 1; i <= 100; i++) {
        int temp = i * i;
        sum += temp;
        printf("Thread %d: i = %d, temp = %d\n", omp_get_thread_num(), i, temp);
    }
    
    printf("Sum of squares from 1 to 100: %d\n", sum);
    return 0;
}

Output

The output might look something like this (note that the order of thread execution may vary):

Thread 0: i = 1, temp = 1
Thread 0: i = 2, temp = 4
...
Thread 1: i = 26, temp = 676
Thread 1: i = 27, temp = 729
...
Thread 2: i = 51, temp = 2601
Thread 2: i = 52, temp = 2704
...
Thread 3: i = 76, temp = 5776
Thread 3: i = 77, temp = 5929
...
Sum of squares from 1 to 100: 338350

Explanation

The i variable is declared as private. This means each thread gets its own copy of i.
Inside the parallel for loop, each thread works with its own i, preventing any data races.
The temp variable is implicitly private because it’s declared inside the parallel region.
Each thread calculates the square of its assigned numbers and adds them to the shared sum variable (using the reduction clause, which we’ll cover in a future topic).

Important Notes

Variables declared as private are not initialized. They contain garbage values when entering the parallel region unless explicitly initialized.
Changes made to private variables are not visible outside the parallel region.
If you need to initialize private variables or use their values after the parallel region, consider using the firstprivate or lastprivate clauses (which we’ll cover next).
Array sections can also be made private using the syntax private(array[start:length]).
The private clause is often used with loop index variables to ensure each thread has its own copy of the loop counter.

The private clause is a fundamental tool in OpenMP for managing data in parallel regions, helping to prevent data races and ensure correct parallel execution.

Firstprivate Clause

The firstprivate clause is similar to the private clause, but it also initializes the private copies with the value of the original variable before entering the parallel region.

Syntax

#pragma omp parallel firstprivate(variable1, variable2, …)

#pragma omp for firstprivate(variable1, variable2, …)

Example

Here’s an example demonstrating the use of the firstprivate clause:

c Private Clause


#include <omp.h>
#include <stdio.h>

int main() {
    int x = 5;
    
    #pragma omp parallel firstprivate(x)
    {
        printf("Thread %d: x initially = %d\n", omp_get_thread_num(), x);
        x += omp_get_thread_num();
        printf("Thread %d: x after modification = %d\n", omp_get_thread_num(), x);
    }
    
    printf("Outside parallel region: x = %d\n", x);
    return 0;
}

Output

The output might look something like this:

Thread 0: x initially = 5
Thread 1: x initially = 5
Thread 2: x initially = 5
Thread 3: x initially = 5
Thread 0: x after modification = 5
Thread 1: x after modification = 6
Thread 2: x after modification = 7
Thread 3: x after modification = 8
Outside parallel region: x = 5

Explanation

The variable x is initialized to 5 before the parallel region.
Using firstprivate(x), each thread gets its own copy of x, initialized to 5.
Each thread then modifies its private copy of x by adding its thread number.
After the parallel region, the original x remains unchanged (still 5).

Important Notes

firstprivate is particularly useful when you need to use the initial value of a variable in a parallel region, but also want each thread to have its own copy.
For complex objects (like C++ objects with constructors), firstprivate ensures that the copy constructor is called to initialize the private copies.
firstprivate can have a performance overhead compared to private, especially for large objects, as it needs to initialize each private copy.
Changes made to firstprivate variables are not visible outside the parallel region unless you use other mechanisms (like reduction or lastprivate).
You can use firstprivate with arrays, but be cautious with large arrays as it will create a full copy for each thread.

Comparison with private

private creates uninitialized copies for each thread.
firstprivate creates initialized copies, with the value from before entering the parallel region.

The firstprivate clause is very useful when you need to work with initialized private copies in your parallel regions, combining the benefits of data isolation with proper initialization.

Lastprivate Clause#

The lastprivate clause is used to specify that the value of one or more private variables should be copied back to the original variables after the parallel region or loop. Specifically, it copies the value from the last logical iteration of the loop or the last section of a sections construct.

Syntax

#pragma omp parallel for lastprivate(variable1, variable2, …)

#pragma omp sections lastprivate(variable1, variable2, …)

Example

Here’s an example demonstrating the use of the lastprivate clause:

c Private Clause


#include <omp.h>
#include <stdio.h>

int main() {
    int i, x = 0;
    
    #pragma omp parallel for lastprivate(x)
    for (i = 0; i < 100; i++) {
        x = i;
        printf("Thread %d: i = %d, x = %d\n", omp_get_thread_num(), i, x);
    }
    
    printf("After parallel for: x = %d\n", x);
    return 0;
}

Output

The output might look something like this (note that the order of thread execution may vary):

Thread 0: i = 0, x = 0
Thread 1: i = 25, x = 25
Thread 2: i = 50, x = 50
Thread 3: i = 75, x = 75
...
Thread 3: i = 99, x = 99
After parallel for: x = 99

Explanation

The variable x is declared as lastprivate.
Each thread has its own private copy of x during the parallel for loop.
After the loop completes, the value of x from the last logical iteration (i = 99) is copied back to the original x.
Outside the parallel region, x has the value from the last iteration (99).

Important Notes

lastprivate is particularly useful when you need the final value of a variable after a parallel loop or sections construct.
For loops, the “last” iteration is determined by the logical order of the loop, not the actual last thread to finish.
For sections constructs, the last section in lexical order determines the lastprivate value.
lastprivate can be combined with firstprivate if you need both initialization and final value preservation.
There’s a potential performance overhead with lastprivate, as it requires additional synchronization to determine and copy the final value.

Comparison with private and firstprivate

private: Creates uninitialized copies, final values are not preserved.
firstprivate: Creates initialized copies, final values are not preserved.
lastprivate: May create uninitialized copies (unless combined with firstprivate), but preserves the final value from the logically last iteration or section.

The lastprivate clause is valuable when you need to capture the final state of a variable after parallel execution, especially in loops where the final iteration produces an important result.

Reduction Clause

The reduction clause is used to perform a reduction operation on one or more variables across all threads in a parallel region. It’s particularly useful for operations like summing, finding the maximum or minimum, or performing logical operations across a set of values computed in parallel.

Syntax

#pragma omp parallel for reduction(operator:variable)

Where operator can be:

+, -, *, &, |, ^, &&, || for built-in types
max, min for arithmetic types

Example

Here’s an example demonstrating the use of the reduction clause to calculate the sum of squares:

c Private Clause


#include <omp.h>
#include <stdio.h>

int main() {
    long long sum = 0;
    int i, n = 1000000;

    #pragma omp parallel for reduction(+:sum)
    for (i = 1; i <= n; i++) {
        sum += i * i;
    }

    printf("Sum of squares from 1 to %d: %lld\n", n, sum);
    return 0;
}

Output

The output will be:

Sum of squares from 1 to 1000000: 333333833333500000

Explanation

The sum variable is specified as a reduction variable with the + operator.
Each thread computes a partial sum for its chunk of the loop.
OpenMP automatically combines these partial sums using the specified operator (+ in this case).
The final result is stored in the original sum variable.

Important Notes

OpenMP creates a private copy of the reduction variable for each thread, initialized according to the reduction operator (e.g., 0 for +, 1 for *, etc.).
At the end of the parallel region, these private copies are combined using the specified operator to produce the final result.
Reduction operations are both thread-safe and efficient, as they avoid the need for explicit synchronization.

You can specify multiple reduction variables in one clause:

#pragma omp parallel for reduction(+:sum1,sum2) reduction(max:maxval)

Custom reduction operations can be defined for user-defined types in C++ using the declare reduction directive.

Common Use Cases

Summing values: reduction(+:sum)
Finding maximum: reduction(max:max_value)
Finding minimum: reduction(min:min_value)
Logical AND of conditions: reduction(&&:all_true)
Logical OR of conditions: reduction(||:any_true)

Performance Considerations

Reductions are generally very efficient, as they allow each thread to work independently and combine results only at the end. This often leads to good scalability. However, for very small loops or when the reduction operation is very simple, the overhead of creating and managing threads might outweigh the benefits of parallelization.

The reduction clause is a powerful feature in OpenMP that simplifies parallel aggregation operations and helps avoid common pitfalls like race conditions when accumulating results across threads.

Schedule Clause

The schedule clause is used to specify how iterations of a loop are divided among threads in a parallel region. It allows you to control the workload distribution, which can significantly impact the performance of your parallel program.

Syntax

#pragma omp parallel for schedule(kind[, chunk_size])

Where kind can be:

static
dynamic
guided
auto
runtime

Types of Schedules

Static Schedule
- Iterations are divided into chunks of size chunk_size and distributed to threads in a round-robin fashion.
- If chunk_size is not specified, iterations are divided equally among threads.
Dynamic Schedule
- Iterations are divided into chunks of size chunk_size.
- Each thread requests a new chunk when it finishes its current chunk.
Guided Schedule
- Similar to dynamic, but the chunk size starts large and decreases over time.
- The chunk size is proportional to the number of unassigned iterations divided by the number of threads.
Auto Schedule
- The decision of scheduling is delegated to the compiler and/or runtime system.
Runtime Schedule
- The schedule is determined at runtime by the OMP_SCHEDULE environment variable.

Example

Here’s an example demonstrating different schedule types:

c Private Clause


#include <omp.h>
#include <stdio.h>

void test_schedule(const char* schedule_type) {
    int i, n = 20;

    #pragma omp parallel for schedule(runtime)
    for (i = 0; i < n; i++) {
        printf("Thread %d executing iteration %d\n", omp_get_thread_num(), i);
    }

    printf("\nTested with schedule type: %s\n\n", schedule_type);
}

int main() {
    const char* schedules[] = {"static", "static,2", "dynamic", "dynamic,2", "guided", "auto"};
    int num_schedules = sizeof(schedules) / sizeof(schedules[0]);

    for (int i = 0; i < num_schedules; i++) {
        omp_set_schedule(omp_sched_runtime, 0);
        setenv("OMP_SCHEDULE", schedules[i], 1);
        test_schedule(schedules[i]);
    }

    return 0;
}

Choosing the Right Schedule

Use static when iterations have similar workloads and you want to minimize scheduling overhead.
Use dynamic or guided when iterations have varying workloads to achieve better load balancing.
Use auto when you’re unsure and want the system to decide.
Use runtime for flexibility in testing different schedules without recompiling.

The choice of schedule can significantly affect performance, especially for loops with irregular workloads or when dealing with NUMA (Non-Uniform Memory Access) architectures.