Streams, events, and synchronization

GPU programming with HIP

2026-05

CSC Training

Motivation

  • So far we have learned a “serial” way of programming for GPUs
  • Questions we now want to answer:
    • Which operations could overlap?
    • Could multiple kernels execute simultaneously?
    • Could me move data between the host and device during kernel execution?

Outline

  • Streams
  • Events
  • Synchronization

Outline

  • Streams
    • Kernels in different streams are asynchronous
    • Kernels in same stream are executed in first-in-first-out order
    • Execute Host-to-Device and Device-to-Host transfers concurrently with kernels
  • Events
    • Synchronize across streams and host
    • Measure time
  • Synchronization
    • host ⇔ stream, host ⇔ event, host ⇔ device
    • stream ⇔ event
    • threads in block

What is a stream?

  • A sequence (queue) of operations that execute in order on the GPU
  • Operations in different streams may run concurrently

  • H-to-D copy runs in a single stream, and the kernel and D-to-H copy are split into 4 streams

  • H-to-D copy, kernel, and D-to-H copy are split into 4 streams

The default stream

  • When you do not specify a stream to your kernel, it is sent to the default stream
    • my_kernel<<<grid, block, 0, 0>>>(...)
  • API functions operate on the default stream: hipMalloc, hipMemcpy, hipFree, ...
  • All operations (Malloc, Memcpy, Kernel execution) in the same stream execute sequentially in submission order
    • Hence, the sequential nature of what you have learnt so far

The order of execution in a stream

  • Operations are sent to the stream and executed in a FIFO manner

Stream creation, synchronization, and destruction

  • Declare a stream variable
hipStream_t stream
  • Create stream
hipError_t hipStreamCreate ( hipStream_t* stream )
  • Destroy stream
hipError_t hipStreamDestroy ( hipStream_t stream )

Using multiple streams

  • H-to-D copy runs in a single stream, and the kernel and D-to-H copy are split into 4 streams

  • H-to-D copy, kernel, and D-to-H copy are split into 4 streams

Asynchronous operations and streams

  • API functions operate on the default stream: hipMalloc, hipMemcpy, hipFree, ...
  • Append Async to name and add hipStream_t as last argument for asynchronous version:
    • hipMalloc(...)hipMallocAsync(..., hipStream_t stream)

Asynchronisity and kernels

  • Running kernels concurrently require placing them in different streams
    • Default stream has special synchronization rules and cannot run concurrently with other streams (applies to all API calls)
  • The stream is supplied to the kernel invocation:
    • my_kernel<<<grid, block, 0, stream>>>(...)
    • hipLaunchKernelGGL(my_kernel, grid, block, 0, stream, ...)
    • Default stream: my_kernel<<<grid, block, 0, 0>>>(...)

// Use the default stream
hipkernel<<<grid, block>>>(args);
// Use the default stream
hipkernel<<<grid, block, bytes, 0>>>(args);
// Use the stream strm[i]
hipkernel<<<grid, block, bytes, strm[i]>>>(args);

Stream example

hipStream_t stream[3];
for (int i = 0; i<3; ++i) 
  hipStreamCreate(&stream[i]);

for (int i = 0; i < 3; ++i) {
  hipMemcpyAsync(d_data[i], h_data[i], bytes, 
    hipMemcpyHostToDevice, stream[i]);

  hipkernel<<<grid, block, 0, stream[i]>>>
    (d_data[i], i);

  hipMemcpyAsync(h_data[i], d_data[i],  bytes, 
    hipMemcpyDeviceToHost, stream[i]);
}

for(int i = 0; i<3; ++i) {
  hipStreamSynchronize(stream[i]);
  hipStreamDestroy(stream[i]); }

Memory caveat:

  • Host memory needs to be page-locked, otherwise memory copies are synchronous
hipError_t hipHostMalloc(void **ptr, size_t size);
hipError_t hipHostFree(void *ptr);

Async memory copy with regular vs page-locked memory

Summary before moving to events

  • When you do not specify a stream to your kernel, it is sent to the default stream
  • Operations in a stream execute in a FIFO manner
  • Multiple streams can execute concurrently on the same GPU

Events

Why events?

  • Cut stream to fragments
    • Useful for inter-stream synchronization and timing asynchronous events
  • Events have a boolean state: occurred / not occurred
    • Query with hipError_t hipEventQuery(hipEvent_t event): hipSuccess/hipErrorNotReady

Measure how fast host places tasks to stream:

  // Start timed GPU kernel
  clock_t start_kernel_clock = clock();
  kernel<<<gridsize, blocksize, 0, stream>>>(d_a, n_total);

  // Start timed device-to-host memcopy
  clock_t start_d2h_clock = clock();
  hipMemcpyAsync(a, d_a, bytes, hipMemcpyDeviceToHost, stream);

  // Stop timing
  clock_t stop_clock = clock();
  hipStreamSynchronize(stream);

Measure duration of tasks on GPU:

  // Start timed GPU kernel
  hipEventRecord(start_kernel_event, stream);
  kernel<<<gridsize, blocksize, 0, stream>>>(d_a, n_total);

  // Start timed device-to-host memcopy
  hipEventRecord(start_d2h_event, stream);
  hipMemcpyAsync(a, d_a, bytes, hipMemcpyDeviceToHost, stream);

  // Stop timing
  hipEventRecord(stop_event, stream);
  hipEventSynchronize(stop_event);

Events: Central API calls

Description API call
Initialize event object hipEventCreate(hipEvent_t* event)
Record an event in the stream hipEventRecord(hipEvent_t event, hipStream_t stream)
Elapsed time (ms) between start and end hipEventElapsedTime(float* ms, hipEvent_t start, hipEvent_t end)
Make stream wait for event hipStreamWaitEvent(hipStream_t stream, hipEvent_t event, unsigned int flags = 0)
Wait for event to complete hipEventSynchronize(hipEvent_t event)
Destroy event object hipEventDestroy(hipEvent_t event)
  • All of the above return hipError_t

Lastly, synchronization

  • GPU operations are asynchronous with respect to the host
    • The CPU may continue executing while kernels are still running
    • Memory copies and kernels may overlap
  • Synchronization is needed when:
    • The host (CPU) needs results produced by the GPU
    • One operation depends on another operation completing
  • Without synchronization:
    • The CPU may access incomplete data
    • Operations may execute in the wrong order

Central synchronization API calls

  • Synchronize the host with a specific stream
hipError_t hipStreamSynchronize ( hipStream_t stream )
  • Synchronize the host with a specific event
hipError_t hipEventSynchronize ( hipEvent_t event )
  • Synchronize a specific stream with a specific event (the event can be in another stream)
hipError_t hipStreamWaitEvent ( hipStream_t stream, hipEvent_t event, unsigned int  flags = 0 )
  • Synchronize the host with the whole device (wait until all device tasks are finished)
hipError_t hipDeviceSynchronize ( void ) 
  • In-kernel blockwise synchronization across threads (not between host/device)
__syncthreads()

Summary

  • Streams provide a mechanism to compute tasks on the GPU concurrently
  • Events provide a mechanism to signal when operations have occurred in a stream
    • Good for inter-stream sychronization and timing events
  • Many host/device synchronizations functions for different purposes
    • The device function __syncthreads() is only for in-kernel synchronization between threads in a same block (does not synch threads across blocks)