GPU programming with HIP
2026-05
CSC Training
malloc returns an address in the virtual memory

Memory management can be Explicit or Implicit.
malloc/free etc)Explicit memory management
int main() {
int *A, *d_A;
A = (int *) malloc(N*sizeof(int));
hipMalloc((void**)&d_A, N*sizeof(int));
...
/* Copy data to GPU and launch kernel */
hipMemcpy(d_A, A, N*sizeof(int), hipMemcpyHostToDevice);
kernel<<<...>>>(d_A);
hipMemcpy(A, d_A, N*sizeof(int), hipMemcpyDeviceToHost);
hipFree(d_A);
// result is in A
free(A);
}Pros
Cons
Unified memory automatically migrates memory pages between CPU and GPU
Without prefetching:
Programmer can proactively move pages to the GPU before execution
Allocate device memory
Copy data
Where kind:
hipMemcpyDefault, or hipMemcpyDeviceToHost, hipMemcpyHostToDevice,
hipMemcpyHostToHost,
hipMemcpyDeviceToDeviceDeallocate device memory
Also known as Managed memory
Allocate Unified Memory
Deallocate unified memory (same as explicitly managed memory)
Prefetch (asynchronously):
Advise about memory access (more in HIP API Documentation)

malloc allows swapping, page migration and page
faultshipHostMalloc page-locks the allocation to a physical
memory location
hipFreeHost()Benefits of page-locking:
hipMemcpy() calls are blocking (ie,
synchronizing)
hipMemcpyAsync()

Page-locked host memory
Allocate/free page-locked host memory
Memory copy functions are the same as with normally allocated memory
| Description | API call |
|---|---|
| Allocate memory from pool. If pool is too small, assign more memory to it. | hipMallocAsync(void** devPtr, size_t size, hipStream_t hStream) |
| Free memory to the pool in the specific stream | hipFreeAsync(void* devPtr, hipStream_t hStream) |
Example 1 - slow
for (int i = 0; i < 100; i++) {
// Allocate memory here (slow)
hipMalloc(&ptr, size);
// Run GPU kernel
kernel<<<..., stream>>>(ptr);
// Deallocate memory here
hipFree(ptr);
}
// Synchronize the default stream (no influence to memory allocations)
hipStreamSynchronize(0); Example 2 - fast
for (int i = 0; i < 100; i++) {
// Obtain unused memory from the current memory pool,
// more memory is allocated for the pool if needed
hipMallocAsync(&ptr, size, stream);
// Run GPU kernel
kernel<<<..., stream>>>(ptr);
// Return memory to the current memory pool
hipFreeAsync(ptr, stream);
}
// Synchronize
hipStreamSynchronize(stream);