Memory dim3
If B has not finished writing its element before A tries to read it, we have a race condition, which can lead to undefined behavior and incorrect results.
Let’s assume that A and B are threads in two different warps. Then, thread A wants to read B’s element from shared memory, and vice versa. Let’s say that two threads A and B each load a data element from global memory and store it to shared memory. When sharing data between threads, we need to be careful to avoid race conditions, because while threads in a block run logically in parallel, not all threads can execute physically at the same time. This capability (combined with thread synchronization) has a number of uses, such as user-managed data caches, high-performance cooperative parallel algorithms (such as parallel reductions), and to facilitate global memory coalescing in cases where it would otherwise not be possible. Threads can access data in shared memory loaded from global memory by other threads within the same thread block. Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. In fact, shared memory latency is roughly 100x lower than uncached global memory latency (provided that there are no bank conflicts between the threads, which we will examine later in this post). Shared Memoryīecause it is on-chip, shared memory is much faster than local and global memory.
#Memory dim3 how to#
Before I show you how to avoid striding through global memory in the next post, first I need to describe shared memory in some detail. However, it is possible to coalesce memory access in such cases if we use shared memory. However, striding through global memory is problematic regardless of the generation of the CUDA hardware, and would seem to be unavoidable in many cases, such as when accessing elements in a multidimensional array along the second and higher dimensions. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.