c++ - Any CUDA operation after cudaStreamSynchronize blocks until all streams are finished -


While profiling my CUDA application with NVIDIA Visual Profiler, I saw that the cudaStream synchronization After any operation, all the streams have expired, this is very strange behavior because if cudaStream synchronizes gives it means that the stream is finished, right? Here's my fax code:

  std :: list & lt; Std :: thread & gt; waitingThreads; Zero StartKernelSync (for {{int i = 0; i <200; ++ i} {KoodhostAllok (CPUPid Memory, Size, KoodhostAllow Default); Memcpy (cpuPinnedMemory, Data, Size); CudaMalloc (gpuMemory); CudaStreamCreate (& amp; stream); CudaMemcpyAynync (gpuMemory, cpuPinnedMemory, size, cudaMemcpyHostToDevice, stream); RunCernel & lt; & Lt; & Lt; 32, 32, 0, stream & gt; & Gt; & Gt; (GPU Memorial); CudaMemcpyAsync (cpuPinnedMemory, gpuMemory, Size, cudaMemcpyDeviceToHost, Stream); WaitingThreads.push_back (std :: move (std :: thread (waitForFinish, cpuPinnedMemory, stream))); } While (Waiting Streads.) () & Gt; 0) {waitingThreads.front () Join (); waitingThreads.pop_front (); }} Zero WaitForFinish (Zero * cpuPinnedMemory, cudaStream_t Stream, ...) {cudaStream Synchronize (Stream); cudaStreamDestroy (stream); // & lt; == Block memcpy (data, cpuPinnedMemory, size) until all streams are finished; cudaFreeHost (cpuPinnedMemory); cudaFree (gpuMemory); }   

If I enter cudaStreamDestroy before cudaFreeHost then it becomes a blocked operation.

Is there any concept wrong here?

EDIT: I found another strange behavior, sometimes it is non-blocking between the processing of currents and then processes the rest of the currents.

General Practices:

normal behavior < p> Strange behavior (often happens):

weird behavior < p> EDIT2: I'm testing on the Tesla K40c card with the calculation capability of 3.5 on CUDA 6.0.

As suggested in the comments, though the memory transfer in my application is quite fast and I can be viable to reduce the number of streams and I mainly work dynamically scheduled Want to use currents GPU? The problem is that after the stream is over, I need to download the data from the pinned memory and clean the allocated memory for further streams which appears to be the blocking operation.

I am using a stream for each data set because every data set

I do not know why the operations are blocking but I have concluded that I can not do anything about it, so I decided to implement the memory and implement PUU memory To reuse pinned CPU memory, pooling (forms of tips in currents) Was switch to suggest) sections to avoid deletion of any kind.

If anyone is interested then my solution is to start Kernel behaves as asynchronous operation which is called schedule kernel and callback after the kernel expires.

  std :: vector & lt; instance * & gt; m_idleInstances; Std :: vector & lt; Example * & gt; m_workingInstances; Void startKernelAsync (...) {// while searching for the finished stream (m_idleInstances.size () == 0) {findFinishedInstance (); If (m_idleInstances.size () == 0) {std :: chrono :: milliseconds do (10); std :: this_thread :: sleep_for (dur); }} Example * Example = m_idleInstances.back (); M_idleInstances.pop_back (); // fill cpu pinted memory cudaMemcpyAsync (..., stream); RunCernel & lt; & Lt; & Lt; 32, 32, 0, stream & gt; & Gt; & Gt; (GPU Memorial); CudaMemcpyAynync (..., stream); m_workingInstances.push_back (clusteringInstance); } For NiftyFlyingFull Instance () (For AUTO = this M_KIRING instance.BZIN (); it! = M_workingInstances.end ();) {EXAMPLE * inst = * it; CudaError_t Status = cudaStreamQuery (inst-> stream); If (position == cudaSuccess) {this = m_workingInstances.erase (this); M_callback (instance-> clusterGroup); M_idleInstances.push_back (inst); } And {++; }}}   

And wait for the bus and all to end:

  Virtual Zero WaitingForFish () {while (m_workingInstances.size ()> gt; ; 0) {example; example = m_workingInstances.back (); m_workingInstances.pop_back (); m_idleInstances.push_back (example); CudaStreamSynchronize (instance- & gt; Stream); FinalizeInstance (example); }}   

And here is a graph form profiler, works as a charm!

Article Article

Comments

Popular posts from this blog

Verilog Error: output or inout port "Q" must be connected to a structural net expression -

jasper reports - How to center align barcode using jasperreports and barcode4j -

c# - ASP.NET MVC - Attaching an entity of type 'MODELNAME' failed because another entity of the same type already has the same primary key value -