Jump to content
HWBOT Community Forums

_mat_

Members
  • Posts

    1003
  • Joined

  • Last visited

  • Days Won

    41

Posts posted by _mat_

  1. I would say that there is a good chance that a decent chip, although not the best, won't get as much attention as the best samples and can be an overlooked treasure. Some chips (and certainly not all) require a huge time investment to figure out how to get past their handicaps. These might coldbugging, illogical and bad mannered bastards but they can still be a winner. :)

  2. That's a CUDA error, that happens directly after the calculation kernel when waiting for the GPU to return the data. Something like this occurs for example when there was something wrong with the memory (read or write in unallocated areas). Is the card heavily overclocked? Are you using high batch and reduction sizes? Try stock clocks and the lowest sizes and see if that's the problem.

     

    Btw, you should have also gotten a detailed error message in square brackets right next to the error you posted. Please let me know what it is.

  3. I have not tested the Legacy version yet. It was compiled with VS 2013 (instead of 2012) which introduces some major changes. I will have a look at it as soon as I have more time.

     

    Btw: I don't know if I will continue to support the Legacy version. It's a lot of work and has very few downloads. This may be the last version.

  4. For the purpose of full exposure, I am posting this here as well. After lots of hours of bugfixing version 2.3 is finally bulletproof. Please redownload the newest build for before benching: GPUPI 2.3.4

     

    Additionally the following features were added in the last four minor version:

     

    • Supporting Tesla graphics cards
    • Support for more than 8 devices - theoretically thousands of devices could be used for calculation now!
    • Detection of AMDs RX 480 graphics cards
    • Important bugfixes for the Legacy version and GeForce 200 series cards
    • Cleanup for source code

     

    Download: https://www.overclockers.at/news/gpupi-international-support-thread

     

    Many many thanks to dhenzjhen again, because of his support GPUPI is now better and more flexible than ever! If you haven't seen his score with 10 Tesla M40s you better do it now: dhenzjhen`s GPUPI - 1B score: 2sec 621ms with a GeForce GTX 1080 (it's momentarily filed under GTX 1080 because the M40s are not in the database)

  5. Just a quick heads up! Tomorrow I will release GPUPI 2.3 with multiple bugfixes and features. I am very happy with the new support of CUDA 8.0 plus serveral optimizations of the CUDA kernels that finally led to faster scores than the OpenCL implementation.

     

    Have a look at this score with a GTX 1080, it's top 10 on air cooling - and my sample doesn't really clock well:

     

    gpupi-2-3-score-gtx-1080_215326.jpg

     

    So hold your horses for now if you are benching NVIDIA cards with GPUPI. :)

  6. I don't know what Geekbench really does when benching memory. If it's mostly a bandwidth test, it should be affected as well.

     

    The gap between 4/512 and 1/64 says a lot. The more the batch size is adjusted to the architecture itself, the faster the bench will be. That's because the workload is aligned to the maximum worksize, that can be run in parallel. 4M seems to be the best choice for the 6100 with 2 Cores/4 Threads.

     

    About the same is true for the reduction size. The bigger the better, because 512 means that 512 sums will be added up to one until only one sum remains. Lets say that we want to sum up 8192 single numbers, that would be:

     

    step 1: sum of 512 numbers, sum of 512 numbers, ... (16 times)

    step 2: sum of 16 numbers = result

     

    Where as the reduction size of 64 would produce:

     

    step 1: sum of 64 numbers, sum of 64 numbers, sum of 64 numbers ... (128 times)

    step 2: sum of 64 numbers, sum of 64 numbers

    step 3: sum of 2 numbers = result

     

    If you consider that GPUPI produces billions of partial results, that need to be added up, then 512 also needs a lot less steps in general to sum up the batches after they are calculated. Additionally the bigger the batch size, the less reduction process have to be made for the calculation. So these two values mean a lot for the whole computation. :)

  7. This is so strange, it's the only bench I have ever had an issue that I couldnt eventually figure out. I guess I should try with different mobo and cpu.
    Have you tried a memory benchmark yet? Try AIDA64 to test your bandwidth, it should be impacted as well: Overclocking, overclocking, and much more! Like overclocking.

     

    Otherwise it's a driver issue, but I doubt it. We never had an efficiency problem with the memory reduction before. It's btw a very common technique for summing up a lot of memory in parallel. The pi calculation itself depends on much more to be efficient.

×
×
  • Create New...