GPUPI - SuperPI on the GPU

skulstation · November 13, 2014

Did you install the newest drivers? Are you sure they can handle OpenCL? If the device doesn't support double precision but can be detected on the system, it will get listed as ignored when starting the benchmark. I'm not sure about your cards, GTS 250 seems to have double support, I think.

Detection is mostly a driver issue and has not much to with the benchmark itself.

i gone re try it whit the gts 250 but now whit the 344.11 and not the 340.52 drivers. its working on a 560 ti 448 whit the 344.11 drivers

knopflerbruce · November 14, 2014

If you don't mind... please do a 2nd setting which takes a bit longer. What about 32B? Sure, most people prefer short benchmarks, but something heavy is also worth considering.

_mat_ · November 14, 2014

If you don't mind... please do a 2nd setting which takes a bit longer. What about 32B? Sure, most people prefer short benchmarks, but something heavy is also worth considering.

10B already takes about 15 minutes on my GTX 980. 20B would be possible without adapting the algorithm for higher precision. But I have to test.

I am currently working on a CUDA implementation. Just curious to see how good the OpenCL implementation of NVIDIA really is.

knopflerbruce · November 14, 2014

What about 16B, then? 20 would be cool. It would be somewhat interesting in a few years - even wprime 1024m is not what it used to be because it's almost over before it really began

TaPaKaH · November 14, 2014

+1 for really large problem instances. Computing power in GPUs grows a lot faster than "+5%" per year so if a test takes 20s now, in a couple of years it will be faster than 1M.

Massman · November 15, 2014

I concur, it would be nice to have a really demanding test

You know, like how 32M used to take an hour to run.

Calathea · November 15, 2014

I too would like a longer test. Just you guys don't forget how powerful a 980 or 290 can be. Don't want to see 4h+ runs with something like a radeon 270x

_mat_ · November 24, 2014

I'm currently testing the CUDA implementation with 32B digits. Takes about 50 minutes on a GTX 980 with stock clocks. Long enough?

attachment.php?attachmentid=198723

I will release version 1.3 later today.

_mat_ · November 24, 2014

Guys, version 1.3 is here. The whole code was refactored and allows multiple APIs now, that are loaded when the system supports it. The new version also includes a standalone for OpenCL and CUDA. The main reason is that the OpenCL version will run on Windows XP, the CUDA version won't.

The CUDA implementation was pretty easy and also uses less code to work. Especially the part to setup the application and prepare the calculation was a piece of cake compared to OpenCL. That said, I tried to be as fair as possible and implemented both APIs with each of their advantages, but still rely on the same algorithms and the same basic optimizations. I've also adjusted the OpenCL code a little bit to get them closer together, so the new version might differ a few milliseconds from the results of 1.2. Please use the new version as of now.

As requested I added two more digits to reach: 20B and 32B. Smaller graphics cards and CPUs will have to crunch those for days. Karl would have loved it!

Have a fun and let me know your results and what you think!

Download: GPUPI Beta 1.3

_mat_ · November 24, 2014

I've just added "GPUPI - 32B" and "GPUPI for CPU - 1B" to the benchmarks. Let's hope you use it.

Btw, I guess there should be a discussion if it's allowed to use CUDA for NVIDIA cards to compete in the rankings. Well, that's why I have so carefully implemented CUDA and OpenCL so close together. I think it would be fair, because any performance improvement is due to the vendor's implementation and optimization of kernel, which is basicly the same.

Massman · December 2, 2014

Looking good Mat!

tiborrr · December 2, 2014

Great benchmark, love it!

I would vote for 2B and 32B to be accepted as 'retail' benchmarks once benchmark passes through the Hwbot validation:

I would also lock down the benchmark window when splash screen "Pi Calculation is Done!" comes up. Just for the sake of nostalgia

Here's my workstation machine with FirePro W9000:

Wannabe W7000 (flashed 7870, slower DP):

Edited December 2, 2014 by tiborrr

_mat_ · December 2, 2014

Thanks Nico, very much appreciated!

I will change the message box, good idea. I really wanted to implement it as close to SuperPi as possible, but changed what I felt was outdated or not well done in the orginial benchmark. For example the window can be moved after the message box for a successful calculation is shown. The screenshot weirdos will thank me for this - yeah, I am one of those. It was also important for me to have an options file, that rembers what was set last time.

Regarding the default bench settings for ranking, I let you guys decide. I will update the bench to show it as default too.

Edited December 2, 2014 by _mat_

GENiEBEN · December 3, 2014

1B and 32B, since the idea was to make it a Spi-Clone. You can add a poll to the thread if you want to.

tiborrr · December 3, 2014

I fully concur with 1B and 32B. This way we're cool for couple of years (y)

For the sake of nostalgia please:

- use the original font

- remove the cancel button on calculation start notification

Also, here's the scaling of my HD7870 (Pitcairn):

Massman · December 4, 2014

So cool to see this benchmark kick off so nicely.

Let's fast-track it for points :D

_mat_ · December 4, 2014

It would be an honour! I promise to support the benchmark actively in the foreseeable future.

Btw: Next stop is multi gpu support. I just got supported by ASUS with a couple of GTX 980s for an overclocking show in Vienna today. I will use them wisely.

_mat_ · December 6, 2014

I fully concur with 1B and 32B. This way we're cool for couple of years (y)

For the sake of nostalgia please:

- use the original font

- remove the cancel button on calculation start notification

Also, here's the scaling of my HD7870 (Pitcairn):

Was a bit busy with an overclocking show the last few days. If anybody wants the see some pictures, have a look here.

First off, thanks for the nice scaling diagram. It clearly shows that GPUPI is not bandwidth limited in any way. The millions or even billions of parallely calculated values always stay in the graphics memory, never have to be copied to the host. That's because I've implemented not only the pure calculation on the GPU, there is also a two-stage memory reduction done right afterwards using shared memory inside the workgroups. Only two doubles have to be transfered afterwards to the host to accumulate the final sum of each series (there are four of them).

Regarding your suggestions, when writing the bench I initially wanted to use the original font. But it has a lot of kerning and I was not able to put in all the information without resizing the window to a strange and not SuperPi-like proportion. As I don't like the original font that much, I thought I'll better choose something more readable. Additionally I also implemented the text as an editable control in WIN32, because I wanted the text to be selectable for copying information, something I also missed in SuperPI. That comes with the sacrifice to control the spacing of the text itself. SuperPI uses GDI draw calls, where you can state a pixel coordinates to place the text.

I've also but some thought in the cancel button before calculation, it's not a bug.

I thought it was a good idea to be able to cancel it. You know how it is benching with ln2. You open the bench, press calculate and wait for the message box to start. Now focus is back on the temperature, pouring ln2 to the maximum of the component. When ready, press Ok. But sometimes things are not ready or you've forgot something. Now you have a choice to go back, quit the application and do it.

But guys, if you want it to be more nostalgic, I can try. I've just wanted to let you know, that I put some thought into it and didn't change things for nothing.

OLDcomer · December 6, 2014

It would be an honour! I promise to support the benchmark actively in the foreseeable future.

Btw: Next stop is multi gpu support. I just got supported by ASUS with a couple of GTX 980s for an overclocking show in Vienna today. I will use them wisely.

Nice benchmark!

Please add an option to select which GPU device will do the calculations. So in multi-gpu systems we will be able to select the best overclocker like we play with affinity in task manager to select the best overclocking core while benching Super Pi.

Massman · December 7, 2014

I'm not too sure 1B is a good choice. The currently wr is already at 19 seconds, so it looks like we're going to be at the end of the benchmark very soon.

_mat_ · December 9, 2014

First official release version is here! Just a few minor bugfixes, no changes on the GPU code and the results. I recommend to use this version for benching though.

Changelog

Explicit device selection for SLI, Crossfire und systems with multiple sockets and CPUs, implemented for CUDA as well as OpenCL. Important: The sort order of the devices depends on the driver and therefor the vendor implementation. In my tests this is same order in which GPU-Z sorts its devices. Some overclocking tools might order them in their own way.
Bugfix: The previously selected CUDA graphics card was not correctly preselected in the settings dialog
The final message box after the calculations is now exactly shown as in SuperPI

Download: GPUPI 1.4 (723 KB)

buildzoid · December 11, 2014

How much does this benefit from FP64? Because if FP64 makes a big impact the WR will always belong to compute cards and that's just no fun for most people.

GENiEBEN · December 11, 2014

How much does this benefit from FP64? Because if FP64 makes a big impact the WR will always belong to compute cards and that's just no fun for most people.

You mean just like many WR's belong to 2P/4P Xeons that cost more than a car? Get with it, we want to see the absolute record possible, not the down-to-earth-Joe-can-afford-it-too.

EDIT: @mat

Can you implement i18n support?

_mat_ · December 11, 2014

How much does this benefit from FP64? Because if FP64 makes a big impact the WR will always belong to compute cards and that's just no fun for most people.

Good double precision helps, but choosing the right weapon - currently R9 290 - and overclocking it to the maximum is key in this benchmark. So FirePro and Quadro won't have a chance to fetch the cups, if they can't be overclocked.

Can you implement i18n support?

Some of the is already using WIN32's WCHAR, but not all of it. Why, seen any problems with English yet? Anyways, I have no intention to translate it.

GENiEBEN · December 11, 2014

^ No, was hoping to translate it, np. Btw, how exactly do I start it on CPU, ran it on Mobile HD4000 + i5 3320M, says it can't run on IGP (missing DP), but it should run w/o problems on the cpu.

GPUPI - SuperPI on the GPU

Recommended Posts

Link to comment

Share on other sites

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

_mat_

Leeghoofd

Splave

Posted Images

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation