Jump to content
HWBOT Community Forums

Mysticial

Members
  • Posts

    156
  • Joined

  • Last visited

  • Days Won

    2

Everything posted by Mysticial

  1. I've gone ahead and released v0.7.1 to the public. Let me know how the version turns out. Here's the first submission for the new version: http://hwbot.org/submission/3216573_mysticial_y_cruncher_pi_1b_core_i7_4770k_1min_59sec_918ms Download links for the new version are both on my website and the other thread. And here's a 100b run just because I feel like adding an image to this post.
  2. Just to clarify that I'm understanding your response correctly: At 4.75 GHz, neither y-cruncher nor Prime95 will run. At stock, Prime95 will run, but only at 87% CPU usage. Neither GPUPi nor y-cruncher will run. Geekbench runs with no problems with 99% CPU usage. If this is the case, then I have no clue. It sounds like something is very wrong with the system. Stuff not working at 4.75 GHz is reasonable. A lot of Haswell-e overclocks at 4.75 GHz are not AVX-stable and will instantly fail on anything that uses AVX (such as y-cruncher or Prime95 28.x). But if things are failing at stock, then something is messed up. The 87% CPU usage thing is usually an indication that you have background programs running.
  3. Are you AVX-stable? Can you run Prime95 Small FFTs (version 28.x) for any amount of time?
  4. I commented on this a couple days ago. I also think that a screenshot with the result window and CPUz (CPU + mem) is sufficient. It is not necessary to have the submitter in the screenshot. With v0.7.1 (which I'm ready to release at anytime), there should be no reason to put any OS restrictions. I also want to make it clear that you are free (and encouraged) to change the computation settings. The default settings that you get when running from the submitter are not always the best. And from my experience, they are rarely optimal when you have 32+ cores. Everything is allowed (swap mode, using a different binary) as long as you are computing Pi to the right number of digits and it finishes correctly. I'll keep the submitter app updated to enforce any necessary new restrictions (such as the reference clock stuff).
  5. No, but you can (almost) already do that by saving the validation file. I can't allow manual submissions until I have a mechanism to block submissions into the wrong category. (i.e. submitting a 25m into the 1b.)
  6. The beta competition has been winding down for the past couple weeks and it officially ends in two days. So I thought I'd shed some light on what's potentially coming next. y-cruncher v0.7.1 has been in feature freeze for the past 3 weeks and I've started handing out a release candidate to a handful of people. In other words, it's almost ready. If any staff members are interested, PM me or shoot me an email. The new version has a bunch of changes. The ones that are relevant to HWBOT are: Admin is no longer required to run. But turning it on anyway may give a small speedup. Admin is still required for swap mode. Detection of the operating system. Detection of more hardware components. (HT, # of sockets, motherboard, memory) Detection of the reference clock. Version 0.7.1 will now recognize the TSC, HPET, and ACPI reference clocks. The submitter will refuse to submit benchmarks that are run Windows 8 or later if the reference clock is not HPET or ACPI. Performance improvements for most processors. And a new binary aimed at Broadwell and Skylake desktop chips. This version adds a lot of environment detection mostly for validation and to streamline the submission process. At the very least, the motherboard and # of processors will now auto-fill themselves (if HWBOT recognizes it). On Windows, it will also detect the individual memory modules, but HWBOT currently doesn't take this information. (I couldn't figure out how to do this on Linux.) OS and reference clock detection is obviously for validation and to lift the Windows 8 and later restriction. However, I found out today that this might not be 100% reliable. Apparently my laptop's platform clock is neither HPET nor ACPI. y-cruncher doesn't recognize it and the submitter blocks me from submitting anything run on my laptop. After a bunch of Googling I still couldn't figure out what the hell it is. In any case, this is something that can probably be fixed on the submitter side. So I'm fine with rolling out y-cruncher as is. Last are the performance improvements. They aren't massive speedups. (Mostly around the 1 - 5% range.) Last time I checked, Sandy/Ivy Bridge with AVX had the biggest improvement. But still nowhere near enough to be competitive with Haswell/AVX2. Those with Broadwell and Skylake will be able to run the new "x64 ADX ~ Kurumi" binary that utilizes the ADX instruction set. Unfortunately, this means that v0.7.1 will not be speed consistent with v0.6.9 and older versions. So if y-cruncher is here to stay on HWBOT, this is something we'll have to live with since y-cruncher probably continue to get incrementally faster with major release. With respect to that, I have AVX512 binaries lined up and good to go. So expect a potentially "unfair" advantage for Skylake-EP and Cannonlake whenever I can get my hands on them. (Knights Landing also has AVX512, but that's still uncertain since the architecture is so drastically different.) ---------------- In any case, there are some open questions: There's a new version of the submitter that will be released simultaneously with y-cruncher v0.7.1. Right now, that submitter is set to block all v0.7.1 submissions on Windows 8 or later if they aren't using HPET or ACPI. Should I extend this to block earlier versions of y-cruncher that cannot detect the clock? This means that the current version of y-cruncher will no longer be usable for HWBOT. What about Linux? I have no idea how easy it is to tamper with the clocks in Linux. (I've never tried.) But it's certainly possible since Linux is open-sourced. So a capable kernel hacker can modify it in a way to trick y-cruncher's timings. When should I actually release v0.7.1? Should I do it immediately after the competition ends? Or should I wait around a bit. (I'm not sure what usually follows a beta competition.)
  7. Excellent. I'll make the error-message more specific for the next version. You probably won't be the last one to hit this. It just occurred to me that not being able to write to disk will also prevent the submitter from running y-cruncher at all since it uses scripts.
  8. While I'm not 100% sure this is the case, I do agree that it sounds like the submitter can't write to disk at all. Where did you put the y-cruncher folder? If you copied directly into "c:/" there's a chance that it won't have permissions to write there. Try copying it somewhere else on that system. If the problem persists, then I'll have to push out a new version with some additional logging to pinpoint the error.
  9. Just to make sure I know which error message you're getting. The exact text reads, "Unable to create datefile." right? Do you see a file named, "datafile.hwbot" in the path where you are running the app? This error message will show up if the submitter app is unable to write to the directory that it's running from.
  10. Fine... I hope I don't get yelled at. http://hwbot.org/submission/3207669_mysticial_y_cruncher_pi_25m_2x_xeon_e5_2696_v4_0sec_780ms http://hwbot.org/submission/3207633_mysticial_y_cruncher_pi_1b_2x_xeon_e5_2696_v4_37sec_624ms http://hwbot.org/submission/3207722_mysticial_y_cruncher_pi_10b_2x_xeon_e5_2696_v4_7min_23sec_945ms/
  11. Alright... Someone in China was kind enough to give me remote access to a dual Xeon E5-2696 v4 (Broadwell-EP) It's got 44 cores/88 threads and 768 GB of ram. And after playing around with it for a few hours, I have some benchmarks for it. Does anyone mind if I lay waste to the leaderboards? I feel very guilty just thinking about it since I wrote the benchmark. Btw, getting the program to run efficiently on these high-end boxes is actually quite difficult. At this level, the program is very sensitive to a lot of things. Combine that with a half-dozen knobs to turn within the program and it's a very large search space to play with.
  12. Thanks. Though on second thought, I'm not sure if I'll be able to run any valid benchmarks on them. They're all running Windows 10, and I don't think I have permission to touch the bootcfg to turn on HPET.
  13. What's the policy of borrowing or loaning hardware? Someone has offered to give me remote-login access to a number of systems including a dual Xeon E5-2696 V4 (Broadwell-EP with 44 cores/88 threads and 768 GB of memory). The purpose of this is to do scalability tuning for y-cruncher. But I am also authorized to run and disclose benchmarks on this thing. Would it be bad taste to submit benchmarks from hardware that I do not own?
  14. I don't see how a batch file would help anything. What did you have in mind?
  15. I've thought about that. But swap mode is part of the custom compute menu. And that custom compute menu is f:D:oking complicated. So it will take a lot of work to mirror that menu into the UI. Aside from that, there are some technical roadblocks. The custom compute menu is an interactive menu. Unlike the benchmark menu, you can't just fire a command at it and expect it to always work gracefully. The custom compute menu will show memory calculations and warnings. It will also hide/disable options that aren't applicable or are incompatible with existing settings. And it will automatically adjust things in response to user-input. But in order to do that in the UI, y-cruncher needs to be able to send information back to the UI. Unfortunately, that's not possible right now. And I don't know how to do it that atm. That said, I can try to design something around this limitation, but no promises.
  16. I was going through the submissions and I noticed a number of multi-socket systems that have seemingly terrible performance. (Especially that 4-socket Magny-Cours Opteron.) I'll go ahead and explain why this is the case. It will probably be obvious to those of you who are familiar with the topic. ----- Why does y-cruncher (sometimes) suck on multi-socket systems? This is due to memory access. Specifically, Non-Uniform Memory Access (NUMA). y-cruncher can only run efficiently when the following assumption is true: Every core/processor has fast access to all the memory. This is true for all single-socket systems as well as some of the pre-Nehalem dual-socket servers. But not on modern multi-socket systems. On multi-socket systems, each processor socket has its own set of memory banks. A processor has fast access to its own set of memory. But if it needs to access memory that's elsewhere (on a different socket), it needs to go over the interconnect to get it from the other processor. So it's a lot slower. In other words, the assumption that is critical to y-cruncher's performance is no longer valid. Some memory is faster, and some memory is really slow - hence "Non-Uniform Memory Access". If you have two sockets, half the memory will be fast and the other half slow. If you have a lot of sockets, then the vast majority of the memory will be slow with respect to each individual processor. If you think that's bad, get ready for more bad news. Operating systems are aware of the NUMA. So they try to be smart about it. When a program runs, it biases the memory in favor of the core that asked for it. This maximizes locality so that memory access stay within the same NUMA node. While this sounds reasonable for most applications, it actually backfires for y-cruncher. Unlike most programs, y-cruncher wants to use the entire system. Some of you might have noticed that y-cruncher's memory usage is static throughout the entire computation. What's happening is that y-cruncher allocates all the memory it needs upfront and reuses it through the computation. And that's where the problem is. That allocation is done by a single thread. So the OS will put all of it on one NUMA node. During the computation, y-cruncher spawns threads that run on all the cores and all the sockets/NUMA nodes. Since all the memory is on one socket, all processors from all the sockets will hammer that one socket. Not only is it overloading the memory bandwidth in that node, it's also swamping the QPI going in and out of that socket. Meanwhile, all memory on the other nodes are idle. In other words, a massive traffic jam while everybody tries to park in one garage while there are 3 others that are empty. This is why the performance sucks on those quad-Opteron servers. It also affects Intel machines as well, but to a lesser degree since they seem to have better interconnects. What can you do about it? The biggest problem is the traffic imbalance. If your BIOS has the option to disable the NUMA, then do it. This doesn't actually disable the NUMA since the NUMA is a physical thing, but it tricks the OS into thinking there's no NUMA so it randomizes the memory allocations across all the nodes. In Linux, you have a bit more control. The numactl package lets you run a program with interleaved memory. This also spreads out the memory across the nodes. These tweaks will help y-cruncher run faster. But it doesn't completely solve the NUMA problem. There's still the latency problem, and even when the interconnect traffic is balanced out, it will still be a bottleneck. Solving the NUMA problem can only be done by redesigning the program. That's obviously beyond the scope of benchmarking. That said, it doesn't mean you should avoid multi-socket systems. A high-end dual-socket machine that is properly configured will still beat out all the single-socket setups - LN2 or not. What makes y-cruncher different from programs like wPrime? y-cruncher actually needs to use memory - and a lot of it. (Not that I needed to say that.)
  17. Try this one and see how far it gets. Version 0.9.3.95: http://www.numberworld.org/y-cruncher/HWBOT%20Submitter%20v0.9.3.95.jar I did manage to hack in a progress counter for the submission. When everything works properly, you should see things in this order: Building datafile. Please wait... Sending datafile. Please wait... Sending: XX.X MiB / XX.X MiB That last one will refresh every second until the submission is complete. Then it disappears. If any errors occur, there should be an error-box that pops up. If something crashes, you probably won't see anything and it will hang. Anyway. I need to get some sleep. So I probably won't be able to respond for quite a while.
  18. So you do not see the message, "Sending datafile. Please wait..."? It may take a few seconds before the message pops up. But if you have a slow internet connection or a really large screenshot, then the actual sending may take a while. It takes 20+ seconds for me if I try to upload a 1440p image. And my connection isn't the worst one out there. It sounds like I need to make the send progress more explicit. That way even when it's not working, it's clearer where it gets stuck. Unfortunately, I can't do a progress bar since the Java network API doesn't seem to have any way to relay that information back to the caller.
  19. That's a pretty low bar to be "impressed" by. If I tried to present this to the UI folks back when I was at Google, they would've blown me off and told me to go back to programming. In all seriousness. I'm also a user of the app. So if something doesn't feel right, or inconvenient, I'll tweak it until it does. After a few iterations of that, this is what it converged to (at least for my workflow). The screenshot one was particularly tricky since I didn't want to over-complicate the UI. Screenshots are very invasive and can capture sensitive stuff. So I wanted the user to know exactly what he/she is about to send before actually sending it. And then there was the horrific side-effect of making the datafile really big and taking 20 seconds to send and freezing the app for the entire time. (My 2560 x 1440p monitor screenshots to a 2MB png.)
  20. Ugh... Manual submissions have always been disabled for reasons I mentioned in the other thread. In what way is the submission button not working? It doesn't do anything when you click? When you click it, do you at least get the status update on the bottom right corner? The behavior of that changed since the last version. Previously, the entire app would freeze while it sends out the datafile. But with screenshots, the datafile is much larger. So depending on your internet connection, it may freeze for a long time. So I changed the implementation so it wouldn't freeze but would at least display an indicator that it's still sending.
  21. I just pushed an update to the submitter app. The new features are: Integrated support for screenshots. An option to easily override the binary selection. Along with this are some miscellaneous UI changes. The top menu bar is largely redundant for now. It's a placeholder for later when there isn't enough space to make a button for everything. The screenshots are sent as part of the datafile. Due to the encoding that HWBOT uses, the datafiles would be really large. So I had to turn on compression. In other words, the older versions of the submitter will no longer work. So you will need to update to this one to make any new submissions. Hopefully this won't be too problematic for everyone. Please let me know if there are any issues with this new version. Quite a few things changed and it's possible I broke something. Here's what the latest version looks like: http://hwbot.org/submission/3189606_
  22. That's interesting. It's almost as if the AVX unit can get "stuck" in some way that can only be cleared up by power cycling it. I don't know if drastically changing the overclock can actually power cycle the AVX execution units. But it's worth a guess since it is known that they turn off when unused (among other circumstances). I can't say I've ever been able to push a Haswell that far. Both of my Haswell boxes are heat-limited to about 4 GHz. My 4770K will hit 90C under AVX2. And I'm not equipped (nor am I gutsy enough) to delid it. Given that I use these machines to develop this program, I'm basically "stuck" at these lame overclocks. Btw, how much LN2 is the benchmark churning through? The 5960X draws a lot of current under AVX-intensive loads. I'm guessing it isn't so bad for the 25m and 1b runs since they are short.
  23. The new rules that I think I'm going to apply after the beta competition ends will be: Swap mode is allowed. But only if it was done in a contiguous run. So anything that used checkpoint restart will not be allowed. Checkpoint restart is a feature to facilitate extremely long running computations when the computer goes down due to power outage or a planned backup session. For competitive benchmarking it allows you to cheat by changing the hardware in the middle of the run. When running on Windows 8/8.1/10, the reference clock must be set to either HPET or ACPI. The digits need to actually be correct. Right now, y-cruncher will tell you whether or not they are correct. But it will output a validation file even if the digits are wrong. And the submitter will let you submit it regardless of whether the digits are right or wrong. This isn't a big problem right now because most errors will halt the program. So if it manages to finish at all, there's a high probability that the digits will be correct. These new rules will apply starting from y-cruncher v0.7.1. And the submitter will enforce them. Technically, the validations files produced by v0.6.9 have enough information in them for the submitter to enforce all of these except the HPET/ACPI. But that's extra work, so I'm just gonna defer it to v0.7.1 which makes the information more explicit in the validation file. I'm in the process of feature-freezing v0.7.1 so that it will hopefully be ready for public release around the end of the beta competition. Swap mode can be run efficiently. But you'll need a very specialized setup such as this:
  24. You can't submit the datafile directly. You need to submit through the app itself. Once you submit it to HWBOT, the app will open up a browser that lets you finish the submission. Somewhere during that process, there will be an option that lets you enter the score into the competition.
×
×
  • Create New...