Advanced search

Message boards : Webpage and Application Code Discussion : Ryzen bug the second - Can someone confirm?

Author Message
moe120
Send message
Joined: 23 Apr 17
Posts: 4
Combined Credit: 434,566
DNA@Home: 0
SubsetSum@Home: 0
Wildlife@Home: 434,566
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 6999 - Posted: 1 May 2017, 13:22:32 UTC

Hi all,

i recently got myself a ryzen 5 1500x with 4 cores 8 threads. My mainboards bios (Asus B350 plus) was updated to the latest available version 0609 that is supposed to have the agesa 1.0.0.4 ryzen FMA3 bug fix implemented.
However, if i tell my boinc client to let run citizen science grid tasks on all 8 threads with an 80% usage limitation the system reproducably crashes after 1-5 minutes - screen gets black, mouse led turns off, no reaction to power button, have to press the reset button get it back.

When i take other projects (NFS, collatz,...) or run benchmarks with 100% cpu usage in a loop this doesnt occur. So this is no thermal issue. This only happens with the citizen science grid tasks (currently "exact mnist batch cnn trainer").

Can anybody with a ryzen cpu confirm this?

ARCHspark
Send message
Joined: 22 May 17
Posts: 5
Combined Credit: 472,305
DNA@Home: 0
SubsetSum@Home: 0
Wildlife@Home: 472,305
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 7068 - Posted: 22 May 2017, 19:22:07 UTC - in response to Message 6999.

similar problem here.
I have a Ryzen 1800x (x370 taichi, linux) running 16 threads of CSG and it crashes after a few minutes (processes stop working, system unusable, can't even shut down). no problem with other projects.
I thought it's a problem with my undervolting/overclocking but it's the same issue with stock settings and even stock clocks and extra volts.

running CSG with 15 tasks plus one other is working fine.

ARCHspark
Send message
Joined: 22 May 17
Posts: 5
Combined Credit: 472,305
DNA@Home: 0
SubsetSum@Home: 0
Wildlife@Home: 472,305
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 7069 - Posted: 22 May 2017, 20:54:39 UTC - in response to Message 7068.
Last modified: 22 May 2017, 20:55:06 UTC

similar problem here.
I have a Ryzen 1800x (x370 taichi, linux) running 16 threads of CSG and it crashes after a few minutes (processes stop working, system unusable, can't even shut down). no problem with other projects.
I thought it's a problem with my undervolting/overclocking but it's the same issue with stock settings and even stock clocks and extra volts.

running CSG with 15 tasks plus one other is working fine.

can't find an edit button...

update: 5 * CNN trainer 0.24 + 11 * Norm CNN trainer 0.29 is looking good so far. if it stays stable it probably was something with version 0.24...

Travis Desell
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 16 Jan 12
Posts: 1795
Combined Credit: 2,265,607
DNA@Home: 293,563
SubsetSum@Home: 349,212
Wildlife@Home: 1,622,832
Wildlife@Home Watched: 212,926s
Wildlife@Home Events: 51
Climate Tweets: 21
Images Observed: 710

              
Message 7073 - Posted: 24 May 2017, 18:20:25 UTC - in response to Message 7068.

similar problem here.
I have a Ryzen 1800x (x370 taichi, linux) running 16 threads of CSG and it crashes after a few minutes (processes stop working, system unusable, can't even shut down). no problem with other projects.
I thought it's a problem with my undervolting/overclocking but it's the same issue with stock settings and even stock clocks and extra volts.

running CSG with 15 tasks plus one other is working fine.


That's very strange? Could it be some kind of memory bandwidth issue?

EXACT v0.24 uses almost 4x the memory than EXACT 0.29, EXACT 0.29 also is doing things with single precision while v0.24 is doing things with doubles.

Does this still happen if you're running all v0.29s?

ARCHspark
Send message
Joined: 22 May 17
Posts: 5
Combined Credit: 472,305
DNA@Home: 0
SubsetSum@Home: 0
Wildlife@Home: 472,305
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 7074 - Posted: 24 May 2017, 19:25:10 UTC - in response to Message 7073.

similar problem here.
I have a Ryzen 1800x (x370 taichi, linux) running 16 threads of CSG and it crashes after a few minutes (processes stop working, system unusable, can't even shut down). no problem with other projects.
I thought it's a problem with my undervolting/overclocking but it's the same issue with stock settings and even stock clocks and extra volts.

running CSG with 15 tasks plus one other is working fine.


That's very strange? Could it be some kind of memory bandwidth issue?

EXACT v0.24 uses almost 4x the memory than EXACT 0.29, EXACT 0.29 also is doing things with single precision while v0.24 is doing things with doubles.

Does this still happen if you're running all v0.29s?


it's currently running v0.30. everything is fine. no problems even while overclocked and/or undervolted. I had this problem only with v0.24.
Ryzen has many reported problems with RAM speed. thats why i even tried my RAM at 2933Mhz instead of 3200Mhz. I haven't tested the RAM at 2133Mhz.
Temperatures with v0.24 were ~4-5°C higher than with later versions but always below mid 60s so no problem there.
everything is/was with up to date BIOS version (AGESA 1.0.0.4a).
total RAM usage of 16 * v0.24 was still a few GB below max system RAM.

I know this infos don't really help with bug hunting and since it's apparently fixed now we probably won't find the reason of the crashes.
i'd like to know if it's fixed for moe120 too.

moe120
Send message
Joined: 23 Apr 17
Posts: 4
Combined Credit: 434,566
DNA@Home: 0
SubsetSum@Home: 0
Wildlife@Home: 434,566
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 7087 - Posted: 29 May 2017, 7:39:18 UTC - in response to Message 7074.

for now i can confirm that the jobs i had in my queue were 0.24, i aborted those and now have 7x 0.30 running (stable for the last 10 minutes) will test the next days and report back.

regarding the system memory: its supposed to run at 2800 MHz (2x8GB) but the board doesnt support this yet so i had it clocked at 2400 which was stable (except csg that is :-) but is also only reachable with bios-assisted overclocking, i also tried the highest possible memory clock rate without oc (2133) and it also crashed, running at this clock rate now and its fine so far with the 0.30 ones

@ARCHspark: what mainboard are you using ?

moe120
Send message
Joined: 23 Apr 17
Posts: 4
Combined Credit: 434,566
DNA@Home: 0
SubsetSum@Home: 0
Wildlife@Home: 434,566
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 7089 - Posted: 29 May 2017, 16:35:41 UTC - in response to Message 7087.

update: it is indeed running stable, tested for 8hrs now, the longest run with 0.24 job chunks was like 10 minutes. I also increased memory clock from 2133 to 2400 and again it is stable as can be.

I wonder if someone could possibly make a standalone .exe that just runs 0.24 jobs that we could hand over to AMD for further investigation? maybe its nothing and we both have just some unstable memory or weird mainboard malfunction that expresses only with these calculations and all other ryzen owners will be fine? but who knows ...

ARCHspark
Send message
Joined: 22 May 17
Posts: 5
Combined Credit: 472,305
DNA@Home: 0
SubsetSum@Home: 0
Wildlife@Home: 472,305
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 7091 - Posted: 29 May 2017, 18:31:08 UTC - in response to Message 7087.
Last modified: 29 May 2017, 18:32:00 UTC

@ARCHspark: what mainboard are you using ?


i'm on asrock x370 taichi. 3200 CL14 RAM running with its XMP profile with 1.4v instead of 1.35v.

i'm currently playing around with the voltages to find my sweet spot so right now it isn't really stable but that has nothing to do with CSG.

I wonder if someone could possibly make a standalone .exe that just runs 0.24 jobs

well... under your account settings you can select to only running the 0.24 version...

in the news thread [wildlife] new app: EXACT MNIST BATCH the user Peppernrino reports a similar problem with running too many tasks of v0.24 with an AMD FX-8350. maybe it's not something ryzen specific.

moe120
Send message
Joined: 23 Apr 17
Posts: 4
Combined Credit: 434,566
DNA@Home: 0
SubsetSum@Home: 0
Wildlife@Home: 434,566
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 7097 - Posted: 30 May 2017, 17:40:51 UTC - in response to Message 7091.


in the news thread [wildlife] new app: EXACT MNIST BATCH the user Peppernrino reports a similar problem with running too many tasks of v0.24 with an AMD FX-8350. maybe it's not something ryzen specific.


i would exclude a general programming mistake like phased out pointers that go to areas their not intended to go :-) because with my last cpu - a FX-4100 i could let run the 0.24 chunks four days with no single problem. My guess is theres more to it and the ryzen bug topic is not over yet :(


Post to thread

Message boards : Webpage and Application Code Discussion : Ryzen bug the second - Can someone confirm?