Advanced search

Message boards : Number Crunching : Heartbeat errors destroy results (rare problem)

Author Message
Ananas
Send message
Joined: 12 Aug 14
Posts: 27
Combined Credit: 1,727,216
DNA@Home: 959,874
SubsetSum@Home: 767,342
Wildlife@Home: 0
Wildlife@Home Watched: 73,544s
Wildlife@Home Events: 10
Climate Tweets: 0
Images Observed: 0

        
Message 4799 - Posted: 5 Nov 2014, 4:34:18 UTC
Last modified: 5 Nov 2014, 4:38:23 UTC

I just found a wingman who has BOINC heatbeat trouble :

hostid=4986

Currently he has 2 invalid, 2 error and one inconclusive, at leat 4 of them show restarts caused by heartbeat problems.

From what I can see, this seems to be a rare problem, so it probably doesn't require immediate investigation.


p.s.: NOT related to the problem with the strange output like this:<stderr_txt> 0,0,0,0,0,0,0,0,0,

Travis Desell
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 16 Jan 12
Posts: 1813
Combined Credit: 23,514,257
DNA@Home: 293,563
SubsetSum@Home: 349,212
Wildlife@Home: 22,871,482
Wildlife@Home Watched: 212,926s
Wildlife@Home Events: 51
Climate Tweets: 21
Images Observed: 774

              
Message 4801 - Posted: 5 Nov 2014, 16:31:10 UTC - in response to Message 4799.

I just found a wingman who has BOINC heatbeat trouble :

hostid=4986

Currently he has 2 invalid, 2 error and one inconclusive, at leat 4 of them show restarts caused by heartbeat problems.

From what I can see, this seems to be a rare problem, so it probably doesn't require immediate investigation.


p.s.: NOT related to the problem with the strange output like this: 0,0,0,0,0,0,0,0,0,


Yeah that output should be fixed. Right now it looks like the problem is related to checkpointing.

Ananas
Send message
Joined: 12 Aug 14
Posts: 27
Combined Credit: 1,727,216
DNA@Home: 959,874
SubsetSum@Home: 767,342
Wildlife@Home: 0
Wildlife@Home Watched: 73,544s
Wildlife@Home Events: 10
Climate Tweets: 0
Images Observed: 0

        
Message 4812 - Posted: 9 Nov 2014, 11:06:49 UTC - in response to Message 4801.
Last modified: 9 Nov 2014, 11:53:04 UTC

You're right, the checkpoints are supposed to catch and fix the heartbeat problems.

If you need one more sample for checking : hostid=4451

edit : Here's an invalid one with a regular restart (not heartbeat related) : resultid=630168


@all : If possible, use this setting :

Leave tasks in memory while suspended? yes

It reduces the risk for invalid results.


Post to thread

Message boards : Number Crunching : Heartbeat errors destroy results (rare problem)