Advanced search

Message boards : Number Crunching : Error with new longer tasks.

1 · 2 · Next
Author Message
P . P . L .
Send message
Joined: 10 Aug 14
Posts: 59
Combined Credit: 336,654
DNA@Home: 336,605
SubsetSum@Home: 0
Wildlife@Home: 49
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 4698 - Posted: 17 Oct 2014, 5:38:32 UTC
Last modified: 17 Oct 2014, 5:40:49 UTC

Hi Travis.

This is the first error I've had in a while, run time was over 2hrs on my i7 2700k.

Boinc manager just said computation error.

gibbs_test_hg19_1000fa_1_190_0_1

http://volunteer.cs.und.edu/csg/workunit.php?wuid=136799
____________

Profile STE\/E
Avatar
Send message
Joined: 5 Apr 13
Posts: 416
Combined Credit: 29,783,819
DNA@Home: 2,634,206
SubsetSum@Home: 735,231
Wildlife@Home: 26,414,382
Wildlife@Home Watched: 53,380,530s
Wildlife@Home Events: 9,349
Climate Tweets: 0
Images Observed: 0

          
Message 4699 - Posted: 17 Oct 2014, 6:48:02 UTC
Last modified: 17 Oct 2014, 6:50:18 UTC

Seems like I'm getting a lot of errors too on all my Box's with the new Wu's ...

Profile [AF>France>IDF]Lic
Send message
Joined: 30 Aug 13
Posts: 6
Combined Credit: 11,024,652
DNA@Home: 436,798
SubsetSum@Home: 140,468
Wildlife@Home: 10,447,386
Wildlife@Home Watched: 16,667s
Wildlife@Home Events: 0
Climate Tweets: 381
Images Observed: 256

            
Message 4701 - Posted: 17 Oct 2014, 7:15:02 UTC

Same thing for me for all the new WU.

http://volunteer.cs.und.edu/csg/result.php?resultid=298008
and same last line in stderr
"created sequence '> 'TMEM203 chr9 140099590 140100590' with max sites = '4'"

P . P . L .
Send message
Joined: 10 Aug 14
Posts: 59
Combined Credit: 336,654
DNA@Home: 336,605
SubsetSum@Home: 0
Wildlife@Home: 49
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 4702 - Posted: 17 Oct 2014, 7:37:14 UTC

Travis.

I've aborted all the ( gibbs_test_hg19_1000fa_ ) tasks, I still have a few of the older ones to run for now.

Hope you can get it sorted out.
____________

Profile STE\/E
Avatar
Send message
Joined: 5 Apr 13
Posts: 416
Combined Credit: 29,783,819
DNA@Home: 2,634,206
SubsetSum@Home: 735,231
Wildlife@Home: 26,414,382
Wildlife@Home Watched: 53,380,530s
Wildlife@Home Events: 9,349
Climate Tweets: 0
Images Observed: 0

          
Message 4703 - Posted: 17 Oct 2014, 9:22:02 UTC - in response to Message 4699.

Seems like I'm getting a lot of errors too on all my Box's with the new Wu's ...


I think this is why some/many are failing for me: PBOYZTOY072

57676 Citizen Science Grid 10/17/2014 5:17:38 AM Output file gibbs_test_hg19_1000fa_1_767_0_0_1 for task gibbs_test_hg19_1000fa_1_767_0_0 exceeds size limit.

I see this on all the ones that are failing ...

Travis Desell
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 16 Jan 12
Posts: 1813
Combined Credit: 23,514,257
DNA@Home: 293,563
SubsetSum@Home: 349,212
Wildlife@Home: 22,871,482
Wildlife@Home Watched: 212,926s
Wildlife@Home Events: 51
Climate Tweets: 21
Images Observed: 774

              
Message 4704 - Posted: 17 Oct 2014, 14:41:39 UTC - in response to Message 4703.

Removed these, going to fix the work unit information.

P . P . L .
Send message
Joined: 10 Aug 14
Posts: 59
Combined Credit: 336,654
DNA@Home: 336,605
SubsetSum@Home: 0
Wildlife@Home: 49
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 4724 - Posted: 18 Oct 2014, 4:44:17 UTC

Posted in other thread still got problems.

Sat 18 Oct 2014 15:31:18 EST | Citizen Science Grid | Computation for task gibbs_test_hg19_1000fa_2_347_0_1 finished
Sat 18 Oct 2014 15:31:18 EST | Citizen Science Grid | Output file gibbs_test_hg19_1000fa_2_347_0_1_1 for task gibbs_test_hg19_1000fa_2_347_0_1 exceeds size limit.
Sat 18 Oct 2014 15:31:18 EST | Citizen Science Grid | File size: 8015669.000000 bytes. Limit: 5000000.000000 bytes
____________

Profile STE\/E
Avatar
Send message
Joined: 5 Apr 13
Posts: 416
Combined Credit: 29,783,819
DNA@Home: 2,634,206
SubsetSum@Home: 735,231
Wildlife@Home: 26,414,382
Wildlife@Home Watched: 53,380,530s
Wildlife@Home Events: 9,349
Climate Tweets: 0
Images Observed: 0

          
Message 4725 - Posted: 18 Oct 2014, 5:05:10 UTC

Same here, still getting the File Size Error ...

PBOYZTOY082

138827 Citizen Science Grid 10/18/2014 1:00:08 AM Output file gibbs_test_hg19_1000fa_2_680_0_0_1 for task gibbs_test_hg19_1000fa_2_680_0_0 exceeds size limit.

Travis Desell
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 16 Jan 12
Posts: 1813
Combined Credit: 23,514,257
DNA@Home: 293,563
SubsetSum@Home: 349,212
Wildlife@Home: 22,871,482
Wildlife@Home Watched: 212,926s
Wildlife@Home Events: 51
Climate Tweets: 21
Images Observed: 774

              
Message 4728 - Posted: 18 Oct 2014, 14:03:29 UTC - in response to Message 4725.

Same here, still getting the File Size Error ...

PBOYZTOY082

138827 Citizen Science Grid 10/18/2014 1:00:08 AM Output file gibbs_test_hg19_1000fa_2_680_0_0_1 for task gibbs_test_hg19_1000fa_2_680_0_0 exceeds size limit.


These should all be fixed now. I've updated the run to ...1000fa_3 which has fixed output sizes.

Profile STE\/E
Avatar
Send message
Joined: 5 Apr 13
Posts: 416
Combined Credit: 29,783,819
DNA@Home: 2,634,206
SubsetSum@Home: 735,231
Wildlife@Home: 26,414,382
Wildlife@Home Watched: 53,380,530s
Wildlife@Home Events: 9,349
Climate Tweets: 0
Images Observed: 0

          
Message 4731 - Posted: 18 Oct 2014, 20:59:20 UTC

Seems to be okay now Travis ...

P . P . L .
Send message
Joined: 10 Aug 14
Posts: 59
Combined Credit: 336,654
DNA@Home: 336,605
SubsetSum@Home: 0
Wildlife@Home: 49
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 4732 - Posted: 18 Oct 2014, 22:30:01 UTC

I've a few of the new ones running now, I'll let you know how they go.
____________

P . P . L .
Send message
Joined: 10 Aug 14
Posts: 59
Combined Credit: 336,654
DNA@Home: 336,605
SubsetSum@Home: 0
Wildlife@Home: 49
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 4733 - Posted: 19 Oct 2014, 0:23:40 UTC

Good news I've had a couple of the bigger ones complete and validate now.

thanks Travis.
____________

P . P . L .
Send message
Joined: 10 Aug 14
Posts: 59
Combined Credit: 336,654
DNA@Home: 336,605
SubsetSum@Home: 0
Wildlife@Home: 49
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

  
Message 4820 - Posted: 13 Nov 2014, 5:29:02 UTC

Hi Travis.

Had another long one fail.


gibbs_test_hg19_1000fa_5_125__177647_260000_1 347128 4199 10 Nov 2014, 8:12:20 UTC 12 Nov 2014, 3:51:38 UTC Completed, marked as invalid 7,201.88 7,115.14 0.00 DNA@Home Gibbs Sampler v0.48
____________

Travis Desell
Volunteer moderator
Project administrator
Project developer
Project scientist
Send message
Joined: 16 Jan 12
Posts: 1813
Combined Credit: 23,514,257
DNA@Home: 293,563
SubsetSum@Home: 349,212
Wildlife@Home: 22,871,482
Wildlife@Home Watched: 212,926s
Wildlife@Home Events: 51
Climate Tweets: 21
Images Observed: 774

              
Message 4828 - Posted: 15 Nov 2014, 4:20:36 UTC - in response to Message 4820.

Pretty sure if a task checkpoints it has a good chance of not validating. Hoping to get it sorted out soon.

Alexander
Send message
Joined: 11 Aug 14
Posts: 41
Combined Credit: 23,861,254
DNA@Home: 428,269
SubsetSum@Home: 1,125,177
Wildlife@Home: 22,307,809
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

      
Message 4840 - Posted: 25 Nov 2014, 17:41:27 UTC

This task failed after shutting down my pc and restarted:
http://volunteer.cs.und.edu/csg/result.php?resultid=1013063
type_string: reverse, motif_width: 6
starting from checkpoint
reading from samples checkpoint
ERROR: reading samples, reached end of samples before all samples should have been read.
accumulated sample: [279], prev_comma_pos: [558], current_comma_pos: [18446744073709551615]
error on line [270], file [..\..\GitHub\dna_at_home\gibbs_cpp\checkpoint.cpp]

Profile Henk Haneveld
Send message
Joined: 25 Dec 14
Posts: 8
Combined Credit: 626,885
DNA@Home: 17,297
SubsetSum@Home: 40,264
Wildlife@Home: 569,324
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

      
Message 4923 - Posted: 27 Dec 2014, 10:58:28 UTC

In joined the project a couple of days ago and I have a problem with results getting called invalid.

It looks like the cause is that I don't run my systeem 24/7 and that these results where in progress when I shut down for the night and then had to start up the next day from the last checkpoint.

http://volunteer.cs.und.edu/csg/result.php?resultid=2123977

http://volunteer.cs.und.edu/csg/result.php?resultid=2127647

Please fix the problem or advise a way to avoid these longer running results
____________

Profile Henk Haneveld
Send message
Joined: 25 Dec 14
Posts: 8
Combined Credit: 626,885
DNA@Home: 17,297
SubsetSum@Home: 40,264
Wildlife@Home: 569,324
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

      
Message 4924 - Posted: 27 Dec 2014, 12:20:49 UTC

Correction to my first posting.

I was to fast thinking that restarting from a checkpoint is a problem.

I have just returned another result with a checkpoint restart and this one validated just fine.

I still don't understand why the other 2 had problems.

I will be gratefull for any help dealing with this.
____________

Ananas
Send message
Joined: 12 Aug 14
Posts: 27
Combined Credit: 1,727,216
DNA@Home: 959,874
SubsetSum@Home: 767,342
Wildlife@Home: 0
Wildlife@Home Watched: 73,544s
Wildlife@Home Events: 10
Climate Tweets: 0
Images Observed: 0

        
Message 4925 - Posted: 27 Dec 2014, 15:11:14 UTC - in response to Message 4924.

...
I was to fast thinking that restarting from a checkpoint is a problem.

I have just returned another result with a checkpoint restart and this one validated just fine....

I don't think that you're wrong, restarted (not resumed!) tasks are more likely to fail. It might depend on the moment when the checkpoint has been taken though.

Leaving suspended tasks in memory is a good idea here.

Depending on the projects you run concurrently, it might help to reduce the checkpoint frequency (write to disk every ...), especially projects with extreme HDD activity on checkpoints (MalariaControl is the worst one I know, but there might be others) might cause heartbeat errors, when multiple results checkpoint at the same time. If a checkpoint is used, you loose some more time, but the risk, that a checkpoint will be needed to resume work decreases.

Profile Henk Haneveld
Send message
Joined: 25 Dec 14
Posts: 8
Combined Credit: 626,885
DNA@Home: 17,297
SubsetSum@Home: 40,264
Wildlife@Home: 569,324
Wildlife@Home Watched: 0s
Wildlife@Home Events: 0
Climate Tweets: 0
Images Observed: 0

      
Message 4926 - Posted: 27 Dec 2014, 18:33:27 UTC

Ananas

Leaving suspended tasks in memory is already on, so that is no help.

I have checkpoint writing set at a 5 minute interval instead of the 1 minute default.
I will increase this to 15 minutes but I run several projects and a higher value is unpractical

However I did locate the source of the validation error.
In the stderr output file there is a value "argument seed (number)" at the start and a value "seeding (number)" at the end.

If these are equal then the result is valid, unequal is invalid.

It looks to me that restarting from a checkpoint can cause this seed value to become corrupt and that points to somekind of error in the way the application loads checkpoint data.
____________

Ananas
Send message
Joined: 12 Aug 14
Posts: 27
Combined Credit: 1,727,216
DNA@Home: 959,874
SubsetSum@Home: 767,342
Wildlife@Home: 0
Wildlife@Home Watched: 73,544s
Wildlife@Home Events: 10
Climate Tweets: 0
Images Observed: 0

        
Message 4927 - Posted: 27 Dec 2014, 21:42:46 UTC - in response to Message 4926.

...

However I did locate the source of the validation error.
In the stderr output file there is a value "argument seed (number)" at the start and a value "seeding (number)" at the end.

If these are equal then the result is valid, unequal is invalid.

It looks to me that restarting from a checkpoint can cause this seed value to become corrupt and that points to somekind of error in the way the application loads checkpoint data.

Wow, good observation, this should help Travis find the bug and squish it :-)

1 · 2 · Next
Post to thread

Message boards : Number Crunching : Error with new longer tasks.