Author |
Message |
Gibson Praise Send message
Joined: 5 Aug 16 Posts: 56 Combined Credit: 50,846,655 DNA@Home: 0 SubsetSum@Home: 73,256 Wildlife@Home: 50,773,399 Wildlife@Home Watched: 0s Wildlife@Home Events: 0 Climate Tweets: 769 Images Observed: 25,032
 |
Hey Travis. Got a couple of MNIST units that are running really long, well outside the normal time frames I get. (normally no more than ~30 hours)
exact_genome_1490477651_4_1417_0 -> currently at 46 hrs+ and 87% complete
and
exact_genome_1490477651_3_1411_1 -> 52 hrs and 64% complete (wingman complete at 68 hours).
Should be interesting to look at.
____________
|
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project scientist Send message
Joined: 16 Jan 12 Posts: 1813 Combined Credit: 23,514,257 DNA@Home: 293,563 SubsetSum@Home: 349,212 Wildlife@Home: 22,871,482 Wildlife@Home Watched: 212,926s Wildlife@Home Events: 51 Climate Tweets: 22 Images Observed: 774
 |
Hey Travis. Got a couple of MNIST units that are running really long, well outside the normal time frames I get. (normally no more than ~30 hours)
exact_genome_1490477651_4_1417_0 -> currently at 46 hrs+ and 87% complete
and
exact_genome_1490477651_3_1411_1 -> 52 hrs and 64% complete (wingman complete at 68 hours).
Should be interesting to look at.
Well there's probably a couple things going on here. I bumped up the number of epochs they're training for from 150-200, so that would be a 33% increase in runtime alone. This will give more refined/better results given what I've seen coming in.
Second, the convolutional neural networks that are being trained get progressively more complex as the searches proceed. Compare:
http://csgrid.org/csg/exact/genome.php?id=1124
Which is from a recently started search, the 612nd CNN evolved for that search, to:
http://csgrid.org/csg/exact/genome.php?id=1258
Which was the 5660th CNN evolved for that search (and current best performing CNN).
So as things go, the neural networks get more complex and require more time per epoch, and I bumped up the number of epochs -- so that's why you're seeing these long running workunits.
I'm hoping to get a GPU version of the application up and going in the next month or so, which should hopefully get some nice performance increases and get results flowing in faster. |
|
|
Gibson Praise Send message
Joined: 5 Aug 16 Posts: 56 Combined Credit: 50,846,655 DNA@Home: 0 SubsetSum@Home: 73,256 Wildlife@Home: 50,773,399 Wildlife@Home Watched: 0s Wildlife@Home Events: 0 Climate Tweets: 769 Images Observed: 25,032
 |
exact_genome_1490477651_3_1411_1 -> 52 hrs and 64% complete (wingman complete at 68 hours).
Should be interesting to look at.
Well .. this one finished and it is interesting, but not quite what I hoped. This 96 hour work unit yielded only 747.82 points ( I, of course, was expecting a veritable bonanza of over 10K points!). I know that points are not based on a straight line basis of time, but this is seriously out of whack no matter what the algorithm. It obviously was not doing the kind of work needed to accrue points.
So could you investigate. If there something hinky with my machine, then I need to fix it (though other units appear to be proceeding normally). Somewhere, this one went off the rails.
g |
|
|
Beyond Send message
Joined: 4 Feb 15 Posts: 12 Combined Credit: 16,990,008 DNA@Home: 66,428 SubsetSum@Home: 195,743 Wildlife@Home: 16,727,837 Wildlife@Home Watched: 0s Wildlife@Home Events: 0 Climate Tweets: 0 Images Observed: 0
 |
exact_genome_1490477651_3_1411_1 -> 52 hrs and 64% complete (wingman complete at 68 hours).
Should be interesting to look at.
Well .. this one finished and it is interesting, but not quite what I hoped. This 96 hour work unit yielded only 747.82 points ( I, of course, was expecting a veritable bonanza of over 10K points!). I know that points are not based on a straight line basis of time, but this is seriously out of whack no matter what the algorithm. It obviously was not doing the kind of work needed to accrue points.
So could you investigate. If there something hinky with my machine, then I need to fix it (though other units appear to be proceeding normally). Somewhere, this one went off the rails.
Seems to be par for the course. Huge runtimes and tiny credit on some WUs. Strangely, this seems to be happening mostly on my fastest machines. Just decided to drop those off the project at least until this is sorted out. Have one WU that's been running for 180 hours on a fast box and am expecting small credit when it finishes... |
|
|
JumpinJohnny Send message
Joined: 24 Sep 13 Posts: 237 Combined Credit: 10,275,610 DNA@Home: 192,548 SubsetSum@Home: 201,740 Wildlife@Home: 9,881,323 Wildlife@Home Watched: 55,997,833s Wildlife@Home Events: 15,584 Climate Tweets: 334 Images Observed: 351
 |
I also have seen these.
Two of them I aborted after 30 hours.
I let a few run only to get 700 credit for 400,000+ cpu sec.
I am aborting another.
I wish I could tell which of these are going to do this in advance of letting a core waste an entire day on 30% completion only to be aborted.
The slowness of the WU's does not seem to reflect OS type or CPU speed.
Some wingmen are doing them in comparable long times some are alot quicker.
------------------------------------------------------------------------------
Even the rest of the results are really just all over the place.
25,604.48 sec = 1,351.24 credits
68,537.76 = 1,258.08
36,625.87 = 2,257.77
32,292.02 = 880.93
The credits algorithm seems to have taken a turn to the dark side. |
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project scientist Send message
Joined: 16 Jan 12 Posts: 1813 Combined Credit: 23,514,257 DNA@Home: 293,563 SubsetSum@Home: 349,212 Wildlife@Home: 22,871,482 Wildlife@Home Watched: 212,926s Wildlife@Home Events: 51 Climate Tweets: 22 Images Observed: 774
 |
I also have seen these.
Two of them I aborted after 30 hours.
I let a few run only to get 700 credit for 400,000+ cpu sec.
I am aborting another.
I wish I could tell which of these are going to do this in advance of letting a core waste an entire day on 30% completion only to be aborted.
The slowness of the WU's does not seem to reflect OS type or CPU speed.
Some wingmen are doing them in comparable long times some are alot quicker.
------------------------------------------------------------------------------
Even the rest of the results are really just all over the place.
25,604.48 sec = 1,351.24 credits
68,537.76 = 1,258.08
36,625.87 = 2,257.77
32,292.02 = 880.93
The credits algorithm seems to have taken a turn to the dark side.
So this seems like something weird may be afoot. Any chance you could link me the work units you aborted? Or if you see these again let me know the work unit so I can take a deeper look and see what's going on? There may be some kind of bug going on that's making them run significantly longer than they should. |
|
|
GLeeMSend message
Joined: 1 Jul 13 Posts: 118 Combined Credit: 47,541,025 DNA@Home: 28,994 SubsetSum@Home: 231,079 Wildlife@Home: 47,280,952 Wildlife@Home Watched: 3,888,714s Wildlife@Home Events: 628 Climate Tweets: 0 Images Observed: 0
 |
So this seems like something weird may be afoot. Any chance you could link me the work units you aborted? Or if you see these again let me know the work unit so I can take a deeper look and see what's going on? There may be some kind of bug going on that's making them run significantly longer than they should.
You can see his WUs by clicking above on his "name", then "view" computers then "tasks" then "error".
Here is one of mine: http://csgrid.org/csg/workunit.php?wuid=979996
If I remember right it started by "Remaining (estimated) = less than one day. When I aborted, the wingman had finished long before and the "Remaining" was greater than one day and counting up not down. |
|
|
Beyond Send message
Joined: 4 Feb 15 Posts: 12 Combined Credit: 16,990,008 DNA@Home: 66,428 SubsetSum@Home: 195,743 Wildlife@Home: 16,727,837 Wildlife@Home Watched: 0s Wildlife@Home Events: 0 Climate Tweets: 0 Images Observed: 0
 |
So this seems like something weird may be afoot. Any chance you could link me the work units you aborted? Or if you see these again let me know the work unit so I can take a deeper look and see what's going on? There may be some kind of bug going on that's making them run significantly longer than they should.
Here's one of many:
http://csgrid.org/csg/workunit.php?wuid=980866
My theory is that the apps are now being compiled with the wrong switches for general use. |
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project scientist Send message
Joined: 16 Jan 12 Posts: 1813 Combined Credit: 23,514,257 DNA@Home: 293,563 SubsetSum@Home: 349,212 Wildlife@Home: 22,871,482 Wildlife@Home Watched: 212,926s Wildlife@Home Events: 51 Climate Tweets: 22 Images Observed: 774
 |
So this seems like something weird may be afoot. Any chance you could link me the work units you aborted? Or if you see these again let me know the work unit so I can take a deeper look and see what's going on? There may be some kind of bug going on that's making them run significantly longer than they should.
Here's one of many:
http://csgrid.org/csg/workunit.php?wuid=980866
My theory is that the apps are now being compiled with the wrong switches for general use.
This is really really odd. I'm looking into it. Both of those are running Windows 7, so I don't know why one would be significantly slower depending on how they were compiled.
What's even weirder, is that the one it's running slower on has more cache and memory, so if anything it should be running quite a bit faster. |
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project scientist Send message
Joined: 16 Jan 12 Posts: 1813 Combined Credit: 23,514,257 DNA@Home: 293,563 SubsetSum@Home: 349,212 Wildlife@Home: 22,871,482 Wildlife@Home Watched: 212,926s Wildlife@Home Events: 51 Climate Tweets: 22 Images Observed: 774
 |
Status update on this, I've made a news post as well.
I think I may have found/fixed the issue. My hunch is that when the initial learning rate is very low (quite a few of these "monster" WUs have initial learning rates of 1e-08), it's causing the weights/weight updates of back propagation to get extremely small.
I think on some architectures, depending on how the double precision math has been implemented, they may run a fair bit slower around these extreme values, which would cause the slowdowns. I've updated things server side to not generate any WUs with initial learning rates lower than 1-e05. Which I'm hoping will fix the issue. |
|
|
Beyond Send message
Joined: 4 Feb 15 Posts: 12 Combined Credit: 16,990,008 DNA@Home: 66,428 SubsetSum@Home: 195,743 Wildlife@Home: 16,727,837 Wildlife@Home Watched: 0s Wildlife@Home Events: 0 Climate Tweets: 0 Images Observed: 0
 |
So this seems like something weird may be afoot. Any chance you could link me the work units you aborted? Or if you see these again let me know the work unit so I can take a deeper look and see what's going on? There may be some kind of bug going on that's making them run significantly longer than they should.
Here's one of many:
http://csgrid.org/csg/workunit.php?wuid=980866
My theory is that the apps are now being compiled with the wrong switches for general use.
This is really really odd. I'm looking into it. Both of those are running Windows 7, so I don't know why one would be significantly slower depending on how they were compiled.
What's even weirder, is that the one it's running slower on has more cache and memory, so if anything it should be running quite a bit faster.
It used to be, not long ago that these machines were very competitive in CSG, now at least on some WUs they're dirt slow. It really makes me wonder about the switches used for compiling the last few app versions. All of a sudden my 8 core AMD 83xx machines are abysmal on this project. They used to be relatively fast. They're still fast on other projects where they're situated for now. |
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project scientist Send message
Joined: 16 Jan 12 Posts: 1813 Combined Credit: 23,514,257 DNA@Home: 293,563 SubsetSum@Home: 349,212 Wildlife@Home: 22,871,482 Wildlife@Home Watched: 212,926s Wildlife@Home Events: 51 Climate Tweets: 22 Images Observed: 774
 |
So this seems like something weird may be afoot. Any chance you could link me the work units you aborted? Or if you see these again let me know the work unit so I can take a deeper look and see what's going on? There may be some kind of bug going on that's making them run significantly longer than they should.
Here's one of many:
http://csgrid.org/csg/workunit.php?wuid=980866
My theory is that the apps are now being compiled with the wrong switches for general use.
This is really really odd. I'm looking into it. Both of those are running Windows 7, so I don't know why one would be significantly slower depending on how they were compiled.
What's even weirder, is that the one it's running slower on has more cache and memory, so if anything it should be running quite a bit faster.
It used to be, not long ago that these machines were very competitive in CSG, now at least on some WUs they're dirt slow. It really makes me wonder about the switches used for compiling the last few app versions. All of a sudden my 8 core AMD 83xx machines are abysmal on this project. They used to be relatively fast. They're still fast on other projects where they're situated for now.
I honestly haven't tweaked any of the compilation settings in quite a few app versions. I'm basically using the standard options for a release build.
The fact that it's only happening on some machines, some of the time to me means it's some issue with the application (or the initial parameters I'm setting) -- not an issue with how it's compiled. |
|
|
JumpinJohnny Send message
Joined: 24 Sep 13 Posts: 237 Combined Credit: 10,275,610 DNA@Home: 192,548 SubsetSum@Home: 201,740 Wildlife@Home: 9,881,323 Wildlife@Home Watched: 55,997,833s Wildlife@Home Events: 15,584 Climate Tweets: 334 Images Observed: 351
 |
....
The slowness of the WU's does not seem to reflect OS type or CPU speed.
Some wingmen are doing them in comparable long times some are alot quicker.
------------------------------------------------------------------------------
Even the rest of the results are really just all over the place.
25,604.48 sec = 1,351.24 credits
68,537.76 = 1,258.08
36,625.87 = 2,257.77
32,292.02 = 880.93
The credits algorithm seems to have taken a turn to the dark side.
So .... Not just a few of the extra long ones are running wierdly, Plenty of other examples of inconsistent times compared to credits,,, which also do NOT always match up with a CPU type or OS.
____________
|
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project scientist Send message
Joined: 16 Jan 12 Posts: 1813 Combined Credit: 23,514,257 DNA@Home: 293,563 SubsetSum@Home: 349,212 Wildlife@Home: 22,871,482 Wildlife@Home Watched: 212,926s Wildlife@Home Events: 51 Climate Tweets: 22 Images Observed: 774
 |
....
The slowness of the WU's does not seem to reflect OS type or CPU speed.
Some wingmen are doing them in comparable long times some are alot quicker.
------------------------------------------------------------------------------
Even the rest of the results are really just all over the place.
25,604.48 sec = 1,351.24 credits
68,537.76 = 1,258.08
36,625.87 = 2,257.77
32,292.02 = 880.93
The credits algorithm seems to have taken a turn to the dark side.
So .... Not just a few of the extra long ones are running wierdly, Plenty of other examples of inconsistent times compared to credits,,, which also do NOT always match up with a CPU type or OS.
Just making sure -- in the first example, you have two different work units giving the same amount of credit but with significantly different runtimes. And in the second one you have two different workunits with similar runtimes giving significantly different credit. Are these from the same system?
If that's the case I need to update the credit calculation. Inconsistent runtime to credit is probably a problem with how I'm calculating credit as the convolutional neural networks get larger. This is an easier fix and not a problem with the application at any rate.
I'll make it a priority and try and get it resolved by the end of the week (unfortunately I'm stuck in some 8 hours of meetings tomorrow which will slow me down). |
|
|
JumpinJohnny Send message
Joined: 24 Sep 13 Posts: 237 Combined Credit: 10,275,610 DNA@Home: 192,548 SubsetSum@Home: 201,740 Wildlife@Home: 9,881,323 Wildlife@Home Watched: 55,997,833s Wildlife@Home Events: 15,584 Climate Tweets: 334 Images Observed: 351
 |
....
The slowness of the WU's does not seem to reflect OS type or CPU speed.
Some wingmen are doing them in comparable long times some are alot quicker.
------------------------------------------------------------------------------
Even the rest of the results are really just all over the place.
25,604.48 sec = 1,351.24 credits
68,537.76 = 1,258.08
36,625.87 = 2,257.77
32,292.02 = 880.93
The credits algorithm seems to have taken a turn to the dark side.
So .... Not just a few of the extra long ones are running wierdly, Plenty of other examples of inconsistent times compared to credits,,, which also do NOT always match up with a CPU type or OS.
Just making sure -- in the first example, you have two different work units giving the same amount of credit but with significantly different runtimes. And in the second one you have two different workunits with similar runtimes giving significantly different credit. Are these from the same system?
If that's the case I need to update the credit calculation. Inconsistent runtime to credit is probably a problem with how I'm calculating credit as the convolutional neural networks get larger. This is an easier fix and not a problem with the application at any rate.
I'll make it a priority and try and get it resolved by the end of the week (unfortunately I'm stuck in some 8 hours of meetings tomorrow which will slow me down).
Yes. Those were 4 quick samples from recent valid WU all done on the same machine.
Here are two sets from today:
exact_genome_1490799770_4_9626
45,742.33 credits: 3,027.08
exact_genome_1490799770_4_9647
47,007.91 credits: 1,979.90
exact_genome_1490799770_5_4087
35,142.55 credits: 916.30
exact_genome_1490799770_3_8159
36,939.77 credits: 3,499.56
Those are just a couple of examples from today. The credits are not predictable as work done by the CPU and the deviation is random.
I thought this might be associated with the "monster" size WU that were being sent because the credits were very very low on those that were completed and has been the subject of this thread. I thought this general credit problem might be associated or causally the same. |
|
|
Beyond Send message
Joined: 4 Feb 15 Posts: 12 Combined Credit: 16,990,008 DNA@Home: 66,428 SubsetSum@Home: 195,743 Wildlife@Home: 16,727,837 Wildlife@Home Watched: 0s Wildlife@Home Events: 0 Climate Tweets: 0 Images Observed: 0
 |
Are these from the same system?
If that's the case I need to update the credit calculation. Inconsistent runtime to credit is probably a problem with how I'm calculating credit as the convolutional neural networks get larger. This is an easier fix and not a problem with the application at any rate.
I'll make it a priority and try and get it resolved by the end of the week (unfortunately I'm stuck in some 8 hours of meetings tomorrow which will slow me down).
Here's 3 consecutive results from the same system. All EXACT MNIST Convolutional Neural Network Trainer v0.20:
2142450 1000174 3 Apr 2017, 13:42:20 UTC 4 Apr 2017, 13:28:43 UTC 57,110.73 56,810.53 2,418.63
2140525 999371 3 Apr 2017, 4:41:38 UTC 4 Apr 2017, 6:40:51 UTC 54,518.36 54,258.40 1,429.07
2140165 999224 3 Apr 2017, 2:12:39 UTC 3 Apr 2017, 20:46:23 UTC 52,309.10 52,088.45 3,400.23 |
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project scientist Send message
Joined: 16 Jan 12 Posts: 1813 Combined Credit: 23,514,257 DNA@Home: 293,563 SubsetSum@Home: 349,212 Wildlife@Home: 22,871,482 Wildlife@Home Watched: 212,926s Wildlife@Home Events: 51 Climate Tweets: 22 Images Observed: 774
 |
Are these from the same system?
If that's the case I need to update the credit calculation. Inconsistent runtime to credit is probably a problem with how I'm calculating credit as the convolutional neural networks get larger. This is an easier fix and not a problem with the application at any rate.
I'll make it a priority and try and get it resolved by the end of the week (unfortunately I'm stuck in some 8 hours of meetings tomorrow which will slow me down).
Here's 3 consecutive results from the same system. All EXACT MNIST Convolutional Neural Network Trainer v0.20:
2142450 1000174 3 Apr 2017, 13:42:20 UTC 4 Apr 2017, 13:28:43 UTC 57,110.73 56,810.53 2,418.63
2140525 999371 3 Apr 2017, 4:41:38 UTC 4 Apr 2017, 6:40:51 UTC 54,518.36 54,258.40 1,429.07
2140165 999224 3 Apr 2017, 2:12:39 UTC 3 Apr 2017, 20:46:23 UTC 52,309.10 52,088.45 3,400.23
Blarg the output files have gone poof. I'm digging -- I think I might know what the issue may be, I think it's going to need an application update to sort out however. |
|
|
Travis DesellVolunteer moderator Project administrator Project developer Project scientist Send message
Joined: 16 Jan 12 Posts: 1813 Combined Credit: 23,514,257 DNA@Home: 293,563 SubsetSum@Home: 349,212 Wildlife@Home: 22,871,482 Wildlife@Home Watched: 212,926s Wildlife@Home Events: 51 Climate Tweets: 22 Images Observed: 774
 |
I've updated the my algorithm for calculating credit which I think should fix the credit inconsistencies. If you're still seeing this with newly generated workunits let me know. |
|
|
Conan Send message
Joined: 13 Apr 12 Posts: 151 Combined Credit: 47,672,899 DNA@Home: 399,792 SubsetSum@Home: 1,448,876 Wildlife@Home: 45,824,231 Wildlife@Home Watched: 70,910s Wildlife@Home Events: 0 Climate Tweets: 413 Images Observed: 0
 |
I've updated the my algorithm for calculating credit which I think should fix the credit inconsistencies. If you're still seeing this with newly generated workunits let me know.
I have one at 2 days 10 hours and 48.3% completed with 2 days 14 hours still to go.
As it was sent out on the 3rd I may not get the new credit but we will see how it travels. My wingman must be running a long time too as it is still running for them as well.
Conan
____________
|
|
|
SEARCHER Send message
Joined: 24 Feb 13 Posts: 29 Combined Credit: 1,450,269 DNA@Home: 121,784 SubsetSum@Home: 114,603 Wildlife@Home: 1,213,882 Wildlife@Home Watched: 0s Wildlife@Home Events: 0 Climate Tweets: 3,594 Images Observed: 35,153
 |
Hello Travis,
I have now a Problem with a Monster-WU too, I crunch now 4,5 Days on this WU and I wait some Hours for the okay of my Wingman. Now I see I become only 1092 Credits, ist very crazy Travis.
If you want check my WU, so I can give you this Informations from me. :
2136329 979728 32305 2 Apr 2017, 6:12:10 UTC 6 Apr 2017, 9:41:13 UTC Fertig und BestÃĪtigt 358,028.01 286,172.30 1,091.76 EXACT MNIST Convolutional Neural Network Trainer v0.20
Many Greetz SEARCHER
____________
Member of CHARITY TEAM |
|
|