Re: [eigen] non-linear optimization test summary |
[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]
Ok, i have had another look at it. I have attached a patch of my local modifs to study this... it will only compile under gnu/linux/x86[-64]. Just to check, I enabled SIGFPE signals, and enabled only what was test #7 before I un-split your test. Indeed, it crashed on a SIGFPE inside of the ColPivHouseholderQR. So to go ahead, I edited your code to make it use a FullPivHouseholderQR, just to study the problem (I fully agree that this is not fully satisfactory as it doesn't offer the same level of performance. I just wanted to eliminate the hypothesis of very bad luck hitting the limitations of ColPivHouseholderQR). I then still got a SIGFPE in a different place, namely in the evaluation of your functor: Program received signal SIGFPE, Arithmetic exception. 0x000000000041c815 in MGH17_functor::operator() (this=0x7fffffffe170, b=..., fvec=...) at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1299 1299 fvec[i] = b[0] + b[1]*exp(-b[3]*x[i]) + b[2]*exp(-b[4]*x[i]) - y[i]; (gdb) bt #0 0x000000000041c815 in MGH17_functor::operator() (this=0x7fffffffe170, b=..., fvec=...) at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1299 #1 0x0000000000439260 in Eigen::LevenbergMarquardt<MGH17_functor, double>::minimizeOneStep ( this=0x7fffffffdff0, x=...) at /home/bjacob/eigen/unsupported/Eigen/src/NonLinearOptimization/LevenbergMarquardt.h:291 #2 0x000000000042339a in Eigen::LevenbergMarquardt<MGH17_functor, double>::minimize ( this=0x7fffffffdff0, x=...) at /home/bjacob/eigen/unsupported/Eigen/src/NonLinearOptimization/LevenbergMarquardt.h:171 #3 0x000000000040eed7 in testNistMGH17 () at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1338 #4 0x00000000004156af in test_NonLinearOptimization () at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1836 #5 0x0000000000401e18 in main (argc=1, argv=0x7fffffffe478) at /home/bjacob/eigen/test/main.h:529 I then printed a few local variables: (gdb) print b[1] $3 = ( Eigen::DenseCoeffsBase<Eigen::Matrix<double, -0x00000000000000001, 1, 0, -0x00000000000000001, 1>, true>::Scalar &) @0x6c8128: -23854.038298795142 (gdb) print b[3] $4 = ( Eigen::DenseCoeffsBase<Eigen::Matrix<double, -0x00000000000000001, 1, 0, -0x00000000000000001, 1>, true>::Scalar &) @0x6c8138: -52987.898500712123 (gdb) print x[i] $5 = 10 (gdb) print b[4] $6 = ( Eigen::DenseCoeffsBase<Eigen::Matrix<double, -0x00000000000000001, 1, 0, -0x00000000000000001, 1>, true>::Scalar &) @0x6c8140: -1750064561.4840834 So what's happening here is that we have huge values in the b vector, leading to overflow when calling exp(). The next thing to do is to come back to a state where your test was successful, and check if there were already SIGFPE's... If there already were already SIGFPEs, that would probably mean that there was a preexisting problem and that my commit only exposed it. If there weren't SIGFPEs, that would me that my commit introduced a serious computational bug. Benoit 2010/6/11 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>: > Can you please remind me with which revision of mine the errors appeared? > > I'll try to have another look at it! > > Benoit > > 2010/6/11 Thomas Capricelli <orzel@xxxxxxxxxxxxxxx>: >> >> >> Yes, we are aware of those failures. They are actually regressions introduced by a change from Benoit which "should not change any behaviour". I've spent a lot of times on trying to understand what happens, some of it with Benoit, but I still dont know what the problem is. >> >> Please do not change this file yet. We need to fix it. >> >> Adding fuziness is NOT the solution to this. At least not to fix this regression (then, when i'll make sure the tests pass on several os/compilers... may be). >> >> Note on the different problems there are: >> * bad 'info' : info is the what the algorithm returns to indicate the reason for stopping. This is a huge problem when it changes >> >> * nfev (number of function of evaluation): this is very slightly less important, but still important. At least on my computer (the very same where tests passed with revision previous Benoit regression) we should get the same number. >> >> * error on 'squared norm': this one is tricky to explain. This is not the usual "stuff computed on a computer may differ from one computer/os/compiler to another one". What we check here is the result from an optimization algorithm. The value at the minimum of the function. This is the very purpose of the algorithm, and even if we might need some more steps on another computer, we should get the same result. URLs in NonLinearOptimization.cpp give the source of some (very important) reference tests, and until now we got almost always exactly the same results as those references. If we do not anymore, this is very, very bad (tm). Not just the usual "computers are fuzzy" >> >> Note to Benoit : when you got a really smaller nfev, this is probably actually because the algorithm completely failed, and stopped on a wrong value ('info' probably is different too, but checked later on on the testfile). >> >> As a side not, i intend to split the files in several tests, but i want to have this regression fixed before, as it does not help while i hunt it. I use to do a lot of going backward/forward in history, merging changes ... >> >> So, anyway, i have yet to fix those, i know, i have not (yet?) given up. >> >> regards!, >> -- >> Thomas Capricelli <orzel@xxxxxxxxxxxxxxx> >> http://www.freehackers.org/thomas >> >> In data venerdì 11 giugno 2010 12:42:29, Benoit Jacob ha scritto: >>> We have discussed this a lot with Thomas already, we're a bit clueless >>> about them. These failures started to appear with a seemingly >>> unrelated changeset. If it were just 602->606, I'd say add fuzziness. >>> But these numbers of iterations can vary a lot more, sometimes much >>> larger, sometimes much smaller. In this test I have had a 98 (while >>> 602 was expected) and the worst is that this was not reproducible on >>> Thomas machine. Since these numbers of iterations are so erratic, my >>> guess was that the termination criteria used by this iterative >>> algorithm was wrong; but a quick look at the code hasn't revealed >>> anything obvious. >>> >>> Benoit >>> >>> 2010/6/11 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>: >>> > Hi, >>> > >>> > I am just posting this as a summary and to get some idea in which >>> > tests I really start looking into and where we simply adapt the >>> > thresholds. >>> > >>> > We have the following tests failing (on all systems): >>> > NonLinearOptimization_7 >>> > NonLinearOptimization_8 >>> > NonLinearOptimization_10 >>> > NonLinearOptimization_12 >>> > >>> > NonLinearOptimization_7: >>> > - number of function evaluations(line 1341, 603-606 where 602 is expected) >>> > >>> > My guess is that here something fuzzy with an upper limit of function >>> > evaluations might be more appropriate. >>> > >>> > NonLinearOptimization_8: >>> > - squared norm (line 1019, 1.42986e-25, 1.42932e-25, 1.42897e-25, >>> > 1.42977e-25 where 1.4309e-25 expected) >>> > >>> > Probably again, we need to be more fuzzy. >>> > >>> > NonLinearOptimization_10: >>> > - info return result (2 where 3 expected) >>> > - number of function evaluations (on line 1180 we get 289 where 284 is expected) >>> > >>> > Maybe here we need to look more deeply into what is going wrong >>> > because the info value should probably be the same. >>> > >>> > NonLinearOptimization_12: >>> > - number of function evaluations (on line 1428 we get 498, 509 where >>> > 490 is expected and on line 1429 we get 378 where 378 is expected) >>> > >>> > Once again we need fuzzyness. >>> > >>> > I don't know whether I recall it well, but did not you (Thomas and >>> > Benoit) already have a discussion about that topic once on IRC? >>> > >>> > - Hauke >>> > >>> > >>> > >>> >>> >>> >> >> >> >
Attachment:
check-sigfpe
Description: Binary data
Mail converted by MHonArc 2.6.19+ | http://listengine.tuxfamily.org/ |