Re: [eigen] non-linear optimization test summary

[ Thread Index | Date Index | More lists.tuxfamily.org/eigen Archives ]


2010/6/13 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
> 2010/6/13 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>> Yet more news. I tried to find when the testNistMGH17 subtest started
>> failing. And I can't find a single revision where it succeeded!
>>
>> I tried revisions 2700, 2500, 2200, 2100, all are failing (yes i know
>> i shouldn't refer to changesets by local rev numbers but for old
>> enough revs we all have the same ones).
>>
>> I couldn't use my gcc 4.5 with old revs (that was before we supported
>> gcc 4.5) so had to use gcc 3.4 and it has an ICE that prevented me
>> from going back to the origins of testNistMGH17 at r1861.
>>
>> What is perhaps interesting is that old enough revisions of this test
>> are complaining about bad 'info' field:
>>
>> Initializing random number generator with seed 1276407956
>> Repeating each test 10 times
>> Test testNistMGH17() failed in
>> "/home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp" (1334)
>>    1 == info
>>
>> Anyway, from my perspective, it is looking like this test never worked
>> (!!). I am not sure how I failed to notice that. This surprises me as
>> I thought that it used to be working.
>
> This is probably easy to explain: that could mean that GCC 3.4 was
> producing x87 code, and that the failure only happens with x87...
> trying to install some gcc 4.x with x<5 to check...

Bingo, that was it. I installed GCC 4.3, and went back to some
revision that worked (dd588a42f9d6).

Build with GCC 4.3, the NonLinearOptimization test is successful.

Now, still with GCC 4.3, edit the CMakeLists to add these CXXFLAGS:

      -mfpmath=387 -DEIGEN_DONT_VECTORIZE

and the NonLinearOptimization test fails. Comment out all the subtests
except testNistMGH17, and you get this output:

Test testNistMGH17() failed in
"/home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp" (1334)
    1 == info

Finally, enable SIGFPE's (see patch in previous email) and you get:

Initializing random number generator with seed 1276442156
Repeating each test 10 times
Floating point exception

I wanted to check that this SIGFPE was more precisely an overflow, so
enabled only this one:

    newval.__control_word &= ~(FE_OVERFLOW);

and yup, still "Floating point exception". Finally, if you want a backtrace:

Program received signal SIGFPE, Arithmetic exception.
0x0000000000458646 in MGH17_functor::operator() (this=0x7fffffffe060,
b=..., fvec=...)
    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1292
1292                fvec[i] =  b[0] + b[1]*exp(-b[3]*x[i]) +
b[2]*exp(-b[4]*x[i]) - y[i];
(gdb) bt
#0  0x0000000000458646 in MGH17_functor::operator()
(this=0x7fffffffe060, b=..., fvec=...)
    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1292
#1  0x0000000000463f23 in Eigen::LevenbergMarquardt<MGH17_functor,
double>::minimizeOneStep (
    this=0x7fffffffdf00, x=..., mode=1)
    at /home/bjacob/eigen/unsupported/Eigen/src/NonLinearOptimization/LevenbergMarquardt.h:338
#2  0x0000000000464946 in Eigen::LevenbergMarquardt<MGH17_functor,
double>::minimize (
    this=0x7fffffffdf00, x=..., mode=1)
    at /home/bjacob/eigen/unsupported/Eigen/src/NonLinearOptimization/LevenbergMarquardt.h:176
#3  0x0000000000431d30 in testNistMGH17 ()
    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1331
#4  0x0000000000440a90 in test_NonLinearOptimization ()
    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1830
#5  0x0000000000442473 in main (argc=1, argv=0x7fffffffe428)
    at /home/bjacob/eigen/unsupported/test/../../test/main.h:453



*** Conclusion ***

There always was something wrong with this testNistMGH17: it always
failed on 387. It only ever succeeded with SSE (saying SSE, not
vectorization). Looking closer, it always overflowed. The overflow in
itself is not necessarily a bug, but the fact that it always failed on
387 is a bug. Is it a bug in Eigen itself or in NonLinearOptimization?
I am almost sure that it is the latter. It can't be a bug in Eigen's
QR since it already exists at rev. dd588a42f9d6 when you were
apparently not yet using Eigen's QR. So I'd say make sure that you
didn't introduce a subtle bug when porting this code from [C]MINPACK
:)

Benoit



>
> Benoit
>
>
>>
>> Benoit
>>
>> 2010/6/13 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>> More news. In your private e-mail you wrote:
>>>
>>>> Tout se passe dans le commit de Benoit 27fd92b5f533.
>>>> Je ne m'en suis apercu que lorsqu'il a fait le merge (commit
>>>> 1977e1805c16). Rien que ca, ce n'est pas facile a retrouver,
>>>> parce que la version du merge ne compile pas et que mes
>>>> tests n'etaient pas encore 'splitted'. Il a fallu backporter des
>>>> commits pour pouvoir tester (typiquement 76d6a98eb24e).
>>>>
>>>> Donc, avant le commit 27fd92b5f533, tous les tests nonlinear
>>>> passent. Apres, les tests 7,8,10,12 ne passent plus.
>>>
>>> Shortened translation into English: "Changeset 27fd92b5f533 is where
>>> the nonlinear tests start failing; before this changeset, they all
>>> pass."
>>>
>>> So I checked the revision just before, namely 114c09b9e714. And for
>>> me, the tests are already failing! If I enable only what was later
>>> called SUBTEST_7, in other words the call to testNistMGH17, I get:
>>>
>>> Initializing random number generator with seed 1276406234
>>> Repeating each test 10 times
>>>
>>>    actual   = 98
>>>    expected = 602
>>>
>>> Test testNistMGH17() failed in
>>> "/home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp" (1337)
>>>    test_is_equal(lm.nfev, 602)
>>>
>>>
>>> This is the "too good result" that I mentioned, 98, and you suggested
>>> that this is probably because a big error is making the algorithm
>>> early.
>>>
>>> What I did then was the same as in my previous email but for this
>>> changeset: enable SIGFPE's and see if some is raised. See new attached
>>> patch, it's against revision 114c09b9e714.
>>>
>>> Compile like this:
>>>
>>> g++ ~/eigen/unsupported/test/NonLinearOptimization.cpp -o x -I ~/eigen
>>> -I ~/eigen/test -DEIGEN_TEST_FUNC=NonLinearOptimization -g3
>>> -DEIGEN_DONT_VECTORIZE -mfpmath=387
>>>
>>> Again, disabling all SSE, not just the vectorization part of SSE, is
>>> very important.
>>>
>>> run it:
>>>
>>> ##### 01:18:04 ~/build/eigen$ ./x
>>> Initializing random number generator with seed 1276406287
>>> Repeating each test 10 times
>>> Floating point exception
>>>
>>> So let's see what happens in GDB:
>>>
>>> Program received signal SIGFPE, Arithmetic exception.
>>> 0x000000000041be97 in MGH17_functor::operator() (this=0x7fffffffe1b0,
>>> b=..., fvec=...)
>>>    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1294
>>> 1294                fvec[i] =  b[0] + b[1]*exp(-b[3]*x[i]) +
>>> b[2]*exp(-b[4]*x[i]) - y[i];
>>> (gdb) bt
>>> #0  0x000000000041be97 in MGH17_functor::operator()
>>> (this=0x7fffffffe1b0, b=..., fvec=...)
>>>    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1294
>>> #1  0x00000000004376e2 in Eigen::LevenbergMarquardt<MGH17_functor,
>>> double>::minimizeOneStep (
>>>    this=0x7fffffffe050, x=...)
>>>    at /home/bjacob/eigen/unsupported/Eigen/src/NonLinearOptimization/LevenbergMarquardt.h:287
>>> #2  0x000000000042258a in Eigen::LevenbergMarquardt<MGH17_functor,
>>> double>::minimize (
>>>    this=0x7fffffffe050, x=...)
>>>    at /home/bjacob/eigen/unsupported/Eigen/src/NonLinearOptimization/LevenbergMarquardt.h:167
>>> #3  0x000000000040e88b in testNistMGH17 ()
>>>    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1333
>>> #4  0x0000000000414d0d in test_NonLinearOptimization ()
>>>    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1831
>>> #5  0x0000000000401e06 in main (argc=1, argv=0x7fffffffe478) at
>>> /home/bjacob/eigen/test/main.h:541
>>> (gdb) print b[3]
>>> $1 = (
>>>    Eigen::ei_traits<Eigen::Matrix<double, 33331, 1, 0, 33331, 1>
>>>>::Scalar &) @0x6be138: -52987.898494912632
>>> (gdb) print x[i]
>>> $2 = 10
>>>
>>>
>>>
>>> What we see is that even before by "bad" changeset, there was already
>>> a couple of overflows going on here. From the values of the parameters
>>> here in GDB, it looks like fvec[i] is being assigned the value NaN
>>> (because inf-inf). This is OK if divergence is the expected behavior
>>> in this test, but apparently it's not ?
>>>
>>> The next step is to see at which revisions these SIGFPE's actually appeared...
>>>
>>> Benoit
>>>
>>> 2010/6/13 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>> I forgot to say: build with a command-line like this:
>>>>
>>>> g++ ~/eigen/unsupported/test/NonLinearOptimization.cpp -o x -I ~/eigen
>>>> -I ~/eigen/test -DEIGEN_TEST_FUNC=NonLinearOptimization -mfpmath=387
>>>> -g3 -DEIGEN_DONT_VECTORIZE
>>>>
>>>> It's important to make sure to use only the x87 instructions, not the
>>>> SSE instructions (not even the non-SIMD part of SSE like addss).
>>>> Otherwise you won't see the SIGFPE.
>>>>
>>>> Benoit
>>>>
>>>> 2010/6/13 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>>> Ok, i have had another look at it. I have attached a patch of my local
>>>>> modifs to study this... it will only compile under gnu/linux/x86[-64]..
>>>>>
>>>>> Just to check, I enabled SIGFPE signals, and enabled only what was
>>>>> test #7 before I un-split your test.
>>>>>
>>>>> Indeed, it crashed on a SIGFPE inside of the ColPivHouseholderQR.
>>>>>
>>>>> So to go ahead, I edited your code to make it use a
>>>>> FullPivHouseholderQR, just to study the problem (I fully agree that
>>>>> this is not fully satisfactory as it doesn't offer the same level of
>>>>> performance. I just wanted to eliminate the hypothesis of very bad
>>>>> luck hitting the limitations of ColPivHouseholderQR).
>>>>>
>>>>> I then still got a SIGFPE in a different place, namely in the
>>>>> evaluation of your functor:
>>>>>
>>>>> Program received signal SIGFPE, Arithmetic exception.
>>>>> 0x000000000041c815 in MGH17_functor::operator() (this=0x7fffffffe170,
>>>>> b=..., fvec=...)
>>>>>    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1299
>>>>> 1299                fvec[i] =  b[0] + b[1]*exp(-b[3]*x[i]) +
>>>>> b[2]*exp(-b[4]*x[i]) - y[i];
>>>>> (gdb) bt
>>>>> #0  0x000000000041c815 in MGH17_functor::operator()
>>>>> (this=0x7fffffffe170, b=..., fvec=...)
>>>>>    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1299
>>>>> #1  0x0000000000439260 in Eigen::LevenbergMarquardt<MGH17_functor,
>>>>> double>::minimizeOneStep (
>>>>>    this=0x7fffffffdff0, x=...)
>>>>>    at /home/bjacob/eigen/unsupported/Eigen/src/NonLinearOptimization/LevenbergMarquardt.h:291
>>>>> #2  0x000000000042339a in Eigen::LevenbergMarquardt<MGH17_functor,
>>>>> double>::minimize (
>>>>>    this=0x7fffffffdff0, x=...)
>>>>>    at /home/bjacob/eigen/unsupported/Eigen/src/NonLinearOptimization/LevenbergMarquardt.h:171
>>>>> #3  0x000000000040eed7 in testNistMGH17 ()
>>>>>    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1338
>>>>> #4  0x00000000004156af in test_NonLinearOptimization ()
>>>>>    at /home/bjacob/eigen/unsupported/test/NonLinearOptimization.cpp:1836
>>>>> #5  0x0000000000401e18 in main (argc=1, argv=0x7fffffffe478) at
>>>>> /home/bjacob/eigen/test/main.h:529
>>>>>
>>>>> I then printed a few local variables:
>>>>>
>>>>> (gdb) print b[1]
>>>>> $3 = (
>>>>>    Eigen::DenseCoeffsBase<Eigen::Matrix<double, -0x00000000000000001,
>>>>> 1, 0, -0x00000000000000001, 1>, true>::Scalar &) @0x6c8128:
>>>>> -23854.038298795142
>>>>> (gdb) print b[3]
>>>>> $4 = (
>>>>>    Eigen::DenseCoeffsBase<Eigen::Matrix<double, -0x00000000000000001,
>>>>> 1, 0, -0x00000000000000001, 1>, true>::Scalar &) @0x6c8138:
>>>>> -52987.898500712123
>>>>> (gdb) print x[i]
>>>>> $5 = 10
>>>>> (gdb) print b[4]
>>>>> $6 = (
>>>>>    Eigen::DenseCoeffsBase<Eigen::Matrix<double, -0x00000000000000001,
>>>>> 1, 0, -0x00000000000000001, 1>, true>::Scalar &) @0x6c8140:
>>>>> -1750064561.4840834
>>>>>
>>>>> So what's happening here is that we have huge values in the b vector,
>>>>> leading to overflow when calling exp().
>>>>>
>>>>> The next thing to do is to come back to a state where your test was
>>>>> successful, and check if there were already SIGFPE's...
>>>>>
>>>>> If there already were already SIGFPEs, that would probably mean that
>>>>> there was a preexisting problem and that my commit only exposed it. If
>>>>> there weren't SIGFPEs, that would me that my commit introduced a
>>>>> serious computational bug.
>>>>>
>>>>> Benoit
>>>>>
>>>>>
>>>>>
>>>>> 2010/6/11 Benoit Jacob <jacob.benoit.1@xxxxxxxxx>:
>>>>>> Can you please remind me with which revision of mine the errors appeared?
>>>>>>
>>>>>> I'll try to have another look at it!
>>>>>>
>>>>>> Benoit
>>>>>>
>>>>>> 2010/6/11 Thomas Capricelli <orzel@xxxxxxxxxxxxxxx>:
>>>>>>>
>>>>>>>
>>>>>>> Yes, we are aware of those failures. They are actually regressions introduced by a change from Benoit which "should not change any behaviour". I've spent a lot of times on trying to understand what happens, some of it with Benoit, but I still dont know what the problem is.
>>>>>>>
>>>>>>> Please do not change this file yet. We need to fix it.
>>>>>>>
>>>>>>> Adding fuziness is NOT the solution to this. At least not to fix this regression (then, when i'll make sure the tests pass on several os/compilers.. may be).
>>>>>>>
>>>>>>> Note on the different problems there are:
>>>>>>> * bad 'info' : info is the what the algorithm returns to indicate the reason for stopping. This is a huge problem when it changes
>>>>>>>
>>>>>>> * nfev (number of function of evaluation): this is very slightly less important, but still important. At least on my computer (the very same where tests passed with revision previous Benoit regression) we should get the same number.
>>>>>>>
>>>>>>> * error on 'squared norm': this one is tricky to explain. This is not the usual "stuff computed on a computer may differ from one computer/os/compiler to another one". What we check here is the result from an optimization algorithm. The value at the minimum of the function. This is the very purpose of the algorithm, and even if we might need some more steps on another computer, we should get the same result. URLs in NonLinearOptimization.cpp give the source of some (very important) reference tests, and until now we got almost always exactly the same results as those references. If we do not anymore, this is very, very bad (tm). Not just the usual "computers are fuzzy"
>>>>>>>
>>>>>>> Note to Benoit : when you got a really smaller nfev, this is probably actually because the algorithm completely failed, and stopped on a wrong value ('info' probably is different too, but checked later on on the testfile).
>>>>>>>
>>>>>>> As a side not, i intend to split the files in several tests, but i want to have this regression fixed before, as it does not help while i hunt it. I use to do a lot of going backward/forward in history, merging changes ...
>>>>>>>
>>>>>>> So, anyway, i have yet to fix those, i know, i have not (yet?) given up.
>>>>>>>
>>>>>>> regards!,
>>>>>>> --
>>>>>>> Thomas Capricelli <orzel@xxxxxxxxxxxxxxx>
>>>>>>> http://www.freehackers.org/thomas
>>>>>>>
>>>>>>> In data venerdì 11 giugno 2010 12:42:29, Benoit Jacob ha scritto:
>>>>>>>> We have discussed this a lot with Thomas already, we're a bit clueless
>>>>>>>> about them. These failures started to appear with a seemingly
>>>>>>>> unrelated changeset. If it were just 602->606, I'd say add fuzziness.
>>>>>>>> But these numbers of iterations can vary a lot more, sometimes much
>>>>>>>> larger, sometimes much smaller. In this test I have had a 98  (while
>>>>>>>> 602 was expected) and the worst is that this was not reproducible on
>>>>>>>> Thomas machine. Since these numbers of iterations are so erratic, my
>>>>>>>> guess was that the termination criteria used by this iterative
>>>>>>>> algorithm was wrong; but a quick look at the code hasn't revealed
>>>>>>>> anything obvious.
>>>>>>>>
>>>>>>>> Benoit
>>>>>>>>
>>>>>>>> 2010/6/11 Hauke Heibel <hauke.heibel@xxxxxxxxxxxxxx>:
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > I am just posting this as a summary and to get some idea in which
>>>>>>>> > tests I really start looking into and where we simply adapt the
>>>>>>>> > thresholds.
>>>>>>>> >
>>>>>>>> > We have the following tests failing (on all systems):
>>>>>>>> > NonLinearOptimization_7
>>>>>>>> > NonLinearOptimization_8
>>>>>>>> > NonLinearOptimization_10
>>>>>>>> > NonLinearOptimization_12
>>>>>>>> >
>>>>>>>> > NonLinearOptimization_7:
>>>>>>>> > - number of function evaluations(line 1341, 603-606 where 602 is expected)
>>>>>>>> >
>>>>>>>> > My guess is that here something fuzzy with an upper limit of function
>>>>>>>> > evaluations might be more appropriate.
>>>>>>>> >
>>>>>>>> > NonLinearOptimization_8:
>>>>>>>> > - squared norm (line 1019, 1.42986e-25, 1.42932e-25, 1.42897e-25,
>>>>>>>> > 1.42977e-25 where 1.4309e-25 expected)
>>>>>>>> >
>>>>>>>> > Probably again, we need to be more fuzzy.
>>>>>>>> >
>>>>>>>> > NonLinearOptimization_10:
>>>>>>>> > - info return result (2 where 3 expected)
>>>>>>>> > - number of function evaluations (on line 1180 we get 289 where 284 is expected)
>>>>>>>> >
>>>>>>>> > Maybe here we need to look more deeply into what is going wrong
>>>>>>>> > because the info value should probably be the same.
>>>>>>>> >
>>>>>>>> > NonLinearOptimization_12:
>>>>>>>> > - number of function evaluations (on line 1428 we get 498, 509 where
>>>>>>>> > 490 is expected and on line 1429 we get 378 where 378 is expected)
>>>>>>>> >
>>>>>>>> > Once again we need fuzzyness.
>>>>>>>> >
>>>>>>>> > I don't know whether I recall it well, but did not you (Thomas and
>>>>>>>> > Benoit) already have a discussion about that topic once on IRC?
>>>>>>>> >
>>>>>>>> > - Hauke
>>>>>>>> >
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>



Mail converted by MHonArc 2.6.19+ http://listengine.tuxfamily.org/