Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to review build options for vigra package #14

Open
stuarteberg opened this issue Feb 8, 2016 · 4 comments
Open

Need to review build options for vigra package #14

stuarteberg opened this issue Feb 8, 2016 · 4 comments

Comments

@stuarteberg
Copy link
Member

At the moment, we build our vigra package with the -O2 flag instead of the -O3 flag. That's because there was at least one function that gcc would miscompile at the -O3 setting, triggering segfaults and/or other spurious behavior. Nowadays, the vigra source has changed, and so has the version of gcc we use, so it's worth giving -O3 a try.

While we're at it, it would be nice to see if there is a performance improvement when switching from -O2 to -O3...

@svenpeter42
Copy link

Just for reference, O3 vs O2 enabled the following additional options:

%                    gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
                   gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
                   diff /tmp/O2-opts /tmp/O3-opts | grep enabled
>   -fgcse-after-reload                 [enabled]
>   -finline-functions                  [enabled]
>   -fipa-cp-clone                      [enabled]
>   -fpredictive-commoning              [enabled]
>   -ftree-loop-distribute-patterns     [enabled]
>   -ftree-loop-vectorize               [enabled]
>   -ftree-partial-pre                  [enabled]
>   -ftree-slp-vectorize                [enabled]
>   -funswitch-loops                    [enabled]

None of those seem to be dangerous and the general opinion these days seems to be that -O3 is fine but does not necessarily increase performance all that much. [2][3][4] That only applies to general programs though and number crunching might benefit more. I recently ran into a -O3 bug anyways which took almost two days to track down [1].

More speedups/better optimized code might be possible by using the -mtune and -march flag though:

       -march=name
           Specify the name of the target architecture, optionally suffixed by one or more feature modifiers.  This option
           has the form -march=arch{+[no]feature}*, where the only permissible value for arch is armv8-a.  The permissible
           values for feature are documented in the sub-section below.

           Where conflicting feature modifiers are specified, the right-most feature is used.

           GCC uses this name to determine what kind of instructions it can emit when generating assembly code.

           Where -march is specified without either of -mtune or -mcpu also being specified, the code will be tuned to
           perform well across a range of target processors implementing the target architecture.


       -mtune=name
           Specify the name of the target processor for which GCC should tune the performance of the code.  Permissible
           values for this option are: generic, cortex-a53, cortex-a57.

           Additionally, this option can specify that GCC should tune the performance of the code for a big.LITTLE system.
           The only permissible value is cortex-a57.cortex-a53.

           Where none of -mtune=, -mcpu= or -march= are specified, the code will be tuned to perform well across a range
           of target processors.

           This option cannot be suffixed by feature modifiers.

This e.g. indirectly enables paired single / vector instruction sets like SSE. The downside for this of course is that compiled binaries would not run on very old CPUs anymore.
But we could certainly activate -msse (and probably quite a few more) if we are compiling 64bit binaries anyways:

       -mfpmath=unit
           Generate floating-point arithmetic for selected unit unit.  The choices for unit are:

           387 Use the standard 387 floating-point coprocessor present on the majority of chips and emulated otherwise.
               Code compiled with this option runs almost everywhere.  The temporary results are computed in 80-bit
               precision instead of the precision specified by the type, resulting in slightly different results compared
               to most of other chips.  See -ffloat-store for more detailed description.

               This is the default choice for i386 compiler.

           sse Use scalar floating-point instructions present in the SSE instruction set.  This instruction set is
               supported by Pentium III and newer chips, and in the AMD line by Athlon-4, Athlon XP and Athlon MP chips.
               The earlier version of the SSE instruction set supports only single-precision arithmetic, thus the double
               and extended-precision arithmetic are still done using 387.  A later version, present only in Pentium 4 and
               AMD x86-64 chips, supports double-precision arithmetic too.

               For the i386 compiler, you must use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and
               make this option effective.  For the x86-64 compiler, these extensions are enabled by default.

               The resulting code should be considerably faster in the majority of cases and avoid the numerical
               instability problems of 387 code, but may break some existing code that expects temporaries to be 80 bits.

               This is the default choice for the x86-64 compiler.


[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68963
[2] https://stackoverflow.com/questions/19689014/gcc-difference-between-o3-and-os
[3] https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html they specifically mention to not use -Ofast and consider -O3 stable
[4] https://www.quora.com/When-should-you-use-the-different-GCC-optimization-flags-e-g-O2

@ukoethe
Copy link
Member

ukoethe commented Feb 12, 2016

I just got a report from a user that -O3 caused a segfault:

[ 56%] Building CXX object
test/watersheds3d/CMakeFiles/test_watersheds3d.dir/testsuccess.cxx.o
*Linking CXX executable test_watersheds3d*
*Running test_watersheds3d*
Entering test suite Watershed3DTestSuite

Failure in Watersheds3dTest::testDistanceVolumesSix()
Unexpected signal: memory access violation

Fatal error - aborting test suite Watershed3DTestSuite.

that went away with -O2. Waiting for more detailed information...

@stuarteberg
Copy link
Member Author

Maybe this is obvious, but maybe it's worth trying the clang or gcc "address sanitizer" feature to see if it spots any problems in the watershed code. Same for the "undefined behavior" sanitizer.

@ukoethe
Copy link
Member

ukoethe commented Feb 14, 2016

The segfault occurred on centos 7 (64 bit) with GCC 4.8.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants