Home
Compilers
Development
Documentation
Distributions
Machines
Processors

Minutes of the GCC for IA-64 Summit

Held June 6, 2001 at Hewlett Packard in Cupertino, California, USA

Minutes were compiled by Janis Johnson based on her notes and updated
with corrections from other summit participants.

Executive Summary
-----------------

This one-day summit brought together a diverse group of people who share
an interest in improving GCC (the GNU Compiler Collection) for IPF (the
Itanium Processor Family, the preferred name for IA-64). Participants
included experienced GCC developers, both with and without experience
supporting IPF; three members of the GCC Steering Committee; new GCC
developers interested in IPF support; developers of proprietary IPF
compilers who want to share their experiences in order to improve IPF
support in GCC; people interested in long-term infrastructure changes to
improve GCC for all platforms; and people from large companies who came
to learn about GCC and its current and potential viability on IPF.
There were 27 people present in Cupertino, and 10 participated by
telephone.

Those of us who organized the summit are extremely pleased with the
wealth of information that was discussed, the lists of short-term and
long-term projects that were proposed, and the opportunity we had to
meet our far-flung colleagues.

Table of Contents
-----------------

Welcome
Who's who
Current state of IA-64 support in GCC
Completed work
Unfinished and/or abandoned work
Current shortcomings
Things tried without success
Discussion
HP IPF Compiler
Talk by Suneel Jain
Discussion
Intel IPF compiler
Talk by Bill Savage
Discussion
SGI Pro64 compiler
IBM IPF compiler
Talk by Jim McInnes
Discussion
Long-term infrastructure improvements to GCC
Talk by Brian Thomson
Discussion
GCC projects to improve IA-64 support
Short-term IPF optimizations
Long-term optimizations or infrastructure changes
Tools: performance tools, benchmarks, etc.
Where do we go from here?

Welcome
-------

Gary Tracy of Hewlett Packard welcomed us to the summit. The goal of
the summit is to discuss optimization ideas and divide them into
short-term (those that can be done without major surgery to GCC) and
long-term (those that require infrastructure changes). We will have
been successful if GCC IPF code is "good enough" by the end of calendar
year 2001.

Gary asked other participants to introduce themselves and for one member
of each group to explain the purpose of the group.

Who's who
---------

[The order here is different from how people introduced themselves at
the summit so that people from the same team can be listed together
and teams from the same company are adjacent.]

Gary Tracy ; HP
Reva Cuthbertson
Steve Ellcey
Jessica Han
Robin Sun
Jeff Okamoto
Gary manages this group that ports software from the open source
community to HP-UX. Unlike most other participants at the summit,
their customers are developers that target HP-UX rather than Linux.
Reva, Steve, and Jessica are porting GCC for IA-64 to HP-UX; unlike
the Linux version of GCC, the HP-UX version of GCC on IA-64 supports
ILP32 as well as LP64. Robin is the group's Quality Engineer and
Jeff handles open source tools other than GCC.

David Mosberger ; HP Labs (via telephone)
Stephane Eranian
Hans Boehm
David and Stephane are involved in Linux kernel changes for IPF.
David is the official maintainer for Linux on IPF and a core member
of the Linux on IA-64 project, and worked on the initial port of GCC
to IA-64. Hans has recently worked on the Java run-time in GCC, some
of which has been IA-64 specific.

Suneel Jain ; HP
Suneel is involved with the HP compiler for IPF and wants to help GCC
developers leverage his experience with IPF.

Vatsa Santhanam ; HP
Vatsa manages compiler work at HP and is interested in helping to
improve GCC for Itanium.

Sassan Hazeghi ; HP
Sassan is the project manager for HP's C++ compiler team.

John Wolfe ; Caldera (via telephone)
Dave Prosser (via telephone)
John and Dave don't yet know just what they'll be doing with GCC or
how much support they have from their employer to do GCC
improvements.

Gerrit Huizenga ; IBM LTC (via telephone)
Gary Hade (via telephone)
Steve Christiansen
Janis Johnson
The purpose of this team within IBM's Linux Technology Center is to
improve Linux on IA-64 in preparation for IBM products based on
IA-64. This team is not as bound by what it does as some of the
other participants, and is working on GCC because improving the
compiler's generated code was identified as a good way to improve
application performance. The team has a goal of working on overall
system and application performance, beginning with short-term
projects that can show improved performance within 6 months to 2
years. Gerrit, a member of the Linux on IA-64 project (formerly
known as Trillian), leads the team. Steve, Janis, and Gary are
former Sequent compiler people who are coming up to speed on GCC but
already have some experience with IA-64.

David Edelsohn ; IBM Research (via telephone)
David is on the GCC Steering Committee and is the co-maintainer of
the PowerPC port of GCC. He is involved in efforts to improve the
GCC infrastructure.

Brian Thomson ; IBM (via telephone)
Brian manages a team that is responsible for application development
tools for Linux.

Jim McInnes ; IBM (via telephone)
Jim is in IBM's compiler group in Toronto and spent a couple of
years adding IA-64 support to IBM's Visual Age compilers.

John Sias ; University of Illinois (via telephone)
John is involved with the Impact compiler and works as an intern with
Jim McInnes in the IBM compiler group.

Richard Henderson ; Red Hat
Jim Wilson
Richard is in the GCC core maintainer group and has been working on
the GCC IA-64 port for 18-24 months. Jim is the main GCC maintainer
for the IPF port and a member of the Linux on IA-64 project. Jim has
concentrated on support work while Richard has done more feature
work. Jim is a member of the GCC Steering Committee.

Richard Kenner ; Ada Core Technologies
Richard has been heavily involved with GCC work, but not with IA-64.

Suresh Rao ; Intel MicroComputer Software Labs
Bill Savage
Bill is co-manager of the Intel Compiler Lab, which develops C/C++
and Fortran compilers for IA32 and IPF. Suresh manages the Compiler
Product portion of the Intel Compiler Lab and is responsible for the
Intel Linux compilers.

Pete Kronowitt ; Intel
Chuck Piper
Tracey Erway
Dawn Foster
Pete manages a group at Intel that works with Linux distributors.
He attended the summit to learn about GCC issues that will help him
ensure that Intel's resources are distributed correctly. Chuck
manages the Performance Tools Enabling Team, which does marketing for
Intel tools including the Intel compiler. Tracey manages a marketing
group with a focus on 3rd party core tools including compilers. Dawn
works for Tracey on 3rd party tools enabling, particularly Linux and
GCC.

Bruce Korb
Bruce has worked with GCC but not for IA-64. He's the maintainer
of fixincludes.

Ross Towle ; SGI
Ross manages the group that does compilers and tools, including SGI's
Pro64 compiler. His group is currently more interested in C++ and
Fortran than C, and in technical computing.

Con Bradley ; SuperH
Con is interested in short-term changes to GCC that will help IPF and
architectures with similar features.

Waldo Bastian ; SuSE
Waldo is a KDE developer. He attended the summit to take information
back to GCC developers at SuSE.

Mark Mitchell ; Code Sourcery
Mark is the release manager for GCC 3.0 and a member of the GCC
Steering Committee. He has been involved with high-level GCC work,
including the C++ ABI, and is interested in long-term infrastructure
changes to GCC.

Sverre Jarp ; CERN (via telephone)
Sverre is a member of the Linux on IA-64 project.

Current state of IA-64 support in GCC
-------------------------------------

All of this information is from Jim Wilson and Richard Henderson unless
otherwise indicated.

Completed work
--------------

- Initial port of GCC to IA-64;

This was done by David Mosberger.

- Constructing control flow graphs (CFGs) and doing optimizations
based on them

This provides an interface between the front end and middle end.
The front end builds an entire function in tree format and then
converts the entire function to RTL. The C front end doesn't yet do
anything to optimize at the high level, but the C++ front end has
function inlining that uses trees.

The C++ inliner should be merged to work with a C front end; that
wouldn't be too much work. Right now, any function that takes a
complex argument won't be inlined, but with this change it could be.

- A little bit of dependence analysis

Red Hat is planning to use this information in loop dependence
analysis.

- New optimization pass for predicated instructions (if-conversion)

- Work on the scheduler, to bundle and emit stop bits at the same time
as scheduling

Unfinished and/or abandoned work
--------------------------------

- A new pipeline description model and a new scheduler that uses it,
with support for software pipelining

This is for resource-constrained software pipelining and helps loops
that can't be modulo scheduled. The new scheduler is in the Cygnus
source tree and has not been merged into the FSF tree. Red Hat
believes that this new scheduler is the right way to go long-term,
but it is not yet ready. It is a very large piece of work that shows
only 1-2% performance improvement for Itanium. Red Hat is using it
from the Cygnus tree for MIPS and PowerPC and for several other
embedded processors.

The new scheduler can do incremental updates of its state. In theory
it should be a lot faster than the Haifa scheduler (for compile
time), but in practice it didn't prove to be. The model was much
larger, so the theoretical speed gains could have been eaten up by
extra work caused by the larger size.

The current pipeline model can indicate the duration of an
instruction, but not really enough information for IPF instruction
scheduling. The new model dooes have enough information.

The FSF GCC sources use the Haifa scheduler but as a moderately
sophisticated list processor. The IA-64 scheduler has lots of tweaks
outside of that, including bundling.

- A new pass to use more registers to avoid an anti-dependence stall in
Itanium that David Mosberger discovered.

The new pass uses more registers. By default the register allocator
will use as few registers as possible, which isn't always the right
thing to do. This change resulted in very measurable speed-ups of
5-6% when using more registers on IPF.

Current shortcomings
--------------------

- First in line for things that need to happen is language independence
in the tree language. Someone at Red Hat in Toronto is working on
adding SSA form. Multiple IL levels are necessary so we don't lose
so much information so soon.

- The machine description could be improved

It currently doesn't describe dependencies between functional units,
or resources that must be claimed at some cycle other than cycle
zero, e.g. needing the memory unit at cycle two.

Currently GCC tweaks partial redundancy code and adds fixup code.
There's not enough pressure on units to show improvement. The code
knows when it's decreasing the number of instructions but doesn't
know about decreasing the number of bundles instead.

- Missing high-level loop optimization

Everything above the expression level is language-specific; that
needs to be changed.

Vatsa asked if GCC synthesizes post-increments on the fly. Richard
Henderson said that post-increments are done in the loop code of GCC,
generated as part of induction variable manipulation. The second
part is that regmove.c should generate post-increment but doesn't do
a very good job. Some source in the Cygnus tree tries harder and
should do better but hasn't in practice and is buggy. Red Hat will
remove it shortly and use a form of GCSE instead. They tweak partial
redudancy code to not delete certain code in some places, then
optimize later. Post-increment hasn't been seeing much improvement
yet.

- Weak on loop optimizations

- No pipelining in the FSF sources; there is in the new scheduler in
the Cygnus tree

- No rotating registers

- No prefetching support

- Control, data speculation

Things tried without success
----------------------------

[Gary Tracy: Potentially there is great value in "known failure paths".
Things tried where it was learned that we should never try again.]

- Control speculation tied into Haifa scheduler

The scheduler is supposed to handle this but it exposed the fact that
alias analysis in GCC during scheduling is extremely weak; it can
even lose track of which addresses are supposed to come from the
stack frame and so it would speculate way too much. This project,
though, was tried quickly, and maybe it could be done successfully
with more time, e.g. 2 months rather than a week.

- The new scheduler

Alias analysis is a general infrastructure problem; GCC has no
knowledge of cross-block scheduling.

Richard Henderson thought of a scheme that could be done in 4-6 weeks
using existing the alias code to keep information disambiguated. The
problem is that GCC drops down the representation practically to the
machine level, so the compiler just sees memory with a register base.

Alias analysis in GCC is weak in general and is even weaker in IA-64.

Gary Tracy asked if the new scheduler resulted in slower code rather
than incorrect code; the answer is yes.

Richard Kenner is planning something that could help here, linking
each MEM to the declaration it's from so that alias analysis can know
that two MEMs from different declarations can't conflict. This will
also allow other things to be specified in a MEM, like alignment,
which was his original motivation.

Vatsa asked if the register allocator is predicate aware; the answer
is no.

- Data speculation work

There was one patch, but it was never reviewed. There currently is
not enough ILP to make completion of this patch worthwhile.

- Control speculation work

Bernd Schmidt might have an unfinished patch that could be picked up.

Discussion
----------

Richard Henderson: Almost all of the things that didn't show
performance improvements didn't work because the GCC infrastructure
didn't support them. A lot of work was done right out the Intel
optimization guide, and presumably those changes were successful in
other compilers.

Richard Kenner: We should get an idea of how much speedup an
optimization should provide, with a quantitative answer for each
optimization, to know where to spend time if we could get ILP. It
would also help if we could know how to get ILP, but we might not be
able to separate them. We should collect empirical results.

Bill Savage: He and Suneel should be able to help us see what pays off
on Itanium. ILP is only one thing to exploit. He has some numbers,
but they'll vary by infrastructure.

Ross Towle: Reported some experiences SGI customers saw using GCC (gcc
and g++ on ia64 and ia32) that was no performance related. They
found it hard to get programs to work other than at -O0, for code
that works on a number of other platforms but doesn't work when
compiled with gcc on ia64. This is proprietary code for which it is
not easy to submit problem reports.

SuSE and HP have successfully built other packages than those that are
included in a Linux distribution, so this is not a generally-seen
problem.

Gary Tracy: This is an important issue but not one to cover here.

someone: What kind of code do we want to improve the performance of?

most: System code and applications that are primarily integer

Ross Towle: SGI is more interested in technical computing with floating
point code.

Bill Savage: IPF excels at floating point code, and Bill put in a strong
plug for supporting good floating point performance.

Suneel Jain: GCC does not support Fortran-90, so technical computing
isn't as much of an issue for GCC.

Richard Henderson: There is lots to do to succeed in technical
computing. GCC is so far off the mark now that it's years away from
being there.

In general, different users and different markets dictate different
priorities.

[break while people figured out how to set up the projector; lunch had
been brought in, so some people started eating]

HP IPF Compiler; talk by Suneel Jain
------------------------------------

[Suneel's slides]

Key IPF Optimizations in HP Compilers

o Predication
- If-conversion before register allocation, scheduling
- Implies predicate aware dataflow and register allocation
o Control and data speculation
- Control speculation more important
- Generation of recovery code
o Cross block instruction scheduling
- Region based or Hyperblocks
o Profile based optimization
- Affects region formation, code layout, predication decisions
o Data prefetching

Other Optimizations

o Accurate machine model for latencies, stalls
- Handling of variable shifts, fcmp
- Register allocation to avoid w/w stall
- Nop padding for multimedia instructions
o Template selection, intra-bundle stops
o Post-increment synthesis, esp. for loops
o Sign-extension, zero-extension removal
o Predicate optimizations
o Post-increment synthesis, especially for loops

Strong Infrastructure

o High level loop transformations
o Good aliasing and data dependency information
o Cross module optimizations
o SSA based global optimizations
o Accurate machine modelling
o Support for predicated instructions, PQS.

Application domain specific issues

o Commercial Applications
- Instruction and Data cache/TLB misses
- Profile based optimization
- Large pages
- Selective inlining
o Technical Applications
- Loop optimizations
- Data prefetching
- Modulo Scheduling

[notes from his talk]

HP had a compiler for PA-RISC, but decided to write a brand-new back end
because of all the things needed for IPF.

The key to good optimization for IPF is the ability to expose ILP.
Having a scheduler that is able to expose ILP is the reason for having
the other optimizations. The predication and speculation features
significantly increase the number of instructions that can be fed to the
scheduler in parallel.

The compiler is not yet at an optimal point for predication. There's a
danger in doing too much; there's a penalty for i-cache misses if it's
too aggressive, which causes a degradation in performance. Currently it
uses lots of tuning and several tweaking heuristics. Predication is not
an easy feature to use even if you have the infrastructure.

It's best to do predication before scheduling and register allocation.
Loop unrolling in GCC will require infrastructure changes in terms of
allowing the RTL to recognize predicate registers and whether
definitions are killed or not. Lots of things are most important. You
do want to use predication for simple if-then-else cases early, but not
be more aggressive.

Control speculation is more important than data speculation. It needs
cross-block scheduling, since the compiler doesn't see the opportunity
or need within a basic block. Both require generating recovery code,
which introduces new instructions and new register definitions and uses.
It might be difficult to build in.

For cross-block instruction scheduling, the compiler must identify
regions to concentrate on and then schedule the whole region at once.
There are also related optimizations, like if-conversion. You can
extend scheduling for basic blocks to hyperblocks, which is a reasonable
approach for IPF.

Profile-based optimization is very key for IPF codes, especially integer
codes, as shown by measurements HP has done.

Data prefetching provided the biggest wins in technical compute
applications. David Mosberger saw some gains in the Linux kernel.

It's important to validate that the machine model in GCC is accurate.

Template selection and intra-bundle stops should be integrated into the
scheduler rather than done in a separate pass. This allows a robust
mechanism to find the best layout for templates and minimize the number
of nops.

Cross-module optimizations reduce the overhead of calls, build a larger
scope over which to generate code, and expose more ILP.

Discussion
----------

Richard Kenner: Does HP have any quantitative measures of the benefit
of each optimization?

Suneel Jain: They all go together. It doesn't help to have only one of
them. A compiler needs good aliasing infrastructure. Profile-based
optimization is a key factor. The instruction scheduler must be
aware of predication.

Vatsa Santhanam, Bill Savage: Code locality is even more important for
this architecture than for others where it shows a benefit.

Ross Towle: Aggressive code motion allows data speculation to be more
important than control speculation.

Intel IPF compiler; talk by Bill Savage
---------------------------------------

[Bill's slides]

Compiling for IPF; High Level View

General Purpose Computing

o Characteristics
- Very large, with few or no hot spots.
- Heavily dependent on i-cache / TLB for perf.
o Code size and locality most significant
o To get the best performance, compiler needs
- profile-guided optimization
- code-size sensitive scheduling, bundling interprocedural optimization

Profile Guided Optimization

o Instruction cache management
- block ordering -- move cold blocks to end of functions
- function ordering -- move cold functions to end of executable
- function splitting -- move cold blocks out to end of executable
o Branch elimination and branch prediction
- block ordering -- make better use of static heuristics
- indirect branch profiling -- especially for C++
- predication

Interprocedural Optimization

o Improves TLB, i-cache
- Dead static function elimination
- Better function ordering
- Inlining of small functions

Technical Computing

o Characteristics
- Large data sets, loops, hot spots
- Heavily dependent on d-cache / TLB
o Loop scheduling and data locality most important
o To get best performance, compiler needs
- Software pipelining
- Loop transformations and prefetch insertion
- Interprocedural optimizations

Software Pipelining

o Get the most from loop computations
o Overlap multiple iterations
o Works best with moderate-to-large number of iterations
o Requires loop dependence information

Loop Transformations, Prefetching, IP Optimization

o Loop transformations
- Enhance data locality by reordering accesses
- Requires data dependence analysis, cost model
- Highest benefits
o interchange, distribution, fusion
o Prefetching
- Generate request to move data that overlaps other computation
o Interprocedural optimizations
- MOD/REF analysis important, data promotion

[notes from his talk; Janis was eating lunch and didn't get many notes.]

The value of profile guided optimization is mostly for code locality.
This is the highest priority.

Interprocedural analysis is important.

The developers looked at how to get all the ILP available. The most
aggressive predication is not the best predication.

There are some low-tech big hitters that can make a difference. You can
compromise in the register allocator between using the fewest and most
registers. GCC could do well for compiling the kernel and some
applications.

Loop transformations enable other things.

Suneel's discussion gave a complete list of what a good ILP compiler
needs.

[Bill showed a slide that showed the performance impact on SPEC, which
can't be shared because it was prepared with others under NDA. In mail,
he said that the interesting data for the Intel compiler is:
12% improvement from IPO on integer SPEC
58% improvement from IPO on FP SPEC
He didn't have hard data on profile based optimizations, but
observations are that it provides an improvement of 10%-30%.]

Discussion
----------

Mark Mitchell: Some of the work for profile-directed optimization is
independent of the compiler itself and might be able to use
technology from other compilers.

Richard Henderson: Nobody real is using profile-directed optimization
in GCC; he doesn't know how we can address that.

Bill Savage: All database vendors will use profile-directed
optimization if it's available. Other vendors will use it depending
on how much gain there is and how important performance it is to
them.

Janis Johnson: Sequent's C compiler supported feedback-directed
ordering of functions and Sequent's database customers used this
extensively.

David Edelsohn: Profiling the Linux kernel and libraries (for profile
directed optimization) would be useful.

Vatsa Santhanam: HP profiles its kernel.

Jim Wilson: Red Hat has some customers who have used profile-directed
optimization, but it takes a long time to do a build, e.g. 4 hours
rather than 16 hours, so it is not very practical for them.

Suneel Jain: The build for a final release of a product might use
different optimizations than development versions, and profile-
directed optimization might be used only at the end.

Mark Mitchell: Is there a way to get "sloppy data" for profile-directed
optimization so that code can change slightly but the same profiling
data can be used?

[this wasn't answered]

Suneel Jain: It's useful to have pragmas for hints and to be used by
static heuristics.

Jim Wilson: We can build that information into the compiler. GCC
allows the developer to flag branch probability via __builtin_expect.

Hans Boehm: That mechanism is limited. If the user says to expect 0
percent, it's not possible to give some other hint or override it.

Richard Henderson: GCC uses the given hints and then unannotated
branches are run through the usual heuristics.

Hans Boehm: For some things it's useful to know that the branch will
be taken 50% of the time, rather than never taken. The problem
really is that if one of the default heuristics says 0%, the user
can't specify 50%.

Richard Henderson: Currently GCC expects either 1% or 99% based on user
information.

Bill Savage: Annotations get out of date quickly and can give a false
sense of security. Intel has given up on using them.

Richard Henderson: This (__builtin_expect) is used within the layout of
a function. GCC does not split a function into multiple regions,
which has been mentioned as a possibility.

The compiler can put each function into a separate section (with the
option --function-sections) so that the linker could rearrange them.

GCC does block ordering within the compiler.

Mark Mitchell: It wouldn't be hard to do function ordering within a
file.

Dave Prosser: Fur [a proprietary tool from SCO pre-Caldera that permits
some editing of code in object files, not ported to IA-64] is worth
looking at for ordering across modules, which is the real challenge
for IA-64. For some applications fur can increase performance by
30-40 percent, while for others it has no real effect.

someone: Would it be better to port fur to IA-64 or to put a
profile-directed optimization framework within GCC?

Dave Prosser: Certainly there are advantages that fur gets over
profile-directed optimization, but we could get more through less
effort by putting a profile-directed optimization framework in GCC.

Bill Savage: Would Fur give the same benefits as profile-directed
optimization, if we didn't change the GCC infrastructure?

[Dave Prosser provided this during the review of the minutes:]
Fur has to figure out the intent of code -- a level or two above the
mere job of disassembly. Unlike with the IA-32 instruction set,
IA-64 makes this job much, much harder. Moreover, doing anything
finer with code than at the function-at-a-time level with IA-64 for
fur will also take a lot of effort. There are tons of stuff to keep
track of, instead of 5 or so registers. For example, with IA-32 we
can insert a call to a function (at the start of a basic block)
without touching "anything" in the calling context. Fur makes use of
this to be able to do its simple instrumenting. With IA-64, there is
no simple instruction sequence that can be used in this way. It's
even possible to imagine the start of a basic block where every
single register and predicate is live in IA-64, thus making it
impossible to drop in any transfer code sequence, let alone a canned
one.

Thus, in the end, it seems *much* more feasible to get the
per-compilation-unit benefits of more localized code by putting
profile-directed optimization into the compiler before one ever
starts to generate code, for IA-64. Yes, fur can still serve as a
function-at-a-time editing tool, but that isn't sufficient for good
IA-64 code generation, and we can still use that level of fur on
IA-64 on top of code generated based on feedback.
[end of Dave's addition]

Vatsa Santhanam: What mechanisms are used in the GCC community to track
performance?

GCC developers: laughter

Richard Henderson: Nightly SPEC runs are posted to watch performance go
up and down. These are not completely ignored, and some people have
fixed performance regressions shown by these results.

[SPEC runs are posted at http://www.cygnus.com/~dnovillo/SPEC/]

Richard Kenner: We need to actually study generated code occasionally.
He does diffs of generated assembly files and tries to understand
the differences.

Mark Mitchell: A key infrastructure problem is that we can't unit test
different parts of the compiler. We feed in source and get assembler
out, and can't tell what changes come from each part of the compiler,
or that the expected transformations were done. There is no
conceptual reason why this is impossible to do.

Vatsa Santhanam: Do GCC developers have tools to help analyze changes?

Mark Mitchell: It would be nice to know which part of the compiler
caused worse performance. Some benchmark loops do well on some chips
but poorly on others.

Richard Kenner: We could have regression tests for specific
optimizations to recognize when they break.

someone: Who would look at results and fix the the problems?

Mark Mitchell: Most funding for GCC has been for features, such as
porting to new chips or adding new languages, rather than for
particular optimizations or for ongoing maintenance.

[Vatsa Santhanam said during the review that he got the sense that GCC
developers do not have adequate tools to help analyze performance
regressions and opportunites at the assembly code level.]

[break for lunch, and to switch phone numbers for those dialing in]

SGI Pro64 compiler
------------------

Ross Towle:
If he had brought slides they would look exactly like Suneel's. SGI
has seen the same thing about what is important.

The key point regarding infrastructure, going back to data
dependence and data analysis, is being able to see memory references
from the start; it doesn't work to derive this information later.
In C and C++, the compiler needs information about subscripted
references. If it has that information then data dependence is that
much more correct and less fuzzy. Other optimizations fall out
nicely if data dependence information is as perfect as it can be.

IBM IPF compiler; talk by Jim McInnes
-------------------------------------

[The following is from mail that Jim sent and which Gary Tracy copied
for those at the summit. Comments like this were added by Janis from
her notes of additional things Jim said during his talk.]

Here is a list of things that we feel are important to do well if you
want to do well on IA64. The list is more or less in descending order.
I'm restricting my attention to things that are IA64 specific and am
leaving out things that are platform independent. I know little about
GCC, so these remarks are not targetted at any specific aspects of GCC.

0) Alignment.

You probably already know that misaligned data references cost
plenty. Zero of them is a good number to have.

I haven't had the impression that this is going to get better.

Perhaps this is a bigger problem for us because PPC allows them.
This can affect application performance by a factor of 25 to 100.
This stopped our plans to try to exploit FP load quad word insns
because they require 16 byte alignment and we don't always have the
info to guarantee that.

[Interactions with other optimizations means that this might not
always be known.]

1) Good bundle aware local scheduling after register allocation.

This pretty much has to be the last thing that looks at the code
before it is written out.

The code needs to be arranged into a stream in a way that maximizes
dispatch bandwidth, i.e.

i) bundles and stop bits need to be correct and the number of stop
bits needs to be minimal.
ii) Pipeline delays are met.
iii) machine resources are not oversubscribed

In order to do this the compiler must have knowledge about which
predicate registers are disjoint, and this knowledge must survive the
register allocation process.

I think you will lose a lot by trying to separate the "pipeline
awareness" from the "bundle awareness"

This is good for all code and is essential for code size.

2) Software pipelining

You need to do Modulo Scheduling using the special hardware provided.
All instructions are fully pipelined to facilitate this.

It is less important to get all the funky cases involving predicate
registers used inside the loop etc. We struggled to get every case
correct. This is very important for fp code. While loops are also
important but this you frequently need ...

3) Speculation.

We did control speculation and not data speculation. Both kinds are
moderately dangerous because they can introduce debugging headaches
for users of the compiler. We found this useful - I have no
particular comments about it.

For us the speculation was driven by our Global scheduler [similar
to the Haifa scheduler used in GCC]. We saw code in important
applications that could have benefitted from very local data
speculation. In particular the dependent sequence:

LD
CMP
BR label x
STORE
label x:

repeated many times. In this case it is a large win to data-
speculate the LD up at least one block and replace it with a LD.CHK.
In the instance we saw, all the stores were through *char or *void
and couldn't be disambiguated from the loads. I think that this
kind of local speculation where only one or two stores have a chance
to invalidate the ALAT entry is less likely to incur penalties.

[This kind requires less infrastructure. Code motion out of loops,
for example, couldn't be disambiguated from a load through a pointer;
that might require major infrastructure changes.]

4) Predication.

This is difficult. My understanding is that great performance
benefits can come from predicating, but this understanding didn't
come from my work on IA64. I think that you need to get many things
right before this is effective. In particular:

i) The heuristics for deciding what to predicate need work. Static
heuristics seem hopeless, so I think that it should be disabled
unless PDF is being used. We used a fairly simple minded scheme
from one of the popular papers. It was inadequate, in our view.
[Most customers don't use PDF.]

Real compilers have to pay attention to compile time as well as
optimization - this might hinder efforts in this area.

The biggest danger here is over predicating. The big
predication boosters always fail to mention that predicated
instructions that get squashed are just extra path length.

ii) The register allocator has to be fully with the program and be
able to assign registers optimally in predicated code. I don't
know all the problems here.

iii) You need pretty good technology to represent relations among
predicates. At the very least you need to know accurately if
two predicates are disjoint. You will also need to make some
predicates manifest in the code (that were previously only
implicit) when you predicate. Doing this efficiently is also
hard. [The Intel IPF assembler requires a lot of information
about branches to be explicit in branch expressions.]

5) Branch prediction.

Our efforts in this area were hampered by bugs in early hardware and
were never able to measure the benefit. My understanding is that it
is important.

[end of Jim's prepared talk.]

The IBM compiler targets a lot of platforms, but the IL is lower level.
The developers had trouble redirecting optimizations that deal with
addresses because IA-64 doesn't have base+displacement addressing. It
was difficult to teach optimizations about addressing. To minimize code
size, the compiler must make effective use of post-increment forms; this
was challenging.

The optimizations all interacted with each other, so the people working
on them had to work closely to get the optimizations to all work well
together. Other platforms supported by the compiler allowed the
optimizations to be separate.

Discussion
----------

Richard Henderson: Alignment is not really an issue for GCC; other
platforms that it supports have similar issues, so it already keeps
data aligned.

Mark Mitchell: For GCC, it is an absolute requirement to target almost
any crazy chip. What were Jim's impressions working on IA-64 support
in a retargetable compiler? How much could be done in a target
independent way, and how much was specific to IA-64?

Jim McInnes: The IBM compiler has three phases of optimizing. The
first is inter-procedural and platform independent, and they didn't
have to do much to that in order to get it to work with IA-64. This
phase uses a different IL from other phases. The back end has two
phases. The first of these is bread-and-butter optimizations [a long
list that I didn't record], getting closer and closer to what the
code will be on machine. The goal was that for IA-64 those two
phases would continue to be the same on all platforms.

For IA-64, the compiler didn't optimize properly because
post-increment had to be done early. There were specific problems
that required getting more platform-specific in the first back-end
phase, but it didn't really affect most optimizations.

The third phase is instruction scheduling and register allocation.
Scheduling was pretty much rewritten, since the existing scheduler
didn't work for IA-64.

Allocating registers was not very hard, although they had trouble
with understanding that you're spilling when doing calls and need to
minimize the use of registers in a function. The allocator tended to
use a different register for every address, so it ended up using too
many registers.

Rotating registers were new to them and caused problems in prolog and
epilog at first.

[John Sias said during the review:] If there are insufficient registers
available on the register stack when a function is invoked, the
machine (at least Itanium) stalls to spill some entries into the
backing store, making room for the new function's register frame.
The register allocator needs to acknowledge that there's some cost in
allocating additional stack registers because there's the danger of
this hidden spilling.

Mark Mitchell: What about loop optimization?

Jim McInnes: This was totally unique, and was rewritten for IA-64 to be
specific to that platform.

Mark Mitchell: The optimization passes in GCC are the same for all
platforms now. Might we need to write different versions of some
passes for IA-64?

Richard Henderson: The pipeliner was originally written for RISC chips,
and it's not an issue to use modulo scheduling.

Jim McInnes: IBM found that modulo scheduling is profitable on RISC
chips as well.

Richard Kenner: No matter how something looks, you might find another
platform later on to take advantage of some optimizations.

Jim McInnes: Rotating registers are important. The epilog counter
is not that important. One way or another you'll have a big chunk of
code that is only used on IA-64.

Ross Towle: Yes, modulo scheduling important is on other platforms as
well. It's necessary to use a very different register allocation for
rotating registers; this is very mechanical, very simple.

[Vatsa Santhanam said during the review:] I think the code example does
not clearly illustrate the point being made in the text.
Specifically, the LD in the code fragment is already positioned above
the STORE and so the need for data speculation is not evident. So
either the LD was below the STORE to begin with and had to be data
(and control) speculated or the above code sequence repeats
*back-to-back* multiple times giving rise to the data speculation
opportunity.

Long-term infrastructure improvements to GCC; talk by Brian Thomson
-------------------------------------------------------------------

Brian is soliciting support for long-term infrastructure improvements
to GCC. He sees a real synergy between that effort and efforts like
this one.

Brian is working with vendors who have a dependence on GCC generating
good code, to get them signed up to support the effort. They will lay
out requirements and then invite the GCC development community to offer
ideas. This effort is broader than a single platform; the efforts will
probably help all processors, but some more than others.

Some of the changes needed for GCC are more fundamental, with broader
effect, involve more upheaval in existing code, and will take longer to
implement. There will be a natural breakdown of work that comes out of
this summit for the two groups of effort. Some of what comes out of
this summit will be input into Brian's effort with other system vendors.

Discussion
----------

Gerrit Huizenga: Is there any list so far of what changes are proposed?

Brian Thomson: There has been some discussion about specific items.
The intention is to identify targets that we want to see improvement
in; what kind of code, languages, and architectures, and then allow
GCC designers to propose technology to do that. He doesn't want to
be prescriptive, but let people who own the solution provide that.
For example, deciding whether and how to do multiple levels of IL as
in the IBM compiler.

Richard Kenner: There are trees and RTL in GCC. The tree representation
is used for expression folding and inlining.

Richard Henderson: Making tree-level optimizations language-
independent is a high priority. Red Hat has a person in Toronto who
is working on SSA. In general, GCC shouldn't throw away so much
information so quickly. Moving to multiple levels of IL is going to
have to come before any longer-term projects can see any benefit.
The first step there is to create a clean interface between the
front end and the optimizer so we can reuse cool optimization
technology for all languages.

Mark Mitchell: The existing tree representation also needs changes. It
needs to be more regular and clean.

Richard Henderson: This is a fair description of what he had in mind.
The higher level RTL that Brian or Jim McInnes mentioned has been
discussed several times in the last couple of years.

Jeff Law been doing some serious thinking on that subject and has
started submitting some bits. GCC has an SSA pass but it suffers
from representational problems in the current RTL. In certain
situations it isn't reliable, so it is not turned on by default.
Jeff has started attacking some of those issues so it can be turned
on.

As for longer term directions, Richard is not sure he has any. When
fundamental problems are resolved then the future direction is more
based on desire and whatever else we identify that can best help
performance.

Suneel Jain: Is the goal of having a higher-level tree representation
to do inter-procedural optimizations from information written to
disk?

Mark Mitchell: This question comes up a lot. The sticky issue is that
the FSF is morally opposed to doing this. The aim of the FSF is not
to produce the best compiler, but to convince the world that all
software should be free. The concern is that writing out the
representation to disk would allow a vendor to use a GCC front end,
write the IL to disk, read it back in, and graft proprietary code
into GCC in a sneaky way to get around the GPL. This is a very
firmly held position in the FSF.

Richard Henderson: Pre-compiled headers write some information to disk,
but not in a way that is entirely accessible. If performance
speed-ups for that are as benchmarks have suggested (20x speedup in
C++ compile time) then it will be almost impossible to disallow it.
For this, though, the representation is very close to the source
level and is not really a GCC internal representation.

Mark Mitchell: Other related work to write out parts of the internal
representation are inevitable. Eventually the political issue
might be weakened if this is important for the long-term viability
of the compiler.

Richard Kenner: We can write out some information and still get
inter-module optimizations, e.g. information about register usage
within a function.

Mark Mitchell: We don't want to let vendors leverage GCC in their
products.

Bill Savage: Summary information could be used for analysis.

Richard Kenner: We don't want a vendor to be able to use a GCC
back end or front end. It isn't clear whether an IL used in GCC
could be GPLed, since it's not just the actual code but the methods
that it uses.

David Edelsohn: Is it useful to discuss specific optimizations, with
long-term vs. short-term?

Richard Kenner: There was a private discussion earlier about setting
up a data structure with useful information about a MEM.

Mark Mitchell: He would like to understand the state of the current
profiling code in GCC. We might be able to take advantage of that
in places where GCC is now guessing .

Richard Henderson: GCC currently collects trip counts off a minimal
spanning tree, for how many times you went from this block to
another block. There are a couple of PRs in the GCC bug database,
but it mostly works except for some computed gotos.

He has built SPEC with it and it seemed to function. He doesn't
know how much performance improvement it gave, but it got data,
which went back in and got attached to the right branches.

We could use the information to improve linearization of the code,
and sometimes for if-conversion to decide which side of the branch
should be predicated. It could also be used for delay slots.

For profiling, GCC generates different code and increments counters
inline.

Jim Wilson: Some information can be computed after the fact from
execution counts.

Bill Savage: What we want is basic block profiling with extra
instrumentation around loops. The IR can be annotated with branch
probabilities, with counts to guide the heuristics of all
optimizations downstream. Code locality is biggest payoff.

Jim Wilson: Jim wrote this functionality when he first started at
Cygnus about 11 years ago. It might have been six years before it
was approved. It's online and usable but very few people use it.
It can be used for profiling; profile-directed feedback is extra.

Bill Savage: Someone really ought to look into using this.

Richard Henderson: Block reordering is a year and a half old. Before
that it was used for branch hints. He doesn't know what kind of
performance help it gives.

Mark Mitchell: Data prefetching might be simple to tackle.

Richard Henderson: Jan Hubicka at SuSE did this.

Bill Savage: There are two approaches: one for floating point that's
complicated, one that's simple-minded that didn't hurt anything but
didn't buy much. There might be other techniques that are simple for
integer computing. Some techniques some work well on linked lists if
they are laid out so elements are contiguous. This showed a good
speedup on SPEC, but real applications might not work that way.

Suneel Jain: He has done comparative SPEC runs and analysis with GCC,
the Intel compiler, and SGI's Pro64 on IPF Linux, using default
optimization levels, -O2, no special options.

The Intel compiler was about 30% higher than GCC. GCC and Pro64 were
comparable at -O2, although Pro64 was really bad at Perl, which
brought down its average.

The Intel and HP compilers are comparable at -O2.

With peak numbers from HP and Intel, GCC gets about half the
performance (GCC 1.0, HP and Intel 1.8). This was two months ago
with a GCC pre-3.0 version.

If we want to focus on application performance then we should improve
-O2 and not require special options.

The difference for GCC runtime (from SPEC) was close to 10-15%,
without profiling feedback.

Mark Mitchell: Was there any analysis of where the differences came
from?

Suneel Jain: No.

Mark Mitchell: We can put anything in -O2 we want.

Suneel Jain: But not profile-based optimization.

Richard Henderson: If our goal is to improve performance of a Linux
system, then profile feedback is not where we should begin looking.

lots: Why?

Richard Henderson: Because we're not going to build the whole
distribution that way.

Waldo Bastian: Lots of people just use the distribution as they receive
it so it would be useful for those people.

Richard Henderson: We need more than a compiler that supports it, we
need representative test cases.

Waldo Bastian: But we can do that. A project like KDE can build its
own test cases if only the profiling tools are easy enough to use.

Richard Henderson: It's not that hard to get the data out of the
kernel.

Reva Cuthbertson: The problem is getting accurate data.

Steve Christiansen: You have to decide what workload to use.

Mark Mitchell: Workload issues are important, but using a workload
that is close wouldn't hurt other workloads.

Bill Savage: Linux kernel performance should be a high priority.
Behind that is general applications and database software, and
behind that is technical computing, which is lower because there
might be other compilers available for those applications.

Hans Boehm: There might not be much opportunity in the kernel, since
much of it is hand tuned. David Mosberger might know whether there
are problems shown by benchmarking.

Mark Mitchell: The shell and the C library could benefit; they both
have a lot of CPU usage.

Richard Kenner: A webserver might be an interesting workload. The
shell is too simple.

Mark Mitchell: It's interesting to profile a workload to see which
processes are running.

Jim Wilson: Itanium's performance monitoring registers let you see a
lot of information.

David Edelsohn: We should focus on uses of the system to guide which
areas would benefit the most from performance improvements.

someone: The kernel is important but has a lower priority than other
parts of the system.

Mark Mitchell: It's reasonable to focus on enterprise applications for
Itanium.

Gary Tracy: We need to have a list of projects and let people sign up
for them.

Mark Mitchell: There has been a project file for GCC for years and he
can't remember any of those projects being done.

Gerrit Huizenga: We should keep track of failed projects so that others
don't go down the same path.

Mark Mitchell: We should get detailed technical information from
developers of other compilers so we don't need to start from scratch.

Richard Kenner: It could slow things down if some of the people
improving GCC are planning to patent their methods. Is anyone
planning to do that?

someone: The various companies which have cross-licensed the compiler
optimization patents could license them to the FSF.

David Edelsohn: The cross-license does not include the right to
sub-license the patents to others, such as the FSF.

Richard Kenner: Companies can't do cross-licensing with the FSF because
it doesn't have its own patents.

David Edelsohn: Daniel Berlin has written a new register allocator
based on a paper that touches on every register allocation patent,
from lots of companies and universities. That work will not go into
GCC until the FSF decides what to do about the relevant patents.

Mark Mitchell: It would be nice if big companies could do patent
searches on behalf of the FSF. Unintentional patent infringement is
a potential risk to the open source community.

[Break]

GCC projects to improve IA-64 support
-------------------------------------

We had a brainstorming session, with lots of sidetracks into the items
being brought up, to divide potential enhancements into three
categories:

short-term IPF optimizations
long-term optimizations
tools: performance tools, benchmarks, etc.

Items were written on large charts as they were brought up. The
information below shows what was written on the charts and some of the
discussion about them.

Short-term IPF optimizations
----------------------------

- alias analysis improvements

Richard Henderson: This work is self-contained and doesn't affect
the rest of the compiler. The idea is to track the origin of the
memory when it is known, despite the memory reference being broken
down. Register+displacement addressing doesn't usually require this
kind of information. With IA-64 we start losing information
immediately.

Richard Kenner is already planning some work on tracking memory
origin.

- prefetching

Richard Henderson: There are existing patches to examine in the
gcc-patches archive. There is dependence distance code already
checked into the compiler that no one uses; that information could be
hooked into the loop unroller and the prefetcher and we might see
improvements.

- prefetch intrinsic

- code locality; function order based on profiling

Bill Savage: Getting functions ordered requires interaction with the
linker.

Richard Henderson: There was some work on such a tool, but it might
be easiest to start from scratch. The [GNU] linker has a scripting
language that can tell it where to place functions. The tool could
almost be a shell script.

Bill Savage: This functionality requires more than a call graph.

Hans Boehm: There might be a problem that profiling tools don't work
with threads.

There is an article by Karl Pettis and Bob Hansen about how to order
functions based on a call graph: "Profile guided code positioning",
http://acm.proxy.nova.edu/pubs/articles/proceedings/pldi/93542/p16-pettis/p16-pettis.pdf

- static function ordering

SGI has a tool called CORD for code ordering that uses either static
or dynamic information.

- machine model

Richard Henderson: There is a good machine model from Vlad [Vladimir
Makarov], but it was not submitted. The current one isn't good
enough for advanced scheduling.

- improve GCC bundling of instructions

Richard Henderson: GCC currently uses an ad-hoc method of bundling;
the machine model should guide it.

Vatsa Santhanam: Look at nop density.

- selective inlining

Mark Mitchell: GCC with -O2 inlines functions that are declared as
inline; -O3 will inline everything "small". GCC could be smarter
about how to inline.

Vatsa Santhanam: GCC could do profile-based inlining.

- hook up to open source KAPI library (machine model description)

Suresh Rao: We can use it to build the machine model rather than
using it directly.

- control speculation for loads only

Suneel Jain: Speculation for loads doesn't need recovery code and is
quite simple, with chk.s.

someone: Recovery mode is not supported in Linux.

Richard Henderson: If you don't care about seg faults you don't even
need the check.

- region formation heuristics

Richard Henderson: We could rip out CFG detection, use regular data
structures, and fix region detection.

[John Sias sent this during the review of the minutes:]
Region formation is a way of coping with either limitations of the
machine or limitations of the compiler / compile time. "Regions" are
control-flow-subgraphs, formed by various heuristics, usually to
perform transformations (i.e. hyperblock formation) or to do register
allocation or other work-intensive things. For hyperblock formation,
for example, region formation heuristics are critical---selecting too
much unrelated code wastes resources; conversely, missing important
paths that interact well with each other defeats the purpose of the
transformation. Large functions are sometimes broken heuristically
into regions for compilation, with the goal of reducing compile time.

- new Cygnus scheduler

Richard Henderson: This scheduler makes the compiler slower and
doesn't always make code faster. It was written by Vlad.

- exploit the PBO (profile based optimization) capability that already
exists in GCC

Make sure it works and improve the documentation.

Try it on the Linux kernel and discuss the information.

Make the instrumentation thread-safe.

Build gcc with feedback; but Mark Mitchell says that the time spent
in gcc is mostly paging because it allocates too much memory.

- straight-line post-increment

non-loop induction variable opportunities

Jeff Law is looking at post-increment work.

- make better use of dependence information in scheduling

Richard Henderson: This is very helpful and very easy.

- enable branch target alignment

It's necessary to measure trade-offs between alignment and code size.

- alignment of procedures

Long-term optimizations or infrastructure changes
-------------------------------------------------

- language-independent tree optimizations

Richard Henderson: Cool optimizations require more information than
is available in RTL. The C and C++ front ends now render an entire
function into tree format, but it is transformed into RTL before
going to the optimization passes. We need to represent everything
that is needed to be represented from every language. Every
construct doesn't need to be represented; WHIRL (SGI's IL) level 4 is
about what he means.

Mark Mitchell: This is one of the projects he's wanted to do for a
couple of years. The IL needs to maintain machine independence
longer.

- hyperblock scheduling

Richard Henderson: This requires highly predicated code.

- predication

if-conversion, predication, finding longer strings of logical

notion of disjoint predicates

PQS (predicate query system); a database of known relationships
between predicate registers

- data speculation

- control speculation

- modulo scheduling

- rotating registers

- function splitting (moving function into two regions), for locality

Richard Kenner: This is difficult if an exception is involved.

Vatsa Santhanam: There might be synergistic effects with reordering
functions for code locality

Jim Wilson: Dwarf2 is the only debugging format that can handle it.

- optimization of structures that are larger than a register

The infrastructure doesn't currently handle this. This is related to
memory optimizations.

- make better use of alias information

- instruction prefetching

- use of BBB template for multi-way branches (e.g. switches)

It might be difficult to keep track of this in the machine-
independent part of GCC.

- cross-module optimizations

Avoid reloads of GP when it is not necessary. The compiler needs
more information than is currently available.

- high-level loop optimizations

This requires infrastructure changes.

- C++ optimizations

Jason Merrill invented cool stuff, e.g. thunks for multiple
inheritance, that hasn't been done yet.

It's possible to inline stubs.

- "external" attribute or pragma

This would be for information like DLL import/export; it is not
machine independent.

If GCC defined such an attribute, glibc would probably use it.

- register allocator handling GP as special

Tools: performance tools, benchmarks, etc.
------------------------------------------

- GCC measurements and analysis, comparison with other compilers

Mark Mitchell: It would be useful to compare performance using real
applications, e.g. Apache and MySQL.

- profile the Linux kernel

- dispersal analysis

Steve Christiansen has a dispersal analysis tool. The output is
similar to the comments in GCC assembler output with -O2 or greater,
but it can be used on any object file and prints information at the
end of each function with the number of bundles and nops.
[Currently this uses McKinley rules and so would still be under NDA,
but if there's interest, Steve could use Itanium rules instead.]

- statistics gathering tool

- PMU-based performance monitor

- small test cases and sample codes for examining generated code

These could come from developers of proprietary IPF compilers, who
presumably have used such code fragments to analyze the code that
their compilers generate.

- compiler instrumentation that would cause an application to dump
performance counter information

Where do we go from here?
-------------------------

Richard Henderson: Any changes can go into the mainline CVS now, but
there's no way to tell when there will be another release.

Mark Mitchell: Perhaps GCC should go to a more schedule-driven release
policy; he'll bring it up at the next steering committee meeting.

Gary Tracy: His group will be making commitments sometime in June.

Further communications should take place in the gcc mailing list
(gcc@gcc.gnu.org, archived at gcc.gnu.org). Use "ia64" in the subject
line to allow people who are only interested in IA-64 work to search for
it.

Janis Johnson will merge items from the list above with the existing GCC
IA-64 wish list (at linuxia64.org) and get someone to add it to the GCC
project list. People planning to work on a project can mail the gcc
list and record their plans in the projects file.