Wish List for IA-64 Improvements to GCC

This document will be updated as new comments are received.

Send updates to Janis Johnson: janis at us dot ibm dot com. Specify if they are meant to be kept confidential or should be attributed to an unnamed source; the assumption is that they can be added to the public list with your name.

Discussions should take place on the general gcc mailing list, gcc at gcc dot gnu dot org, which is archived at http://gcc.gnu.org/ml/gcc/. Please include "ia64" in the subject line.

Control Speculation

Jim Wilson wrote on 1/10/2001 to the gcc mailing list:

There is a patch written for control speculation about a year ago, but I don't think anyone ever looked at it other than the author Stan Cox. It would likely need a lot of rewriting to be useable again.

Jim Wilson wrote on 2/2/2001 to the gcc mailing list:

Advanced loads are currently not supported. We need data/control speculation optimizations in order to support them. There is some support for control speculation in the current scheduler, but I don't think we are using it in the IA-64 port yet. This is probably the next big optimization project that will need to be worked on for the IA-64 back end.

Data Speculation

Jim Wilson wrote on 1/10/2001 to the gcc mailing list:

There was one aborted attempt to add data speculation by Bernd Schmidt.

Itanium Hardware Description

Jim Wilson wrote on 1/10/2001 to the gcc mailing list:

We don't have a good Itanium hardware description. There is an improved pipeline description scheme that was written by Vladimir Makarov. However, it hasn't been submitted to the FSF yet, and it is unclear when it will be, so for now we aren't using it. When we did try to use it, we didn't get any noticable speedup. It was unclear whether this was due to limitations with the scheduler, or whether there was something wrong with the Itanium pipeline description.

Bernd Schmidt wrote on 1/10/2001 to the gcc mailing list:

I think Vlad's Itanium description is actually a bit less accurate than what we now do via MD_SCHED_REORDER. I believe it also wasn't as good at placing stop bits.

A source from Intel MicroProcessor Reseach Labs wrote on March 21, 2001:

It is important to have a precise machine model to help the compiler making machine dependent decisions, such as latency, bundle, by-pass, etc.

Software Pipelining

Jim Wilson wrote on 1/10/2001 to the gcc mailing list:

We don't have software pipelining. There is code written by Vladimir Makarov for modulo contrained software pipelining (or something like that), but I don't know how well it works, and it has no IA-64 specific knowledge, so it doesn't support interesting features like register rotation. This code has not been submitted to the FSF, and requires the new pipeline description scheme which is also not submitted to the FSF, so we can't use it.

Richard Henderson wrote on 1/23/2001:

Vlad implemented something akin to
Resource-constrained Software Pipelining by A. Aiken, A. Nicolau, and S. Novack IEEE Transactions on Parallel and Distributed Systems, 6(12), pp. 1248--1270, December 1995.

Note that there are bugs in the algorithm presented in the paper.

This is not the same thing as modulo scheduling, and so cannot take advantate of the rotating registers even it we described them. It can, iirc, handle a fractional iteration interval (eg a loop with conditional branches), so it is a good fallback for loops that cannot be modulo scheduled.


Jim Wilson wrote on 1/10/2001 to the gcc mailing list:

The support for predication could be improved. There is currently little or no knowledge of predication outside the if-cvt.c file, so there are a number of optimization passes that are suboptimal when predicated code is present. Register allocation for instance, I don't think it will reuse registers in mutex code blocks because it doesn't know about predication. The scheduler was creating lots of false dependencies for a while, but that may have been fixed already, I'm not sure.

Bernd Schmidt wrote on 1/10/2001 to the gcc mailing list:

That's mostly fixed.

Richard Henderson wrote on 1/23/2001:

Predication currently does not happen until after register allocation, so concerns about reuse of registers are moot. In fact, none of the optimization passes up to and including register allocation know how to handle predication from a correctness standpoint.

Jim Wilson wrote on 2/2/2001 to the gcc mailing list:

Predication support was added by extending the RTL to represent predicated instructions, and adding the if-conversion (ifcvt.c) optimization pass. Also, a register renaming pass was added that helps with register allocation of predicated code.

Branch Prediction

Jim Wilson wrote on 2/2/2001 to the gcc mailing list:

Nothing specific has been done for branch prediction, but gcc already has features added for other targets that can be used here. There is a basic block reordering pass that estimates branch probabilities. This info can be used to set branch prediction bits, and the IA-64 port does use this info. There is builtin_expect which allows the programmer to indicate branches that are likely/unlikely to be taken. There is support for profile-directed feedback that can drive branch prediction.

Instruction Level Parallelism

Jim Wilson wrote on 1/10/2001 to the gcc mailing list:

I suspect that we aren't generating enough ILP to really take advantage of the architecture. So improving optimizations that create ILP might be a good first step. Perhaps some code replication would be useful. Or perhaps improving predication/speculation support would help. I would expect that scheduler improvements tie into this. We probably need an agressive cross block scheduler to get much benefit from any of these areas. I doubt we want to go as far as trace scheduling, but there is probably something intermediate that would be useful.

Bernd Schmidt wrote on 1/10/2001 to the gcc mailing list:

The code I've added to schedule across extended basic blocks seems moderately effective (maybe 8% on SPEC with the other improvements to reduce false dependencies).

Robert Dewar wrote on 1/10/2001 to the gcc mailing list:

Conventional wisdom for these kinds of architectures would say that you have no hope of generating enough ILP if you do not do speculation. Trace scheduling + speculation seems really required to extract the promise of EPIC architectures of this kind.

Sounds like there are a LOT of opportunities here. It is probably worth taking a close look at what is going on in the Trimaran project.

The Trimaran project is "An Infrastructure for Research in Instruction-Level Parallelism"; see http://www.trimaran.org/.

Alternate Code Sequences

Jim Wilson wrote on 1/10/2001 to the gcc mailing list:

Using alternate code sequences might be useful in some cases. The cost of moving integer values to/from the FP registers for the multiply (xma) instruction might make other code sequences faster.


Jim Wilson wrote on 1/10/2001 to the gcc mailing list:

That is kind of a general wish list based on interesting processor features. So far, I haven't had any time to look at code and figure out what needs to be improved. Just getting the toolchain working well enough, and complete enough, for OS releases has taken almost all of my time so far.

There are ??? comments scattered through the ia64 backend, some of which point out optimization opportunities. Most of these are small local opts though, so we may not get much performance from them.

(the string "???" occurs 161 times in files in gcc/config/ia64)

I also suspect that the, um, idiosyncracies of the Itanium pipeline are causing problems. Some operations are a lot slower than one would expect. Like dynamic shifts, which are effectively 10 cycles unless you schedule them right or emit some nops after them in which case they are 4 cycles. We don't get this right. There are a lot of cache pipeline flush cases if you put stores too close. Some coalescing of stores might be useful to avoid this. [moved to separate section.] I saw a message today pointing out that we are moving values into branch registers too late. There are likely other things that could be improved in this area.

We recently started emitting entire functions instead in a high level IL instead of one statement at a time. This introduces the possibility of adding high level optimizations such as loop transformations. We have a little bit of dependence analysis, but it really isn't hooked into anything yet. The C++ front end does function inlining on the high level IL, but the C front end doesn't yet. This needs to be fixed. The C front end is still using a low level function inliner which can't handle any complicated call sequence, which includes HFA arguments. This means functions using the complex FP types can't be inlined. This hurts glibc math library performance.

We have a little bit of infrastructure for profile directed feedback, but we haven't tried to use it yet, and it probably needs some maintenance work before it will be usable again. We probably should only rely on this as a last resort though. This is inconvenient for many applications.

A source from Intel MicroProcessor Reseach Labs wrote on January 25, 2001:

As for which IA-64 optimizations are most important for good performance, this depends heavily on the workloads. For floating point benchmarks (e.g. SPEC2000F), software pipelining, loop optimizations, memory prefetching, etc, are crucial. For integer benchmarks (e.g. SPEC2000C), function inlining, good alias analysis, instruction scheduling, control speculation, use of profile information, sign-extension elimination, etc, are important. For OLTP type of programs, any optimizations that increase code size should be carefully examined.

David Mosberger reported to the gcc-bugs mailing list on March 23, 2001, http://gcc.gnu.org/ml/gcc-bugs/2001-03/msg00712.html, that interactions between optimizations prevent sibling call optimization when inlining is used. He manually tweaked the generated code to force both optimizations to be used and the performance of his test case was 50% better than with -O3. This was with gcc version 3.0 20010322 (prerelease).

People working on IA-64 code generation

What organizations are quietly working on IA-64 gcc improvements that will be added to a future release of gcc, or would be willing to submit their changes for someone else to integrate into a future release? This information could prevent unnecessary duplication of effort.

People working on other IA-64 compilers could help out a lot by providing feedback about which kinds of compiler enhancements are the most important for performance of system software, and which ones are most likely to fit into gcc's current infrastructure. Another help would be to provide code fragments that demonstrate opportunities for optimization and small benchmarks that demonstrate areas where the compiler is crucial to allowing decent performance on IA-64.

Communications about IA-64 work in gcc

Jim Wilson wrote on 1/10/2001 to the gcc mailing list:

The ia64-linux (formerly Trillian) group has a private mailing list for the toolchain, but I'd rather not use it anymore, because it is private. There is a public list linux-ia64 at linuxia64 dot org but it mostly gets used by kernel developers. It might be feasible to use it for some linux/gcc discussions though. For now, the regular gcc lists are probably the best choice. If we remember to put ia64 in the subject line, it should be possible for people to follow just the ia64 threads if that is all they want to read.


There is a gcc benchmark suite, but it doesn't appear to be used by many gcc developers (let me know if this isn't true).

Bench++ is open source; all of the tests are in C++.

Al Aburto has a good collection of small C benchmarks available at at ftp.nosc.mil/pub/aburto.

Is SPEC CPU2000 widely used by gcc developers, and/or developers of other IA-64 compilers?

Last updated Mon Mar 26 09:49:58 PST 2001.