Apple Developer Connection
Advanced Search
Member Login Log In | Not a Member? Contact ADC

Optimizing for the Power Mac G5

Apple’s Power Mac G5 computer has significantly increased speed, capacity and performance capabilities, giving you new opportunities to create powerful, innovative software. But to take full advantage of the G5 platform, you’ll need to understand how the G5 differs from previous processors, which new tools Apple provides to help you analyze your software for the G5, and how to optimize your application.

Optimization is not just for high-end, math and science applications—every application can benefit significantly from modifications that range from very easy to very complex. You will have to decide for yourself how much to optimize; this article can help by guiding you through the decision-making process, and by explaining what’s involved in each set of options.

If you are creating new applications, optimizing can help you get the most out of the G5 from the start; for existing applications, you can learn how to identify the parts of your application that can be tuned to run optimally on the G5. If you have already optimized your current products for the G4 processor, you’ll need to make a similar effort to optimize them for the Power Mac G5.

Apple provides the tools you will need for your optimization efforts, and these are described below. The sections that follow summarize the various levels of optimization that are available, so you can choose the optimizations that are the most appropriate for you, according to your resources. But first, this article provides a quick summary of the Power Mac G5 Platform and how it is different than any other personal computer.

The Power of the Power Mac G5

Notice that this article does not focus on the G5 as a processor only, but rather on the entire Power Mac G5 system. Previous efforts at optimization strategy focused primarily on exploiting processor features to maximum advantage. But the scope of optimization work for the Power Mac G5 is broader; that’s because there’s more to the Power Mac G5 than just the G5 processor. Everything in the Power Mac G5 computer contributes to its overall excellence.

In fact, you shouldn’t think of the Power Mac G5 as just a processor—rather, as a developer, you should think of it as a platform. Power Mac G5 computers aren’t just fast, they’re capacious—they have much higher capacities for computation, memory, disk storage, and data transfer. Developers who understand and fully exploit this will be able to design creative new Power Mac G5-based applications that, until now, would have been impossible on any personal computer.

Where does the power of the Power Mac G5 platform reside? A quick tour of the architecture will help you to understand the new approach to optimization and what it means for you.

64-Bit G5 Processor

The G5 processor isn’t simply a faster version of the G4. Instead, it is a redesigned processor that implements the 64-bit architecture built into the original PowerPC when it was designed by the AIM Consortium almost a decade ago. The G5 processor draws on the IBM’s considerable experience in processor design. In fact, it is based on the execution core of IBM’s POWER4 processor, which drives IBM’s high-end pSeries 690 servers.

A full description of how the G5 processor contributes to the superior performance and capacity of the Power Mac G5 platform would be an article in itself. The following characteristics of the G5 processor give some indication of its performance and capacity:

  • 64-bit-wide registers, data paths, and internal logic units make it possible for the PowerPC G5 processor to do 64-bit-wide integer and floating-point operations in one clock cycle.
  • The processor contains 12 discrete functional units that can hold up to 215 simultaneous instructions in various stages of execution.
  • Among its 64-bit wide internal logic units are two 64-bit floating-point units, two 64-bit integer units, and two 64-bit load/store units.
  • The architecture of the G5 processor provides full support for symmetric multiprocessing.

The initial Power Mac G5 product line offers models with 1.6GHz, 1.8GHz and Dual 2GHz PowerPC G5 processors; also, note that bus speeds and other components vary depending on the model.

A Faster, Smarter System Controller

The U3 system controller (which connects the G5 processor to the rest of the computer’s components) is a custom chip created using the same IBM 130-nanometer technology as the G5 processor itself. It supports point-to-point routing, which enables multiple subsystems to simultaneously exchange data with main memory without involving the G5 processor.

Faster Memory, and More of It

The U3 system controller also makes it possible for Power Mac G5 computers to use fast 400 MHz, 128-bit DDR (Double Data Rate) SDRAM. Power Mac G5 computers currently have either four or eight DIMM slots, which enables them to hold up to 8 GB of physical memory with today’s 1 GB DIMMs. As higher-capacity DIMMs become available, Power Mac G5 computers will be able to use them.

This significantly higher level of memory capacity will certainly be at the heart of at least some of the future breakthrough applications for the Power Mac G5. In today’s computers, extremely large data sets (for example, video and complex 3-D models) must reside on hard disks, forcing data accesses periodically into the millisecond range. When all the data can reside in physical memory, data accesses will always be in the nanosecond range. In addition, data residing in main memory is easier to manipulate than data stored on a hard disk.

1 GHz Frontside Bus

The Power Mac G5 computer connects the G5 processor to the system controller through a frontside bus, with a capacity of up to 1 GHz. This enables a tremendous increase in data throughput. Dual-processor Power Mac G5 models include separate frontside buses to each G5 processor; this gives them an extra speed advantage over dual-processor Intel computers, which force both processors to share a single bus.

Why You Should Optimize

With no changes whatsoever, most compiled software will run proportionately faster on Power Mac G5 computers simply because of the G5 processor’s higher clock speed and the computer’s higher data throughput and increased number of execution units. However, you can make your software run even faster by simply recompiling it. If you take further optimization steps (described later in this article), you may be able to get your software to run several times faster on a Power Mac G5 computer than it does on previous Power Mac computers.

Tools for Optimization

The software suite you absolutely must have is the Xcode Tools, the new development tools package from Apple that includes everything from the integrated development environment where you write, build and debug your applications, to human interface design tools, to performance optimization and debugging tools. The Xcode Tools include updated compilers, the gcc (GNU Compiler Collection) version 3.3, which Apple has augmented to work with Mac OS X and the G5 processor. The gcc 3.3 compiler includes a number of changes that are necessary to optimize code for the Power Mac G5 platform, including new compiler flags and much stricter adherence to the established language specifications than previous versions of gcc (see Technical Note TN2086: Tuning for the G5: A Practical Guide for details).

The performance analysis tools that come with Xcode fall into two categories: software-only, non-invasive tools, both command-line and graphical, that operate at the process level; and the Computer Hardware Understanding Development (CHUD) tools, which rely upon dedicated hardware features to operate. Here are the highlights of a couple of them.

For a starting point, investigate Sampler. An exploratory optimization tool, Sampler is a performance-measuring application that analyzes a program’s running behavior and its allocation of memory by stopping the program periodically to examine the function call stack. Sampler displays the functions that were most frequently seen while sampling was taking place. This information can help you locate those functions and sections of your code that are consuming large chunks of CPU time, as well as functions in your applications where excessive memory allocations are occurring.

Sampler is one of the non-invasive tools that operates at the process level and has features which allow you to understand overall running behavior and application state over time, with no modification to your application code.

Sampler’s hardware-based relative in the CHUD set is called Shark. This is an extremely valuable tool that also does time-based sampling of the computer running your software, telling you where the computer spends its time. The difference is that Shark can delve deep into the details of function usage, due to its measurement being linked directly to the hardware. It enables you to find the specific routines that will benefit the most from optimization.

When Shark is used in conjunction with the rest of Xcode, it can also display your routines’ source code and highlight the individual lines of source code that are consuming the most processor time. In many cases, Shark will also suggest what you might try to increase performance. Shark can also display the assembly-language code associated with your source code and show you execution details (for example, instruction groupings and processor stalls) that you can use to make assembly-level optimizations.

Note: An earlier version of Shark was named Shikari. Shark has been substantially upgraded specifically for use with the Power Macintosh G5 platform.

That provides a starting point for optimization tools. You should become familiar with Sampler and Shark as well as all the other performance analysis tools so that you can incorporate them into your standard development, debugging and quality assurance process. Full documentation about using these tools is installed when you install Xcode: see file:///Developer/Documentation/Performance/Performance. html for the top-level entry to this documentation set as well as Performance Fundamentals in the /Developer/Documentation/Performance directory, and on the ADC website.

Some Important Guidelines

There are some general optimization-related guidelines that you should consider first when deciding which level of optimization applies to you.

Re-optimize for Power Mac G5

If you implemented processor-specific optimizations on your software in the past, you’ll need to implement a similar level of G5-specific optimization on your software to ensure comparable performance on all Power Mac G5 computers. This is necessary because the very same code changes that maximize performance on one processor may interact adversely with another processor.

Harness Velocity Engine with vecLib

If your software does any amount of vector, matrix, or signal processing, you should seriously consider rewriting the appropriate code to use Apple’s vecLib framework, which gives you access to several vector-processing libraries, including BLAS (Basic Linear Algebra Subprogram) and the vDSP digital processing library. Using vecLib multiplies the benefits resulting from your effort:

  • The vecLib routines give you most of the performance benefits of Apple’s Velocity Engine without requiring the substantial learning and programming efforts needed to write custom code for it. In some situations, using vecLib will produce the most dramatic performance increase of any single optimization you can apply to your software, and the benefit/effort ratio will be extremely high.
  • Code written using vecLib doesn’t have to be rewritten for different processors, giving you both portability and high performance with one development effort. This is possible because Apple’s vecLib framework has separate versions customized for each of the “G-series” processors. This means that one programming effort produces code that is highly optimized for the G3, G4, G5, and future G-series processors.

If you are willing and able to write your own code for Velocity Engine, see the Velocity Engine web page for more information.

Keep the Power Mac G5 Well-Fed

When optimizing, remember that Power Mac G5 computers are very hungry, very fast, and very sequential. This means that they consume very large amounts of data at one time, that they process it very quickly, and that nonsequential instructions and data accesses cause significant performance penalties. Many of the optimizations described below and in other Apple-supplied documentation cater to these characteristics. You should keep them in mind when you look for opportunities to optimize your software.

Be Wise When You Optimize

Optimization does not happen in a vacuum; it produces side effects that may affect your program in other, unacceptable ways. For example, a program optimized for speed alone may be too large on disk, or its larger size may cause additional disk accesses that negate the speed increase of the code itself.

For these and other reasons, you’ll probably find it necessary to optimize different parts of your code separately. The Xcode Tools enable you to add compiler flags on a per-module or per-file basis, thus giving you the ability to control how the compiler optimizes your program. You’ll need to test different optimization combinations to determine which ones enable you to meet your performance goals.

Achieving the best possible performance for your program involves more than just optimizing it for the Power Mac G5 platform. You must also optimize the program itself, including such program optimizations as:

  • ensuring that your program launches quickly
  • limiting your program’s memory and disk footprints
  • minimizing the time taken by disk accesses
  • making drawing operations as efficient as possible.

See the Performance section under For Further Information at the end of this article for resources to help you with this task.

Resist the urge to optimize the code that you intuitively “know” needs it. Profile your code for hot spots, evaluating the effects of optimization not just on time alone but on the benefit that the user will perceive from it.

Optimization, Level by Level

Because every technical task exists within larger technical and business contexts, only you can decide which optimizations you should perform on your software. Your decision will include such factors as what your software does, technical expertise, and what benefits you expect to see from the optimization process.

There are four levels of optimization for you to consider, starting with the easiest and working up to the most complex. In general, the lower the level, the more likely you are to implement the optimizations in that level. However, be aware that, even within the same level, different optimization tasks vary in difficulty, time to completion, and benefit/time ratio.

For details on these and other optimizations, see Technical Note TN2086: Tuning for the G5: A Practical Guide and Technical Note TN2087: PowerPC G5 Performance Primer.

Level 0 Optimizations: Definitely

Recompile your software, using the -O3 flag to incorporate processor-independent optimizations.

Examine your code for opportunities to consolidate multiple operations on small amounts of data into one operation on one large amount of contiguous data. It may make sense to preload larger amounts of data from remote sources (for example, reading an entire file into memory in one operation rather than line by line) or to use larger buffers.

Converting data from one type to another (for example, from string to integer) is even more resource intensive on the G5 as opposed to earlier processors. Look for opportunities to minimize the amount of type conversion that your software does. Type conversions that can be done without memory accesses are significantly faster.

Level 1 Optimizations: Probably

Recompile your software using the flags (as appropriate) that implement G5-specific optimizations.

Use Sampler to discover the routines where your software is spending most of its time (also known as “hot code” or “hot spots”). You may be able to improve this code’s performance by recompiling it using flags that unroll loops and replace subroutines with in-line code. Also look for opportunities to improve performance by rewriting the appropriate source code to be more efficient.

If your software makes significant use of the square-root function, a simple recompilation of your code using the appropriate flag will cause the compiler to invoke the G5 processor’s built-in square-root instruction. Depending on your situation, this simple, fast change may give your software a noticeable boost with virtually no effort involved.

Level 2 Optimizations: Possibly

Profile your running application using Shark and follow its recommendations for improving the performance of hot spots.

Some of Shark’s recommended optimizations require some understanding of the G5 processor’s inner workings, but one of its recommendations—improving instruction alignment—is easy to understand and quick to implement. The G5 processor is more sensitive than previous processors to misaligned instructions. The G5 processor is negatively affected by certain key addresses when they are not aligned to 32-byte block boundaries. You can get the compiler to automatically align functions, loops, jumps, and jump targets to 32-byte block boundaries by compiling individual files with the appropriate optimization flags (for example, -align-loops=32). Because such optimizations increase the size of the resulting binary code, you must apply them sparingly and monitor their overall effect on the size of your software.

If you have previously optimized your software for the G3 or G4 processors, remember to take your original processor-independent code and optimize it for the G5 processor.

Level 3 Optimizations: For Maximum Effect

These optimizations provide the maximum performance increase, but they require significant technical knowledge about the behavior of the G5 processor. In most cases, they also involve a significant programming effort.

By writing the appropriate code, you can maximize your software’s use of the G5 processor’s dual 64-bit floating-point units (FPUs), dual 64-bit integer units, and dual load/store units.

For the appropriate numeric operations, you will achieve the highest possible performance by writing custom code for the built-in Velocity Engine. This can be a demanding process, and also a processor-dependent one that must be revisited for each G5 processor you plan on supporting. Remember that you can get most of the benefits of the Velocity Engine with only a small fraction of the effort by using the vecLib framework.

Packaging your Optimizations

Code that has been optimized for the G5 by simple re-compilation will run without penalty on a G4. If you have done more in-depth, G5-specific tuning (levels 1, 2 and 3) then you will in all likelihood want to provide a separate binary. In extreme cases, you may decide that you need only offer one version of your software that runs on Power Mac G5 computers only. However, you’ll probably want to support most or all of the Macintosh product line, which means that you need to decide how best to deliver the right code to each of your customers. There are several ways to achieve this; the first is:

  • Create different versions of your software for each processor that you support. This requires that you maintain three parallel code bases, something you may not want to do.

It is possible for your software to query the computer on which it is running to see which processor-related features are available. You can design your software to isolate processor-dependent code and call the appropriate version as needed. This leads to two additional strategies for packaging your application:

  • For every function that calls processor-dependent binary code, have your code call the appropriate version. If such functions are needed frequently, using this approach may decrease execution speed and make your source code (cluttered with if...then constructs) less readable.
  • Isolate processor-specific functions into frameworks or shared libraries, then have your software load the appropriate version when it starts up. This enables you to write your main code without wrapping function calls in if...then constructs.

Summary

This article is designed to get you started on optimizing for the G5; see the Technical Notes and other documents listed at the end of the article for more details.

The G5 platform is important not just because of the G5 processor but also because of the operating system and the hardware around it. Together, these components implement next-level increases in computing power and data capacity. These increases deliver new opportunities to you, the developer—opportunities for new applications and even new categories of applications.

It’s essential that you consider carefully your goals and resources before you begin to modify or design your applications.

Although today’s applications, unchanged, will run faster on Power Mac G5 computers, you can add significant additional performance gains by optimizing your software for the Power Mac G5 platform. There are some optimizations that you should definitely implement on all your Macintosh software and others that you probably should implement. If you have performed processor-specific optimizations on your current software, you will need to implement G5-specific optimizations to make your software ready for the Power Mac G5 platform.

For Further Information

Overview Information

  • ADC Power Mac G5 web page
  • Power Mac G5 Performance White Paper (PDF)
  • Power PC G5: The World’s First 64-Bit Desktop Processor White Paper (PDF)
  • Power PC G5 Developer Note (HTML) (PDF)

Optimization

Performance

Posted: 2005-04-29