# THE INSIDERS' GUIDE TO MICROPROCESSOR HARDWARE

# Motorola Unfolds ColdFire Roadmap Clock Doubling, Superscalar Cores, MAC Units Lie Ahead

### by Jim Turley

Motorola has started to reveal its future plans for the ColdFire product family, pointing the way to more performance and more integration in the years ahead. The company's roadmap indicates that clock-doubled ColdFire chips will be available by next year, with advanced superscalar derivatives before the end of the decade. A new multiply-accumulate unit gives ASIC customers pseudo-DSP capability for the first time in the long history of the 68K architecture.

The company also released another member of the family, the inexpensive 5204. The new chip drops the entry price of the ColdFire family to less than \$9 in volume for the 16-MHz version. The product line now numbers seven processors, most with application-specific integrated logic. Motorola's plans call for continued development of application-specific consumer and computer devices, leveraging the company's expertise in merging processors and peripherals.

### **Roadmap Plots Four Major Enhancements**

Motorola evidently feels ColdFire chips are old enough now to be classified by processor generation. The original 5102 (*see* **081405.PDF**) is the sole example of the first generation, which offers complete 68EC040 instruction-set compatibility. The remaining members are all part of the second generation, characterized by the subset of the 68K instruction set they implement and their debug support.

Clock-doubled devices will debut early next year, as Figure 1 shows, establishing the third product generation. These are expected to be followed in mid-1998 by devices based on a new two-way superscalar core. This would mark only the second time a 68K CPU has achieved superscalar execution, after the 68060, a fabulously complex three-issue machine. Further developments call for enhancing the superscalar core by lengthening the execution pipeline and reducing the number of clocks per instruction, finally moving to "superpipelining" sometime in the next millennium.

Cache sizes are slated to increase as well, from the small

2K instruction caches in production now to 8K unified caches in the 5300-series parts. Dual 8K caches should debut at the same time as the first superscalar core, with dual 16K caches appearing on the 5500 series around 1999.

If all goes according to plan, Motorola will have a plethora of RISC and CISC cores at its disposal for driving application-specific controllers, with ColdFire, 68K, and PowerPC.

# ColdFire Overtakes 68K in 1998

Currently, ColdFire performance is about equal to that of a 68030. Even the fastest ColdFire chip, the 33-MHz 5102, lags the 68040 on which it is based and is well behind the 68060. Motorola is not planning to upgrade its moribund 680x0 processors, so the 68030, '040, and '060 are stationary targets. According to this new roadmap, ColdFire performance will overtake the '040 sometime in 1997 and surpass the '060 around 1998.

In the near term, ColdFire's gains will come from process improvements as capacity becomes available in



**Figure 1.** Motorola's roadmap for ColdFire calls for four more major improvements over the next five years, including superscalar execution before the end of the decade. (Source: MicroDesign Resources, Motorola) \*based on Dhrystone 1.1

Motorola's 0.35-micron fabs for relatively low-margin processors. This change will push clock frequencies up to around 80 MHz, necessitating half-speed (or slower) bus interfaces in 1997 and 1998. It will also force a drop in supply voltage, to 3.3 V, hastening the end of 5-V parts for historical 68K users. When ColdFire encounters 0.25-micron processes, sometime in 1998, voltages will drop again, to 2.5 V.

The superscalar core will initially maintain clock rates and process technologies at 1997 levels, reaping its performance benefits through multiple execution units. The net gain will depend upon the efficiency of the execution units and the extent to which 68K-derived code can take advantage of parallel execution.

Motorola won't be cutting back on 68K production any time soon and is, in fact, projecting 15% annual growth for the next five years. While the company is happy to supply 68K parts into existing embedded systems, it encourages new designers to go with ColdFire. The 5102, for example, already offers a compelling price/performance advantage over the 68030, making the latter part unattractive for new designs that don't need an MMU or floating-point support. Such cannibalization of the parent product line will continue for the rest of the decade.

Motorola's roadmap asserts that ColdFire performance will increase roughly tenfold in the space of five years, an increase of about 60% per year—an average rate of return, by industry standards. This is comforting news for current or prospective ColdFire users, but it won't suddenly thrust ColdFire to the forefront of embedded performance. Presumably, MIPS, SuperH, ARM, and PowerPC vendors will also reap approximately the same performance gains year by year, keeping relative performance positions the same for the foreseeable future.

ColdFire will continue to hold its position below Motorola's embedded PowerPC chips as the performance of both families increases. This strategy allows the company some pricing flexibility and an opportunity to reuse its I/O library.

| MAC Unit Marks First Enh | nancement to 68K ISA |
|--------------------------|----------------------|
|--------------------------|----------------------|

Although ColdFire is typified as a subset of the 68K instruction set, Motorola is extending the core definition with a MAC (multiply-accumulate) unit. The all-new MAC cell is an optional part of the ColdFire cell library available to the company's FlexCore ASIC customers. The MAC unit is slated to appear in future ColdFire parts sometime next year.

FlexCore customers can choose either the conventional ColdFire core or the ColdFire "2M" core with the MAC unit. The MAC unit adds about 8,500 gates to the ColdFire core, an increase of about 10% in terms of die area.

The MAC unit adds two new instructions, listed in Table 1, that perform  $16 \times 16 + 32 \rightarrow 32$ -bit multiply-accumulate operations with a single-cycle repeat rate. By taking a second pass through the MAC hardware,  $32 \times 32 + 32 \rightarrow 32$ bit operations are transparently supported. In addition to the new MAC (multiply-add) and MSAC (multiply-subtract) instructions, the MOV instruction was expanded and the MULU and MULS duties shifted to the MAC unit.

These new instructions are the first significant enhancement to the 68K instruction set since 1984, when the 68020 added more than a dozen bit-field, BCD, function-call, and arithmetic instructions to the ISA (many of which were subsequently dropped in later generations).

### Programmers Gain Three Registers

Three programmer-visible registers are added to the architecture with the MAC option: the accumulator (ACC), the MAC status register (MACSR), and the mask register (MASK). The results of all MAC operations are deposited in the 32-bit accumulator; a separate MOV instruction is required to transfer the result into the general-purpose register file, a classic technique that MIPS processors also use.

The MAC status register contains four status bits for overflow, negative, zero, and carry flags. Two additional bits control saturation behavior in case of overflow and whether operands are considered signed. Copying the contents of the MACSR to the condition-code register can make conditional

branches dependent on the results of the MAC instructions. Alternatively, the MACSR can be copied to a general-purpose register and examined from there.

The mask register works with any of the new instructions that reference memory. Programmers can use the 16-bit mask register to limit the magnitude of an address pointer, effectively implementing a circular buffer from a conventional 32-bit address reference. With a properly aligned table of coefficients and the 68K's autoincrement addressing mode, the mask register makes an effective tool for addressing circular queues common to filters, servo code, and other signal-processing applications.

| Opcode | Mnemonic                                  | Description                                    |
|--------|-------------------------------------------|------------------------------------------------|
| Axxx   | MAC Rx, Ry, [-1/0/+1]                     | Multiply-add; optional shift left/right 1      |
| Axxx   | MSAC Rx, Ry, [-1/0/+1]                    | Multiply-subtract; optional shift left/right 1 |
| Axxx   | MAC Rx, Ry, [-1/0/+1], <addr>, Rz</addr>  | Multiply-add and load register                 |
| Axxx   | MSAC Rx, Ry, [-1/0/+1], <addr>, Rz</addr> | Multiply-subtract and load register            |
| Cxxx   | MULU <addr>, Dn</addr>                    | Multiply, unsigned                             |
| Cxxx   | MULS <addr>, Dn</addr>                    | Multiply, signed                               |
| A1xx   | MOV.L Rx/imm32, ACC                       | Copy register/immediate to accumulator         |
| A18x   | MOV.L ACC, Rx                             | Copy accumulator to register                   |
| ADxx   | MOV.L Rx, MASK                            | Copy register to mask                          |
| AD8x   | MOV.L MASK, Rx                            | Copy mask to register                          |
| A9xx   | MOV.L Rx/imm32, MACSR                     | Copy register/immediate to status register     |
| A98x   | MOV.L MACSR, Rx                           | Copy status register to register               |
| A9C0   | MOV.L MACSR, CCR                          | Copy status register to condition codes        |

**Table 1.** The new ColdFire 2M core adds two multiply-accumulate instructions and modifies a handful of others. Each new MAC operation uses a pair of 16-bit operands from any general-purpose 32-bit register and deposits the result in a new 32-bit accumulator. The MAC unit also alters the behavior of conventional integer multiplication operations. On all previous ColdFire and 68K implementations (except the 68060), integer multiplication is iterative, with an early-out algorithm. Thus,  $16 \times 16$ -bit multiplies on the 68000 take 38–70 cycles, depending on the value of the multiplicand. With a ColdFire 2M core, all  $16 \times 16$ -bit multiplies execute in four clocks, while  $32 \times 32$ -bit operations take just six clocks.

The old MULU and MULS instructions execute in the same amount of time as MAC and MSAC (three-cycle latency with single-cycle throughput) because they now use the same hardware. However, the multiply instructions appear to be slower because they automatically transfer their results into the general-purpose register file. For MAC and MSAC, transferring the results requires an explicit MOV instruction, incurring a three-clock penalty.

### MAC Expands 68K Architecture

The ColdFire 2M core takes advantage of the architecture's 32-bit registers to feed the 16-bit MAC unit. Programmers can select the high or low half of the two source registers for each MAC operation, effectively doubling the number of registers available. This way, two pairs of operands can be loaded into a single pair of registers, allowing two MAC operations back to back and eliminating an intervening load from memory. For DCTs and other data-intensive functions that depend on traversing large tables, a MOVEM (move multiple) instruction can load the entire register file in a single burst transaction.

The syntax of the MAC and MSAC instructions allows the 32-bit product to be shifted left or right by one bit prior to being added to (or subtracted from) the accumulator. This little quirk allows programmers to squeeze an extra bit of precision from intermediate results without first scaling a table of coefficients.

The MAC and MSAC syntax also supports an unusual MAC-with-load format that dispatches a multiply-accumulate operation and a memory-to-register load simultaneously. This function is helpful for loading a string of coefficients or data points during impulse-response filtering. The MAC executes in one cycle as usual, but the load requires at least two cycles—more if the data is off-chip. ColdFire's execute and load/store pipelines are interlocked, so execution is tied up until the load completes. This effectively halves (best case) the throughput of the MAC. On the other hand, the loaded data is available to the instruction immediately following the MACwith-load operation, avoiding a load-use penalty.

### **Dormant A-Line Opcodes Pressed Into Service**

The new MAC and MSAC instructions, as well as the new variants of MOV, are mapped onto opcodes that were undefined in every previous 68K and ColdFire implementation. ColdFire has always used a proper subset of the 68040 instruction set, with "unneeded" instructions removed to reduce core complexity. This characteristic

# Price & Availability

The MCF5204 is sampling now, with production scheduled for 4Q96. In 10,000-unit quantities, the 5204 is priced at \$8.95, \$9.95, and \$11.38 for versions running at 16, 25, and 33 MHz, respectively. For more information, contact Motorola (Austin, Texas) at 512.891.6701 or visit the Web at www.mot.com/coldfire.

makes it fairly simple for compiler writers to modify their products for ColdFire simply by restricting the allowable instructions.

The new instructions all use A-line opcodes (so called because of their position in a hexadecimal opcode map), which previously caused faults if executed. Many 68K operating systems and real-time kernels use A-line traps to call system-level functions. Assembly-language programmers will have to evaluate carefully which reserved opcodes, if any, are now legitimate instructions.

### ColdFire Keeps Changing to Meet the Market

The new MAC unit is a welcome and overdue addition to the venerable 68K instruction set. Most microprocessor vendors have been busy the past few years adding pseudo-DSP and multimedia-processing instructions. MIPS, ARM, SuperH, and others were either designed with or have added multiply-accumulate capabilities. Without some concession to signal processing, Motorola would have been left with a serious hole in its product line.

The company hopes ColdFire 2M will displace DSPs in high-volume applications like hard disk drives, which typically use both a microcontroller and a separate DSP for servo control. Benchmarking efforts within Motorola show MACequipped ColdFire processors performing well against dedicated DSPs like TI's popular 'C52.

ColdFire 2M's MAC isn't perfect. The accumulator is grafted onto the architecture in an awkward, nonorthogonal manner. Transferring its contents to a normal register takes 1–3 clock cycles, and using the MAC condition codes generally involves trashing the usual condition-code register. It performs best with 16-bit precision, which is probably adequate for many small-signal applications, but several ARM and MIPS vendors offer better throughput with 32-bit precision and higher clock rates. On the plus side, ColdFire's inline shifter is reminiscent of ARM's, and its combination MAC-with-load operation should accelerate filtering operations when used with on-chip memory.

### ColdFire 5204 Lowers Price Point

Motorola's other ColdFire-related announcement is almost anticlimactic, coming as it does at the low end of the family's price/performance range. The new 5204 lowers the cost of entry to less than \$10 for a no-frills ColdFire processor.

| Processor                  | 5102    | 5202    | 5203    | 5204    | 5206    | 5266    | 5267    |
|----------------------------|---------|---------|---------|---------|---------|---------|---------|
| Min clock                  | 20      | 16      | 16      | 16      | 16      | 25      | 27      |
| Max clock                  | 33      | 33      | 33      | 33      | 33      | 25      | 27      |
| I-Cache                    | 2K      | 2K      | 2K      | 512     | 512     | 512     | 512     |
| D-Cache                    | 1K      | unified | unified | none    | none    | none    | none    |
| RAM                        | none    | none    | none    | 512     | 512     | 512     | 512     |
| MIPS*                      | 36      | 27      | 25      | 17      | 13.5    | 10      | 11      |
| Address bus                | 32 bits | 32 bits | 32 bits | 22 bits | 28 bits | 28 bits | 28 bits |
| Data bus                   | 32 bits | 32 bits | 16 bits | 16 bits | 32 bits | 16 bits | 32 bits |
| Sync/async                 | sync    | sync    | sync    | async   | both    | both    | both    |
| Multiplexed?               | yes     | yes     | yes     | no      | no      | no      | no      |
| Burst?                     | yes     |
| Bus sizing?                | no      | yes     | yes     | yes     | yes     | yes     | yes     |
| UARTs                      | none    | none    | none    | 1       | 2       | 2       | 2       |
| Timers                     | none    | none    | none    | 2       | 2       | 2       | 2       |
| Chip selects               | none    | none    | none    | 6       | 8       | 5       | 8       |
| Interrupt ctrl             | no      | no      | no      | yes     | yes     | yes     | yes     |
| I <sup>2</sup> C interface | no      | no      | no      | no      | yes     | yes     | yes     |
| DRAM ctrl                  | no      | no      | no      | no      | yes     | yes     | yes     |
| MPEG2 I/F                  | no      | no      | no      | no      | no      | yes     | yes     |
| Power (typ)                | 0.8 W   | 0.2 W   | 0.2 W   | 0.28 W  | 0.33 W  | n/a     | n/a     |
| Package                    | TQFP    | TQFP    | TQFP    | TQFP    | PQFP    | PQFP    | PQFP    |
| Pins                       | 144     | 100     | 100     | 100     | 160     | 208     | 208     |
| Process                    | 0.6μ    | 0.65μ   | 0.65µ   | 0.65μ   | 0.65µ   | 0.65µ   | 0.65μ   |
| Metal                      | 3 M     | 3 M     | 3 M     | 3 M     | 3 M     | 3 M     | 3 M     |
| Voltage                    | 3.3 V   | 5 V     | 5 V     | 5 V     | 5 V     | 5 V     | 5 V     |
| Availability               | Now     | Now     | Now     | 4Q96    | 1Q97    | 1Q97    | 2Q97    |
| Price (10K)                | \$15-25 | \$10–13 | \$10-13 | \$9–11  | n/a     | n/a     | n/a     |

**Table 2.** Motorola's ColdFire product line now includes seven processors with varying degrees of integrated logic. The low-cost 5204 merges the data bus and packaging of the 5203 with the cache and on-chip logic of the 5206. \*Dhrystone 1.1

The new part is a cross between the 5203 and the 5206, with the former's 16-bit data bus and small, 100-lead package and the latter's tiny 512-byte instruction cache, on-chip RAM, and peripherals. As such, the 5204 is configured as an inexpensive, entry-level chip, somewhat like IBM's newest PowerPC 401GF (*see* **100802.PDF**) or IDT's R3041 MIPS processor, with integration. Apart from some 386SX chips and Motorola's paleolithic but dirt-cheap 68EC000, there aren't many 32-bit CPUs available for less than \$10. The 33-MHz 5204 will easily outpace a 25-MHz 'EC000 but won't keep up with a 50-MHz 401GF. For designers with an attachment to the 68K seeking moderate performance, easy integration, and a bargain-basement price, the 5204 makes a good first choice.

### **Bus Interface Keeps Changing**

Unfortunately, hardware compatibility is not among Cold-Fire's virtues. Few of the seven existing ColdFire processors are bus-compatible with any of the others, using different protocols and timings and supporting different transaction types, as Table 2 shows. No two are pin-compatible, even though they are available in similar packages. The 5204 continues this tradition by introducing yet another bus interface.

The 5102, for example, has a synchronous 68040-style

bus; the 5202 and 5203 have a synchronous multiplexed bus; the 5206 has a nonmultiplexed bus; and the 5204 has an asynchronous bus with pseudo-burst capability.

Motorola defined each chip for a particular market niche with different needs regarding hardware compatibility; in the case of the 5204, the bus was designed to lure 68340 users looking for higher performance. The company is considering a single, standardized ColdFire bus, but such a move is still far off and would probably introduce yet another interface into the mix. For the time being, sharing PCB designs among ColdFire projects simply is not an option.

IDT is in exactly the opposite position: it offers no fewer than seven different MIPS-compatible chips with exactly the same bus, package, and pinout. This shrewd tactical maneuver allows the company to attract customers with \$10 chips while promising painless upgrades to \$40 parts that have FPUs, MMUs, and larger caches.

### Practicality Over Glamour for the 1990s

As Motorola's value microprocessor line, ColdFire is doomed to lag the industry's—and even its owner's—advances in architecture and first-run process technology. Like Intel's i960 family, ColdFire will have to survive on second-best fab lines, inexpensive packages, and relatively low-budget marketing, with design decisions driven by cost expediency rather than by high-minded performance concerns.

Even so, Motorola has shown it's not willing to let ColdFire languish from benign neglect. Fulfilling the company's roadmap will involve significant engineering and better process technology than is available today.

Certainly other embedded microprocessor architectures will advance during that time, too, perhaps in similar directions. ARM, MIPS, Hitachi, and others are all projecting rapid performance growth. ColdFire shows no signs of overtaking PowerPC, and that is part of Motorola's plan. The company's strategy calls for maintaining full-on 68K chips at the low end for compatibility, positioning ColdFire in the middle for cost-sensitive users, and letting PowerPC push the front of the performance curve. The 68K market is gravy; those designs were amortized years ago and are selling well. ColdFire chips will be the processors of choice for midrange users and ASIC customers.

All of which means ColdFire could well inherit a large portion of the huge 68K legacy in the years to come. Certainly some portion of that customer base will defect to newer RISC designs for performance reasons. But if Motorola fulfills its plans and keeps pace with industry advances, ColdFire will remain the logical choice for thousands of embedded designers.

At this rate, it appears that ColdFire will hold an even better position in 2001 than it does in 1996: a solid workhorse that garners little glory but gets enough of the winnings to keep coming back for more.