3 FPGA Development


3.1 FPGA Architecture

Programmable devices have moved up the technology curve and now boast gate counts well above the half-million range, thus making even very large project realizable.
User-programmable logic options for gate counts between 1,000 and 250,000 gates include the programmable devices known as field-programmable gate arrays, laser-programmable gate arrays, and complex PLDs; one-time-programmable devices include mask-programmable gate arrays, more typically referred to as ASICs.

After completing work at RTL-level, it was part of this thesis to take a decision, what FPGA technology and product to use or whether a combination of solutions makes sense. This includes finding out advantages and differences between available FPGA. Furthermore, it was necessary to contact different semiconductor distributors to get information on availability, prices and manufacturer lead times.

Programmable devices vary in architecture and technology (see table 1). Some are RAM-based and therefore volatile, they lose their data if power is turned off. Some are nonvolatile and keep their programmed data even if powered down. Those devices generally rely on either antifuse or flash technology. The most popular programmable device architectures are the traditional product-term macrocell and the look-up table (LUT). Other devices, like the Actel and Quicklogic FPGAs, use various proprietary architectures.

table 3.1 Characteristics of programmable logic [11]
FPGA Type Architecture Technology
Actel SX FPGA Logic Macro Antifuse
Altera Apex FPGA macrocell + LUT + EAB SRAM
Altera Flex FPGA LUT + EAB SRAM
Altera Max CPLD macrocell EE
Lucent ORCA FPGA LUT SRAM
Quicklogic PASIC FPGA Logic Macro Antifuse
Xilinx XC4000 FPGA LUT SRAM
Xilinx XC9500 CPLD macrocell Flash

Important constraints for this project were to use Field Programmable Logic Devices that offer on-site programmability for test and debug of different design configurations and alternatives.
Second, the Barracuda is designed to be used in automotive area. This has determined the required supply voltage to 5V since the Barracuda Prototype board had to be implemented as close as possible to the real Barracuda MCU.
Third, at Motorola some experiences using Altera FPGA and the Altera FPGA design software were available from passed projects.
Meanwhile, Altera provided the new FLEX 10KE family PLDs based on a 0.25-micron with new programmable delay and phased-locked loop (PLL) features. Enhancements to the Flex10KE family include dual-port RAM in configurations up to 16 bits wide.
Altera devices offer MultiVolt I/O operation for compatibility with devices of different voltages. The 3.3-V FLEX 10KA devices can interface with 2.5-V, 3.3-V, and 5.0-V devices. FLEX 10KE devices, which will operate at a 2.5-V supply voltage, will interface with 5.0-V, 3.3-V, and 2.5-V devices. This meets the requirement for the board to be used in automotive area with 5V-supply voltage. With 6,656 logic elements, 65,536 bits of on-chip RAM, and 2.5-V operation, the Flex10KE device enables higher performance and lower power consumption than available on 3.3-V, 0.30-micron PLDs. The programmable delay and PLL features enhance I/O performance and improve pin-to-pin timing. With a 2.5-V core, and 130,000 gates, the EPF10K130E EPF10K200E meet best the requirements for the Barracuda design. The decision of the author to use Altera Flex10kE FPGA for the Barracuda Project is based on arguments listed in table 2.

table 3.2. Arguments to use Flex10k Devices
Feature Benefit
100 MHz and above system performance The Barracuda chips will run at 20 MHz, we aim the ProtoBoard to run at 1 to 10 MHz
Density from 10,000 to 250,000 gates The largest module (CORE) requires 150 000 gates, so a self-contained module need not to be split in multiple devices.
Embedded array blocks RAM, ROM, FIFO megafunctions are needed for one interface module (MSCAN) that implements on-chip RAM.
MultiVolt I/O operation: 5.0 V, 3.3 V, and 2.5 V Required for the mixed-voltage ProtoBoard

In addition, the FLEX 10K family has fast and predictable performance, and in-circuit reconfigurability (ICR) required for the Barracuda project. The chips are programmed using tools that run on personal computers or engineering workstations.

table 3.3. FLEX 10K Device Features
Feature EPF10K30E EPF10K130E EPF10K200E
Typical gates
(logic and RAM)
30,000 130,000 200,000
Logic elements (LEs) 1,728 6,656 9,984
Logic array blocks (LABs) 216 832 1,248
Embedded array blocks (EABs) 6 16 24
Total RAM bits 24,576 65,536 98,304

Concerning the supply voltage of the board, FLEX 10K devices, which operate at 5.0 V, are to be preferred for the project. Based on the proposal of the author, FLEX10KE devices have been selected because the 2.5-V, 0.25-micron FLEX10KE devices will be an average of 20% to 30% faster than the equivalent 0.35-micron FLEX10KA devices, which operate at 3.3 V. Similarly, FLEX 10KA devices are an average of 20% to 30% faster than the 0.5-micron FLEX 10K devices.
Moreover, the FLEX 10K embedded array can be used to create ROMs, FIFOs, and asynchronous, synchronous, and dual-port RAM, necessary to emulate SRAM of one interface module.

Concerning timing constraints of the board, Altera's ClockLock programming option uses on-chip PLL circuitry to reduce clock delay and skew to increase I/O performance up to 35 percent. The ClockBoost feature provides internal 2x-clock multiplication, doubling datapath bandwidth and reducing the use of logic resources.

EPF10K130E device support QFP packages preferred at MCU Design Center Munich because it is ideal for prototyping. The EPF10K130E and EPF10K240E device are available with 240-pin PQFP package, required for the Barracuda design.

Concerning availability, due to information of different semiconductor distributors, both the EPF10K130E and EPF10K240E device are available at a reasonable price (see chapter 4 for more details). One restriction turned out; these devices had a manufacturer lead time of at least 12 weeks. Is would not be possible to manufacture and mount the Barracuda PreSilicon emulation board at time of this thesis.

Concerning software support, the EPF10K130E and EPF10K200E devices with PLL circuitry are supported by the MAX+PLUS II development system. After compilation, MAX+PLUS II software can simulate timing, functionality and provide timing analysis.

Summing up all arguments described above, we decided to use these devices.

3.2 FPGA Place and Route

MaxII+ software is an architecture-independent package for designing logic with Altera programmable logic devices. It allows easy design entry, fast processing, and straightforward device programming and offers three design entry methods for hierarchical designs: floorplan editing, logic synthesis; design partitioning; functional, timing simulation; detailed timing analysis; automatic error location; and device programming and verification. MaxII+ reads and writes standard EDIF netlist files, Verilog HDL netlist files. Different other formats (VHDL, OrCAD Schematic Files, and Xilinx Netlist Format Files) are supported as well as a great number of primitives, megafunctions, and macrofunctions.
The tool offers a graphical user interface and provides some useful features like a design editor, Graphic, Text, and Waveform Editors and a Floorplan and Symbol Editor that perform tasks, such as assigning a pin. The Compiler provides customizable design processing to achieve the best possible silicon implementation. Automatic error location and extensive documentation on error and warning messages make design modifications as simple as possible. Output files in a variety of formats for simulation, timing analysis, and device programming, including EDIF, Verilog HDL, and VHDL files for use with other industry-standard EDA tools can be created. [1]

3.3 Preparing modules for FPGA Synthesis

MaxII+ could only read the Barracuda design files in standard EDIF, because the design included some Verilog HDL commands, not supported. The design had first to be compiled using Synopsys Design Compiler and saved as EDIF netlist.
Some of the modules passed compilation the first time without errors. In the main modules taken from existing projects (the JUPITER project) that have already been prepared for FPGA emulation. For most of the other modules, some errors had to be located before they successfully passed FPGA compiler. To be able to correct errors, each one had to be identified at RTL-level.
After each change in RTL code the complete design had to be recompiled every time after an error had been corrected running Synopsys Design Compiler and running FPGA Compiler.
Common problems were duplicate node names, missing inputs and outputs, and outputs that were tied together.
The Barracuda modules used for this poroject were between 10.000 and 150.000 gates. So RTL Synthesis runs between 30 min. and 3 h, afterwards FPGA synthesis runs the same time again for each single module. This makes adapting large designs like the Barracuda time consuming. It lasts at least six weeks before all modules passed successfully FPGA synthesis.
The CORE module available for this project used RS flip-flops that could not be implemented. The RS flip-flops had to be substituted by D flip-flops.
During compilation of wrapper module often occurred errors such as tri-state buffer outputs wired together ("wired ORs") mainly during compilation of the old JUPITER modules because these were self-contained and not designed to be put together in this way.

At this stage, all IPbus interfaces for each peripheral interface have been verified and tested.
Each Barracuda module has first compiled without any assignment. This allows the Compiler to choose any device that best suits the project. The Fitter uses the smallest available device (number of pins and logic elements).
Then a specific device has been assigned for the wrapper modules that met best the Barracuda project requirements for package and pincount.

3.4 FPGA Pin Assignment

A mayor task of this thesis was the configuration and optimization of the FPGA pin layout. The board-level routing problem is the problem of assigning internal net pins in each chip to the I/O pins, so that all pins belonging to the same net are assigned to certain I/O pins. Also, at most one net pin can be assigned to each I/O pin. By assigning signals to traces, the placement of the FPGA's on the board is fixed.
The author will refer to the FPGA's on the Barracuda logic emulator simply as chips. Since assignment of device specific pins can't be changed, only user I/O pins are regarded.

The quality of pin assignment is determined by its impact on the internal FPGA routing resource and what is more, the board-level routing expense. The FPGA compiler applies algorithms to minimize the length of connections between cells inside the FPGA regarding only internal logic. While this will solve the routing problem for single-FPGA emulators, additional expence is necessary to optimize routing expense at board-level for inter-FPGA signals. Since PCB place and route tools require fixed signals for board-level optimization, the assignment of inter-FPGA signals to specific I/O pins on the FPGA's in the multi-FPGA system had to be done manually. In other words, it was necessary to minimize the length of connections between chips as well as removing all crossing traces by performing pin assignment. The source to sink delay of inter-FPGA traces is decisive especially for the MCU bus signals routed on the board.
The best routing results are reached if all pins of adjacent parts (FPGA's, connectors, ..) are connected to each other. The Barracuda design has been partitioned so that no long distance routing (e.g., a signal from FPGA#2 to FPGA#5) was necessary (see figure 3.1).

klick to open PDF-Format
Fig.3.1 Symbolic Module Layout

The problem of this method is that, while automatic pin assignment will be very good since the placer is mostly free in its placement choices, for the designed FPGA positioning, more than 70% of I/O pins are dedicated by the placement of their neighbors as well as by connectors, and thus there is no attention paid to the logic to be emulated. But it showed, that pin assignment has not influenced timing significantly. Therefore, we put up with a possible additional delay for other FPGA's than FPGA#1.

First, the logic of each FPGA has been mapped and assigned to I/O pins completely independently of one another by FPGA compiler during partitioning and global placement. The CORE module had the poorest timing results, thus representing one limiting factor for the Barracuda emulation system. This FPGA has been regarded first since it should benefit most from avoiding extra pin constraints. While for this module timing is determined by internal FPGA routing between different logic cells, internal routing to I/O pins will not influence timing decisively. Thus, as it concerns FPGA#1 it has been assumed that pin assignment has no effect on the quality of logic element implementation.
Furthermore, all FPGA's are constraint by the system's architecture e.g., signals assigned to connectors are sorted (write[0..15] must be assigned to CON[1..16]). Additional specific source or sink for individual routes exist for some device specific signals, their assignments can not be changed. As a result, the logic emulator as a whole will be constrained by board-level routing.

The first step was to determine which subset of pins will be used for inter-FPGA routing and connecting FPGA signals to connectors. Second, the set of signal nets interconnecting the FPGA's have been assigned as it initially appeared to be advantageous. Clearly, that means the author tryed to obtain a feasible assignment of all internal MCU bus signals to one subset of pins available on each chip to minimize board-level routing expence.
The largest number of pins has subset B, PQFP-240 has 47 available I/O pins on subset B, the PQFP-208 package has a different subset size with 39 I/O pins. Assigning the bus signals addr, read, write (48 pins) to the respective pin subset B of each FPGA has predefined the position of FPGA#3 and FPGA#4 which are adjacent to FPGA#1 and FPGA#2 and rotated by 180° (see figure 3.2), and determined the architecture of the Barracuda board and thus any interchip routing had to conform to this topology.

At this stage MCU control signals were not still constraint as they can be grouped in any order outside of the FPGA's. After placing all FPGA's and assigning the busses to FPGA's and connectors, the designs for each FPGA have been recompiled allowing the FPGA placement tool to determine its own assignment. This proceeding takes into consideration the structure of the logic assigned to the FPGA's and maintains the assignments of the busses performed to meet the clean pinount requirement at connector side. This also prevents from making poor manual assignments, requiring extra routing inside the FPGA's to compensate for a bad pin positioning. In succeeding steps, the pin assignment of the time-critical FPGA#1 has then been propagated to the adjacent FPGA's beginning with FPGA#2 and FPGA#5, followed by FPGA#3, FPGA#4, and ending with TIMER, the two SRAM's.

Besides the pins that connect to each FPGA (five-terminal nets), each FPGA has some intermodule signals as well as signals routed to connectors (two-terminal nets). Regarding the respective I/O pin subset of two adjacent FPGA's, the assign process described above has been reiterated several times for each FPGA.

To sum up, we can say that this sequential method gives the best result for FPGA#1, while internal signals used together in a neighbor FPGA may be scattered. Moreover, compiling a project after pin assignments takes up to 3 h since the wrapper modules contain the complete design to be emulated in on FPGA and can not been treated module wise. As shown above, sequential board-level design optimization is very time consuming. To find an optimal mapping solution for the Barracuda ProtoBoard, the author has performed at least 20 optimization cycles to minimize the degree of imbalance at one chip without increasing the degree of imbalance at any other chip.

For future improvements, the results of internal routing should be made visible using information from the report file generated by the mapping tool. What is necessary is a more global approach wich optimizes the entire mapping for each FPGA simultaneously, while avoiding sequential optimization steps.
So far the author has assumed that all nets interconnecting the FPGA's are routed on the board. Howewer, there are two schemes to connect signals between the FPGA chips. The first one is to connect the FPGA's directly to each other. Another scheme is to connect some nets through FPGA internal nets so that the FPGA has to serve two functions: logic and interconnection. The latter has the advantage of higher FPGA utilization and net delay uniformity. This approach has not still been used for the Barracuda board. As far as it concerns FPGA#2/#3/#5 there are still a complete subset of I/O pins available. These pins could be used for interconnection to improve board-level routing. In this case, long-distance signals can be split into multiple shorter interconnected signals, where each short signal represents the passage of the signal between connected FPGA's.

It is easy to see that board-level design optimization can be extended further. For the case that logic emulation is the limiting factor, there are methods of pin assignment requiring that all logic bearing FPGA's are connected only through chrossbar-chips or routing-only FPGA's. These FPGA's can be placed independently while the routing chips can handle any pin connection pattern equally well. [14]

klick to open Details
Figure 3.1 Intended Board Layout

The performed pin assignments had no limiting effect on timing of the Barracuda modules. On the contrary, it has shown that grouping the busses and assigning the clock and reset signals to an appropriate dedicated input pin has lightly improved the timing of time critical modules as the CORE. This is also indicated by the fact, that pin assignments made for the Barracuda logic emulator have not slowed down the FPGA mapping tool decisively. Poor pin assignments would take up to four times as long.

Note The board-level design optimization by optimal pin assignment described above was possible only by using the novel PCB design tool, developed in the course of this thesis, that allowed the translation of board specification (pin files) into script files of the P&R tool Eagle to view the results of pin assignments at board-level in less than 5 min after recompiling the respective FPGA logic. Chapter 5 describes the PCB design tool in detail.

When making pin assignments the first time, it proved to be convenient to use the configuration wizard and type in each pinnumber to the equivalent signal. To change, swap assigned pins or shift all assigned pins to another chip location, it proved to be useful to edit assignments in the respective ASCII file manually. But this easily results in making illegal assignments e.g. assignment of more then one signal to a pin. After any assignment change the design had to be recompiled to take effect. The Altera placement tool misses a feature that would allow the user to restrict the locations to which a logic pin can be assigned (e.g. a pin subset).

3.5 Analysis of Performance, Resources

After compiling the design, each module has been analyzed determining critical speed paths, performance-limiting delay, minimum Clock period, and maximum Clock frequency.
The Barracuda CORE had the longest timing delay. Needless to say that the slowest element determines the clock frequency on the board. The CORE runs at 8.2 MHz, so the author estimates the clock on the Barracuda logic emulator to be between 1 MHz and 8 MHz.
Some modules did not pass the Timing analysis successfully. It turned out that compilation with the default logic option setting and without any resource and device assignment results in some signal paths with a clock skew longer than the delay path. Clock skew is defined as the maximum difference of the delays from clock source to clock pins of latches in a clock tree. Unlike traditional ASIC technologies, the geometric structures of clock trees in a FPGA are usually fixed and cannot be changed for different circuit designs. As a result, the load capacitances of a clock tree may be changed, depending on the utilization and distribution of logic modules in an FPGA. Since these modules will not work successfully in real time, it is important to reduce clock skew for achieving high performance. [13]
It is possible to minimize clock skew by carefully distributing the logic modules using clique assignments so that only a minimum number of signals travel between LABs, rows to ensure unnecessary delays on critical timing paths.
A clique is a resource assignment that groups a block of logic functions together. The Compiler attempts to keep clique members together when it fits the project. A clique assignment allows grouping all logic on a speed-critical path, thus ensuring optimum speed.
If possible, all clique members are assigned to the same LAB or they are placed in the same row. Cliques therefore allow the partitioning of a project so that only a minimum number of signals travel between LABs, rows, to ensure that no unnecessary LAB-to-LAB or row-to-row delays exist on critical timing paths.