Programmable devices have moved up the technology curve and now boast
gate counts well above the half-million range, thus making even very large
project realizable.
User-programmable logic options for gate counts between 1,000 and 250,000 gates
include the programmable devices known as field-programmable gate arrays,
laser-programmable gate arrays, and complex PLDs; one-time-programmable devices
include mask-programmable gate arrays, more typically referred to as ASICs.
After completing work at RTL-level, it was part of this thesis to take a decision, what FPGA technology and product to use or whether a combination of solutions makes sense. This includes finding out advantages and differences between available FPGA. Furthermore, it was necessary to contact different semiconductor distributors to get information on availability, prices and manufacturer lead times.
Programmable devices vary in architecture and technology (see table 1). Some are RAM-based and therefore volatile, they lose their data if power is turned off. Some are nonvolatile and keep their programmed data even if powered down. Those devices generally rely on either antifuse or flash technology. The most popular programmable device architectures are the traditional product-term macrocell and the look-up table (LUT). Other devices, like the Actel and Quicklogic FPGAs, use various proprietary architectures.
FPGA Type | Architecture | Technology |
---|---|---|
Actel SX | FPGA Logic Macro | Antifuse |
Altera Apex | FPGA macrocell + LUT + EAB | SRAM |
Altera Flex | FPGA LUT + EAB | SRAM |
Altera Max | CPLD macrocell | EE |
Lucent ORCA | FPGA LUT | SRAM |
Quicklogic PASIC | FPGA Logic Macro | Antifuse |
Xilinx XC4000 | FPGA LUT | SRAM |
Xilinx XC9500 | CPLD macrocell | Flash |
Important constraints for this project were to use Field Programmable Logic
Devices that offer on-site programmability for test and debug of different
design configurations and alternatives.
Second, the Barracuda is designed to be used in automotive area. This has
determined the required supply voltage to 5V since the Barracuda Prototype
board had to be implemented as close as possible to the real Barracuda MCU.
Third, at Motorola some experiences using Altera FPGA and the Altera FPGA
design software were available from passed projects.
Meanwhile, Altera provided the new FLEX 10KE family PLDs based on a 0.25-micron
with new programmable delay and phased-locked loop (PLL) features. Enhancements
to the Flex10KE family include dual-port RAM in configurations up to 16 bits
wide.
Altera devices offer MultiVolt I/O operation for compatibility with devices of
different voltages. The 3.3-V FLEX 10KA devices can interface with 2.5-V,
3.3-V, and 5.0-V devices. FLEX 10KE devices, which will operate at a 2.5-V
supply voltage, will interface with 5.0-V, 3.3-V, and 2.5-V devices. This meets
the requirement for the board to be used in automotive area with 5V-supply
voltage. With 6,656 logic elements, 65,536 bits of on-chip RAM, and 2.5-V
operation, the Flex10KE device enables higher performance and lower power
consumption than available on 3.3-V, 0.30-micron PLDs. The programmable delay
and PLL features enhance I/O performance and improve pin-to-pin timing. With a
2.5-V core, and 130,000 gates, the EPF10K130E EPF10K200E meet best the
requirements for the Barracuda design. The decision of the author to use Altera
Flex10kE FPGA for the Barracuda Project is based on arguments listed in table
2.
Feature | Benefit |
---|---|
100 MHz and above system performance | The Barracuda chips will run at 20 MHz, we aim the ProtoBoard to run at 1 to 10 MHz |
Density from 10,000 to 250,000 gates | The largest module (CORE) requires 150 000 gates, so a self-contained module need not to be split in multiple devices. |
Embedded array blocks | RAM, ROM, FIFO megafunctions are needed for one interface module (MSCAN) that implements on-chip RAM. |
MultiVolt I/O operation: 5.0 V, 3.3 V, and 2.5 V | Required for the mixed-voltage ProtoBoard |
In addition, the FLEX 10K family has fast and predictable performance, and in-circuit reconfigurability (ICR) required for the Barracuda project. The chips are programmed using tools that run on personal computers or engineering workstations.
Feature | EPF10K30E | EPF10K130E | EPF10K200E |
---|---|---|---|
Typical gates (logic and RAM) |
30,000 | 130,000 | 200,000 |
Logic elements (LEs) | 1,728 | 6,656 | 9,984 |
Logic array blocks (LABs) | 216 | 832 | 1,248 |
Embedded array blocks (EABs) | 6 | 16 | 24 |
Total RAM bits | 24,576 | 65,536 | 98,304 |
Concerning the supply voltage of the board, FLEX 10K devices, which operate
at 5.0 V, are to be preferred for the project. Based on the proposal of the
author, FLEX10KE devices have been selected because the 2.5-V, 0.25-micron
FLEX10KE devices will be an average of 20% to 30% faster than the equivalent
0.35-micron FLEX10KA devices, which operate at 3.3 V. Similarly, FLEX 10KA
devices are an average of 20% to 30% faster than the 0.5-micron FLEX 10K
devices.
Moreover, the FLEX 10K embedded array can be used to create ROMs, FIFOs, and
asynchronous, synchronous, and dual-port RAM, necessary to emulate SRAM of one
interface module.
Concerning timing constraints of the board, Altera's ClockLock programming option uses on-chip PLL circuitry to reduce clock delay and skew to increase I/O performance up to 35 percent. The ClockBoost feature provides internal 2x-clock multiplication, doubling datapath bandwidth and reducing the use of logic resources.
EPF10K130E device support QFP packages preferred at MCU Design Center Munich because it is ideal for prototyping. The EPF10K130E and EPF10K240E device are available with 240-pin PQFP package, required for the Barracuda design.
Concerning availability, due to information of different semiconductor distributors, both the EPF10K130E and EPF10K240E device are available at a reasonable price (see chapter 4 for more details). One restriction turned out; these devices had a manufacturer lead time of at least 12 weeks. Is would not be possible to manufacture and mount the Barracuda PreSilicon emulation board at time of this thesis.
Concerning software support, the EPF10K130E and EPF10K200E devices with PLL circuitry are supported by the MAX+PLUS II development system. After compilation, MAX+PLUS II software can simulate timing, functionality and provide timing analysis.
Summing up all arguments described above, we decided to use these devices.
MaxII+ software is an architecture-independent package for designing
logic with Altera programmable logic devices. It allows easy design entry, fast
processing, and straightforward device programming and offers three design
entry methods for hierarchical designs: floorplan editing, logic synthesis;
design partitioning; functional, timing simulation; detailed timing analysis;
automatic error location; and device programming and verification. MaxII+ reads
and writes standard EDIF netlist files, Verilog HDL netlist files. Different
other formats (VHDL, OrCAD Schematic Files, and Xilinx Netlist Format Files)
are supported as well as a great number of primitives, megafunctions, and
macrofunctions.
The tool offers a graphical user interface and provides some useful features
like a design editor, Graphic, Text, and Waveform Editors and a Floorplan and
Symbol Editor that perform tasks, such as assigning a pin. The Compiler
provides customizable design processing to achieve the best possible silicon
implementation. Automatic error location and extensive documentation on error
and warning messages make design modifications as simple as possible. Output
files in a variety of formats for simulation, timing analysis, and device
programming, including EDIF, Verilog HDL, and VHDL files for use with other
industry-standard EDA tools can be created. [1]
MaxII+ could only read the Barracuda design files in standard EDIF,
because the design included some Verilog HDL commands, not supported. The
design had first to be compiled using Synopsys Design Compiler and saved as
EDIF netlist.
Some of the modules passed compilation the first time without errors. In the
main modules taken from existing projects (the JUPITER project) that have
already been prepared for FPGA emulation. For most of the other modules, some
errors had to be located before they successfully passed FPGA compiler. To be
able to correct errors, each one had to be identified at RTL-level.
After each change in RTL code the complete design had to be recompiled every
time after an error had been corrected running Synopsys Design Compiler and
running FPGA Compiler.
Common problems were duplicate node names, missing inputs and outputs, and
outputs that were tied together.
The Barracuda modules used for this poroject were between 10.000 and 150.000
gates. So RTL Synthesis runs between 30 min. and 3 h, afterwards FPGA synthesis
runs the same time again for each single module. This makes adapting large
designs like the Barracuda time consuming. It lasts at least six weeks before
all modules passed successfully FPGA synthesis.
The CORE module available for this project used RS flip-flops that could not be
implemented. The RS flip-flops had to be substituted by D flip-flops.
During compilation of wrapper module often occurred errors such as tri-state
buffer outputs wired together ("wired ORs") mainly during compilation
of the old JUPITER modules because these were self-contained and not designed
to be put together in this way.
At this stage, all IPbus interfaces for each peripheral interface have been verified and tested.
Each Barracuda module has first compiled without any assignment. This allows
the Compiler to choose any device that best suits the project. The Fitter uses
the smallest available device (number of pins and logic elements).
Then a specific device has been assigned for the wrapper modules that met best
the Barracuda project requirements for package and pincount.
A mayor task of this thesis was the configuration and optimization of
the FPGA pin layout. The board-level routing problem is the problem of
assigning internal net pins in each chip to the I/O pins, so that all pins
belonging to the same net are assigned to certain I/O pins. Also, at most one
net pin can be assigned to each I/O pin. By assigning signals to traces, the
placement of the FPGA's on the board is fixed.
The author will refer to the FPGA's on the Barracuda logic emulator simply as
chips. Since assignment of device specific pins can't be changed, only
user I/O pins are regarded.
The quality of pin assignment is determined by its impact on the internal
FPGA routing resource and what is more, the board-level routing expense. The
FPGA compiler applies algorithms to minimize the length of connections between
cells inside the FPGA regarding only internal logic. While this will solve the
routing problem for single-FPGA emulators, additional expence is necessary to
optimize routing expense at board-level for inter-FPGA signals. Since PCB place
and route tools require fixed signals for board-level optimization, the
assignment of inter-FPGA signals to specific I/O pins on the FPGA's in the
multi-FPGA system had to be done manually. In other words, it was necessary to
minimize the length of connections between chips as well as removing all
crossing traces by performing pin assignment. The source to sink delay of
inter-FPGA traces is decisive especially for the MCU bus signals routed on the
board.
The best routing results are reached if all pins of adjacent parts (FPGA's,
connectors, ..) are connected to each other. The Barracuda design has been
partitioned so that no long distance routing (e.g., a signal from FPGA#2 to
FPGA#5) was necessary (see figure 3.1).
The problem of this method is that, while automatic pin assignment will be very good since the placer is mostly free in its placement choices, for the designed FPGA positioning, more than 70% of I/O pins are dedicated by the placement of their neighbors as well as by connectors, and thus there is no attention paid to the logic to be emulated. But it showed, that pin assignment has not influenced timing significantly. Therefore, we put up with a possible additional delay for other FPGA's than FPGA#1.
First, the logic of each FPGA has been mapped and assigned to I/O pins
completely independently of one another by FPGA compiler during partitioning
and global placement. The CORE module had the poorest timing results, thus
representing one limiting factor for the Barracuda emulation system. This FPGA
has been regarded first since it should benefit most from avoiding extra pin
constraints. While for this module timing is determined by internal FPGA
routing between different logic cells, internal routing to I/O pins will not
influence timing decisively. Thus, as it concerns FPGA#1 it has been assumed
that pin assignment has no effect on the quality of logic element
implementation.
Furthermore, all FPGA's are constraint by the system's architecture e.g.,
signals assigned to connectors are sorted (write[0..15]
must be
assigned to CON[1..16]
). Additional specific source or sink for
individual routes exist for some device specific signals, their assignments can
not be changed. As a result, the logic emulator as a whole will be constrained
by board-level routing.
The first step was to determine which subset of pins will be used for
inter-FPGA routing and connecting FPGA signals to connectors. Second, the set
of signal nets interconnecting the FPGA's have been assigned as it initially
appeared to be advantageous. Clearly, that means the author tryed to obtain a
feasible assignment of all internal MCU bus signals to one subset of pins
available on each chip to minimize board-level routing expence.
The largest number of pins has subset B, PQFP-240 has 47 available I/O pins on
subset B, the PQFP-208 package has a different subset size with 39 I/O pins.
Assigning the bus signals addr
, read
,
write
(48 pins) to the respective pin subset B of each FPGA has
predefined the position of FPGA#3 and FPGA#4 which are adjacent to FPGA#1 and
FPGA#2 and rotated by 180° (see figure 3.2), and
determined the architecture of the Barracuda board and thus any interchip
routing had to conform to this topology.
At this stage MCU control signals were not still constraint as they can be grouped in any order outside of the FPGA's. After placing all FPGA's and assigning the busses to FPGA's and connectors, the designs for each FPGA have been recompiled allowing the FPGA placement tool to determine its own assignment. This proceeding takes into consideration the structure of the logic assigned to the FPGA's and maintains the assignments of the busses performed to meet the clean pinount requirement at connector side. This also prevents from making poor manual assignments, requiring extra routing inside the FPGA's to compensate for a bad pin positioning. In succeeding steps, the pin assignment of the time-critical FPGA#1 has then been propagated to the adjacent FPGA's beginning with FPGA#2 and FPGA#5, followed by FPGA#3, FPGA#4, and ending with TIMER, the two SRAM's.
Besides the pins that connect to each FPGA (five-terminal nets), each FPGA has some intermodule signals as well as signals routed to connectors (two-terminal nets). Regarding the respective I/O pin subset of two adjacent FPGA's, the assign process described above has been reiterated several times for each FPGA.
To sum up, we can say that this sequential method gives the best result for FPGA#1, while internal signals used together in a neighbor FPGA may be scattered. Moreover, compiling a project after pin assignments takes up to 3 h since the wrapper modules contain the complete design to be emulated in on FPGA and can not been treated module wise. As shown above, sequential board-level design optimization is very time consuming. To find an optimal mapping solution for the Barracuda ProtoBoard, the author has performed at least 20 optimization cycles to minimize the degree of imbalance at one chip without increasing the degree of imbalance at any other chip.
For future improvements, the results of internal routing should be made
visible using information from the report file generated by the mapping tool.
What is necessary is a more global approach wich optimizes the entire mapping
for each FPGA simultaneously, while avoiding sequential optimization steps.
So far the author has assumed that all nets interconnecting the FPGA's are
routed on the board. Howewer, there are two schemes to connect signals between
the FPGA chips. The first one is to connect the FPGA's directly to each other.
Another scheme is to connect some nets through FPGA internal nets so that the
FPGA has to serve two functions: logic and interconnection. The latter has the
advantage of higher FPGA utilization and net delay uniformity. This approach
has not still been used for the Barracuda board. As far as it concerns
FPGA#2/#3/#5 there are still a complete subset of I/O pins available. These
pins could be used for interconnection to improve board-level routing. In this
case, long-distance signals can be split into multiple shorter interconnected
signals, where each short signal represents the passage of the signal between
connected FPGA's.
It is easy to see that board-level design optimization can be extended further. For the case that logic emulation is the limiting factor, there are methods of pin assignment requiring that all logic bearing FPGA's are connected only through chrossbar-chips or routing-only FPGA's. These FPGA's can be placed independently while the routing chips can handle any pin connection pattern equally well. [14]
The performed pin assignments had no limiting effect on timing of the Barracuda modules. On the contrary, it has shown that grouping the busses and assigning the clock and reset signals to an appropriate dedicated input pin has lightly improved the timing of time critical modules as the CORE. This is also indicated by the fact, that pin assignments made for the Barracuda logic emulator have not slowed down the FPGA mapping tool decisively. Poor pin assignments would take up to four times as long.
Note The board-level design optimization by optimal pin assignment described above was possible only by using the novel PCB design tool, developed in the course of this thesis, that allowed the translation of board specification (pin files) into script files of the P&R tool Eagle to view the results of pin assignments at board-level in less than 5 min after recompiling the respective FPGA logic. Chapter 5 describes the PCB design tool in detail.
When making pin assignments the first time, it proved to be convenient to use the configuration wizard and type in each pinnumber to the equivalent signal. To change, swap assigned pins or shift all assigned pins to another chip location, it proved to be useful to edit assignments in the respective ASCII file manually. But this easily results in making illegal assignments e.g. assignment of more then one signal to a pin. After any assignment change the design had to be recompiled to take effect. The Altera placement tool misses a feature that would allow the user to restrict the locations to which a logic pin can be assigned (e.g. a pin subset).
After compiling the design, each module has been analyzed determining
critical speed paths, performance-limiting delay, minimum Clock period, and
maximum Clock frequency.
The Barracuda CORE had the longest timing delay. Needless to say that the
slowest element determines the clock frequency on the board. The CORE runs at
8.2 MHz, so the author estimates the clock on the Barracuda logic emulator to
be between 1 MHz and 8 MHz.
Some modules did not pass the Timing analysis successfully. It turned out that
compilation with the default logic option setting and without any resource and
device assignment results in some signal paths with a clock skew longer than
the delay path. Clock skew is defined as the maximum difference of the delays
from clock source to clock pins of latches in a clock tree. Unlike traditional
ASIC technologies, the geometric structures of clock trees in a FPGA are
usually fixed and cannot be changed for different circuit designs. As a result,
the load capacitances of a clock tree may be changed, depending on the
utilization and distribution of logic modules in an FPGA. Since these modules
will not work successfully in real time, it is important to reduce clock skew
for achieving high performance. [13]
It is possible to minimize clock skew by carefully distributing the logic
modules using clique assignments so that only a minimum number of signals
travel between LABs, rows to ensure unnecessary delays on critical timing
paths.
A clique is a resource assignment that groups a block of logic functions
together. The Compiler attempts to keep clique members together when it fits
the project. A clique assignment allows grouping all logic on a speed-critical
path, thus ensuring optimum speed.
If possible, all clique members are assigned to the same LAB or they are placed
in the same row. Cliques therefore allow the partitioning of a project so that
only a minimum number of signals travel between LABs, rows, to ensure that no
unnecessary LAB-to-LAB or row-to-row delays exist on critical timing paths.