A.
Here's a quick (well actually its not as quick as I expected) description
of how I chose between Altera and Xilinx. I was comparing Altera's Flex/APEX
with the Xilinx's XC4000/Virtex families. Some comments are thrown in about
Virtex and Apex and so on, but I will be the first to admit I haven't done
any more than read the datasheets. The Flex10KE I have used in anger.
A lot of what I have learnt about the two families has come from reading
comp.arch.fpga, and in particular, a discussion I had with Ray Andraka.
Most of the rest came from reading datasheets and appnotes.
Due to the discursive nature of this bit, it is indented with later
comments being more indented...
Logic
The structure of the Xilinx logic cells is well suited to arithmetic
structures, compared to the Altera Flex/Apex structure, due to the ability
to generate both output and carry from one logic cell. Altera's 4-LUT is
divided into two 3-LUTs for arithmetic.
I think you misunderstand the basic Altera
Logic element. The carry and sum outputs are implemented in a single
logic cell. Just as a Xilinx logic cell can be reconfigured to act
as a 16-bit memory cell, the Altera logic cell has several configurations
optimized for arithmetic, counters, or misc logic. There is no speed
penalty for using these modes and there is no inherent advantage to the
Xilinx logic cell with regard to arithmetic. I hear all kinds of
claims about Xilinx architectural advantages, but I have never heard even
the most ardent Xilinx user claim that tha Xilinx architecture has a arthmetic
advantage in the logic cells.
I think the point is that Altera can only do a 2-bit arithmetic
operation with carry in and out - in Flex at least (ref. figure 11 in the
10KE datasheet). Also, with Flex, if you use a CE that has to take the
place of an arithmetic input as well, leading to a one input function (without
using cascade chains to make it wider, with their, admittedly small, routing
delay).
Xilinx does have a distinct advantage over Altera
when it comes to arithmetic circuits. For arithmetic, the Altera 4-Lut
does indeed get partitioned into a pair of 3-Luts: one for the 'sum' function,
and one for the 'carry' function. One input to each of these is connected
to the carry out from the previous bit. As a result, you are limited to
a two input arithmetic function if you wish to stay in one level of logic.
Arithmetic functions with more than two inputs, such as adder-subtractors,
multiplier partial products, mux-adds, and accumulators with loads or synchronous
clears (this last one is addressed by improvements in the 20K family) require
two levels of logic to implement. The Xilinx logic cell does not
use the Lut for the carry function; it has dedicated carry logic.
The 4K/Spartan families use one Lut input to connect
the carry chain, leaving three inputs for your function. Virtex,
VirtexII, and SpartanII have a dedicated xor gate after the Lut to do this,
so these devices can handle 4 input arithmetic functions without having
to go to two levels. The relatively limited arithmetic function of the
Altera parts means as much as twice the Luts are used in heavily arithmetic
applications. Two levels of logic also equates to a significant performance
penalty, everything else being equal.
Xilinx's logic cells can also be used as 16 bit shift registers or 16x1
SRAMs for small amounts of storage. In addition, in Virtex there are BlockRAMs
which are larger blocks of dual-ported memory. Altera only has large blocks
of RAM called EABs which are configurable between 256x16bits and 4096x1bit.
They are also only partially dual-ported (one read and one write port).
The ability to convert the Logic cell
into memory is a neat feature. This is one of the key differences
in the architectures. My only comment on that is that it isn't used
as much as you might think. Xilinx parts have a much lower logic
cell count relative to device size since they include so much RAM (example:
XVC600E: 13.8K logic cells, 288kbits RAM, 20K400E: 16.6K logic cells,
208Kbits RAM). Because of this it doesn't usually make sense to take
away your less abundant resource (logic cells) to create more of something
you already have lots of (memory). None-the-less it is a neat and
sometimes quite useful feature.
For DSP designs, the CLB RAM capability is another
significant advantage over the Altera offerings. DSP designs tend to have
many small delay queues (filter tap delays, for example) which use up a
lot of logic cells if implemented as flip-flops, or severely under-utilize
block memories if done there. By using the CLB RAMs (or in the case of
Virtex, the shift register mode), you get up to a 17:1 area reduction over
using Lut flip-flops. Similar reductions come into play for designs
having register files and small fifos. The Virtex SRL16 primitive
also gives you the capability to reload Lut contents without reconfiguring
the device. This makes it possible to have re-programmable coefficients
in a distributed arithmetic filter for instance. There is simply no equivalent
capability in the Altera devices. My Virtex designs typically have more
than half of the Luts configured as SRL16's.
(This is comparing marketing gate counts (600E
vs 400E) , actual logic cells (Xilinx actually claims 15.5K, but 13.8K
is the actual number of 4LUTS). The 288Kbits of RAM in Xilinx is the block
RAM, there can be upto 216Kbits more in the LUTs (which would leave zero
for logic). The 208K for Altera is for block RAM only. For each user, a
better measure might be to find the product in each vendors product line
that can hold a given design, and compare actual price. This gets away
from inflated gate and RAM claims, and whether or not it makes sense to
trade logic for RAM)
Precisely - as with many of the architectural differences,
if you need the feature, its brilliant, otherwise, it has no (or even a
negative) impact.
As far as the memory blocks go, the Altera
blocks have built-in circuitry to allow them to be used as CAM (content
addressable memory). The Altera CAMs have a huge performance advantage
over trying to implement CAMs in Xilinx devices using memory blocks and
some logic. The Altera memory blocks can also be used to implment
fast, wide, product-term logic. (Xilinx block
RAMs can too) This is useful, for example,
for implementing a wide address decode in few levels of logic. With that
said, I will agree that the Xilinx dual-port mode is more full-featured
than the APEX 20KE dual-port (although the advantage disappears when you
compare Virtex II vs. Apex II).
This is with APEX and later families. As far as I can tell,
the Flex devices don't have this ability. Again, great if you need it!
On the subject of block memories, the advantages
of one over the other are not as clear. Xilinx does have a true dual port
capability where Altera's memory is at best (depends on the family) a read
only port and a write only port through the 20K family. This is fine
for many designs, so unless you need it, not having it is not a problem.
Altera does have two very nice unique capabilities in the 20K memories:
a CAM mode and the product term mode. The CAM is more than nice to
have for network apps and places where you need to sort data. While you
can do a CAM in Xilinx, the design is neither trivial nor particularly
fast (either the fetch or the write operation has to take multiple clocks;
see the Xilinx app notes for details). The product term capability is reminscent
of a CPLD, which is very handy when dealing with big combinatorial functions
such as address decodes.
The flipflops in the logic cells differ in that the Xilinx logic cell has
a dedicated clock enable input, whereas Altera use one of the inputs to
the LUT to create a CE signal. In addition the Altera flip flops only have
a clear input. If you want a preset, the tools will put NOT gates on the
input and output of the DFF. Which means that you can't have a preset flipflop
implemented in the I/O cell - therefore your Tco can suffer badly. The
diagram in the datasheet implies a preset input, but on reading the text
you discover the truth!
The Altera does have a true clock enable
on the LE flip-flop but (except for the 20K) it shares an input to the
LE with one of the Lut inputs, so using the clock enable reduces the available
functionality of the Lut. In the case of arithmetic logic, using
the CE limits you to a single input for one level of logic.
FLEX 8000: No clock enable. Software
emulates the clock enable by building it into the logic
FLEX 6000: No clock enable. Software
emulates the clock enable by building it into the logic
FLEX 10K, 10KE, ACEX 1K: Clock enable uses
one of the LUTs data inputs (per the authors original comment)
APEX 20K, 20KE, Mercury, APEX II: Regular clock
enable.
The logic cells allow you to implement EITHER
an asynchronous clear OR an asynchronous preset. You can't do both
without using additional logic cells, but you can implement either, even
in the I/O cell. By the way, the tco increases by only 0.233ns when
using a register near the periphery rather than a register in the I/O cell
(APEX EP20K30ETC144-1).
Provided you can get the register to be consistently
located adjacent to the IOB (can be difficult as the device gets full).
Depending on registers placed in the core rather than in the IOB leads
to external timing being a function of the place and route solution...not
a good thing. Incidently this is also a problem in the 10K if you
need bi-directional I/O since there is only one flip-flop in the IOB.
If I can return to the Flex architecture, which is what I began
the article comparing, according to the 10KE datasheet, an async preset
is implemented in one of two ways:
-
Using the clear and inverting both input and output. Inverting the input
is 'free' but inverting the output requires a LUT between your register
and the pin. Hence, its not just a case of putting the register not in
the I/O element, there's extra logic to consider.
-
Admittedly, I missed the other way of doing it, which is to use one of
the LUT inputs as a preset. But then you've lost a LUT input, so that's
not always possible either.
Altera's new Mercury family has a different logic structure, including
two carry chains, so the arguments are probably different. I haven't had
time/inclination to do any detailed analysis.
I/O
Both families offer similar I/O families. The biggest difference
is that the Altera I/O cell has a single register, which can be used as
a output, input or OE register. The Xilinx I/O has all three available
for use. Note that the diagrams in the Altera datasheet implies that they
have the same capability, but on reading the text you find that the picture
shows all the possibilities at once!
You're right about that diagram in the
datasheet. Also, you can't use the register in the I/O cell for the OE
either - just input or output. However, note the comment above regarding
using nearby registers not in the I/O cell. The performance penalty
in most cases is less than 1ns for using a non-I/O cell register.
Fair comment - I admit to being a bit bitter about what I consider
to be misrepresentation of the truth in the diagram - still, I've learned
not to trust the pictures and read the words now!
This is not true for Mercury, which has three flipflops in the I/O cell,
and ApexII which has six, for DDR applications.
The Virtex/Apex comparison of their respective LVDS implementations
is interesting. As far as I can gather the SerDes function is implemented
in the FPGA fabric for Virtex, and in custom silicon for Apex. This means
that you only get proper SerDes LVDS support with the larger Apex devices.
The dedicated SERDES circuitry in the
APEX devices allows you to move data around inside the device at 105 MHz
and drive it out the LVDS drivers at 840Mbps. The Xilinx solution
requires routing data and clocks around internally at 320 MHz (not simple)
and they use both edges of the clock to drive data at 640Mbps. Also,
the LVDS drivers in the Altera part are balanced (equal rise and fall times)
providing a much better eye-diagram than what you get from the unbalanced
drivers in the Xilinx device. The Xilinx solution also requires an
exernal resistor network to get the right LVDS voltage levels. Finally,
the Apex 20KE devices have dedicate de-skew circuitry in the LVDS receivers.
This prevents the board designer from having to make all the signal traces
exactly the same length. It's hard to argue that the Altera LVDS
solution is significantly superior (Apex 20KE vs Virtex-E), but I do have
to admire the fact that Xilinx was able to coax 640 Mbps LVDS out of drivers
that were never intended to do LVDS. Altera's general-purpose I/Os
have trouble making it to 200 Mbps with a Xilinx-type solution.
As far as Apex II and Virtex II, I have yet to
see details on the Virtex II LVDS. Apex II increased LVDS performance
to 1 Gbps and put it on more channels. Apex II also improved the
clock de-skew circuitry to reduce even further the need to carefully hand-route
the board-level LVDS signals.
Also good comments, from someone who has actually done it,
rather than simply my reading of those datasheets and appnotes!
Routing
The routing structures are also different. Altera's main routing
strategy is to have many lines connecting the entire chip together This
is in contrast to the Xilinx approach, which consists of a hierarchy of
short, medium and long connections. This make the job of the place and
route tool harder in the Xilinx devices, unless it is guided. The downside
for Altera is that larger devices get slower as there is more capacitance
to drive.
The routing structures of the Xilinx and Altera
families are very different; each has different abilities. The Altera structure
is a hierarchical structure akin to that of a CPLD. At the lowest
level, there are very fast connections between the logic elements (LE's,
which consist of a flip-flop and a 4-Lut each) within a LAB (logic array
block-with 8 to 10 LE's). These connections are great for very fast
state machines, but are useless for arithmetic because the carry chain
also runs thru the LAB. The next level up in the routing hierarchy
connects the LABs in a row together. The row routes run halfway or
all the way across the chip in 10K, with switches connecting to selected
LAB's. The rows are then interconnected by column routes. A LAB can
drive a row or column route directly, but can only receive input from a
row route. This structure has the advantage of having uniform delays for
any connections using similar hierarchical resources. That in turn makes
placement less critical. Unfortunateiy, it also means even local connections
incur the delay associated with a cross-chip connection. A bigger
problem appears with heavily arithmetic designs because the routing in
and out of every arithmetic LE is forced onto the row routing. There
are only six row routes for every eight LE's in a row, so even with perfect
routing in a heavily arithmetic data-flow design, the row can only be 75%
occupied. The row interconnect matrix is sparsely populated (any one LAB
can only directly connect to a fraction of the LAB's on the same row. As
the row fills up, some of the connections have to be made via a third LAB,
adding to the delay and further congesting the row routes. In a math intensive
design, system performance often falls off sharply at 50 to 60% device
utilization. The global nature of the row and column routes also
means that performance degrades with increasing device size.
The 20K architecture fixes many of the routing
problems of the earlier families cited above. Another hierarchical
layer is added between the row route and the LAB, which has the effect
of localizing connections that previously had to go on the row tracks.
Since those connections don't have to cross the chip, they are faster.
To fix the arithmetic connections, direct connections have been added from
each LAB to the LE's in the adjacent LAB's in the so called megaLAB.
The Xilinx routing structure is a mix of different
length wires connected by switches. For the more local connections,
very fast single length connections are used. Longer connections use the
longer wires to minimize the number of switch nodes traversed. The
routing delays have a strong dependence on the connection distance, so
placement is critical to performance. This can make performance elusive
to the novice user, but on the other hand, the segmented routing means
extreme performance is available if you are willing to do some work to
get it.
Bottom line is that the Altera routing is more
forgiving for moderate designs at moderate densities, which makes it easier
for users and tools alike. However, the same things that make it easier
for those designs are roadblocks for higher performance.
Tools
Both vendors now ship FPGA Express for compiling/synthesising.
Altera also offer Leanardo Spectrum, which in my opinion is vastly better
than the Synopsys tool. Synplify would still be my synthesiser of choice,
but that isn't likely to be free any time soon!
Altera-specific version of FPGA Express
and Leonardo Spectrum are offered FREE on the Altera web site. You
do not need a subscription to get them. However, if you do get a subscription,
you also get ModelTech's Modelsim program.
The place and route (Xilinx) and Fitter (Altera) tools both accomplish
the same job. At the time of my investigations (1999) the design I was
benchmarking would take several hours to p&r for Xilinx, rather than
several minutes for the Altera tools. This is mainly due to the difficulties
caused by the Xilinx architecture to the tools. Note that no effort was
made to guide the tools, other than providing timing constraints, as the
environment I work in places a high priority on speed of turnaround. I'm
told (by Xilinx) that things are much improved with the new tools, but
I haven't been able to compare.
It's quite possible that I could have done the job in a smaller/cheaper
Xilinx part, but our production volumes were exptremely small, so the time
taken to create/debug the design on the bench was a priority.
Other bits
Xilinx have DLLs, Altera have PLLs. Altera claim PLLs are better
becuase they give you proper 'analogue' control over the timing of your
clocks. Xilinx claim DLLs are better because they are not analogue and
therefore easier to deal with. Xilinx have an interesting appnote
comparing the two, but they have subtracted the jitter of their source
clock from the Xilinx numbers and not from the Altera measurements. They
didn't measure the jitter of the Altera input, so it's difficult to judge
if the PLLs are the cause of the jitter they measure or not. In the interests
of fairness, you can look at Altera's
jitter comparison - however, it seems to have a lot less experimental
details to it. I feel I could reproduce the Xilinx experiment to verify
the results if I wanted to!
One significant difference between the
PLLs and the DLLs that you missed is the ability of the PLLs to create
non-integer multiples of the input clock. In fact, the Altera PLL can multiply
the input clock by m/(n*k), where m is any value from 1 to 160 and (n*k)
is also any value from 1 to 160. Check out App Note 115 for details
on the PLLs.
Summary
Xilinx
-
Potentially smaller and cheaper devices
-
Good at arithmetic functions
-
Flexible I/Os
-
Longer compile times
-
More complex tools
-
More capable tools for the power user
-
Both small and large blocks of embedded RAM
-
Proper dual port RAM
Altera
-
Quick compile
-
Simple tools
-
Less flexible tools for the power user
-
Flex and Apex make it tricky to make fast bi-directional I/O
-
Less capable arithmetic
-
No small blocks of embedded RAM
-
RAM has one read and one write port, not proper dual ported.
The conclusion about compile times does
not hold for all designs. The compile time for dense arithmetic designs
in Altera can literally take days where a similar design in Xilinx can
finish in under an hour with decent floorplanning. Floorplanning in Altera
is not well supported and frankly won't provide as much as it does with
Xilinx
Because of Altera's row/column architecture, Altera
has been able to design-in redundant rows and colums. If a fab defect
is found, a redundant row can be switched in and the die is saved rather
than thrown away. Since the biggest cost-driver is die size and yield,
I would have to dispute the "potentially... cheaper" devices claim.
As far as smaller goes, I would have to agree that Xilinx has a wider product
offering at the small end of the FPGA size spectrum.
(The reality of whether one vendor's parts are
cheaper than the other is independent of whether the device includes redundancy
logic.The efficiency of the architecture (gates per some metric of silicon
usage such as area or transistors), the implementation geometry, test costs,
volume, package type, and many other factors all affect the manufacturing
cost. The user pays a "Price" not a "Cost", and this price depends on the
cost, as well as the supplier's profit margin, and how good you are at
negotiating lower prices :-) . While redundancy may help reduce the cost,
what matters in the end to the end user is the price they pay for a device
that meets their needs. )
Quite right Philip. And there's more than the piece price to
think about. If the tools/architecture/whatever allow you to get to market
quicker, or your volumes are so low that the development costs outweigh
the FPGA price (as it does in my particular application) different things
become more important.
Regarding "potentially... cheaper", maybe it would be better to say
"in some applications, potentially cheaper". And therefore the same should
apply to Altera!
I'm also going to have to raise an issue with the
"More capable tools for the power user". Just because Altera's tools
have a nicer GUI doesn't mean that the tools are not for the power user.
Quartus II has a built-in TCL console for creating scripts that can do
everything that you can do in the Xilinx tools.
Well... no! Show me where in their tools
you can look at and edit individual wires in the device. You can do that
in Xilinx's FPGA editor. How about specifying placement in your source
(the edif netlist)? It sure would be nice to be able to constrain the two
level arithmetic logic and the registers driving it to lie in the same
row. Cliques gave the tools a *HINT* that you want to keep stuff
together in the max plus tools, but only if there was a small number of
them. Last I checked, Quartus still could not use cliques.
If you don't like to use the menus, ask your local
Altera FAE and he can provide you with a library of TCL functions (ask
for the PowerKit) that will allow you to create constraints like "Real
Men" do rather than use the GUI.
This is probably my fault - I was referring to Maxplus2 which
I have consistently failed to get to do what I want with placing certain
logic cells - due to the Quartus fitter ignoring all my assignments - and
the older fitter not being able to get close to my timing requirements.
Approaches to our local FAE, Altera direct and the c.a.f newsgroup all
hit a brick wall. My cursory inspection of Quartus a while ago did lead
me to the idea that it was much more capable in this area, but as I've
not gone beyond 10K I have no 'real world' comments to make. I do use emacs
to enter my constraints in the acf file though :-)
I can only encourage you to check out the literature
and talk to the FAEs from both Altera and Xilinx to get a more balanced
view of the strengths and weaknesses of the two architectures.
I have read the literature, and spoken to FAEs from both companies.
I think much of our misunderstanding probably stems from the fact that
I initially wrote this piece based on 10K compared with 4000, with comments
thrown in about other architectures jsut to confuse the issue! Sorry about
that!
|