FPGA-FAQ 0014

How do I choose between Altera and Xilinx?

Vendor	Both
FAQ Entry Author	Martin Thompson
FAQ Entry Rebuttal/Commentary	Anonymous Altera Fan
FAQ Entry Additional Analysis	Ray Andraka
FAQ Entry Editor	Philip Freidin
FAQ Entry Date	7 June 2001

One of our many readers suggested that the only way to read this particular page was to do it
listening to Arlo Guthrie's Alice's Restaurant, and I agree. Here it is !

Q. How do I choose between Altera and Xilinx?

A.
Here's a quick (well actually its not as quick as I expected) description of how I chose between Altera and Xilinx. I was comparing Altera's Flex/APEX with the Xilinx's XC4000/Virtex families. Some comments are thrown in about Virtex and Apex and so on, but I will be the first to admit I haven't done any more than read the datasheets. The Flex10KE I have used in anger.
A lot of what I have learnt about the two families has come from reading comp.arch.fpga, and in particular, a discussion I had with Ray Andraka. Most of the rest came from reading datasheets and appnotes.
Due to the discursive nature of this bit, it is indented with later comments being more indented...
Logic

The structure of the Xilinx logic cells is well suited to arithmetic structures, compared to the Altera Flex/Apex structure, due to the ability to generate both output and carry from one logic cell. Altera's 4-LUT is divided into two 3-LUTs for arithmetic.
I think you misunderstand the basic Altera Logic element. The carry and sum outputs are implemented in a single logic cell. Just as a Xilinx logic cell can be reconfigured to act as a 16-bit memory cell, the Altera logic cell has several configurations optimized for arithmetic, counters, or misc logic. There is no speed penalty for using these modes and there is no inherent advantage to the Xilinx logic cell with regard to arithmetic. I hear all kinds of claims about Xilinx architectural advantages, but I have never heard even the most ardent Xilinx user claim that tha Xilinx architecture has a arthmetic advantage in the logic cells.
I think the point is that Altera can only do a 2-bit arithmetic operation with carry in and out - in Flex at least (ref. figure 11 in the 10KE datasheet). Also, with Flex, if you use a CE that has to take the place of an arithmetic input as well, leading to a one input function (without using cascade chains to make it wider, with their, admittedly small, routing delay).
Xilinx does have a distinct advantage over Altera when it comes to arithmetic circuits. For arithmetic, the Altera 4-Lut does indeed get partitioned into a pair of 3-Luts: one for the 'sum' function, and one for the 'carry' function. One input to each of these is connected to the carry out from the previous bit. As a result, you are limited to a two input arithmetic function if you wish to stay in one level of logic. Arithmetic functions with more than two inputs, such as adder-subtractors, multiplier partial products, mux-adds, and accumulators with loads or synchronous clears (this last one is addressed by improvements in the 20K family) require two levels of logic to implement. The Xilinx logic cell does not use the Lut for the carry function; it has dedicated carry logic.
The 4K/Spartan families use one Lut input to connect the carry chain, leaving three inputs for your function. Virtex, VirtexII, and SpartanII have a dedicated xor gate after the Lut to do this, so these devices can handle 4 input arithmetic functions without having to go to two levels. The relatively limited arithmetic function of the Altera parts means as much as twice the Luts are used in heavily arithmetic applications. Two levels of logic also equates to a significant performance penalty, everything else being equal.
Xilinx's logic cells can also be used as 16 bit shift registers or 16x1 SRAMs for small amounts of storage. In addition, in Virtex there are BlockRAMs which are larger blocks of dual-ported memory. Altera only has large blocks of RAM called EABs which are configurable between 256x16bits and 4096x1bit. They are also only partially dual-ported (one read and one write port).
The ability to convert the Logic cell into memory is a neat feature. This is one of the key differences in the architectures. My only comment on that is that it isn't used as much as you might think. Xilinx parts have a much lower logic cell count relative to device size since they include so much RAM (example: XVC600E: 13.8K logic cells, 288kbits RAM, 20K400E: 16.6K logic cells, 208Kbits RAM). Because of this it doesn't usually make sense to take away your less abundant resource (logic cells) to create more of something you already have lots of (memory). None-the-less it is a neat and sometimes quite useful feature.
For DSP designs, the CLB RAM capability is another significant advantage over the Altera offerings. DSP designs tend to have many small delay queues (filter tap delays, for example) which use up a lot of logic cells if implemented as flip-flops, or severely under-utilize block memories if done there. By using the CLB RAMs (or in the case of Virtex, the shift register mode), you get up to a 17:1 area reduction over using Lut flip-flops. Similar reductions come into play for designs having register files and small fifos. The Virtex SRL16 primitive also gives you the capability to reload Lut contents without reconfiguring the device. This makes it possible to have re-programmable coefficients in a distributed arithmetic filter for instance. There is simply no equivalent capability in the Altera devices. My Virtex designs typically have more than half of the Luts configured as SRL16's.
(This is comparing marketing gate counts (600E vs 400E) , actual logic cells (Xilinx actually claims 15.5K, but 13.8K is the actual number of 4LUTS). The 288Kbits of RAM in Xilinx is the block RAM, there can be upto 216Kbits more in the LUTs (which would leave zero for logic). The 208K for Altera is for block RAM only. For each user, a better measure might be to find the product in each vendors product line that can hold a given design, and compare actual price. This gets away from inflated gate and RAM claims, and whether or not it makes sense to trade logic for RAM)
Precisely - as with many of the architectural differences, if you need the feature, its brilliant, otherwise, it has no (or even a negative) impact.

As far as the memory blocks go, the Altera blocks have built-in circuitry to allow them to be used as CAM (content addressable memory). The Altera CAMs have a huge performance advantage over trying to implement CAMs in Xilinx devices using memory blocks and some logic. The Altera memory blocks can also be used to implment fast, wide, product-term logic. (Xilinx block RAMs can too) This is useful, for example, for implementing a wide address decode in few levels of logic. With that said, I will agree that the Xilinx dual-port mode is more full-featured than the APEX 20KE dual-port (although the advantage disappears when you compare Virtex II vs. Apex II).
This is with APEX and later families. As far as I can tell, the Flex devices don't have this ability. Again, great if you need it!
On the subject of block memories, the advantages of one over the other are not as clear. Xilinx does have a true dual port capability where Altera's memory is at best (depends on the family) a read only port and a write only port through the 20K family. This is fine for many designs, so unless you need it, not having it is not a problem. Altera does have two very nice unique capabilities in the 20K memories: a CAM mode and the product term mode. The CAM is more than nice to have for network apps and places where you need to sort data. While you can do a CAM in Xilinx, the design is neither trivial nor particularly fast (either the fetch or the write operation has to take multiple clocks; see the Xilinx app notes for details). The product term capability is reminscent of a CPLD, which is very handy when dealing with big combinatorial functions such as address decodes.
The flipflops in the logic cells differ in that the Xilinx logic cell has a dedicated clock enable input, whereas Altera use one of the inputs to the LUT to create a CE signal. In addition the Altera flip flops only have a clear input. If you want a preset, the tools will put NOT gates on the input and output of the DFF. Which means that you can't have a preset flipflop implemented in the I/O cell - therefore your Tco can suffer badly. The diagram in the datasheet implies a preset input, but on reading the text you discover the truth!
The Altera does have a true clock enable on the LE flip-flop but (except for the 20K) it shares an input to the LE with one of the Lut inputs, so using the clock enable reduces the available functionality of the Lut. In the case of arithmetic logic, using the CE limits you to a single input for one level of logic.
FLEX 8000: No clock enable. Software emulates the clock enable by building it into the logic
FLEX 6000: No clock enable. Software emulates the clock enable by building it into the logic
FLEX 10K, 10KE, ACEX 1K: Clock enable uses one of the LUTs data inputs (per the authors original comment)
APEX 20K, 20KE, Mercury, APEX II: Regular clock enable.

The logic cells allow you to implement EITHER an asynchronous clear OR an asynchronous preset. You can't do both without using additional logic cells, but you can implement either, even in the I/O cell. By the way, the tco increases by only 0.233ns when using a register near the periphery rather than a register in the I/O cell (APEX EP20K30ETC144-1).
Provided you can get the register to be consistently located adjacent to the IOB (can be difficult as the device gets full). Depending on registers placed in the core rather than in the IOB leads to external timing being a function of the place and route solution...not a good thing. Incidently this is also a problem in the 10K if you need bi-directional I/O since there is only one flip-flop in the IOB.
If I can return to the Flex architecture, which is what I began the article comparing, according to the 10KE datasheet, an async preset is implemented in one of two ways:

Using the clear and inverting both input and output. Inverting the input is 'free' but inverting the output requires a LUT between your register and the pin. Hence, its not just a case of putting the register not in the I/O element, there's extra logic to consider.

Admittedly, I missed the other way of doing it, which is to use one of the LUT inputs as a preset. But then you've lost a LUT input, so that's not always possible either.

Altera's new Mercury family has a different logic structure, including two carry chains, so the arguments are probably different. I haven't had time/inclination to do any detailed analysis.

I/O

Both families offer similar I/O families. The biggest difference is that the Altera I/O cell has a single register, which can be used as a output, input or OE register. The Xilinx I/O has all three available for use. Note that the diagrams in the Altera datasheet implies that they have the same capability, but on reading the text you find that the picture shows all the possibilities at once!
You're right about that diagram in the datasheet. Also, you can't use the register in the I/O cell for the OE either - just input or output. However, note the comment above regarding using nearby registers not in the I/O cell. The performance penalty in most cases is less than 1ns for using a non-I/O cell register.
Fair comment - I admit to being a bit bitter about what I consider to be misrepresentation of the truth in the diagram - still, I've learned not to trust the pictures and read the words now!

This is not true for Mercury, which has three flipflops in the I/O cell, and ApexII which has six, for DDR applications.
The Virtex/Apex comparison of their respective LVDS implementations is interesting. As far as I can gather the SerDes function is implemented in the FPGA fabric for Virtex, and in custom silicon for Apex. This means that you only get proper SerDes LVDS support with the larger Apex devices.
The dedicated SERDES circuitry in the APEX devices allows you to move data around inside the device at 105 MHz and drive it out the LVDS drivers at 840Mbps. The Xilinx solution requires routing data and clocks around internally at 320 MHz (not simple) and they use both edges of the clock to drive data at 640Mbps. Also, the LVDS drivers in the Altera part are balanced (equal rise and fall times) providing a much better eye-diagram than what you get from the unbalanced drivers in the Xilinx device. The Xilinx solution also requires an exernal resistor network to get the right LVDS voltage levels. Finally, the Apex 20KE devices have dedicate de-skew circuitry in the LVDS receivers. This prevents the board designer from having to make all the signal traces exactly the same length. It's hard to argue that the Altera LVDS solution is significantly superior (Apex 20KE vs Virtex-E), but I do have to admire the fact that Xilinx was able to coax 640 Mbps LVDS out of drivers that were never intended to do LVDS. Altera's general-purpose I/Os have trouble making it to 200 Mbps with a Xilinx-type solution.
As far as Apex II and Virtex II, I have yet to see details on the Virtex II LVDS. Apex II increased LVDS performance to 1 Gbps and put it on more channels. Apex II also improved the clock de-skew circuitry to reduce even further the need to carefully hand-route the board-level LVDS signals.
Also good comments, from someone who has actually done it, rather than simply my reading of those datasheets and appnotes!

Routing

The routing structures are also different. Altera's main routing strategy is to have many lines connecting the entire chip together This is in contrast to the Xilinx approach, which consists of a hierarchy of short, medium and long connections. This make the job of the place and route tool harder in the Xilinx devices, unless it is guided. The downside for Altera is that larger devices get slower as there is more capacitance to drive.
The routing structures of the Xilinx and Altera families are very different; each has different abilities. The Altera structure is a hierarchical structure akin to that of a CPLD. At the lowest level, there are very fast connections between the logic elements (LE's, which consist of a flip-flop and a 4-Lut each) within a LAB (logic array block-with 8 to 10 LE's). These connections are great for very fast state machines, but are useless for arithmetic because the carry chain also runs thru the LAB. The next level up in the routing hierarchy connects the LABs in a row together. The row routes run halfway or all the way across the chip in 10K, with switches connecting to selected LAB's. The rows are then interconnected by column routes. A LAB can drive a row or column route directly, but can only receive input from a row route. This structure has the advantage of having uniform delays for any connections using similar hierarchical resources. That in turn makes placement less critical. Unfortunateiy, it also means even local connections incur the delay associated with a cross-chip connection. A bigger problem appears with heavily arithmetic designs because the routing in and out of every arithmetic LE is forced onto the row routing. There are only six row routes for every eight LE's in a row, so even with perfect routing in a heavily arithmetic data-flow design, the row can only be 75% occupied. The row interconnect matrix is sparsely populated (any one LAB can only directly connect to a fraction of the LAB's on the same row. As the row fills up, some of the connections have to be made via a third LAB, adding to the delay and further congesting the row routes. In a math intensive design, system performance often falls off sharply at 50 to 60% device utilization. The global nature of the row and column routes also means that performance degrades with increasing device size.
The 20K architecture fixes many of the routing problems of the earlier families cited above. Another hierarchical layer is added between the row route and the LAB, which has the effect of localizing connections that previously had to go on the row tracks. Since those connections don't have to cross the chip, they are faster. To fix the arithmetic connections, direct connections have been added from each LAB to the LE's in the adjacent LAB's in the so called megaLAB.
The Xilinx routing structure is a mix of different length wires connected by switches. For the more local connections, very fast single length connections are used. Longer connections use the longer wires to minimize the number of switch nodes traversed. The routing delays have a strong dependence on the connection distance, so placement is critical to performance. This can make performance elusive to the novice user, but on the other hand, the segmented routing means extreme performance is available if you are willing to do some work to get it.
Bottom line is that the Altera routing is more forgiving for moderate designs at moderate densities, which makes it easier for users and tools alike. However, the same things that make it easier for those designs are roadblocks for higher performance.

Tools

Both vendors now ship FPGA Express for compiling/synthesising. Altera also offer Leanardo Spectrum, which in my opinion is vastly better than the Synopsys tool. Synplify would still be my synthesiser of choice, but that isn't likely to be free any time soon!
Altera-specific version of FPGA Express and Leonardo Spectrum are offered FREE on the Altera web site. You do not need a subscription to get them. However, if you do get a subscription, you also get ModelTech's Modelsim program.
The place and route (Xilinx) and Fitter (Altera) tools both accomplish the same job. At the time of my investigations (1999) the design I was benchmarking would take several hours to p&r for Xilinx, rather than several minutes for the Altera tools. This is mainly due to the difficulties caused by the Xilinx architecture to the tools. Note that no effort was made to guide the tools, other than providing timing constraints, as the environment I work in places a high priority on speed of turnaround. I'm told (by Xilinx) that things are much improved with the new tools, but I haven't been able to compare.
It's quite possible that I could have done the job in a smaller/cheaper Xilinx part, but our production volumes were exptremely small, so the time taken to create/debug the design on the bench was a priority.

Other bits

Xilinx have DLLs, Altera have PLLs. Altera claim PLLs are better becuase they give you proper 'analogue' control over the timing of your clocks. Xilinx claim DLLs are better because they are not analogue and therefore easier to deal with. Xilinx have an interesting appnote comparing the two, but they have subtracted the jitter of their source clock from the Xilinx numbers and not from the Altera measurements. They didn't measure the jitter of the Altera input, so it's difficult to judge if the PLLs are the cause of the jitter they measure or not. In the interests of fairness, you can look at Altera's jitter comparison - however, it seems to have a lot less experimental details to it. I feel I could reproduce the Xilinx experiment to verify the results if I wanted to!
One significant difference between the PLLs and the DLLs that you missed is the ability of the PLLs to create non-integer multiples of the input clock. In fact, the Altera PLL can multiply the input clock by m/(n*k), where m is any value from 1 to 160 and (n*k) is also any value from 1 to 160. Check out App Note 115 for details on the PLLs.

Summary

Xilinx

Potentially smaller and cheaper devices

Good at arithmetic functions

Flexible I/Os

Longer compile times

More complex tools

More capable tools for the power user

Both small and large blocks of embedded RAM

Proper dual port RAM

Altera

Quick compile

Simple tools

Less flexible tools for the power user

Flex and Apex make it tricky to make fast bi-directional I/O

Less capable arithmetic

No small blocks of embedded RAM

RAM has one read and one write port, not proper dual ported.

The conclusion about compile times does not hold for all designs. The compile time for dense arithmetic designs in Altera can literally take days where a similar design in Xilinx can finish in under an hour with decent floorplanning. Floorplanning in Altera is not well supported and frankly won't provide as much as it does with Xilinx
Because of Altera's row/column architecture, Altera has been able to design-in redundant rows and colums. If a fab defect is found, a redundant row can be switched in and the die is saved rather than thrown away. Since the biggest cost-driver is die size and yield, I would have to dispute the "potentially... cheaper" devices claim. As far as smaller goes, I would have to agree that Xilinx has a wider product offering at the small end of the FPGA size spectrum.
(The reality of whether one vendor's parts are cheaper than the other is independent of whether the device includes redundancy logic.The efficiency of the architecture (gates per some metric of silicon usage such as area or transistors), the implementation geometry, test costs, volume, package type, and many other factors all affect the manufacturing cost. The user pays a "Price" not a "Cost", and this price depends on the cost, as well as the supplier's profit margin, and how good you are at negotiating lower prices :-) . While redundancy may help reduce the cost, what matters in the end to the end user is the price they pay for a device that meets their needs. )
Quite right Philip. And there's more than the piece price to think about. If the tools/architecture/whatever allow you to get to market quicker, or your volumes are so low that the development costs outweigh the FPGA price (as it does in my particular application) different things become more important.
Regarding "potentially... cheaper", maybe it would be better to say "in some applications, potentially cheaper". And therefore the same should apply to Altera!
I'm also going to have to raise an issue with the "More capable tools for the power user". Just because Altera's tools have a nicer GUI doesn't mean that the tools are not for the power user. Quartus II has a built-in TCL console for creating scripts that can do everything that you can do in the Xilinx tools.
Well... no! Show me where in their tools you can look at and edit individual wires in the device. You can do that in Xilinx's FPGA editor. How about specifying placement in your source (the edif netlist)? It sure would be nice to be able to constrain the two level arithmetic logic and the registers driving it to lie in the same row. Cliques gave the tools a *HINT* that you want to keep stuff together in the max plus tools, but only if there was a small number of them. Last I checked, Quartus still could not use cliques.
If you don't like to use the menus, ask your local Altera FAE and he can provide you with a library of TCL functions (ask for the PowerKit) that will allow you to create constraints like "Real Men" do rather than use the GUI.
This is probably my fault - I was referring to Maxplus2 which I have consistently failed to get to do what I want with placing certain logic cells - due to the Quartus fitter ignoring all my assignments - and the older fitter not being able to get close to my timing requirements. Approaches to our local FAE, Altera direct and the c.a.f newsgroup all hit a brick wall. My cursory inspection of Quartus a while ago did lead me to the idea that it was much more capable in this area, but as I've not gone beyond 10K I have no 'real world' comments to make. I do use emacs to enter my constraints in the acf file though :-)
I can only encourage you to check out the literature and talk to the FAEs from both Altera and Xilinx to get a more balanced view of the strengths and weaknesses of the two architectures.
I have read the literature, and spoken to FAEs from both companies. I think much of our misunderstanding probably stems from the fact that I initially wrote this piece based on 10K compared with 4000, with comments thrown in about other architectures jsut to confuse the issue! Sorry about that!

Amazing as it may seem, other people have asked to contribute to this page, and editing each person's input (I am expecting more) is getting to be a bit much, so for your enjoyment, here are comments from others on the topic. Good luck with your selecting a vendor of FPGAs :-)

Anonymous Designer:
1. TOOLS
Altera supports AHDL, which is more powerful than ABEL, but much easier to learn than VHDL/Verilog. The Maxplus2 tool allows you to target anything between a 7032 and a 10k200, almost seemlessly. When we were just getting started this was a big advantage.
2. SUPPORT
Altera data sheets have to be read most carefully to check that the device has the features you want, in the package you want. E.G. only some 10K series have PLLs. When Xilinx says a family has DLLs then the whole family has them. The summary data sheet for Altera's APEX 20KE series states that LVDS is supported, but does not mention that it is only really supported in the 20K400E and larger devices. There is no mention in the summary front page; you have to look really hard in the datasheet to find this.
3.
Altera appears to be targetting the router and other network hardware market at the moment. Xilinx seems to be going towards DSP.
4.
Some people are of the opinion that Xilinx appears to be far more innnovitive and open : There is code to make your FPGA into a DAC with one resistor and one capacitor ! The Altera app. notes amount to "Yes, we did it" but do not give sufficient detail for ME to do it. Xilinx app notes are far more helpful, and they will respond to postings in comp.arch.fpga. Altera NEVER do.

An update from Anonymous Altera Fan
The Product-Term mode of the Altera memory blocks is something other than just using the RAMs as a big LUT.
A single memory block can be configured to provide 16 product-term outputs based on 32 inputs. Although this can be duplicated using a generic RAM block as a big LUT, it would take an extemely large memory block (32 address lines = 2^32, 16-bit memory cells) to do it in the brute-force manner.
Note that this is only a feature in Apex, Apex-E, Mercury, and Apex II devices.

FPGA-FAQ FAQ Root