Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Austin, > > Brian just seems to be stuck, and is unwilling to >grant that there are ways to make it work just fine > Exactly what part of "requires external back termination and/or input matching scheme when driving FPGA inputs from a modern high speed LVDS driver" didn't you read? The way you "make it work just fine" with a high speed driver is by adding "external back termination and/or input matching" - that's what I've been saying, repeatedly, since my first post. Item 13 from my original post: > >13) Massive 8pf IBIS C_COMP input capacitance value for the > V2 LVDS inputs requires external back termination and/or > input matching scheme to achieve reasonable signaling when > driving FPGA inputs from a modern high speed LVDS driver > Why did I feel it necessary to include this item on the list: Because inexperienced designers wouldn't know any better, and even experienced designers with ECL/GaAs/SiGe high speed digital components may be caught off guard by such a high Cin spec - when first reading the Virtex2 datasheet, I thought it was a tester specification limit until I did initial system SPICE modeling and real world driver/TDR testing on a Virtex2 prototype board. Brian Austin Lesea <Austin.Lesea@xilinx.com> wrote in message news:<3F8578D1.AAB14B8B@xilinx.com>... > Rick, > > Now you, I can have a discussion with. > > Anything that is unlcear or still in doubt about the input C issue that I might explain? > > After all, many posts ago I explained why the C was what it was, how it is documented, and made > comment that there are ways to deal with it, but Brian just seems to be stuck, and is unwilling to > grant that there are ways to make it work just fine, and that perhaps there are valid reasons why > the C input can not be 0.5pF. > > Do you have, or have you run an IBIS simulation of the ORCA-4 IOB and looked at how its input C > affects the signal? > > Austin >Article: 61751
Hi Arkaitz, arkaitz wrote: > Hi Antti, > > I've done a flash loader but I don't know which file do I have to > store in flash in order to enable to execute it. > > I've proved storing the "executable.elf" file which contains the > crt0.o initialization code linked and then I jump to that address from > a program stored in the Block RAMS, but as I supposed it doesn't work. re-reading your messages, it occurs to me - are you hoping to execute the code directly from the flash? In that case, you will need a custom link script, because otherwise your data segment (read/write) will also be located in the flash address space, and of course that won't work at all! In my applications I simply use the flash as somewhere to store the image when the power is off - at bootup I copy the image from flash, down into RAM to the address at which it was originally linked, then jump to it. You will need to modify this sequence somewhat... Regards, JohnArticle: 61752
I need to infer an 8 bit accumulator (acc8) using Verilog on the Xilinx Webpack. The Library guide seems to contain syntax errors. I could not get the tool to infer a loadable accumulator, no matter how I play around with the implementation. I get an adder using 10 slices, instead of 5 slices I should get when an accumulator is inferred. Does anybody know the solution?Article: 61753
Hi, I am interested in the reliability of modern FPGA/PLD hardware and am surveying groups of users for their experience along with studying reliability data provided by various manufacturers. So, two basic questions: 1. Are there reliability issues for modern devices with the higher clock speeds that we are using today? I shall set, for the sake of discussion, an artificial boundary of 100 MHz clock frequency for the dividing line between high and not high speed. 2. Are there handling/assembly/application issues for modern devices as compared to say devices from 5 years ago? That is, are there observed changes in sensitivity to conditions such as ESD, input voltage excursions, transients on the power supplies, etc. Please categorize the application environment in terms of commercial, industrial, or mil/aerospace and specify clock frequency. If possible, quantities of devices might be helpful for evaluating trends. Posts to the newsgroup are of course fine. If you wish to be anonymous, please demunge and use the e-mail address in the header. Thanks, Richard B. Katz NASAArticle: 61754
Depends on the HDL. VHDL certainly is. There is a link on my links page of my website to an example of some VHDL that does exactly this, which IIRC is a function call. Tim wrote: > Ray Andraka wrote: > > I'd rather use a function or procedure within the HDL so that the > > boolean expression is in the code and is used directly to generate > > the init value. > > Not possible for the HDL which is not a complete > programming language ;-) -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 61755
John, This was only a snippit of my code. In my final design I accounted for all possible rotations of the code. Using the casez worked quite well in my application. Sorry for any confusion. Sincerely, Jeremy "John_H" <johnhandwork@mail.com> wrote in message news:<9Sjhb.18$jU3.8636@news-west.eli.net>... > I'm afraid I'm lost. The example you give shows a single alignment for a > 6-ones check. There are 10 total alignments that can apply. Once detected, > there doesn't seem to be an indication THAT the detection occurred except > that one of the bits from the SERDES is now a one (by definition of your > pattern, it would have been a zero). > > Were you suggesting that, in general, a casez might produce good results > from the synthesizer for run detection? > > "Jeremy Webb" <jeremywebb@ieee.org> wrote in message > news:4d807c8a.0310091158.5a0ba215@posting.google.com... > > John, > > > > I did something similar to this in a Spartan II. I was searching > > through a 2^7-1 PRBS pattern (at the output of a SERDES, data bus is > > 10-bits wide) for the longest string of zeros. Granted the longest > > string of zeros in a 2^7-1 PRBS pattern is 6, the idea could be > > extrapolated to longer strings like 9 in a 65-bit wide bus. > > > > Here's an example of what I did for searching for 6 zeros in a row. > > You'll notice that in my casez statement, I'm actually searching for 6 > > ones in a row. This is because the BERT that I was using inverted > > it's output PRBS pattern. > > > > always @(posedge clock1) > > begin > > casez (datasi[9:0]) > > 10'b???111111? : Q[9:0] = {datasi[9:8],7'b1111111,datasi[0]}; > > default : Q[9:0] = datasi[9:0]; > > endcase > > end > > > > Once you find the string that you're looking for, you can do what ever > > you'd like. > > > > Hope this helps, > > > > Jeremy > > > > johnhandwork@mail.com (John_H) wrote in message > news:<6c803f5f.0310060552.267dc963@posting.google.com>... > > > "Morten Leikvoll" <m-leik@online.nospam> wrote in message > news:<5z9gb.28389$os2.397003@news2.e.nsc.no>... > > > > I just started reading this thread.. Am I correct if you really want > to > > > > detect 9 EQUAL bits in a row from a stream? > > > > Could you not do this just with a 4bits counter and a comparator/zero > > > > detector? > > > > > > Correct, I need "equal" bits, either 9'h000 or 9'h1ff, starting from > > > 0, 8, 16, ... 56. > > > > > > The input is 65 bits per clock with a fast clock, output from BlockRAM > > > which was loaded at full width. > > > > > > Counters require more than one clock.Article: 61756
You have to be a little careful because the carry logic in the slice follows the LUT, so in order to get in one level of logic you need to visualize the load preceeding the add. To do that, one input of the adder has an and gate so that it is forced to zero when load is active, the other input is a mux to select your load value or the addend (could be the same, depending on your design). Note in that case that there is logic in front of both add inputs. That same logic needs to preceed the carry mux DI input, which does have an AND gate available (the mult and). In order to use that, your load signal has to be active low. Some synthesis tools will infer the right structure as long as it is realizable in the hardware, while others need you to be more explicit. Y K wrote: > I need to infer an 8 bit accumulator (acc8) using Verilog on the > Xilinx Webpack. > The Library guide seems to contain syntax errors. > I could not get the tool to infer a loadable accumulator, no matter > how I play around with the implementation. I get an adder using 10 > slices, instead of 5 slices I should get when an accumulator is > inferred. > Does anybody know the solution? -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 61757
This pointed me at a workaround: The problem is indeed around the load input. The following code shows three versions: The Xilinx "Verilog" version (self explanatory), the same written in Verilog, and one where reset and load are or'ed together. Only the third one works. I prefer a pure Verilog solution for future portability reasons. The code is going to a product line that may live for 20 years, and I don't want to rewrite it everytime I have to design a board using a new FPGA family. The workaround is good enough for me, but perhaps Xilinx can fix the documentation and XST code too? Thank you Yishai Kagan yk_four_zeroes_one@hotmail.com (convert to digits to get the correct e-mail). Here it is: module accumulator8(input C,R,L,D,CE,ADD, input [7:0] B, output reg [7:0] Q); /* The following does not work for obvious reasons. // Verilog Inference Code copied from the Libraries Guide Page 163 always @ (posedge C) begin if (R) Q <= 0; else if (L) Q <= D; else if (CE) end if (ADD) Q <= Q + B; else Q <= Q - B; end */ // The following does not infer an accumulator for less obvious reasons: /* always @ (posedge C) begin if (R) Q <= 0; else if (~L) Q <= D; else if (CE) begin if (ADD) Q <= Q + B; else Q <= Q - B; end end */ // The problem is in L, as you claim. // The following code does infer an accumulator, // at the expense of losing the distinct reset: wire RorL = R || L; always @ (posedge C or posedge RorL) begin if (RorL) Q <= D; else if (CE) begin if (ADD) Q <= Q + B; else Q <= Q - B; end end endmodule Ray Andraka wrote: > You have to be a little careful because the carry logic in the slice > follows the LUT, so in order to get in one level of logic you need to > visualize the load preceeding the add. To do that, one input of the > adder has an and gate so that it is forced to zero when load is active, > the other input is a mux to select your load value or the addend (could > be the same, depending on your design). Note in that case that there > is logic in front of both add inputs. That same logic needs to preceed > the carry mux DI input, which does have an AND gate available (the mult > and). In order to use that, your load signal has to be active low. > Some synthesis tools will infer the right structure as long as it is > realizable in the hardware, while others need you to be more explicit. > > > Y K wrote: > > >>I need to infer an 8 bit accumulator (acc8) using Verilog on the >>Xilinx Webpack. >>The Library guide seems to contain syntax errors. >>I could not get the tool to infer a loadable accumulator, no matter >>how I play around with the implementation. I get an adder using 10 >>slices, instead of 5 slices I should get when an accumulator is >>inferred. >>Does anybody know the solution? > > > -- > --Ray Andraka, P.E. > President, the Andraka Consulting Group, Inc. > 401/884-7930 Fax 401/884-7950 > email ray@andraka.com > http://www.andraka.com > > "They that give up essential liberty to obtain a little > temporary safety deserve neither liberty nor safety." > -Benjamin Franklin, 1759 > >Article: 61760
Dear all, I have one big problem with my cardbus PC-card. I developed this card on my own and a collegue developed the driver. The CardBus-interface is included in an FPGA APEX20K100E. Since there were some statements in the Specifications about burst read, I adapted everything on this card for this burst read. But when I insert the card in a Notebook and read my memory space, no burst happens. All what is done are normal single accesses. What do I have to do to perform a burst read? Are there any settings which have to be made to enable such a burst? My PC-Card is burst-read capable and on the PCI-to-CARDBUS-Bridge I also set the MBURSTUP and MBURSTDN-Bit to enable such burst. The sys-driver itself should also be burst-capable I think. I would be very pleased if anyone could help me solving this problem. Thanks in advance JoachimArticle: 61761
Dear Mr Treseler, thank youu for your answer. I have looked at the pdf-file you recommended. On page 50 there is the VHDL description of a single-clock synchronous RAM, but I use two clocks. The problem seems to be the signal writing: When I do not use it, the compiler inferres RAM-memory. But it should not be such a big problem to combine the write-signal with a writing signal! But the compiler seems to have a recognition problem of the RAM structure when doing so. I am trying to find out why, but without any success yet. Thanks. Kind regards Andres Vazquez G&D System Development Mike Treseler <mike.treseler@flukenetworks.com> wrote in message news:<3F85DCD8.60208@flukenetworks.com>... > Vazquez wrote: > > > Why does QuartusII not synthesize it as a RAM structure using > > the memory bits of Cyclone? > > see pg. 50 > http://www.altera.com/literature/an/an238.pdf > > -- Mike TreselerArticle: 61762
Great answer Ray - thanks very much. Ken "Ray Andraka" <ray@andraka.com> wrote in message news:3F85D4D1.EFC91BDC@andraka.com... > Bzzzt. The 'pipeline' register in the multiplier is in the middle. the setup > and clock to Q of the 'pipelined' multiplier is substantial. In order to get > the data sheet max performance, you need to add CLB registers to the > multiplier I/O AND you need to place them in the slices where there are direct > connects to the multiplier. If you do this, and as long as you don't have > 'stepping 0' parts, the embedded multipliers can be clocked faster than an 18 > bit carry chain. The advantage of in the fabric multipliers is that you can > make them whatever size you need, and put them where they are convenient > rather than being restricted to the mult/bram columns. In the fabric, you can > also take advantage of cases where you have multiple clocks per sample to > reduce the size of the multiplier. I look at the FPGA sort of like a bin of > different Legos (tm). You use what you have in the box to the best advantage > for your particular project. Sometimes there are more multipliers than you > need, so you can use them for things like shifters or muxes if you get real > cute about it. Other times, there are not enough, so you pick and choose what > goes where. > > Ken wrote: > > > <snip> > > > > > 3: Use of the BlockRAMs. Since the BlockRAMs and multipliers share > > > interconnect, there are limits on when they can be used > > > simultaneously. > > > > > > 4: Pipelined, throughput-optimized performance. The fixed multipliers > > > are unpipelined or single-stage, a LUT multiplier can be much more > > > finely pipelined (higher thorughput). > > > > Ok - it is my understanding that there are registers just before and just > > after the dedicated multipliers that can be used to speed them up. > > > > But what you are saying is that the LUT multipliers will have a higher max > > MHz when both solutions are as pipelined as they can be? > > > > Thanks for your time, > > > > Ken > > -- > --Ray Andraka, P.E. > President, the Andraka Consulting Group, Inc. > 401/884-7930 Fax 401/884-7950 > email ray@andraka.com > http://www.andraka.com > > "They that give up essential liberty to obtain a little > temporary safety deserve neither liberty nor safety." > -Benjamin Franklin, 1759 > >Article: 61763
Hi John, My first idea was to execute it directly from flash memory, but now I will first try copying to it SRAM before I execute it. I will use an specific linker script later to execute it directly. I have resolved the problem. Was that I was creating the elf file in XMDSTUB mode instead of EXECUTABLE mode. Thanks a lot for your time. Arkaitz. John Williams <jwilliams@itee.uq.edu.au> wrote in message news:<bm564s$d66$1@bunyip.cc.uq.edu.au>... > Hi Arkaitz, > > arkaitz wrote: > > Hi Antti, > > > > I've done a flash loader but I don't know which file do I have to > > store in flash in order to enable to execute it. > > > > I've proved storing the "executable.elf" file which contains the > > crt0.o initialization code linked and then I jump to that address from > > a program stored in the Block RAMS, but as I supposed it doesn't work. > > re-reading your messages, it occurs to me - are you hoping to execute > the code directly from the flash? In that case, you will need a custom > link script, because otherwise your data segment (read/write) will also > be located in the flash address space, and of course that won't work at all! > > In my applications I simply use the flash as somewhere to store the > image when the power is off - at bootup I copy the image from flash, > down into RAM to the address at which it was originally linked, then > jump to it. You will need to modify this sequence somewhat... > > Regards, > > JohnArticle: 61764
"Ray Andraka" wrote: > FWIW, you need to put those registers in those spots around the multipliers in > order to achieve the data sheet max performance. Right. I experimented with the XAPP636 placement and studied the routing in and out of the multiplier with FPGA Editor. Makes sense. Can't see a faster way to lay it out. Funny enough, if you let the tools do a layout they will be exceedingly happy to put FF's so far away from multipliers that a monkey with a dart might be able to do better. This, I don't really understand. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Martin Euredjian To send private email: 0_0_0_0_@pacbell.net where "0_0_0_0_" = "martineu"Article: 61765
If you mean putting both Tundra 310 bridges on a single pci-x133 bus I don't think this is electrically supported. As I understand it you can only have one load on pci-x133 bus. Please correct me if I have mis-stated your intention. chad. > If you're looking for an existing silicon solution I believe you could > do it with two Tundra Tsi310 parts. > > -hpaArticle: 61766
On 10 Oct 2003 03:30:40 GMT, "Richard B. Katz" <richard.b.katz@nospamplease.nasa.gov> wrote: >Hi, > >I am interested in the reliability of modern FPGA/PLD hardware and >am surveying groups of users for their experience along with >studying reliability data provided by various manufacturers. So, >two basic questions: > > 1. Are there reliability issues for modern devices with the > higher clock speeds that we are using today? I shall set, > for the sake of discussion, an artificial boundary of 100 MHz > clock frequency for the dividing line between high and not > high speed. > > 2. Are there handling/assembly/application issues for modern > devices as compared to say devices from 5 years ago? That > is, are there observed changes in sensitivity to conditions > such as ESD, input voltage excursions, transients on the > power supplies, etc. You also might like to consider packaging. The high performance we achieve today owes as much to the packaging as to the silicon. New packaging (e.g. BGA) will have new ways to fail. High performance also means lots of power, and the thermal aspects may influence reliability. Regards, Allan.Article: 61767
Eric Crabill <eric.crabill@xilinx.com> wrote in message news:<3F85E01C.A0E99EFA@xilinx.com>... > Hi, > > Logically, what you described can be built with three > PCI-X to PCI-X bridges. > > You can take bridge #1 from PCI-X 133 to PCI-X 66. Aren't you cutting your bandwidth in half? I would like to have the pci66 busses be able to run at full speed to access the host's memory (primary side of the pcix133 bridge #1). If you drop to 66 MHz here now my to secondary busses can only run at 1/2 there bandwidth _if_ trying to access host memory at the _same_ time. > On > that PCI-X 66 bus segment, you put bridge #2a and #2b, > both of which bridge from PCI-X 66 to PCI 66. So, you > can actually go buy three of these ASSPs and build > exactly what you want. > > I wouldn't want to turn you away from a Xilinx solution. > A Xilinx solution could be a one-chip solution, offer > lower latency, and provide you with the opportunity to > customize your design in a way you cannot with ASSPs. > However, you would want to carefully weigh the benefits > with the downsides -- you will need to put in some design > effort. Another thing to consider is cost, which will > be a function of the size of your final design. > > Good luck, > Eric >Article: 61768
The tools do the same thing with pipeline registers added to BRAMs. They don't seem to do very well with placement of and around the multipliers and BRAMs. Martin Euredjian wrote: > "Ray Andraka" wrote: > > > FWIW, you need to put those registers in those spots around the > multipliers in > > order to achieve the data sheet max performance. > > Right. I experimented with the XAPP636 placement and studied the routing in > and out of the multiplier with FPGA Editor. Makes sense. Can't see a > faster way to lay it out. > > Funny enough, if you let the tools do a layout they will be exceedingly > happy to put FF's so far away from multipliers that a monkey with a dart > might be able to do better. This, I don't really understand. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Martin Euredjian > > To send private email: > 0_0_0_0_@pacbell.net > where > "0_0_0_0_" = "martineu" -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 61769
Joachim Mann <jogges@web.de> wrote in message news:bm5j3v$h2vjr$1@ID-199325.news.uni-berlin.de... > Dear all, > I have one big problem with my cardbus PC-card. > I developed this card on my own and a collegue developed the driver. The > CardBus-interface is included in an FPGA APEX20K100E. > Since there were some statements in the Specifications about burst read, I > adapted > everything on this card for this burst read. But when I insert the card in a > Notebook and read my memory space, no burst happens. All what is done are > normal single accesses. What do I have to do to perform a burst read? Are > there any settings which have to be made to enable such a burst? > My PC-Card is burst-read capable and on the PCI-to-CARDBUS-Bridge I also set > the MBURSTUP and MBURSTDN-Bit to enable such burst. The sys-driver itself > should also be burst-capable I think. > > I would be very pleased if anyone could help me solving this problem. > Thanks in advance > > Joachim Joachim, I've read that PC hosts won't perform burst reads from target PCI cards. If you want a burst transfer you've got to implement a PCI master/target in your interface so the master can perform the burst transfer. I don't think this is mentioned in the PCI spec. I presume this is the same for Cardbus. Can anyone else confirm this? Nial. ------------------------------------------------ Nial Stewart Developments Ltd FPGA and High Speed Digital Design www.nialstewartdevelopments.co.ukArticle: 61770
Hi, Perhaps I am a bit jaded, but I think you will never actually realize anything close to "full speed" using PCI. (PCI-X has some improvements in protocol). Your statement assumes that both the data source and the data sink have an infinitely sized buffer, nobody uses retries with delayed read requests, and you have huge (kilobytes at a time) bursts. > Aren't you cutting your bandwidth in half? It depends -- are you talking about "theoretical" bandwidth, or bandwidth you are likely to achieve? If you are designing under the assumption that you will achieve every last byte of 533 Mbytes/sec on a PCI64/66 bus, you will have some disappointment coming. :) PCI and PCI-X are not busses that provide guaranteed bandwidth. I've seen bandwidth on a PCI 64/66 bus fall to 40 Mbytes/sec during certain operations because the devices on it were designed poorly (mostly for the reasons I stated in the first paragraph). > like to have the pci66 busses be able to run at full > speed to access the host's memory (primary side of > the pcix133 bridge #1). If you drop to 66 MHz here > now my to secondary busses can only run at 1/2 there > bandwidth _if_ trying to access host memory at the > _same_ time. While the point you raise is theoretically valid, you must consider that the bandwidth you achieve is going to be no greater than the weakest link in the path. What is the actual performance of the PCI-X 133 Host? How about your PCI 66 components? The bridge performance may be moot. An interesting experiment you could conduct would be to plug your PCI 66 component into a PCI 66 host, and see how close to "full speed" you can really get using a PCI/PCI-X protocol analyzer. Then, you could buy two bridge demo boards from a bridge manufacturer (PLX/Hint comes to mind...) and see what you get behind two bridges, configured as I described. I would certainly conduct this experiment as a way to justify the design time and expense of a custom bridge to myself or my manager. While I suspect you won't get half of "full speed" in either case, I am very often wrong. That's why I'm suggesting you try it out. I'm not trying to discourage you from using a Xilinx solution. However, I'd prefer that potential customers make informed design decisions that result in the best combination of price/performance/features. Good luck, EricArticle: 61771
Hi, For both PCI and CardBus (which is basically point-to-point 3.3v, 33 MHz PCI) the host bridges are traditionally very good targets but not good initiators. This means that if you want to move lots of data, the way you need to do it is by making your add-in card become a bus master, and then read/write host memory. If you are trying to have the "CPU" read/write the add-in card, you'll get poor performance. As the original poster noted, the host won't even burst... EricArticle: 61772
Hi I am trying to build a prototype Spartan-II board. 1. I am using a XC2S50 TQ144 part with all the mode pins tied to VCCINT 2. I am using Xilinx webpack 4.2 3. The parallel cable is from insight electronics (Model IJC-2) 4. I am using a general purpose PCB and a QFP144 adapter from adapters.com to connect the FPGA to the PCB. Now when I compile a simple test module and try to download the bit file through Xilinx iMPACT tool, it gives an error saying "Configuration failed: done pin did not go high". What could be the cause and how can I debug it. Thanks SumitArticle: 61773
Hi , I downloaded HOTman from Virtual Computing corporation around April'03 but when I tried to open the console now it didnt work.Also I went through the procedure again for running HOTman the first time and still couldnt get the GUI to appear.I got the error message "Could not find main class" when I double clicked on the Hotman.jar file. Is this a problem of JRE or is the software HOTman no longer working. I also tried to download the evaluation edition again from the VCC website and couldnt do it(got an internal server error problem).Have they closed the site ? Has anybody faced similar problems with HOTman? Also to implement programs from C directly to FPGA would Celoxica's HandelC oriented DK1-design Suite be the next best option,if I cant get HOTman working. Kindly do help me out on the above. Thanks , SriramArticle: 61774
"John_H" <johnhandwork@mail.com> wrote in message news:<pDYdb.14$XP3.1342@news-west.eli.net>... > <snip> > In either the Xilinx or Altera architecture, it's probably most efficient to > pre-add in groups of 4 bits then add thse results in an adder tree. For 32 > bits (for instance) you can get 8 values with counts of 0-4 with simple > LUTs. > <snip> > - John_H I love coming back to a subject and gaining a little more insight! In the current discussion (and the thread from 3 years ago) I made a mistake (and so has everyone else contributing, a surprising event on usenet). There is a seeming paradox, in that it is more efficient to use the LUTs to do a three input sum at the leaves of the adder tree than to do a 4 input sum. I'll show below the most efficient (area wise) FPGA implementations that I know (for 16 bits, and OP's 30 bits), and challenge anyone to come back and show an even better way. After that, some simple words about the 'paradox' of why the three input/LUT solution is better than four input/LUT in this case. Look first at a _LUT_only_ implementation (16 bits), which applies to any LUT based FPGA... A full adder (FA3) can be implemented in 2 LUTs These are 3-LUTs!, not 4-LUTS, although in practice 4-LUTs are used because that is what is available. It is also called a (3,2) counter. See L.Dadda's papers, including "Pipelined Adders" in IEEE Transactions on Computers, Mar 1996, for a good discussion of adders, or Patterson et.al on "Optimal Carry Save Networks", available at Citeseer, for better/deeper discussions than I can give. This is the full adder: ___ -0-| F | -0-| A |-1- -0-|_3_|-0- (the numbers are powers of 2 at a position) Repeating: The FA3 sums 3 bits, using 2 3-LUTs. Next, 4 full adders are arranged to sum 7 bits, using 8x 3-LUTs: ___ 0--| F | 0--| A |-1--------+ 0--|_3_|-0---+ | ___ | +-----1-| F | ___ +--|----------1-| A |---2 0--| F | | | ___ +1-|_3_|---1 0--| A |-1+ +0-| F | | 0--|_3_|-0----0-| A |-1+ 0-------------0-|___|-0-----------0 Call the above a 7 Adder (7Add): ___ | 7 | 7 | A |-2- 0-/-| d |-1- |_d_|-0- (a more correct term might be '(7,3) counter', but '7Add' fits in the ascii drawing) Two 7Adds can be used to sum 14 bits into two 3-bit numbers, using 16x 3-LUTs. 3x 4-LUTs are used to sum 4 bits to a 3-bit result in a 4Add: ___ | 4 | 4 | A |---2 0-/-| d |---1 |_d_|---0 Produce the final result from the output of the two 7Adds plus the remaining two bits, using two 4Adds and an FA: ___ | 7 | 7 | A |-2-------------------------------+ ___ 0-/-| d |-1-------------------+ +-2-| 4 | |_d_|-0-----+ +-------|-------------2-| A |--4 | | +-|-------------2-| d |--3 | | | | ___ +-2-|_d_|--2 ___ +---|-----+ | +-1-| F | | | 7 | | +-|-----------|---1-| A |-2-+ 7 | A |-2-+ | | ___ | +-1-|___|--------------1 0-/-| d |-1---+ +-0-| 4 | | | |_d_|-0-------0-| A |-2-+ | 0-----------------0-| d |-1---+ 0-----------------0-|_d_|----------------------------0 Total LUT count is 24 (3 Virtex CLBs), using 6x 4-LUTs and 18x 3-LUTs. This is the absolute minimum using 3-LUT and 4-LUT based logic alone (that I know of). Now look at how the VirtexII carry logic may improve things... At the leaves of the tree (left hand side), two 3-LUTs are still used to build full adders. The next step is building a 7Add, which sums the output of 2 FAs plus one more bit. Here, either carry logic can be used, at a total cost of 8 LUTs, or a LUT only solution, also costing 8 LUTs. A 15Add sums two 7Adds and another bit. This costs two 7Adds (16 LUTs), plus a standard 3 bit carry logic based adder with carry-in and carry-out, another 5 LUTs. Adding in the last bit is a 4 bit increment with carry-out, costing 6 LUTs using carry logic. Total: 27 LUTs! Three more than the LUT only circuit. Frustrating, isn't it? For this size, a sum of 16 bits, trying to use carry logic does worse than a LUT only implementation. For a 15Add, carry logic allows a marginal improvement of one LUT, using only 21 LUTs instead of 22. For larger bit counts, the carry logic becomes increasingly important, as will be explained below. Finally, a few simple-minded words about the 'paradox'... Many operations can be viewed as an exercise in compression, reducing a number of inputs to a smaller number of outputs through some sort of multi-level tree structure. In this case, inputs are the bits to count, and outputs are the count. At any tree level, a simple, naive figure of merit qualifies a circuit: (InputBits/OutputBits) * (InputBits/LUTs) The larger this number, the better the circuit is for that level. The first ratio helps the tree converge faster, the second reduces the LUT count. For 4-LUT based logic, the highest possible figure of merit would be 16, indicating four input bits produce a single output bit, with a single LUT. Parity trees, "and" trees, "or" trees acheive 16. Implementing the leaf side initial level of bit counting with 4-LUTs gives (4/3)*(4/3)=1.778. Implementing the initial bit counting with 3-LUTs gives (3/2)*(3/2)=2.25. At any stage where bits of the same weight are aggregated and compressed, the 3-LUT implementation is better if they can be grouped by threes. At the end of the 16Add tree, some 4-LUTs are used to advantage because there are 4 signals to combine that cannot be cleanly split into groups of three. For every other location in the tree, the 3-LUTs work out better. When carry logic is used at a combining stage, uniting two addends of the same size plus an LSB, the metric becomes: ((2*Size+1)/(Size+1))*((2*Size+1)/(Size+2)) where Size is the number of bits in each addend. This gives: Size Merit 1 1.5 2 2.083 3 2.45 4 2.7 ... approaching 4 When 3-LUT based full adders are cascaded to add two numbers and a carry-in, the metric becomes: ((2*Size+1)/(Size+1))*((Size+1)/(2*Size)) Giving: Size Merit 1 2.25 2 2.083 3 2.042 4 2.025 ... approaching 1 Comparing the tables, for combining three equal weight bits, the LUT solution is better; for going from full adder to 7Add the two circuits have equal LUT count; for going from 7Add to 15Add carry logic should be used. The optimum (30,5) counter (that I know of) uses a mix of 3-LUTs and carry-logic, and no 4-LUTs. Whew! Talk about being over-anal(ytical). Hope I haven't bored anyone, just wanted to get the 'best' circuits public (in hopes someone else has a better one), and show something not immediately obvious about the leaves of the adder tree. Peter's BlockRam implementation is also slick! I'll finish with a question: Just what are the best/worst results from HDL synthesis tools that folks get with this function? I.e., barring forcing the mapping, and letting the tool optimize from something like: Count <= Bit0 + Bit1 + Bit2 +... Regards, John p.s. Original poster asked about 30 bits...here's the most compact way I know, but I haven't spent much time on 30 bits. Here the carry logic helps. Note that none of the LUTs are configured as 4-LUT.... ___ 0-| F | 0-| A |-1-------+ 0-|_3_|-0-----+ | ___ | | ___ 0-| F | | +-1-| F | 0-| A |-1-----|---1-| A |-2-----------------+ 0-|_3_|-0-+ +-|---1-|_3_|-1---------------+ | ___ | | | ___ | | 0-| F | +-|-|---0-| F | | | 0-| A |-1---+ +---0-| A |-1---+ | | 0-|_3_|-0---------0-|_3_|-0---|---------+ | | ___ | | | | 0-| F | | ___ | | | 0-| A |-1-------+ + | F | | | | 0-|_3_|-0-----+ | +---1-| A |-|-|-|-2---+ ___ | | ___ | +-1-|_3_|-|-|-|-1-+ | 0-| F | | +-1-| F | | | | | | | | 0-| A |-1-----|---1-| A |-2-|-|-------+ | | | | | ___ 0-|_3_|-0-+ +-|---1-|_3_|-1-+ | | | | | | | '0'-4-| | ___ | | | ___ | | | | | | | '0'-3-| | 0-| F | +-|-|---0-| F | | | | | | | +------2-| C | 0-| A |-1---+ +---0-| A |-1---|-----+ | | | | +--------1-| Y |-4 0-|_3_|-0---------0-|_3_|-0---|---+ | | | | | '0'-0-| A |-3 ___ | | | | | | | ___ | d |-2 0-| F | | | | | | | |'0'-3-| |-4-| d |-1 0-| A |-1-------+ | | | | | | +----2-| C |-3-| |-0 0-|_3_|-0-----+ | | | | | | +------1-| Y |-2-| | ___ | | ___ | | | | +--------0-| A |-1-| | 0-| F | | +-1-| F | | | | | ___ | d |-0-|___| 0-| A |-1-----|---1-| A |-2---|-+ | | +--2 | C |-3-| d | | 0-|_3_|-0-+ +-|---1-|_3_|-1---+ | | +----1-| Y |-2-| | 0 ___ | | | ___ | +------0-| A |-1-| | | 0-| F | +-|-|---0-| F | +--------2-| d |-0-|___| | 0-| A |-1---+ +---0-| A |-1--------------1-| d | | | 0-|_3_|-0---------0-|_3_|-0--------------0-|___| 0 | | | | 0 | | 0--------------------------------------------+ | | 0----------------------------------------------------+ | 0------------------------------------------------------------+ LUT Count: 18 + 12 + 2 + 5 + 6 + 6 = 49 49 LUTs = 8.125 CLBs (Pipeline to taste) HTH (apologies for being late to the thread, suffered a disk crash recently, could not post)
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z