Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
In article <Fq31J5.57K@world.std.com>, Joseph H Allen <jhallen@world.std.com> wrote: >In article <38ab3f43.13235782@nntp.best.com>, >Bob Perlman <bobperl@best_no_spam_thanks.com> wrote: >>On Wed, 16 Feb 2000 13:44:04 -0800, Peter Alfke <peter@xilinx.com> >>wrote: > >>>The classical solution to this old problem is to utilize the input flip-flop with >>>its input delay, but configured as a latch, and hold it permanently transparent. > >>>Peter Alfke > >>I tried the latch trick some years ago. It worked, but one >>significant complication was that PPR (this was back in '92-'93) kept >>trying to help by optimizing out the always-transparent latch. I >>don't know if M2.1 has the same problem. > >Acutally I think it's map which is doing it. It has the same problem in >M2.1, no matter how many KEEP and NOREDUCE attributes you add to the input >or gate (however I was able to get leonardo verilog synthesizer to keep it >with the 'dont_touch' attribute: //exemplar attribute delay dont_touch >true). > >Then I realized that you can just attach the latch gate to the clock for the >same effect, which is by far the easiest solution: > > ild_1 delay(.Q(original_input), .D(new_input), .G(clk)); > > always @(clk) > begin > ... state machine which uses original_input ... > end Note however that this should only be done on signals which actually have hold time issues (or more exactly, only to signals with a setup time lower than 1/2 the clock period), since it adds a huge delay to the input (while clk is high). It would be better to tie .G low, but I don't no how to prevent map from optimizing the latch out. You can use the fpga editor, but it's a pain. -- /* jhallen@world.std.com (192.74.137.5) */ /* Joseph H. Allen */ int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) +r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2 ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);}Article: 20676
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 10-th Great Lakes symposium on VLSI Design March 2-4, 2000, Chicago, Illinois, U.S.A. http://www.glsvlsi.com =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Dear Colleague, On behalf of the organizing committee for the 10-th Great Lakes Symposium on VLSI, I would like to draw your attention to this year's very exciting program. Please refer to the new symposium web-page at URL: http://www.glsvlsi.com for the advance program, registeration and hotel information. Prospective participants are encouraged to register and make their travel arrangements at their earliest convenience. Please disseminate this announcement further among your colleagues. We look forward to a seeing you at the Symposium. Regards, Amir H. Farrahi Publicity Chair =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= 10-th Great Lakes symposium on VLSI Design March 2-4, 2000, Chicago, Illinois, U.S.A. http://www.glsvlsi.com =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Sent via Deja.com http://www.deja.com/ Before you buy.Article: 20677
George wrote: > > Hi Folks, > > I am using Foundation 2.1i-SP4 to target XC4k series. I am having > difficulties with 'RLOC_RANGE' property. I am using schematic design entry. > I attach a property 'RLOC_RANGE=Rr1Cc1:Rr2Cc2' on a symbol if I wanted it to > be placed between Rows r1 and r2, and Columns c1 and c2. However, when I map > the design, I do not get what I expect. This directive seems to be toatally > ignored. > Has anybody anybody out there used this property? How did it work for you? > > I appreciate your help. I'm not sure if this is the problem, but remember that RLOC constraints only control relative locations, not absolute. If you are trying to get absolute positioning, use the constraint: LOC = Rr1Cc1 : Rr2Cc2 Or apply an RLOC_ORIGIN constraint. Another concern is getting the RLOC sets right, there are a number of ways to control which elements are part of which RLOC sets. The automatic hierarchy-based processing may not be doing what you want, I can't tell without knowing more about your schematic. -Steve GrossArticle: 20678
Matt Billenstein wrote: > All, I'm interested in purchasing a prototyping board based on a Xilinx FPGA > and I have about $200 to spend. I've looked a little at the boards at > www.xess.com so far. Does anyone have any recommendations? > > thx in advance, > > m > > Matt Billenstein > REMOVEhttp://w3.one.net/~mbillens/ > REMOVEmbillens@one.net I think Xess board is the best for this price. Additionally Xess provides an excellent support and help and they have and very good web with a lot of useful tutorials. -- =================================================================== Sergio A. Cuenca Asensi Dept. Tecnologia Informatica y Computacion (TIC) Escuela Politecnica Superior, Campus de San Vicente Universidad de Alicante Ap. Correos 99, E-03080 ALICANTE ESPAŅA (SPAIN) email : sergio@dtic.ua.es Phone : +34 96 590 39 34 Fax : +34 96 590 39 02 ===================================================================Article: 20679
Hello Josheph, I was tired and didnt realize what I wrote to you before. One way to reduce the hold time is by "advancing the phase of the clock" using a roboclock or any other clock manager. Joseph H Allen wrote: > The M2.1i software now reports hold times on input pads in the data sheet > timing report file, and, of course, I have some significant (up to 2.5 ns) > hold times relative to the system clock. > > This does not happen when using the IOB flip flop, with its delay line. It > does happen when there is small amount of logic between the input and the > first flip flop (so that the IOB flip flop can not be used), and when both > are placed together in a CLB near the pad. > > What is the best (easy+automatic) way to eliminate these hold times? Has > anyone else noticed this? > > -- > /* jhallen@world.std.com (192.74.137.5) */ /* Joseph H. Allen */ > int a[1817];main(z,p,q,r){for(p=80;q+p-80;p-=2*a[p])for(z=9;z--;)q=3&(r=time(0) > +r*57)/7,q=q?q-1?q-2?1-p%79?-1:0:p%79-77?1:0:p<1659?79:0:p>158?-79:0,q?!a[p+q*2 > ]?a[p+=a[p+=q]=q]=q:0:0;for(;q++-1817;)printf(q%79?"%c":"%c\n"," #"[!a[q-1]]);} -- Hernan Javier Saab Western Area Applications Engineer Email: HERNAN@synplicity.com Direct Phone: 408-215-6139 Main Phone: 408-215-6000 Pager: 888-712-5803 FAX: 408-990-0295 Synplicity, Inc. 935 Stewart Drive Sunnyvale, CA 94086 USA Internet <http://www.synplicity.com >Article: 20680
This is a multi-part message in MIME format. --------------4F9241746F17810F1561C2C3 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit jlamorie@engsoc.carleton.ca wrote: > In article <38AAEE78.9B969870@xess.com>, > Dave Vanden Bout <devb@xess.com> wrote: > > You can take a look at http://www.xess.com/fndmake.pdf. This document > > shows you how to implement Xilinx projects using a makefile and batch > > mode processing. You can store the makefiles and VHDL files in a CVS > > tree and recall them to regenerate your project bit files. > > > > The makefile described in the document is a bit simple, but you can > > probably modify it to make it smarter. > > This is absolutely wonderful!! I'll try it out today, and try to figure > out how to include the source files for FSM, schematics and logiblox. > > I know some of these tools dump out VHDL, so you can just include the VHDL files from those in the makefile. --------------4F9241746F17810F1561C2C3 Content-Type: text/x-vcard; charset=us-ascii; name="devb.vcf" Content-Transfer-Encoding: 7bit Content-Description: Card for Dave Vanden Bout Content-Disposition: attachment; filename="devb.vcf" begin:vcard n:Vanden Bout;David tel;fax:(919) 387-1302 tel;work:(919) 387-0076 x-mozilla-html:FALSE url:http://www.xess.com org:XESS Corp. adr:;;2608 Sweetgum Drive;Apex;NC;27502;USA version:2.1 email;internet:devb@xess.com title:FPGA Product Manager x-mozilla-cpt:;28560 fn:Dave Vanden Bout end:vcard --------------4F9241746F17810F1561C2C3--Article: 20681
Hi Peter, Can you tell me, with a Virtex 1000E, using the STARTUP-VIRTEX, GSR net, and your clever little LUT flip-flop, what is the maximum propagation delay from, i.e. can i use GSR with a 74MHz system clock. Thanks, Mark. On Wed, 16 Feb 2000 09:19:31 -0800, Peter Alfke <peter@xilinx.com> wrote: > > >Rick Filipkiewicz wrote: > >> Looking at the Virtex data sheet there's a timing parameter for the GSR->IOB/CLB FF >> outputs given. For a -4 part its 12.5nsec. The question is whether this includes GSR >> routing. If it doesn't its got to be the slowest async reset since LS TTL. > >Of course it includes the max routing delay. >But it's a max delay, and some flip-flops are closer to the source and have a much >shorter delay. So this delay ( different from all other delays in the data sheet) has >an enormous spread, you really should assume anywhere between almost zero to the max >value. That's what causes the problems that Ray and I discussed before. > >Peter Alfke > >Article: 20682
hi jeff, that's not possible bacause i use the synopsys behavioral-compiler. but - we could fix the problem which seems to be a language-problem between coregen-edif-writer and design-compiler-edif-reader. fjz001@email.mot.com wrote: > > Mark, > > Why not directly instantiate the RAMB4_S16_S16 in your HDL? In this > case, Coregen just adds a layer of unnecessary complexity. > > Jeff > > In article <88c4df$4r2@news.Informatik.Uni-Oldenburg.DE>, > "Mark Hillers" <Mark.Hillers@Informatik.Uni-Oldenburg.DE> wrote: > > Hello, > > > > i think i have found a bug in xilinx-tool coregen 2.1i. > > it happens when creating single-port-blockrams with words larger than > 16 > > bit. > > > > the resulting ".edn"-file (for synopsys) uses one RAMB4_S16_S16 > > component where the > > lower 16 bit of the 24-bit-word are mapped to port A (DOA[15:0]) and > the > > upper 8 bit are mapped to port B (DOA[15:8]). > > But - and here comes the bug - the address of the desired word is > simply > > mapped to both address-ports (ADDRA and ADDRB (8 bit wide)) the > > following way: > > > > ADDRA(4 downto 0) = myaddress(4 downto 0) > > ADDRA(7 downto 5) = "000" > > ADDRB(4 downto 0) = myaddress(4 downto 0) > > ADDRB(7 downto 5) = "000" > > > > The problem is now that always both ports load the same address and > with > > it the same data. The Result is an output which has the form CDABCD > > where A,B,C,D are hex-ciphers. > > > > In application-note XAPP130 (V1.1) is a solution to this problem. The > > mapping of the address-ports should be: > > > > ADDRA(4 downto 0) = myaddress(4 downto 0) > > ADDRA(7 downto 5) = "000" > > ADDRB(4 downto 0) = myaddress(4 downto 0) > > ADDRB(7 downto 5) = "100" > > > > Now I am looking for a simple patch. The simples would be a new > version > > of coregen because i am not good in writing ".edn"-files :-(. > > > > greetings > > mark > > > > Sent via Deja.com http://www.deja.com/ > Before you buy.Article: 20683
Nova Engineering has two different Altera FPGA development boards: Constellation and Constellation-E. See <http://www.nova- eng.com/constellation.html> Both are very similar, but the "Constellation-E" adds a USB interface and uses the newer 10KE FPGAs from Altera <http://www.altera.com/html/products/f10ke.html>. The Constellation-E can utilize up to an EPF10K200S. For other boards see http://www.optimagic.com/boards Michael Rauf Nova Engineering, Inc. 1.800.341.NOVA (6682) 1.513.860.3456 1.513.860.3535 (fax) mailto:mrauf@nova-eng.com http://www.nova-eng.com 5 Circle Freeway Drive Cincinnati, Ohio, USA 45246 In article <38A1C9F2.9BDDD89B@dtic.ua.es>, "Sergio A. Cuenca Asensi" <sergio@dtic.ua.es> wrote: > Hello all, > I´m looking for a reconfigurable board (PCI, ISA) to develop REAL image > processing projects. > Any ideas? > > -- > =================================================================== > Sergio A. Cuenca Asensi > Dept. Tecnologia Informatica y Computacion (TIC) > Escuela Politecnica Superior, Campus de San Vicente > Universidad de Alicante > Ap. Correos 99, E-03080 ALICANTE > ESPAŅA (SPAIN) > email : sergio@dtic.ua.es > Phone : +34 96 590 39 34 > Fax : +34 96 590 39 02 > =================================================================== > > Sent via Deja.com http://www.deja.com/ Before you buy.Article: 20684
Take a look at Xtensa www.tensilica.com -dave In article <3898D606.6BA00B3E@cmt.co.il>, Irit <irit@cmt.co.il> wrote: >Hello, >I am looking for a CPU core which can be placed in an FPGA. It should >have the following features: > >1. 32-bit registers and ALU, integer only; don't need multiply or >divide. > >2. Fast enough to run at 66 MHz on a Virtex or Apex FPGA. > >3. Not too big (a single instance should fit in less than 100K system >gates, whatever it means). > >4. Possible to have more than one instance in a single chip; not locked >to specific cells or I/O pins. > >5. Must have code development tools (assembler, linker, debugger) >available; C compiler is nice-to-have but not mandatory. > >6. Preferably synthesizable VHDL or Verilog; if available as netlist or >routed block, must have a VHDL simulation model. > >7. Can be converted later to an ASIC cell. > >I would greatly appreciate any pointers; after all replies (if any) have >been sent, I will post a summary in the relevant NGs. > >Please send replies to my email (assaf_sarfati@yahoo.com) as well as >posting them; I suspect my NG server either loses posts or deletes them >after a few minutes. > > Thanks in Advance > Assaf Sarfati > >Article: 20685
Another possibility is the LEON VHDL sparc compatable processor. It is in VHDL, at http://www.estec.esa.nl/wsmwww/leon/ . It doesn't QUITE meet your timing requirements, but they claim a synthesised performance of 45 MHz on an XCV300E-8 integer only, SPARC compatable, 32 bit memory bus, Icache, dcache, etc. Results in synthesis are 5300 LUTs on a Virtex 300E. It wouldn't suprise me if hand laying out the datapath would improve things (the John Henry FPGA pannel discussion showed some impressive results for hand laying out datapaths). It is under the LGPL, so you can use this in a commercial product. -- Nicholas C. Weaver nweaver@cs.berkeley.eduArticle: 20686
If you have a spare I/O pin, configure it with a pullup and then connect this to the strobe input of the latch. That way, the backend tools can't optimize away your transparent latch without changing your design. Using a spare I/O pin to keep the optimizers from being too smart is a technique I use on occasion. Although it's not the prettiest solution, it does have the benefit of being quick, and works across various synthesis tools vendors as well as different revisions of the place-and-route software. Urb Bob Perlman wrote: > > On Wed, 16 Feb 2000 13:44:04 -0800, Peter Alfke <peter@xilinx.com> > wrote: > > >The classical solution to this old problem is to utilize the input flip-flop with > >its input delay, but configured as a latch, and hold it permanently transparent. > > > >Peter Alfke > > I tried the latch trick some years ago. It worked, but one > significant complication was that PPR (this was back in '92-'93) kept > trying to help by optimizing out the always-transparent latch. I > don't know if M2.1 has the same problem. > > If you try this approach, go into FPGA editor after the route and > confirm that the latch(es) didn't disappear. > > Finally, thanks to Joseph for posting the issue. I've been looking > for the simple, automatic solution for a long time. It's easy to say, > "Always go through the IOB FF," but in practice there are those > situations where the additional latency isn't tolerable. > > Good luck, > Bob Perlman > > ----------------------------------------------------- > Bob Perlman > Cambrian Design Works > Digital Design, Signal Integrity > http://www.best.com/~bobperl/cdw.htm > Send e-mail replies to best<dot>com, username bobperl > -----------------------------------------------------Article: 20687
That's well and good as long as you have the budget. Last time I checked, those roboclock thingies cost as much as the FPGA. Hernan, weren't you at Lattice before? Hernan Saab wrote: > Hello Josheph, > > I was tired and didnt realize what I wrote to you before. > One way to reduce the hold time is by "advancing the phase of the clock" using > a roboclock or any other clock manager. > > Hernan Javier Saab > Western Area Applications Engineer > Email: HERNAN@synplicity.com > Direct Phone: 408-215-6139 > Main Phone: 408-215-6000 > Pager: 888-712-5803 > FAX: 408-990-0295 > Synplicity, Inc. > 935 Stewart Drive > Sunnyvale, CA 94086 USA > Internet <http://www.synplicity.com > -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email randraka@ids.net http://users.ids.net/~randrakaArticle: 20688
It depends on your requirements as well as the relative cost of the memories _for your design_. Typical block RAMs are only on the order of 4K bits, so one memory only makes a 4x5 multiplier (9bits address to 8 bits data). Not too impressive. If you don't need the memory for something else in the design, you can use the memories for partial products, but be aware of that it may actually be slower than a pipelined multiplier implemented in CLBs. "Keith Jasinski, Jr." wrote: > The new FPGAs that have RAM that can be configured/initialized like a ROM > are touting the ability to use the RAM as a single clock cycle multiplier by > using it as a look-up table. Maybe that might be your answer. > > -- > Keith F. Jasinski, Jr. > kfjasins@execpc.com > Pradeep Rao <pradeeprao@planetmail.com> wrote in message > news:88fc5e$b84$1@news.vsnl.net.in... > > Hi, > > > > Which would be the best implimentation of a multiplier in VHDL > > (synthesisable) in terms of speed/area? > > I know of array implimentation and the ragister configuration using a > single > > adder. Are there any other better ones ? > > Thanks in anticipation, > > > > Pradeep Rao > > > > > > > > -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email randraka@ids.net http://users.ids.net/~randrakaArticle: 20689
You still should be able to instantiate the RAMB4 primitives as a black box, no? Mark Hillers wrote: > hi jeff, > > that's not possible bacause i use the synopsys behavioral-compiler. > but - we could fix the problem which seems to be a language-problem > between coregen-edif-writer and design-compiler-edif-reader. > > fjz001@email.mot.com wrote: > > > > Mark, > > > > Why not directly instantiate the RAMB4_S16_S16 in your HDL? In this > > case, Coregen just adds a layer of unnecessary complexity. -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email randraka@ids.net http://users.ids.net/~randrakaArticle: 20690
I'd pick a Xilinx board for image processing rather than Altera. I think you'll find the altera architecture considerably more limiting, as it is not as adept as the xilinx architectures for arithmetic or delay queues (needed in the imaging filters for example). See my previous posts regarding altera vs xilinx for signal processing applications. mrauf@nova-eng.com wrote: > Nova Engineering has two different Altera FPGA development boards: > Constellation and Constellation-E. See <http://www.nova- > eng.com/constellation.html> > > Both are very similar, but the "Constellation-E" adds a USB interface > and uses the newer 10KE FPGAs from Altera > <http://www.altera.com/html/products/f10ke.html>. The Constellation-E > can utilize up to an EPF10K200S. > > For other boards see http://www.optimagic.com/boards > > Michael Rauf > Nova Engineering, Inc. > 1.800.341.NOVA (6682) > 1.513.860.3456 > 1.513.860.3535 (fax) > mailto:mrauf@nova-eng.com > http://www.nova-eng.com > 5 Circle Freeway Drive > Cincinnati, Ohio, USA 45246 > > In article <38A1C9F2.9BDDD89B@dtic.ua.es>, > "Sergio A. Cuenca Asensi" <sergio@dtic.ua.es> wrote: > > Hello all, > > I´m looking for a reconfigurable board (PCI, ISA) to develop REAL > image > > processing projects. > > Any ideas? > > > > -- > > =================================================================== > > Sergio A. Cuenca Asensi > > Dept. Tecnologia Informatica y Computacion (TIC) > > Escuela Politecnica Superior, Campus de San Vicente > > Universidad de Alicante > > Ap. Correos 99, E-03080 ALICANTE > > ESPAŅA (SPAIN) > > email : sergio@dtic.ua.es > > Phone : +34 96 590 39 34 > > Fax : +34 96 590 39 02 > > =================================================================== > > > > > > Sent via Deja.com http://www.deja.com/ > Before you buy. -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email randraka@ids.net http://users.ids.net/~randrakaArticle: 20691
Ray Andraka (randraka@ids.net) wrote: : Wallace trees are not generally the fastest multipliers in FPGAs. See the If you pipeline them they generally are. However, it depends on how you define speed. If you are referring to the clocking rate, then a fully pipelined Wallace tree multiplier will provide the best results - over vector and array based techniques. However, Wallace trees require a large amount of device resource to do so (CLB count). If you are interested in pipelined structures and associated clocking rates, be prepared to experience an area/time tradeoff for multiplication implementations. Thats is, the faster you wish to clock the implementation, the more area you will have to use. If you are interested in the functional density of the implementation, I'd say that vector based approaches (which add partial products in parallel - using fast carry logic) provide best utilisation results. -- Mathew WojkoArticle: 20692
The app note is based on a system with an 8051 that has non-volatile storage, rather than just RAM. It could be adapted to your application, though usually you want to store the configuration for the FPGA in some non-volatile storage rather than just RAM. You would re-work the part that reads the data from the EPROM and read it from your in-memory data structures. The JTAG programmer is not used. The programming method is JTAG. The "meat" of the app note is the control of the FPGA programming in JTAG mode using a microcontroller. JTAG programming requires you to send data and control flow in very specific bit lengths and sequences. The majority of the C-code in the app note is the implementation of this and does not need to be modified. Many people are using JTAG on their systems to program FPGAs and CPLDs today. You can chain them up and program many devices using JTAG as well. Hope that helps, -r <elynum@my-deja.com> wrote in message news:88d7j9$h9g$1@nnrp1.deja.com... > I read the app note. It didn't mention about how to program the fpga > with an 8051 with just a buffer and ram. Can you do that? Can you do > it without using the jtag programmer? It was kind of confusing and the > C code was fairly long. It seems it would be faster to program with > just an eeprom. > > n article <2X5q4.52$mjh.185876992@news.frii.net>, > "rodger" <brownsco@frii.com> wrote: > > Try this: > > > > http://www.xilinx.com/xapp/xapp058.pdf > > > > The App Note is titled: > > > > Xilinx In-System Programming Using an Embedded Microcontroller - > XAPP058, > > v2.0 (06/99) > > > > It will get you started. The programming mode is JTAG and the > included code > > example is for a 8051, with minor modifications needed for other > > architectures. > > > > -r > > > > <elynum@my-deja.com> wrote in message > news:881ajg$c3l$1@nnrp1.deja.com... > > > How would I go about programming 2 xilinx fpga's on a single board? > > > Would I need 2 separate EEPROM chips(ATMEl) or just one? How would > > > I go about doing it with a microprocessor 8051 or 860? What would I > > > need to do this? > > > > > > > > > Sent via Deja.com http://www.deja.com/ > > > Before you buy. > > > > > > > Sent via Deja.com http://www.deja.com/ > Before you buy.Article: 20693
Try the $150 Atmel starter kit at <http://www.kanda.com>. A bit more on the board than the Xess ones. -John Matt Billenstein wrote: > > All, I'm interested in purchasing a prototyping board based on a Xilinx FPGA > and I have about $200 to spend. I've looked a little at the boards at > www.xess.com so far. Does anyone have any recommendations? > > thx in advance, > > m > > Matt Billenstein > REMOVEhttp://w3.one.net/~mbillens/ > REMOVEmbillens@one.netArticle: 20694
Mathew Wojko wrote: > Ray Andraka (randraka@ids.net) wrote: > : Wallace trees are not generally the fastest multipliers in FPGAs. See the > > If you pipeline them they generally are. > No, they are not. A wallace tree produces a sum vector and a carry vector. Those have to be added together to obtain the full sum. The tree portion of the wallace tree can be clocked quite fast if it is pipelined. However, that final adder determines the maximum clock rate of the multiplier. In an ASIC, that adder can be made quite a bit faster than a ripple carry adder using any of a number of fast adder schemes. A row ripple array multiplier can be made about as fast as a wallace tree if it is rearranged into a tree and all the adders in the tree are the same as the fast adder used to combine the sum and carry vectors from the wallace tree. That of course, comes at a considerable cost in area. The real advantage to the wallace tree is that it allows you to use cheap full adders for the array, and only one copy of the expensive fast adder. It comes at a cost of a very complicated routing pattern. Now fade to the FPGA. The fast carry chain logic in modern FPGAs is a highly optimized dedicated path that is about an order of magnitude faster than logic implemented in the LUT logic and connected via the general routing resources. That fact makes it extremely difficult to improve upon the performance of the carry chain ripple carry adder. This non-homogenous mix of logic means that the cheap ripple carry adder is about as fast as you're gonna get in the FPGA (short of pipelining the carry) for word widths up to around 24-32 bits. The result is a wallace tree buys you nothing in terms of area, and in fact is twice as big as a a row-ripple tree because the ripple carry adders use one LUT per bit (the carry is in dedicated logic in xilinx or splits the lut in altera) where the full adders in the wallace tree need two luts per bit (one for sum, one for carry). The larger area costs clock cycle time since the routing in FPGAs has substantial delay. Now pipelining will get back the performance (requires a register immediately in front of the final adder for best clock speed), but the fact of the matter is you are still limited by the speed of that final adder. So a wallace tree gets you at best, the same performance as a row-ripple tree with double the area (more if you use partial product techniques at the front layer). This is why a wallace tree multiplier is not appropriate for an FPGA. That said, the column route delay penalty in Altera 10K devices does make a wallace tree a little more attractive for pipelined trees that cannot fit in one row. The reason for that is the clock period is limited by the delay from the output register on one level of the tree through the carry chain to the msb output register of the next level. If the levels cross a row boundary, there is a significant delay hit which will reduce the clock frequency unless additional registers are added ahead of and in the same row of the carry chain. If the tree extends across several rows, several layers of pipeline registers are needed if the tree is all ripple carry adds. A wallace tree can reduce the hit, but again at the expense of a considerable amount of area...and that is only true for trees that extend across more than two rows. You get the same clock cycle performance in less area by simply adding the extra pipeline registers instead of doing a wallace tree, but at the expense of a little clock latency. Note that this is a special case. The other special case occurs in FPGAs without carry chains, where in order to get an advantage by using a wallace tree, your final adder should use a fast carry scheme. > However, it depends on how you define speed. If you are referring to the > clocking rate, then a fully pipelined Wallace tree multiplier will provide > the best results - over vector and array based techniques. However, > Wallace trees require a large amount of device resource to do so (CLB count). > > If you are interested in pipelined structures and associated clocking > rates, be prepared to experience an area/time tradeoff for multiplication > implementations. Thats is, the faster you wish to clock the implementation, > the more area you will have to use. > > If you are interested in the functional density of the implementation, > I'd say that vector based approaches (which add partial products in parallel > - using fast carry logic) provide best utilisation results. For FPGAs with fast carry chains, these partial product techniques also provide the fastest multipliers short of pipelining the carries. > > > -- > Mathew Wojko -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email randraka@ids.net http://users.ids.net/~randrakaArticle: 20695
Was the movement limited to within a CLB? If not check the placement report to see what and why it happened. Bob Perlman wrote: > Hi - > > I don't know how many of you use the Xilinx M2.1 floorplanner. If you > do, I have a question for you. > > Yesterday I used the floorplanner to place portions of a > schematic-based XCS30XL design, and managed to go from a design that > failed route after 1-1/2 hours (didn't complete route and didn't meet > timing on the routed nets) to a design that routed and met all timing > constraints in 40 minutes. So, I'm happy with the results, but was > puzzled by the fact that the Xilinx tools moved some of the cells that > I'd placed. Any RPMs that I placed stayed put, but cells that I'd > moved individually into the placement window were sometimes in new > places after routing. You could see that the place and route tools > had kept the cells more or less where I'd placed them, but moved some > cells around. > > Is this expected behavior when using the floorplanner? If so, what's > to keep I/O pin assignments from moving? > > Thanks, > Bob Perlman > > > ----------------------------------------------------- > Bob Perlman > Cambrian Design Works > Digital Design, Signal Integrity > http://www.best.com/~bobperl/cdw.htm > Send e-mail replies to best<dot>com, username bobperl > ----------------------------------------------------- -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email randraka@ids.net http://users.ids.net/~randrakaArticle: 20696
Yes, you can significantly accelerate this with an FPGA if I am reading this right. I think the hash function is basically a PN sequence, which you can undo using finite field arithmetic. It more or less reduces to a shift register and xor gates, which can be quite compactly realized in xilinx FPGAs (using the clb ram feature), or a bit less compactly in Altera parts. In current FPGAs, you can do a PN generate at bit rates well over 200MHz. At 64 bits per, that is over 3 million/sec with a single copy, and you have room in an FPGA for lots of these. It will take more logic to distill the results than to do the hashes. The big numbers in your polynomial look a little suspect. I would expect the upper exponents to be more like 64 and 59? Neill Clift wrote: > Could anyone give me an idea if the following would be possible to do with > one of the many > programable logic device I see mentioned here? > > VMS hashes user passwords using a polynomial over Zp. p = 2^64-59 and the > polynomial > looks like this: > > f(x) = x ^16777213 + A * x ^16777153 + B * x ^3 + C * x^2 + D * x + E (mod > p) > > On say a 600Mhz PIII I can evaluate 0.4 million values of this polynomial / > sec. > > Noting that 2 is a primitive root mod p we can make a search of the whole > space much faster by calculating like this: > > f(0), f(1), f(2), f(4), ... f(2^r),..., f(2^(p-2) > > We can use each term calculated for f(2^r) to calculate the terms for > f(2^(r+1)) just by multiplying > by the constants 2^16777213, 2^16777153, 8,4,2 and 1 (mod p). > > So to search then entire 64 bit space of the problem involves doing / point: > > 2 x 64 bit multiplies mod p where the multiplier is a constant with no > special structure and 3 > small constant multiplies (that can even be converted to additions) followed > by 5 additions (all > mod p). > > Once again on a 600Mhz PIII I can do something like 4 million points / sec. > > Is this the kind of problem thats easily done with an FPGA etc? I would need > to be able to fit > many of these on a single device to divide the problem space up. > > What current devices are available that would be best suited to such a task? > Are they affordable > by someone who just wants to play about like this? > Thanks. > Neill. -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email randraka@ids.net http://users.ids.net/~randrakaArticle: 20697
For devices, depends on how many you want to do at once. Even the small FPGAs can handle several of the PN generators. Perhaps a good starting point would be the XESS board, which is available with a 4000 series xilinx, software and a student manual for about $200. Ray Andraka wrote: > Yes, you can significantly accelerate this with an FPGA if I am reading this > right. I think the hash function is basically a PN sequence, which you can undo > using finite field arithmetic. It more or less reduces to a shift register and > xor gates, which can be quite compactly realized in xilinx FPGAs (using the clb > ram feature), or a bit less compactly in Altera parts. In current FPGAs, you > can do a PN generate at bit rates well over 200MHz. At 64 bits per, that is > over 3 million/sec with a single copy, and you have room in an FPGA for lots of > these. It will take more logic to distill the results than to do the hashes. > The big numbers in your polynomial look a little suspect. I would expect the > upper exponents to be more like 64 and 59? > > Neill Clift wrote: > > > Could anyone give me an idea if the following would be possible to do with > > one of the many > > programable logic device I see mentioned here? > > > > VMS hashes user passwords using a polynomial over Zp. p = 2^64-59 and the > > polynomial > > looks like this: > > > > f(x) = x ^16777213 + A * x ^16777153 + B * x ^3 + C * x^2 + D * x + E (mod > > p) > > > > On say a 600Mhz PIII I can evaluate 0.4 million values of this polynomial / > > sec. > > > > Noting that 2 is a primitive root mod p we can make a search of the whole > > space much faster by calculating like this: > > > > f(0), f(1), f(2), f(4), ... f(2^r),..., f(2^(p-2) > > > > We can use each term calculated for f(2^r) to calculate the terms for > > f(2^(r+1)) just by multiplying > > by the constants 2^16777213, 2^16777153, 8,4,2 and 1 (mod p). > > > > So to search then entire 64 bit space of the problem involves doing / point: > > > > 2 x 64 bit multiplies mod p where the multiplier is a constant with no > > special structure and 3 > > small constant multiplies (that can even be converted to additions) followed > > by 5 additions (all > > mod p). > > > > Once again on a 600Mhz PIII I can do something like 4 million points / sec. > > > > Is this the kind of problem thats easily done with an FPGA etc? I would need > > to be able to fit > > many of these on a single device to divide the problem space up. > > > > What current devices are available that would be best suited to such a task? > > Are they affordable > > by someone who just wants to play about like this? > > Thanks. > > Neill. > > -- > -Ray Andraka, P.E. > President, the Andraka Consulting Group, Inc. > 401/884-7930 Fax 401/884-7950 > email randraka@ids.net > http://users.ids.net/~randraka -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email randraka@ids.net http://users.ids.net/~randrakaArticle: 20698
What are you planning to do with the board. The best choice of devices really depends on the application. John Rible wrote: > Try the $150 Atmel starter kit at <http://www.kanda.com>. A bit more on the > board than the Xess ones. > > -John > > Matt Billenstein wrote: > > > > All, I'm interested in purchasing a prototyping board based on a Xilinx FPGA > > and I have about $200 to spend. I've looked a little at the boards at > > www.xess.com so far. Does anyone have any recommendations? > > > > thx in advance, > > > > m > > > > Matt Billenstein > > REMOVEhttp://w3.one.net/~mbillens/ > > REMOVEmbillens@one.net -- -Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email randraka@ids.net http://users.ids.net/~randrakaArticle: 20699
In article <38ACBDDF.CA310F0A@ids.net> you wrote: : Mathew Wojko wrote: : > Ray Andraka (randraka@ids.net) wrote: : > : Wallace trees are not generally the fastest multipliers in FPGAs. See the : > : > If you pipeline them they generally are. : > : No, they are not. A wallace tree produces a sum vector and a carry vector. : Those have to be added together to obtain the full sum. : However, that final adder determines the maximum clock rate of the multiplier. Precisely. The Wallace tree is a carry-save architecture. When pipelined, carry values only ever propagate one bit-position within each stage of processing (no carry propagation latencies are experienced). Thus fast clocking rates for this 'tree-part' of the multiplier can be acheived. However, when combining the carry and sum vectors, you do not want to compromise the performance obtained thus-far from the 'tree-part' of the multiplier. A simple ripple adder implemented using fast carry logic will not yeild the same performance as acheived by the wallace tree. Thus overall performance will be affected. : Now fade to the FPGA. The fast carry chain logic in modern FPGAs is a highly : optimized dedicated path that is about an order of magnitude faster than logic : implemented in the LUT logic and connected via the general routing resources. : That fact makes it extremely difficult to improve upon the performance of the : carry chain ripple carry adder. This is the point that I dont necessarily agree on. I agree that you cannot improve on the performance of a ripple carry adder. Using the fast-carry logic provides unparallel results for their implemenation. However, their exist other addition techniques that will provide better pipeline performance when implemented on an FPGA. The trick is not to ripple or propagate the carry great lengths between successive pipeline stages. : This non-homogenous mix of logic means that the : cheap ripple carry adder is about as fast as you're gonna get in the FPGA (short : of pipelining the carry) for word widths up to around 24-32 bits. Exactly. If you pipeline the carry then you can acheive a matching performance result to that of the wallace tree. Remember that the Wallace tree pipelines the carry result at every stage of processing. Thats why its called a carry-save technique. Why you would want to use a carry ripple adder after expending the extra logic to implement a Wallace tree to reduce partial products is beyond me. : The result is : a wallace tree buys you nothing in terms of area, and in fact is twice as big as : a a row-ripple tree because the ripple carry adders use one LUT per bit (the : carry is in dedicated logic in xilinx or splits the lut in altera) where the full : adders in the wallace tree need two luts per bit (one for sum, one for carry). I agree that the wallace requires more area than the row-ripple tree. As you have pointed out, thats true because you do not pipeline the carry values in a row-ripple tree (what I call vector based computation), whereas in the wallace tree you do. As such, the wallace tree *does* give you added performance for area. The clocking speed is substantially faster since carry values only propagate one bit position between pipeline stages rather than up to 2n bits as in the row-ripple technique. : The larger area costs clock cycle time since the routing in FPGAs has substantial : delay. Now pipelining will get back the performance (requires a register : immediately in front of the final adder for best clock speed), but the fact of : the matter is you are still limited by the speed of that final adder. But thats my point. Why include a carry ripple adder at the final stage? This is the obvious performance limiting factor. By using carry lookahead techniques you can obtain better performance results than the carry ripple adder. Regardless of the carry ripple adder implemented by the fast-carry logic. So a : wallace tree gets you at best, the same performance as a row-ripple tree with : double the area (more if you use partial product techniques at the front layer). : This is why a wallace tree multiplier is not appropriate for an FPGA. Sorry, but I disagree. A wallace tree multiplier is appropriate for an FPGA *if* you use the appropriate adder to combine the sum and carry results. The BCLA adder is a perfect addition technique to combine with the wallace tree. Using this, (implemented correctly) the pipeline latency at every stage of processing will only be from one 4-input LUT output to a register. Thus this technique matches well to both ALTERA and Xilinx FPGA architectures. : That said, the column route delay penalty in Altera 10K devices does make a : wallace tree a little more attractive for pipelined trees that cannot fit in one : row. The reason for that is the clock period is limited by the delay from the : output register on one level of the tree through the carry chain to the msb : output register of the next level. If the levels cross a row boundary, there is : a significant delay hit which will reduce the clock frequency unless additional : registers are added ahead of and in the same row of the carry chain. If the tree : extends across several rows, several layers of pipeline registers are needed if : the tree is all ripple carry adds. A wallace tree can reduce the hit, but again : at the expense of a considerable amount of area...and that is only true for trees : that extend across more than two rows. You get the same clock cycle performance : in less area by simply adding the extra pipeline registers instead of doing a : wallace tree, but at the expense of a little clock latency. Note that this is a : special case. The other special case occurs in FPGAs without carry chains, where : in order to get an advantage by using a wallace tree, your final adder should use : a fast carry scheme. : > However, it depends on how you define speed. If you are referring to the : > clocking rate, then a fully pipelined Wallace tree multiplier will provide : > the best results - over vector and array based techniques. However, : > Wallace trees require a large amount of device resource to do so (CLB count). : > : > If you are interested in pipelined structures and associated clocking : > rates, be prepared to experience an area/time tradeoff for multiplication : > implementations. Thats is, the faster you wish to clock the implementation, : > the more area you will have to use. : > : > If you are interested in the functional density of the implementation, : > I'd say that vector based approaches (which add partial products in parallel : > - using fast carry logic) provide best utilisation results. : For FPGAs with fast carry chains, these partial product techniques also provide : the fastest multipliers short of pipelining the carries. : > : > : > -- : > Mathew Wojko : -- : -Ray Andraka, P.E. : President, the Andraka Consulting Group, Inc. : 401/884-7930 Fax 401/884-7950 : email randraka@ids.net : http://users.ids.net/~randraka -- Mathew Wojko
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z