Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
In the process of typing this algorithm into some examples, I realized that I can improve the part of the algorithm that is involved with the base conversion. This will speed the circuit up only very slightly, but a few LUTs will get removed, and it's a sweet non-standard use of a Virtex carry chain. The algorithm involves using a Xilinx carry chain to compute some, but not all, of the carries in the addition of a constant to a variable. The section this applies to is here: -- From this it is obvious that to convert the number "A" in base 8 -- to base 8-4, I need merely add the octal constant o444444... to "A". -- This perfectly converts it to the corresponding (i.e. carrying the -- same numerical value) number in base 8-4. -- -- This conversion is very convenient when the multiplier has a lot of -- bits, but it isn't needed for relatively short multipliers. In -- particular, a multiplier of n x 5 would do well to avoid performing -- the base conversion explicitly. -- -- -- After performing the base conversion, I take each digit from B -- (where B = A + o44...44. = A + "100100100...100100") and use -- it to create a single partial product. For an n x (3m-1) multiply, -- I'll end up with m partial products. What I realized is that instead of sending the complete "Base 8-4" value to the logic that determines the mode lines, (each set of mode lines takes 3 bits of the Base 8-4 version of the multiplier and creates 2 mode bits suitable for selecting which partial result to compute, i.e. which of {1M, 2M, 3M, 4M}) I can instead make the mode lines work off of four values. The four values would include the 3 bits of the base 8 version of the multiplier, and another bit which indicates whether or not a carry into these three bits will be caused by the addition of the "4444...444" constant. Example, 24 bits. Each of the eight partial products will always use their associated three bits. The lowest partial products needs no other input. The other partial products need one more bit, and this bit is the carry-in to that digit in the + "44..44" calculation. To get this, I need a chain of functions C. Each C is a function of four variables. Three of them are the previous digit, the fourth is the carry-in to the previous digit. Together, those four values give the carry-in to the given digit. The whole thing connects together like this: X 222 211 111 111 11 321 098 765 432 109 876 543 210 --- --- --- --- --- --- --- --- 7 6 5 4 3 2 1 0 (digit #) ||| ||| ||| ||| ||| ||| ||| ||| VVV VVV VVV VVV VVV VVV VVV VVV C<--C<--C<--C<--C<--C<--C<--C--'0' 8 7 6 5 4 3 2 1 C(0) = '0'; C(I) = X(3*I-1) or (C(I-1) and X(3*I-2) and X(3*I-3)); (Note: C(1) = X(2).) The logic for the mode bits for the Ith digit is then a function of X(3*I+2 downto 3*I) & C(I). In that logic, C acts exactly as a carry-in to the three X bits used, modifying the mode lines appropriately. It turns out that the "C" function can be programmed into the carry structure of a Virtex. (I haven't built this or simulated it yet, so there could be an error, but I'm pretty sure this works.) The carry outputs are brought out of the carry chain, the sum outputs are ignored. This means that the carry chain leaves the flip-flops, along with their CE, CLK, and SR controls unused. X(3*I-1) is placed on "I0", and sent to the '0' input of the MUXCY, which gets it's select from the LUT, as is standard. Only 3 inputs to the LUT are used, so it's really a LUT3. When the LUT3 is zero, this will send X(3*I-1) to the carry. The '1' input of the MUXCY is connected to C(I-1), as is standard with Virtex carry chains. The other two LUT3 inputs are connected to X(3*I-2 downto 3*I-3). The LUT3 is programmed to give a '1' only to the case where X(3*I-1 downto 3*I-3) == "011". This is the propagate case. If X is this value, the carry-out needs to be equal to the carry-in, and this is exactly what happens. The resulting operation in the LUT3 / Carry chain section is as follows, where I've renumbered the three bits of X to be Y(2 downto 0): Y 210 CIN | COUT Description --- --- + ---- ----------- 000 0 | 0 000 1 | 0 001 0 | 0 001 1 | 0 010 0 | 0 010 1 | 0 011 0 | 0 011 1 | 1 Propagate case 100 0 | 1 Generate 100 1 | 1 Generate 101 0 | 1 Generate 101 1 | 1 Generate 110 0 | 1 Generate 110 1 | 1 Generate 111 0 | 1 Generate 111 1 | 1 Generate It's pretty clear that this is exactly what I need. The result is an extremely efficient way of precomputing the carry-ins. Only one LUT is used per 3-bits of the X multiplier. I'll be coding this up over the next few days, providing I find the time. Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37826
"Rick Filipkiewicz" <rick@algor.co.uk> wrote in message news:3C21B681.887ECA2D@algor.co.uk... > I think I've figured out how these things happen & how the obvious default of > ``-pr b'' deosn't happen. > > It goes like this > 7. User complains of stupid default value. > 8. engineer agrees, changes it, and then gets told ``You can't do that, its now in > the manual''. > Swap h/w for s/w and all the above is based on personal experience. Don't forget the marketing guy telling everyone that leaving the default as "don't pack IOB FFs" forces the engineers to use an attribute that makes their code less portable, thus locking in Xilinx. Of course the newbie won't make his setup and clock to out times, and he'll switch anyway.Article: 37827
On Fri, 21 Dec 2001 08:47:02 +1300, Jim Granville <jim.granville@designtools.co.nz> wrote: >Bob Perlman wrote: >> >> On Thu, 20 Dec 2001 18:19:54 GMT, Andy Peters >> <andy@exponentmedia.nospam.com> wrote: >> >> >Bob Perlman wrote: >> > >> >> On Wed, 19 Dec 2001 17:19:58 GMT, Andy Peters >> >> <andy@exponentmedia.nospam.com> wrote: >> >> >> >> >> >>>Stephen Byrne wrote: >> >>> >> >>> >> >>>>I originally posted this yesterday on google groups, but I'm not seeing it >> >>>>on my home news server. In case it is not visible to all, I'm reposting. >> >>>> >> >>>>Hello All, >> >>>> >> >>>>My company is currently comparing 66MHz PCI core solutions from Xilinx >> >>>>and Altera, as well as debating using a home-spun core. One issue >> >>>>I've come upon is the PCI requirement for a MAX clock-to-out time of 6 >> >>>>ns and MIN clock-to-out time of 2 ns. Both the Xilinx ISE and Altera >> >>>>Quartus II tools seem very helpful in supplying MAX (worst-case) Tco >> >>>>times, but I don't see any info on best-case times. Apparently the >> >>>>SDF files for back-annotated timing sim have the same worst-case >> >>>>numbers repeated 3 times, resulting in the same simulation regardless >> >>>>of case selection. My question is: how is anyone (FPGA vendors >> >>>>included) guaranteeing a MIN Tco of 2 ns across all conditions and >> >>>>parts if the design tools don't even yield that information? >> >>>> >> >>> >> >>>You like to live dangerously if you depend on best-case timing information. >> >>> >> >> >> >> What's the alternative? >> > >> > >> > >> >Um, worst-case timing information? >> >> Worst-case timing information isn't a substitute for best-case timing >> information; you need both. If you're trying to calculate setup >> margin, you need the worst-case clock-to-Q time of the driving device. >> But if you're trying to calculate the hold margin, you need the >> best-case (i.e., shortest) clock-to-Q time. That's why Stephen Byrne >> was looking for best-case timing. >> >> Bob Perlman > > Seems a terminology problem. > BOTH the shortest and longest time delays can be considered 'worst >case', >from a statistical and tolerance sense. I don't see how this can be written off as a disagreement on terminology. Take a look at Stephen Byrne's original post, above. It's clear that he already has max timingnumbers, which he's called "worst-case", and that he's looking for min timing, which he calls "best-case." We can quibble about terms, but he's done us the favor of defining what he means (besides, 95% of the designers I've worked with use the very same terminology). It's also clear that he's talking about communications among chips on a PCI bus, not between a chip and itself, where delay tracking might help. Bob Perlman Cambrian Design WorksArticle: 37828
Thanks a lot for the information, none you are very helpful Iwant to ask something else, I synthesize my VHDL models using FPGA EXPRESS 3.6 but I can't edit any constraints there so I first produce the EDIF from my models and later I edit the timing constraints using ISE4.1i (alliance) constraint editor. I have heard that this is not a good thing, the best option is to use the ucf during synthesis. In FPGA EXPRESS I cant edit constraints because this option is unavailable (so is the "view schematic" option), xilinx refers to that and says I need a special license. I e-mailed xilinx and asked them how i can enable these 2 functions and they sent me a license.dat file where in the package declaration FPGA EXPRESS is not mentioned so i cant load the program, i searched everywhere but nothing works. I am sure that i make the correct changes in the license.dat file (hostid , host name, path of the daemon etc), the sad thing for me is that xilinx's support will be closed during Christmas holidays and that means I will have 10 days delay which is gonna be catastrophic. Is there any way i can fix that problem and to use FPGA EXPRESS "edit constraints" and "view schematic". Thanks and my best regards, Harris none" <x@y.z> wrote in message news:RIlU7.942$yw1.4862@news.uk.colt.net... > Hi Harris, > > > I have a small question. I work in a Virtex-E FPGA, my model has 4 clocks > (3 > > in 155MHz and the other one slower~100MHz) as inputs. In the ucf file I > > located all clocks in the GCK pins, is that right? > Yup > > > If yes, is this the only constraint that ensures proper distribution of > the > > signals? > I think it is likely that your synthesis tool will infer clock buffers for > you (provided you use the dedicated clock pins). > Alternatively you could instantiate the components directly to remove any > doubt (I needed to 'cos at the time it was the only was to get the LVPECL > input standard that I needed). > I used IBUFG to get the clock on chip, then BUFG to distribute it for simple > clocking and > IBUFG then CLKDLLE then BUFG in a more complex setup. > You could do this if your synthesis tool doesn't infer clock distribution > directly (try it and see). > > > Ah , and another one :) .. one of the processes (VHDL coding) in my model > > that i want to implement uses the falling edge of the slow clock while the > > others use the rising edge of all clocks, is this going to be a problem? > No problem, an entity can contain rising edge clocked processes and falling > edge clocked process, you just can't have rising & falling in one process. > > Fred > > >Article: 37829
Hello , That difference on speed is owing to "faster" flip-flops? So if you want to buy a FPGA you must determine the speed grade you want or the same FPGA has the ability to operate in different speeds (i.e 6,8)? I am a bit confused Harris "Ray Andraka" <ray@andraka.com> wrote in message news:3C216F7E.76102A77@andraka.com... > Those suffixes are the speed grade. The parts are graded as they come > through test and "binned" according to their performance scores and the > relative demand for the various speed grades. This way the vendor can > sell the faster parts at a higher premium. For the virtex families, the > higher the number, the higher the performance. In the 4K families, it > went the other way with the smaller numbers indicating faster parts. > > Better is a relative term. If you need the speed, then the faster parts > are 'better'. If you need to keep your costs down, then the slower parts > are 'better' because they are significantly cheaper. > > Antonio wrote: > > > Some hardware question on FPGA : > > > > 1) What's the difference between a part with speed -3 and another with > > speed -4 , the number is the number of metal layers ?? > > > > 2) I read data sheet of Virtex and Virtex E, I didn't found really > > much difference, can you explain me which is better and why ?? > > > > Thanks > > -- > --Ray Andraka, P.E. > President, the Andraka Consulting Group, Inc. > 401/884-7930 Fax 401/884-7950 > email ray@andraka.com > http://www.andraka.com > > "They that give up essential liberty to obtain a little > temporary safety deserve neither liberty nor safety." > -Benjamin Franklin, 1759 > >Article: 37830
> The recommendation is to wait 500 ms at room temp (longer if cold), to I have done some more work on this and have found that it behaves itself when the feedback loop samples the CLK0 output directly. This DCM is being used for deskewing a board level clock, so I want to connect an IOB to FB, not CLK0. When I do this, the problem reappears. It's quite consistent and reproducable - it breaks when I use external feedback. Any ideas? The IOBs (both output K and K# and input FB) are set to HSTL_II_DCI. Thanks for your help. -- David Miller, BCMS (Hons) | When something disturbs you, it isn't the Endace Measurement Systems | thing that disturbs you; rather, it is Mobile: +64-21-704-djm | your judgement of it, and you have the Fax: +64-21-304-djm | power to change that. -- Marcus AureliusArticle: 37831
Kevin Brace wrote: > > I must say that if Virtex-E/Spartan-IIE supported 5V PCI I/O, I rather > have used them instead of Virtex/Spartan-II because newer devices tends > to be cheaper than the older devices of the same density, or for the > same amount of money, you get more gates (and features). > Although not the question you asked, Altera basically did the same thing > ... It was with a heavy heart that we dropped 5-V tolerance, but the newer processes do not support it, at least not at a reasonable cost. We are being pushed relentlessly to build faster and cheaper chips, and there is one thing that has to give: it's the oxide thickness and thus the resultant supply voltage and input voltage tolerance. From our perspective, 5-V really ought to be retired. I remember when it was introduced as DTL and later TTL supply voltge around 1965. That means it has lived a long and productive life. Let's do our best to retire this standard... And, watch out, 3.3 V will not live forever, either! But as you said, you can still buy 5-V Vcc devices, and 3.3-V Vcc devices with 5-V input tolerance. They are just not the fastest, biggest or most cost-effective ones. Peter AlfkeArticle: 37832
"S. Ramirez" wrote: > Don't forget the marketing guy telling everyone that leaving the default as > "don't pack IOB FFs" forces the engineers to use an attribute that makes > their code less portable, thus locking in Xilinx... We are rally not that devious. Instead, we suggest the designer use the DCM (with 50-ps phase-delay stepping), the dual-ported ( honestly dual-ported ) BlockRAM, the multiplier, the SRL16 and the digitally-controlled output impedance. That should keep the other guys out... :-) Peter AlfkeArticle: 37833
Kevin Brace wrote: > Hi, I will like to know if someone knows the strategies on how to reduce > routing (net) delays for Spartan-II. A few things. 1) Look very hard at how logic on failing paths is designed. Is there a simpler way to do the function? Can you split a complex function into two simple functions? Can you move some of the logic to the other side of registers? 2) Does XST re-order logic? If so, you might make sure that the order of functions is good: x= f(a,b,c,(f(d,e,f,g)) will be faster for a,b and c than for d,e,f and g. Fine if a is the critical signal, bad if g is. Change it to (and I don't know enough about XST to tell you how to do this): x= f(g,a,b,(f(c,d,e,f)) or similar with the speed critical net having the fewest levels of logic. [f(a,b,c) is a three input lookup table with input signal a, b and c] 3) What effort level are you running PAR at? "5" is the highest. Use it. > Here are some solutions I came up with. > > 1) Reduce the signal fanout (Currently at 35 globally, but FRAME# and > IRDY#'s fanout are 200. What number should I reduce the global fanout > to?). If you have a problem with fanout, you may want to control how the fanout is split up. Telling the synthesis tool to reduce fanout isn't good, as the synthesis tool does not have a clue as to how the logic is located, so it may split the net in a way that makes no sense. No, I should say it will split nets in ways that make no sense. This may mean that you will need to add a module to your design with the buffering for this net. Again, I don't know how to force mapping of logic in XST. > 3) Floorplan all the LUTs and FFs on the FPGA (currently, I only > floorplanned the LUTs that violated Tsu, and most of them take inputs > from FRAME# and IRDY#.). Logic that is near the critical paths may need to be floorplanned to avoid interaction with the critical path. "Near" can be logical or physical. > 4) Use Guide file Leverage mode in Map and Par. This might help. To use this feature, make a sub-design with the critical path and as little else as reasonable, and PAR this design into "my_guidefile.ncd". Then go a guided MAP and PAR with this as a guide file. > P.S. Considering that I am struggling to meet 33MHz PCI timings with > Spartan-II speed grade -5, how come Xilinx meet 66MHz PCI timings on > Virtex/Spartan-II speed grade -6? (I can only barely meet 33MHz PCI > timings with Spartan-II speed grade -6 using floorplanner.) They are good, and they cheat. Their design is clever and well done, and they use a "magic_box" , a bit of dedicated logic that can only be used from FPGA_editor. > I know that Xilinx uses the special IRDY and TRDY pin in LogiCORE PCI, > but that won't seem to help FRAME#, since FRAME# has to be sampled > unregistered to determine an end of burst transfer. Question to make you think: What do you NEED to do at the end of a burst transfer? And when? -- Phil HaysArticle: 37834
Re those 3 155MHz clocks. SONET? Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37835
I wish block ram had an async read option like distributed ram. How hard would this be to do ? The problem I have is using the same port to perform both read and write operations for a cpu. The cpu always generates an address that is a registered output registered on the clock edge. So just after the clock edge, the address is available. This works great for writes because the next clock edge can be used to write the data to the block ram. However it doesn't work for reads, because we want the next clock edge to latch the data into a cpu register. Instead, the read data isn't available until after the next clock edge. So 1) a wait state could be inserted for read operations (cuts performance in half). 2) we can use the address from the cpu as it is just before it's registered and use a second port of the block ram - means we have two address busses and the block ram can't be shared with another device, or twice as many blocks rams are required. RobArticle: 37836
I found out how to do it (from helpfull altera support). Use the maxplus2 wizard to generate the dual-port ram vhdl files. Copy the component declaration wizard code into your architecture code, and instantiate it. Leonardo generates an .edf file with the component as a black box. Maxplus2 reads a wizard-generated vhdl file to fill in the black-box component. Make sure the wizard files are in the same directory as the .edf file. Mike Treseler wrote: > > Russell Shaw wrote: > > > -- Pre Optimizing Design .work.sync_dpram_8_8.synth > > -- Boundary optimization. > > "E:/AAProjs/Bugs/Leonardo/main.vhd", line 34:Info, Inferred ram instance 'ix26409' of type > > 'ram_dq_da_inclock_outclock_8_8_256' > > So it found the right module but . . . > > > -- optimize -target acex1 -effort quick -chip -area -hierarchy=auto > > Using default wire table: STD-1 > > Warning, Dual read ports not supported for FLEX/APEX/MERCURY RAMs; using default implementation. > > Warning, using default ram implementation for ram_dq_da_inclock_outclock_8_8_256, run time can get large. > > . . . it refused to use it. > > Since acex1k is not in the unsupported list above, > it's either a bug or a deliberate dumbing down > of the oem version. > > Note that this ram is inferred properly with > acex1k technology on the mentor version of leo.Article: 37837
WDM Harris "Carl Brannen" <carl.brannen@terabeam.com> wrote in message news:94c9d180ad1ec9713e5672513e311ddb.51709@mygate.mailgate.org... > Re those 3 155MHz clocks. > > SONET? > > Carl > > > > > -- > Posted from firewall.terabeam.com [216.137.15.2] > via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37838
Carl Brannen wrote: > > Kevin, Are you at a relatively high utilization on the part (presumably with > logic from something other than the PCI interface)? > The logic utilization of the version I posted the part of Timing Analyzer report is about 40% of Spartan-II 150K system gate part (about 700 Configurable Logic Blocks (CLBs) = about 1,400 Slices). Because developing the PCI IP core is more important to me than spending time on the user side logic, I kept the user side really to a really minimal design (the user side takes only single I/O cycles). I don't know exactly, but I am sure that the user side uses less than 10% of Spartan-II's logic resources (I think around 7%). > If you are, one great strategy is to remove all your logic but a stub, and then > get the PCI interface to route to perfection. Then you use that result as a > guide file (or copy the PCI placement into the UCF file). > > If this works, then you've just got the tool to do the routing for you. I > don't know if it will help here, but the technique works great on Altera > designs especially. > Xilinx tools also have a Guide mode, but somehow when I set Guide Mode for P&R to Leverage rather than Exact, the P&R software crashes. When the Guide Mode is at Exact, the P&R software doesn't crash. I haven't checked out Xilinx's support page, so I don't know the status of this bug, but I hope the next release of ISE WebPack will fix this problem. It is probably because I don't have a good understanding of the Guide Mode, and perhaps because Leverage Guide Mode doesn't work properly in P&R software, but I haven't seen any major improvement in routing from using Guide Mode. I know that Xilinx uses a Guide file for their 66MHz PCI LogiCore, but not for 33MHz one, so Guide file must be doing something, so I will still think about using this method in the future when the P&R software's bug is fixed. Regarding Altera's tools, does MAX+PLUS II or Quartus II 1.1 support floorplanning like Xilinx does? I am thinking of porting my PCI IP to Altera's FPGAs like ACEX 1K or FLEX10KE, and will like to know if the floorplanning support is as good as Xilinx's because from the experience of using Xilinx devices, automatic place & route tools just don't do a good job of placing the LUTs in a good location and related ones close to each other. Quartus II 1.1 Web Edition (I only use free tools because I am poor) looks like it is as good as ISE WebPack 4.1, but I personally will rather not deal with MAX+PLUS II-BASELINE because the tool looks hard to use and old. > The idea is to route the critical logic first, but do it while unloading the > place and route from having to deal with the uncritical logic. > > Best of luck. > > Carl > > -- > Posted from firewall.terabeam.com [216.137.15.2] > via Mailgate.ORG Server - http://www.Mailgate.ORG In my case, I floorplanned timing critical LUTs by hand, but because ISE WebPack 4.1 doesn't come with FPGA Editior, I don't know how the routing is being done. I used to treat everything (devices and tools) as a blackbox because the design is done on HDL, but I no longer want to treat everything as a blackbox because I no longer totally trust automatic tools. Kevin Brace (don't respond to me directly, respond within the newsgroup)Article: 37839
This 16x5 unsigned multiplier uses the algorithm listed above. It has a single register stage, at the end. I'm fairly sure it works, having simulated a lot of numbers through it. If it were to be pipelined, it would be most efficient to put the first register stage at the M input, but at the "MODE" stage in the X input. It uses 33 slices, with a total of 65 FGs used. Two of the FGs are programmed to be zero so that two carry-outs can be made visible. I don't immediately see how to avoid this. Despite being a fall through, with no internal registers, (though note that the final register is necessary in order to get M x zero = zero), and without being floorplanned, (I haven't even gone back through the code to see if I can improve it), it still gets 131MHz in the xcv50e -8. Were it to be pipelined, it would be natural to bring the "X" input into the logic a clock early. This would allow the mode inputs for the partial products to be registered. This could be done without unbalancing the multiplier by registering the "M" inputs on the inputs, but registering the "X" inputs only after a clock. I haven't taken a look at how to minimize the logic in this case... I typically over-comment my VHDL. I've removed the comments here to save bandwidth on the internet. If anyone is interested, I can add them back in. Thanks to Frédéric Rivoallon at Xilinx for e-mailing me a link to instructions on how to instantiate LUT4s inside generate statements without a lot of grief. For those interested, the link is here: http://tech-www.informatik.uni-hamburg.de/vhdl/doc/faq/FAQ1.html#attributes This is the first in a series of multipliers. The next, a 16x8, will use 3 partial products, which is not a particularly natural number for this algorithm. But the one after that, 16x11, will use 4 and will be quite sweet. Total LUT usage with this algorithm will increase by 3 per additional bit beyond 16. That is, the number of LUTs for a Nx5 multiplier will be about LUT( multiplier Nx5) = 17 + 3N. The 3 adders are hooked up as follows: LUT#1 \ -- PP0V creates { 1M, 2M, 3M, 4M} + LUT#3 -- PS0V creates final result between 0M and 31M LUT#2 / -- PP3V creates { 8M,16M,24M,32M} The usual algorithm for multiplying by 5 bits on a Virtex will require 4 LUTs per bit. The adder tree will look like this (maybe a slightly different topology will be better): LUT#1 \ -- creates { 0M, 1M, 2M, 3M} + LUT#3 \ -- creates { 0M ... 15M} LUT#2 / \ -- creates { 0M, 4M, 8M,12M} + LUT#4 -- creates final result between 0M and 31M (M)-------------/ -- creates { 0M,16M} (AND gate absorbed into LUT#4) For extremely wide multiplies, the savings of the new algorithm approach 25% over the old technique. -- Multiplier code, 16x5 multiplier -- Design by Carl Brannen. -- Uses 3 + 2 bit coding. -- Multiplier code, 16x5 multiplier library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_unsigned.all; entity MUL16x5S is port ( CLK: in STD_LOGIC; M: in STD_LOGIC_VECTOR(15 downto 0); X: in STD_LOGIC_VECTOR( 4 downto 0); Y: out STD_LOGIC_VECTOR(20 downto 0); TEST: in STD_LOGIC ); end MUL16x5S; architecture MUL16x5S_arch of MUL16x5S is component LUT4 port ( I0: in STD_LOGIC; I1: in STD_LOGIC; I2: in STD_LOGIC; I3: in STD_LOGIC; O: out STD_LOGIC); end component; attribute INIT: string; component XORCY port ( CI: in STD_LOGIC; LI: in STD_LOGIC; O: out STD_LOGIC); end component; component MUXCY port ( DI: in STD_LOGIC; CI: in STD_LOGIC; S: in STD_LOGIC; O: out STD_LOGIC); end component; component MULT_AND port ( I0: in STD_LOGIC; I1: in STD_LOGIC; LO: out STD_LOGIC); end component; component FDR port ( D: in STD_LOGIC; C: in STD_LOGIC; R: in STD_LOGIC; Q: out STD_LOGIC); end component; signal EM0V: STD_LOGIC_VECTOR(18 downto 0); -- Extended M[] signal PP0PN: STD_LOGIC_VECTOR( 1 downto 0); -- PP0.P & PP0.N signal PP0V: STD_LOGIC_VECTOR(20 downto 0); -- PP0.V signal PP0M: STD_LOGIC_VECTOR( 1 downto 0); -- Mode control bits signal PP0_LUT: STD_LOGIC_VECTOR(17 downto 1); -- LUT signal PP0_MA: STD_LOGIC_VECTOR(17 downto 1); -- MULT_AND signal PP0_XC: STD_LOGIC_VECTOR(17 downto 1); -- XORCY signal PP0_CRY: STD_LOGIC_VECTOR(18 downto 1); -- Carry signal PP0_SUM: STD_LOGIC_VECTOR(17 downto 1); -- Sum output signal EM3V: STD_LOGIC_VECTOR(20 downto 0); -- Extended M[] signal PP3PN: STD_LOGIC_VECTOR( 1 downto 0); -- PP3.P & PP3.N signal PP3V: STD_LOGIC_VECTOR(20 downto 2); -- PP3.V signal PP3M: STD_LOGIC_VECTOR( 1 downto 0); -- Mode control bits signal PP3_LUT: STD_LOGIC_VECTOR(20 downto 3); -- LUT signal PP3_MA: STD_LOGIC_VECTOR(20 downto 3); -- MULT_AND signal PP3_XC: STD_LOGIC_VECTOR(20 downto 3); -- XORCY signal PP3_CRY: STD_LOGIC_VECTOR(21 downto 3); -- Carry signal PP3_SUM: STD_LOGIC_VECTOR(20 downto 3); -- Sum output signal PS2SEL: STD_LOGIC_VECTOR( 3 downto 0); -- Select bit signal PS2PN: STD_LOGIC_VECTOR( 1 downto 0); -- PS2.P & PS2.N signal PS2V: STD_LOGIC_VECTOR(20 downto 0); -- PS2.V signal PS2M: STD_LOGIC_VECTOR( 1 downto 0); -- Mode control bits signal PS2_LUT: STD_LOGIC_VECTOR(20 downto 2); -- LUT signal PS2_MA: STD_LOGIC_VECTOR(20 downto 2); -- MULT_AND signal PS2_XC: STD_LOGIC_VECTOR(20 downto 2); -- XORCY signal PS2_CRY: STD_LOGIC_VECTOR(21 downto 2); -- Carry signal PS2_SUM: STD_LOGIC_VECTOR(20 downto 2); -- Sum output signal YRES: STD_LOGIC; -- Reset final FF signal YQ: STD_LOGIC_VECTOR(20 downto 0); -- Final FF begin EM0V(18 downto 0) <= "000" & M(15 downto 0); PP0V(0) <= (EM0V(0) and X(0)); PP0V(1) <= (EM0V(0) and X(1)) xor (EM0V(1) and X(0)); PP0V(17 downto 2) <= PP0_SUM(17 downto 2); PP0V(20 downto 18) <= "000"; PP0PN(0) <= X(2); -- Negative bit with X(2 downto 0) select -- Positive bit PP0PN(1) <= '1' when "001" | "010" | "011", '0' when others; with X(2 downto 0) select PP0M(1 downto 0) <= "01" when "111" | "001", -- PP0V <= 1M "00" when "110" | "010", -- PP0V <= 2M "11" when "101" | "011", -- PP0V <= 3M "10" when others; -- PP0V <= 4M PP0_CRY(1) <= '0'; A0: for I in 1 to 17 generate B: block attribute INIT of L0: label is "7484"; begin L0: LUT4 port map( I0 => PP0M(1), I1 => EM0V(I-1), I2 => PP0M(0), I3 => EM0V(I), O => PP0_LUT(I)); MA: MULT_AND port map ( I0 => PP0M(1), I1 => EM0V(I-1), LO => PP0_MA(I)); MC: MUXCY port map ( DI => PP0_MA(I), CI => PP0_CRY(I), S => PP0_LUT(I), O => PP0_CRY(I+1)); XC: XORCY port map ( CI => PP0_CRY(I), LI => PP0_LUT(I), O => PP0_SUM(I)); end block b; end generate; EM3V(20 downto 0) <= "00" & M(15 downto 0) & TEST & TEST & TEST; PP3V(2) <= '0'; PP3V(20 downto 3) <= PP3_SUM(20 downto 3); PP3PN(0) <= '0'; -- Negative bit (never negative) with X(4 downto 2) select -- Positive bit PP3PN(1) <= '0' when "000", '1' when others; -- Usually positive with X(4 downto 2) select PP3M(1 downto 0) <= "01" when "001" | "010", -- PP3V <= 1M "00" when "011" | "100", -- PP3V <= 2M "11" when "101" | "110", -- PP3V <= 3M "10" when others; -- PP3V <= 4M PP3_CRY(3) <= '0'; A3: for I in 3 to 20 generate B: block attribute INIT of L3: label is "7484"; -- See PP0V begin L3: LUT4 port map( I0 => PP3M(1), I1 => EM3V(I-1), I2 => PP3M(0), I3 => EM3V(I), O => PP3_LUT(I)); MA: MULT_AND port map ( I0 => PP3M(1), I1 => EM3V(I-1), LO => PP3_MA(I)); MC: MUXCY port map ( DI => PP3_MA(I), CI => PP3_CRY(I), S => PP3_LUT(I), O => PP3_CRY(I+1)); XC: XORCY port map ( CI => PP3_CRY(I), LI => PP3_LUT(I), O => PP3_SUM(I)); end block b; end generate; PS2V(1 downto 0) <= PP0V(1 downto 0); PS2V(20 downto 2) <= PS2_SUM(20 downto 2); PS2SEL <= PP3PN(1 downto 0) & PP0PN(1 downto 0); with PS2SEL select PS2PN(1 downto 0) <= -- Result of sum: "01" when "0101" | "0100" | "0110" | "0001", -- Negative "10" when "1001" | "1000" | "1010" | "0010", -- Positive "00" when others; with PS2SEL select PS2M(1 downto 0) <= -- Mode: "00" when "0100" | "1000", -- A "01" when "0001" | "0010", -- B "10" when "0110" | "1001", -- A-B "11" when others; -- A+B with PS2M(1 downto 0) select PS2_CRY(2) <= ( '0' ) when "00", -- CIN = 0 ( '0' ) when "01", -- CIN = 0 (PP0V(1) nor PP0V(0)) when "10", -- CIN = 1 ( '0' ) when others; -- CIN = 0 S2: for I in 2 to 20 generate B: block attribute INIT of L2: label is "7C86"; begin L2: LUT4 port map( I0 => PS2M(1), I1 => PP3V(I), I2 => PS2M(0), I3 => PP0V(I), O => PS2_LUT(I)); MA: MULT_AND port map ( I0 => PS2M(1), I1 => PP3V(I), LO => PS2_MA(I)); MC: MUXCY port map ( DI => PS2_MA(I), CI => PS2_CRY(I), S => PS2_LUT(I), O => PS2_CRY(I+1)); XC: XORCY port map ( CI => PS2_CRY(I), LI => PS2_LUT(I), O => PS2_SUM(I)); end block b; end generate; YRES <= not PS2PN(1); F0: for I in 0 to 20 generate FR: FDR port map ( D => PS2V(I), C => CLK, R => YRES, Q => YQ(I)); end generate; Y <= YQ(20 downto 0); end MUL16x5S_arch; Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37840
Dear FPGA comunity, I have a design that must cope with asynchronous input signals. Basically I have a WE pulse that gates a data vector into the chip. The WE signal is sampled by two FFs to enshure proper pulse detection. One FF is clocked by the positive edge of the system clock and one by the negative edge (I do not want to go into too much details about why I must do this). The FFs that sample the pulse connect to the CE (clock enable) of the following FF to prevent the metastable state from probagating actually into the design. Since I have only simulated this so far I cannot say if it will really work inside the chip (which will be a XILINX FPGA). My question is: Has anyone experience with using CE as a mean to prevent a metastable state from probagating further. Tool Setup: ----------- Simulation & Synthesis: SYNOPSIS Ver 1999.10 Target Technology Mapping: XILINX Design Manager V3.3.08i Target Part: XILINX VirtexE XCV300E-8-PQ240 I would also be greatful if you could point me to some electronically available article, technote or appnote about this topic, if available. Thanks in advance, FRANKArticle: 37841
Austin Franklin wrote: > > Something sounds wrong...aren't you registering your PCI signals in the > IOBs, and are you using the built-in PCI logic? Making 33MHz in an SII > should be a snap. Although you may think I am not being realistic, but I was never really a fan of using registered inputs because from what I understand, registering an input means that the signal getting registered incurs one clock cycle of latency, and I consider this being one cycle "stale." That being said, initially (about a month ago) I was decoding address on the AD[31:0] lines directly without registering it for six BARs (Base Address Register) and one Expansion ROM BAR, and AD[31:0] were having problems meeting Tsu < 7ns requirement (plus FRAME# and IRDY#). After realizing that taking "raw" data off the PCI bus is not a good idea, and the PCI IP core doesn't have to do fast DEVSEL# decode, I decided to register the AD[31:0] and C/BE#[3:0] during an address phase. That got AD[31:0] and C/BE#[3:0] to meet Tsu. Considering that the way PCI protocol works, I find that it is hard to use registered inputs in PCI because again if I am correct, registered inputs incur one cycle of latency. Here are situations where registered inputs can be used easily in my opinion. - Address decode assuming that the PCI IP core doesn't have to do a fast DEVSEL# decode (DEVSEL# decode will be medium or slow decode). Since all chipsets (Northbridges) I know of support medium DEVSEL# at best, this is not a problem at all. - During the a single or the first transfer of a burst transfer where typically the PCI IP has to initiate the user side bus. In this situation, the PCI IP core is inserting wait cycles on the bus, and taking advantage of a protocol rule that once a signal is asserted, it cannot be changed until the end of that access (microaccess I think it is called in a burst transfer), using a registered input signal should not make any difference versus a raw (non-registered) input signal. Although I still will rather use the "raw" one rather than the "stale" one. Here are situations where it is difficult to use registered inputs. - During a burst target transfer where no wait cycle transfer has to be supported by the PCI IP core. The PCI IP core has to constantly monitor IRDY# in case the initiator inserts wait cycles, and monitor FRAME# to know if the present microaccess is the last one or not. Perhaps using registered inputs for FRAME# and IRDY# during a burst target transfer will require the target to insert one wait cycle (deasserting TRDY#) for each microaccess. - When a PCI IP core asserts STOP#, it has to continuously monitor FRAME# to make sure that the initiator deasserts FRAME#, and when FRAME# is deasserted, the PCI IP core has to deassert DEVSEL#, TRDY#, and STOP#, and stop driving the signal if a back-to-back transfer is not occurring to itself. If registered inputs are used for FRAME# and IRDY#, that will miss the correct timing to deassert DEVSEL#, TRDY#, and STOP# because of the one cycle latency. After thinking about the various suggestions I got, I guess I haven't really used registered inputs that extensively throughout my design, and I think that is because of my resistance to using one cycle "stale" data thinking of perhaps a buggy initiator (a host-to-PCI bridge or a busmaster PCI) not following the PCI protocol correctly might change the state of FRAME# or IRDY# after being asserted. Regarding the "built-in PCI logic," I will assume what you mean is Xilinx's special IRDY and TRDY logic. Because the PCI IP core has to be portable across different platforms, I am not interested in using that special IRDY and TRDY logic, and I don't really know how it works. Thanks, Kevin Brace (don't respond to me directly, respond within the newsgroup)Article: 37842
Kevin Brace wrote: > If I add my two cents to this question, as a Xilinx Spartan-II user (a > low cost version of Virtex. From what I see, it is sold at 1/3 of the > price of the equivalent density Virtex) struggling to get my PCI IP core > to meet mere 33MHz PCI timings (Tsu < 7ns . . . is hard to meet at least > in my case), Virtex/Spartan-II (manufactured in UMC's 0.22u process) are > the last and fastest device that supports 5V PCI I/O. > Virtex-E/Spartan-IIE (manufactured in UMC's 0.18u process) dropped 5V > PCI I/O support, and it only supports 3V PCI, which is hardly used on > regular desktop motherboards. > I must say that if Virtex-E/Spartan-IIE supported 5V PCI I/O, I rather > have used them instead of Virtex/Spartan-II because newer devices tends > to be cheaper than the older devices of the same density, or for the > same amount of money, you get more gates (and features). > Although not the question you asked, Altera basically did the same thing > when they moved from APEX 20K (supported 5V PCI according to their > datasheet, manufactured in TSMC's 0.22u process) to APEX 20KE (dropped > 5V PCI support, manufactured in TSMC's 0.18u process). > > Regards, > > I became a bit worried about using the Virtex-E's 3v3 PCI with 5V PCI cards (the voltage conversion is done via QuickSwitch parts) so I did some investigation. With a fully loaded board - 4 populated 5V PCI slots and 2 3v3 onboard devices - the VirtexE-thru'-QS outputs easily met the 5V PCI input specs in terms of Vih and the time to get there. The longest delay I was seeing was about 13.5 nsec from the FPGA's clock to the device input. It was very nearly independent of where along a bussed line I looked at a signal (+/- 1 nsec or so). Also: Even though PCI is an unterminated bus there was only the slightest hint of a reflection step on a couple of signals. Even for those the rise/fall were still monotonic. The upshot of an afternoon's investigation is that, for our system at least, I'm happy (*) driving 5V PCI devices from the V-E parts. (*) Definition: Happy = marginally less paranoid than usual.Article: 37843
Frank Papenfuss wrote: > Dear FPGA comunity, > > I have a design that must cope with asynchronous input > signals. Basically I have a WE pulse that gates a data > vector into the chip. The WE signal is sampled by two > FFs to enshure proper pulse detection. One FF is clocked > by the positive edge of the system clock and > one by the negative edge (I do not want to go > into too much details about why I must do this). The FFs > that sample the pulse connect to the CE (clock enable) > of the following FF to prevent the metastable state from > probagating actually into the design. Since I have only > simulated this so far I cannot say if it will really work > inside the chip (which will be a XILINX FPGA). > > My question is: Has anyone experience with using CE as > a mean to prevent a metastable state from probagating > further. > Frank, It is an unfortunate fact that if an signal from a source async to a clock is sampled on that clock then there is always a chance that a metastable state could propagate arbitrarily far into your system. Metastability is a statistical thing and so all you can do is reduce the probability of its affecting your system to some very small number (or the MTBF >> time between you changing jobs). IIRC there is even a paper somewhere that proves metstability cannot be eliminated by purely digital means. BTW: If anyone has that original reference I'd be grateful - I read it in ~1984 and have long since lost it.Article: 37844
Re the 50% duty cycle divide by 3 counter... > Because the logic design is correct. The simulator usually > has a problem with a combinatorial latch because the simulator > is not intelligent enough to cope with the ambiguity of the > latch state. In reality the ambiguity resolves itself, and > is therefore meaningless. But we all hate getting simulator error messages, so why not use a real latch? That will make the tools happy, and they'll go to the trouble of figuring out our half clock timing for us as well. Nobody wants to dig around a 1M gate FPGA trying to figure that kind of stuff out. I bet we'd get pretty good at it, but why not do it with a real latch? > And by the way, use the DLL, it does it for you. DLLs are a limited resource. In addition, they don't work on quite a large variety of clocks: (1) Clocks that are too fast (2) Clocks that are too slow (3) Clocks that don't have a constant period. The Virtex series has resources that allow the old Xilinx app note to be improved upon a bit: (1) You have SRL16s that give you multiple FF bits per LUT. (2) You have real latches so you don't have to use combinatorial ones. (3) Here's some VHDL code for a 50% divide by 3 that fits in a slice, leaves one of the slices' flip-flops unused, and runs at 186.081MHz even in a Spartan2 -5: -- Divide by 3 in a slice. 50% duty cycle. No simulator warnings. -- Carl Brannen -- library IEEE; use IEEE.std_logic_1164.all; entity DIV3V is port ( CLK: in STD_LOGIC; Q: out STD_LOGIC); end DIV3V; architecture DIV3V_arch of DIV3V is component SRL16 port ( D: in STD_LOGIC; CLK: in STD_LOGIC; A0: in STD_LOGIC; A1: in STD_LOGIC; A2: in STD_LOGIC; A3: in STD_LOGIC; Q: out STD_LOGIC); end component; component LD_1 port ( D: in STD_LOGIC; G: in STD_LOGIC; Q: out STD_LOGIC); end component; component LUT2 port ( I0: in STD_LOGIC; I1: in STD_LOGIC; O: out STD_LOGIC); end component; signal MYGND: STD_LOGIC; signal MYVCC: STD_LOGIC; signal RD3B: STD_LOGIC; signal FD3B: STD_LOGIC; attribute INIT: string; attribute INIT of U1: label is "0001"; -- Start out with a "0" attribute INIT of U2: label is "R"; -- Not needed. attribute INIT of U3: label is "E"; -- What the heck is this? -- Fit into a slice attribute RLOC: string; attribute RLOC of U1: label is "R0C0.S1"; attribute RLOC of U2: label is "R0C0.S1"; attribute RLOC of U3: label is "R0C0.S1"; begin MYGND <= '0'; MYVCC <= '1'; U1: SRL16 port map( D => RD3B, CLK => CLK, A0 => MYGND, A1 => MYVCC, A2 => MYGND, A3 => MYGND, Q => RD3B); -- Note that a Rising edge clock on a SRL16x is compatible -- with a rising edge clock to a flip-flop on the same slice, -- but in addition, it is compatible with an active low latch: U2: LD_1 port map( D => RD3B, G => CLK, Q => FD3B); -- OR gate U3: LUT2 port map( I0 => RD3B, I1 => FD3B, O => Q); end DIV3V_arch; -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37845
Mr. Andraka revealed himself: >and a >bit more time to grok the design, Welcome to our Planet! gives me an idea for a book to read over the holidays! enjoy them - Mike ThomasArticle: 37846
Re very long counter design... > In my design I need to make a synchronous counter that counts, let's > say, till 1000000. (Actual aim for counter is to built in a delay). I > do this by the use of integer type signals and with each clock'event I > add 1 till I reach the wanted 1000000. When I try to implement this > in an FPGA it consumes a very high amount of CLBs and it seems very > disastrous for the maximum reachable clock freq. Assuming that you don't care about the intervening counts, you can use SRL16s and SRL16Es to create relatively efficient large counters. And you don't have to deal with decoding LFSR values either. An SRL16 with its Q output brought back to its D input can be initialized (with an INIT attribute) to have only a single bit high. The other (of up to 17 bits), are initialized to zero. As it clocks around, it produces a pulse every 17th clock. This puts a counter with a length of up to a little over 4 bits (i.e. log2(17)) into a single LUT. That's 4x as efficient as regular counters, and you get a free registered "done" bit. You can gang these up, either by using the enables, or by ANDing the outputs of counters whos periods have no common divisor. Example with 5 SRL16/SRL16Es, gets within 5% of 10^6 clocks, uses only 7 LUTs: First SRL16 goes high every 17th clock. It's output connects to the enable input of an SRL16E that also is set for 17 clocks. The result: Two bits, that when ANDed, produce a pulse every 17^2 = 289 clocks. Third SRL16 goes high every 15th clock. It's output connects to the enable input of an SRL16E that also is set for 15 clocks. The result: Two bits, that when ANDed, produce a pulse every 15^2 = 225 clocks. Fifth SRL16 goes high every 16th clock. Since 17^2, 15^2, and 16 have no common divisors, the outputs of the five SRL16 / SRL16Es can be ANDed together to produce a counter that pulses once every 17^2 * 15^2 * 16 = 1040400 clocks. This is in excess of the 1000000 (as was asked for), and it only took 7 LUTs (<2 CLBs). In addition, there are no lines that have a loading of more than 3. The 5-input AND can be implemented with a registered 4-input AND (of the first four SRL16s), and a registered 2-input AND. That means that there are no paths that go through un registered logic, and the design will clock at a very high rate. One downside is that the SRLs require so much GND and VCC routing, but you can create all that yourself and prevent the placer from going hog wild with it. Another downside is what happens to the SRL16s if you have glitches on your clock. Unlike most counters, this circuit will not "fix" itself. But lets try to not think too much about that. You can also play sneaky games with the first layer SRL16s. When that first registered 4-input AND gate goes high, all the SRL16s will have just been in their high state. That means that if you replace those two SRL16s with two SRL16Es you can hook the registered AND gate output back up to the (inverted logic) enables of those first two SRL16Es. The effect of that modification will be to change that registered AND gate from counting to (16^2 * 17^2) to one that counts to (16^2*17^2 + 1). Since this is relatively prime to the previous 16^2*17^2, that means that you can build two such circuits and AND their outputs together to get a period of 73984 * 73985 with just 11 LUTs. This is getting a 32.35 bit binary count, with DONE pulse, and very high speed performance for only 11 LUTs or 2.94 bits per LUT. I should mention that I've never implemented that last sneaky game, so if it doesn't work I'd not be completely surprised. Sure seems like it would though, and my instincts for this sort of stuff are usually pretty good. Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37847
Virtex meta posts in this group: http://groups.google.com/groups?as_q=virtex&as_oq=metastable%20metastability &as_ugroup=comp.arch.fpga&num=50&as_scoring=d&hl=en Meta posts in this group by Peter Alfke: http://groups.google.com/groups?as_q=metastability&as_ugroup=comp.arch.fpga& as_uauthors=Peter%20Alfke&num=50&as_scoring=d&hl=en watch the wrap on the linksArticle: 37848
On Fri, 21 Dec 2001 14:31:06 +1300, Jim Granville <jim.granville@designtools.co.nz> wrote: (snip) > >It's probably 'good practice' to copy the 80c51 scheme, which uses a >D-FF on the >feedback, so the OE width is clock width, not propogation/threshold >defined. > >In a FPGA, D-ff are almost free :-) > >- Jim G It is good practice if the high level output is guaranteed to meet the high level input requirement on the connected device. In the case of using 3.3V Xilinx to drive 5V CMOS, this practice could cause a one clock delay before the input switches, because the driver will prevent the pullup resistor from pulling the input above the high output drive voltage level. =================================== Greg Neff VP Engineering *Microsym* Computers Inc. greg@guesswhichwordgoeshere.comArticle: 37849
Really Thanks for your Christmas present Antonio
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z