Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Hello Group, What is the best way to count 64 incoming simultaneous bit signals to determine the number of 1s (in VHDL)? I have clock cycles to spare but the result must be pipelined so that each clock cycle produces a new count. Brad Smallridge b r a d @ a i v i s i o n . c o mArticle: 87226
Add them. Add registers to your path and make your tool retime them. This has been covered in the newsgroup in the past. How many levels of logic you can deal with depends on your device and your clock. Just adding the individual bits together will produce the desired results and you can pipeline to your heart's content allowing a new result every clock (after the initial latency) in the time it takes to run through one carry-chain adder. "Brad Smallridge" <bradsmallridge@dslextreme.com> wrote in message news:11dr115qoteg87b@corp.supernews.com... > Hello Group, > > What is the best way to count 64 incoming simultaneous > bit signals to determine the number of 1s (in VHDL)? > I have clock cycles to spare but the result must be pipelined > so that each clock cycle produces a new count. > > Brad Smallridge > b r a d @ a i v i s i o n . c o mArticle: 87227
Yeah, I understand this. But I can't wrap my head around how to code it. Do you do like this: if( clk'event and clk='1') then partial_sum1_2bit <= '0'&bit0 + '0'&bit1; partial_sum2_2bit <= '0'&bit0 + '0'&bit1; partial_sum1_3bit <= '0'&partial_sum1_2bit + '0'&partial_sum2_2bit; -- and so on end if; And then there is the question on how this all synthesizes, probably, for me at 27MHz, opimized for area not speed. I could use a little insight from someone who's done this before. Brad Smallridge b r a d @ a i v i s i o n . c o mArticle: 87228
I saw the appnote that pertains to older devices and the Answers Database had a reference to it being in the V4 PCB Designer's Guide but I am unable to find it. Any help appreciated. Thanks, JDArticle: 87229
I also don't understand what you mean by "having your tool retime them". I don't have Precision or any advance tools here.Article: 87230
Brad, what you need is called a Wallace-Tree Adder. Here is a fairly efficient implementation: Divide your input into groups of 12 bits, and use each as address bits to a BlockRAM, loaded to output a 4-bit number that describes the number of 1s in the address. You can treat each port independently, so one BlockRAM handles 24 inputs and generates two 4-bit outputs. Three BlockRAMs handle 72 incoming bits, and produce six sets of 4-bit values. You can combine them with five 6-bit adders on 3 levels, giving you a total of 4 pipeline delays. This is just one of many ways to solve your design problem... I like to use BlockRAMs for unconventional purposes. Peter Alfke Brad Smallridge wrote: > Yeah, I understand this. But I can't wrap my head around how to code it. > > Do you do like this: > if( clk'event and clk='1') then > partial_sum1_2bit <= '0'&bit0 + '0'&bit1; > partial_sum2_2bit <= '0'&bit0 + '0'&bit1; > partial_sum1_3bit <= '0'&partial_sum1_2bit + '0'&partial_sum2_2bit; > -- and so on > end if; > > And then there is the question on how this all synthesizes, probably, for me > at 27MHz, opimized for area not speed. I could use a little insight from > someone who's done this before. > > Brad Smallridge > b r a d @ a i v i s i o n . c o mArticle: 87231
64 is 0+63+1 63 is 31+31+1 31 is 15+15+1 15 is 7+7+1 7 is 3+3+1 simple recursion a few adder rows should be pretty quick and way less resources than BlockRam, takes about 6 levels of small addersArticle: 87232
In case you are interested in price and performance: 3 BlockRAMs plus 6 CLBs, four levels of pipelining, running at 200 MHz+ Not too bad :-) Peter AlfkeArticle: 87233
I'm working with FPGA's that use STAPL bitstreams. In addition to the standard configuration bitstream, I'd also like to input custom (proprietary) JTAG commands and bitstreams. All the STAPL documentation I've looked at implies that a piece of software they call STAPL composer performs this function (i.e. generates the STAPL scripting file) However, the discussion never proceeds beyond this point. No manufacturer, no external website, seems to give details on or even roughly describe the STAPL Composer, much less offer one for download. It seems as though the FPGA development tools tend to embed functionality which acts in the capacity of a STAPL composer in the bitstream-generation utility, but this functionality is strictly limited to the configuration bitstreams, not to mention the specific FPGA company in question. So, is there a STAPL Composer out there which will allow me to generate custom STAPL files, inserting user-specified bitstreams and JTAG commands? Ideally this program would also be able to create STAPL files for FPGAs from multiple different companies. I'd like to find some solution for configuration, programming, and test that does more than the dedicated task of loading a configuration bitstream for a specific manufacturer's FPGA's. -- Alex Rast ad.rast.7@nwnotlink.NOSPAM.com (remove d., .7, not, and .NOSPAM to reply)Article: 87234
Brad Smallridge wrote: >Hello Group, > >What is the best way to count 64 incoming simultaneous >bit signals to determine the number of 1s (in VHDL)? >I have clock cycles to spare but the result must be pipelined >so that each clock cycle produces a new count. > >Brad Smallridge >b r a d @ a i v i s i o n . c o m > > > > > Brad, Basically, you want to gather bits together in small adders. a wallace tree does that using full adders to compress 3 single bit inputs, all withthe same weight into two signals, a sum and a carry. The sum has the same weight as the inputs, the carry has weight 2x the input. Then you use another layer to sum all like weighted bits, and repeat until you are left with two signals of each weight. You combine those with a conventional adder. What Peter described is going to be more clock cycle efficient because you use the BRAM in place of a wallace tree. His description isn't really a wallace tree because it doesn't have the same structure (no tree of carry-save adders, and the final outputs are complete sums of the bits for those BRAMs, not a carry vector and a sum vector like a wallace tree). You could use wallace trees to combine the results, from the BRAMs, but it isn't efficient in an FPGA. -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 87235
Tim, The only way I trust to compare the logic capacity of two different architectures is to benchmark them against each other. Modern FPGA architectures, like modern processor architectures, are too complex to say which is more area-efficient, and by how much, based on a hand analysis. Think of trying to guess if a P4, P3 or Athlon is faster based purely on the specifications of their pipelines, issue units and clock rates -- it is impossible, so you have to benchmark them. FPGAs have hit that level too. The best thing to do is to do your own comparison using the circuits of interest to you in the devices you're considering. But that's a lot of work, especially if you want to test it for multiple circuits (as you really should to get statistically valid answers). Next best is to get someone else's benchmark results. And that's one of the things that will be presented in the NetSeminar tomorrow. In terms of what should be counted in Stratix II vs. Virtex4 -- this is very difficult to do by hand. Stratix II is fundamentally based on a larger LUT (5-LUT with extra circuitry, or a 6-LUT, depending on how you look at it) than Virtex4 (4-LUT plus extra circuitry) so counting LUTs doesn't work. Academic and industrial research long ago showed that bigger LUTs implement more logic, so you can't simply count the number of LUTs in an architecture and ignore their size. But how much more logic can a bigger LUT implement, for a typical circuit? Nobody can tell you accurately, except by running a bunch of benchmark circuits and showing the results. Regards, Vaughn Altera [v b e t z (at) altera.com]Article: 87236
Vaughn Betz wrote: [...] > In terms of what should be counted in Stratix II vs. Virtex4 -- this is very > difficult to do by hand. Stratix II is fundamentally based on a larger LUT > (5-LUT with extra circuitry, or a 6-LUT, depending on how you look at it) > than Virtex4 (4-LUT plus extra circuitry) so counting LUTs doesn't work. ^^^^ ^^^^^ ^^^^^^^^^ ^^^^^^^^ ^^^^ ^^^^^^^ Howdy Vaughn, No one (except maybe Xilinx) will fault Altera for trying to show how the 2S180 can pack more logic into the device than the LX200. But engineers ARE likely to fault Altera if they do such a comparison with misleading figures. Your response _*completely ignored*_ the hard facts that Tim presented. Here are Tim's numbers again, since you clipped them: V4 Slices Actual LUTs Logic cells (Xilinx claim) ----------------------------------------------------------- LX200 89088 178176 200448 S2 ALMs ALUTs Equiv_four_input_LUTs (Altera claim) -------------------------------------------------------------------- 2S180 71760 143520 186576 Since the last column is the only one where the funny math comes in, that is the only place Altera has any hope of showing how the 2S180 can pack more logic. To do that, Altera needs to provide a convincing argument that Xilinx's 200k number for their logic isn't just a little overly optimistic, but is so to the tune of at least 7%. And while doing so, it'd probably be good to show why Altera's funny numbers for the S2 are NOT overly optimistic. Until doing that, might I suggest that Figure 1 be fixed on http://www.altera.com/products/devices/stratix2/features/density/st2-vir-density-compare.html which seems to show that it is valid to compare the 178k number against the 186k number. You admit in your response above (where I underlined) that using the 178k number "doesn't work". So why does Altera use it in their comparisons? http://www.altera.com/literature/wp/wpstxiixlnx.pdf also uses the 178k number (even going so far as to claim that it is Xilinx's "equivalent" number, when it is most obviously the *actual* number). This paper is also where the 30% better number is presented without any backup data. Readers might trust this number a bit more if design details (especially the number of designs in each size) were published with the white paper. I have absolutely nothing against Altera, the S2, or the new ALM. But I detest being misled, especially after it has been pointed out a time or two (at which point it becomes obvious that the misleading is being done on purpose rather than it having happened by accident). MarcArticle: 87237
Hi, I have some doubts concerning the following problems: In my design I have an 8bit bidiretional bus "Data_ulpi". When the external module drives data into my FPGA I have to read that data and respond immediately that is I have NO time to synchronize the data with 2-stage-FFs/FIFO. .=2E. NO time because the external module is expecting response on the next clock cycle. I have read several posts in this newsgroup explaining that the state machine would have to be very tricky to handle unregistered inputs. So how do I have to place bidirectional bus and control signals to have at least a chance of doing a good job ? What constraints do I have to take into consideration in that special case for tSU/tH ? Thank you for your advice. Rgds Andr=E9Article: 87238
Hi Jens, like the others already replied : it is indeed a bug in the ISE tools. In ISE7.1 SP3 you will get this error when following is true for your design : * register duplication is on (this also means you use timing driven map) * your design contains one or more differential buffers (IBUFDS, OBUFDS,..) My design is in vhdl, target is Virtex4, I don't know if this bug also applies to Verilog and/or other targets. I opened a Webcase => response was surprise, because they thought it should be fixed in SP3. It isn't, so they say it should be fixed in ISE8.1 *sigh* best regards, Bart De ZwaefArticle: 87239
Maybe have a look at the SVF too. STAPL has a higher abstration level than the SVF. But the SVF has the advantage to be very close to the JTAG chain description layer. It will be easier to use SVF for your own implementation. Both Xilinx and Altera generate bitstreams version STAPL and SVF ! Xilinx provides a XSVF : a binary format of SVF. Good thing if you have to download FPGA/FLASH from an embedded processor. Laurent www.amontec.com Alex Rast wrote: > I'm working with FPGA's that use STAPL bitstreams. In addition to the > standard configuration bitstream, I'd also like to input custom > (proprietary) JTAG commands and bitstreams. > > All the STAPL documentation I've looked at implies that a piece of software > they call STAPL composer performs this function (i.e. generates the STAPL > scripting file) However, the discussion never proceeds beyond this point. > No manufacturer, no external website, seems to give details on or even > roughly describe the STAPL Composer, much less offer one for download. > > It seems as though the FPGA development tools tend to embed functionality > which acts in the capacity of a STAPL composer in the bitstream-generation > utility, but this functionality is strictly limited to the configuration > bitstreams, not to mention the specific FPGA company in question. > > So, is there a STAPL Composer out there which will allow me to generate > custom STAPL files, inserting user-specified bitstreams and JTAG commands? > Ideally this program would also be able to create STAPL files for FPGAs > from multiple different companies. I'd like to find some solution for > configuration, programming, and test that does more than the dedicated task > of loading a configuration bitstream for a specific manufacturer's FPGA's. > >Article: 87240
Some additional info: The clock of the FSM is provided by the external module. Rgds Andr=E9Article: 87241
One way to help is to multiply your clock up so that you have several edges to react on. You can register the input with one of these edges and that should have a fixed minimum relationship to your main clock if your multipled clock is phase locked to the input clock. Another similar and crude way is to use the negative edge of the clock if your timing allows to sync the input. The state machine then uses it on the positive edge. Failing that grey-encoded machines or even one-hot machines where effectively the input is "looked at" by a single flip-flop are also a reasonable thing to do. John Adair Enterpoint Ltd. - Home of MINI-CAN. The Spartan3 CAN Development Board. http://www.enterpoint.co.uk <ALuPin@web.de> wrote in message news:1121846253.756193.223170@g44g2000cwa.googlegroups.com... Hi, I have some doubts concerning the following problems: In my design I have an 8bit bidiretional bus "Data_ulpi". When the external module drives data into my FPGA I have to read that data and respond immediately that is I have NO time to synchronize the data with 2-stage-FFs/FIFO. ... NO time because the external module is expecting response on the next clock cycle. I have read several posts in this newsgroup explaining that the state machine would have to be very tricky to handle unregistered inputs. So how do I have to place bidirectional bus and control signals to have at least a chance of doing a good job ? What constraints do I have to take into consideration in that special case for tSU/tH ? Thank you for your advice. Rgds AndréArticle: 87242
Hi Marc, Comparing different logic architectures is a difficult exercise, and only legitimate way to do so is by benchmarking. That's how we architect our new logic architectures -- we build prototype synthesis and place & route tools, and measure each candidate architecture on a large suite of designs. The problem is making people outside the company believe our benchmarking is correct and impartial. I'd suggest tuning in to the Net Seminar, and then discussing here all the flaws you find in it afterwards. I'm sure Vaughn, Alex and I will be happy to discuss any areas of contention! Perhaps another way of looking at things is by comparing Stratix II to Stratix (lets remove competition for a moment). When we introduced Stratix II, we said that an ALM is equivalent to approximately 2.5 Stratix LEs; this result was based on our own internal benchmarking with the tools and circuit set we had available at that time. We also have shown in previous white papers that the Stratix LE is more efficient than a Virtex half-slice by a margin of ~10% (I'd need to look up the exact number). This difference arises primarily from the increased (routable) register-packing capabilities of the Stratix LE architecture. So *if* you believe these two results, then the results we give for Stratix II vs. Virtex-4 are at least consistent with our previous claims. Regards, Paul Leventis Altera Corp.Article: 87243
Hi all, I have come across the topic "Softcore based Rapid Protyping". I have no idea what it is. So I would be happy if any one gives me the details on the topic. Moreover I am planning to do this as a project. So please furnish me with the details of what the topic is. Thanks in advance, SarathArticle: 87244
We are in a causal world, so there is always an event which signals valid data, so NO time seems to be an exaggeration. Without more information, the following can be said: > You can think of using latches. The read event will latch the last data. > You can delay signals and data appropriately to generate proper setup/hold conditions. Using delays, you can even design circuitry handling negative setup times properly etc. > The latched data can then be treated using 2-stage ffs or any other possibilties to handle unsychronized data. Hubble.Article: 87245
Hi I am trying hard that 4 bit counter example with no luck so far --: I changed into the following In ILA Inserter * Trigger port : - number of port = 1, width = 1 * Match function : - number of match unit = 2, match type = basic * Port connection - clock port <= 'clock' signal (CH0) - trigger port <= 'count' signal (CH0) - data port <= 'Q' signal (CH0,CH1,CH2,CH3) In Analyzer * Trigger setup : 2 match function definition - M0 == '0', M1 == '1' /* I am not skillful at this part part */ - Trigger condition sequence (2) : M0 -> M1 /* I am not skillful at this part part */ My intention for the test vector is that - We give a stimulus to 'count' port, using 2 match functions - " 0, 1, ...." (seqence of 2) The result is that - In waveform : data port<0> is always '1', the other ports are all '0' What the command window says is that -------------------------- COMMAND: reset_trigger_settings 2 0 COMMAND: set_window_capture 2 0 0 1 512 0 COMMAND: set_trigger_condition 2 0 1 1 FFFF COMMAND: set_storage_condition 2 0 FFFF COMMAND: run 2 0 COMMAND: upload 2 0 INFO - Device 2 Unit 0: Waiting for core to be armed --------------------------- It seems that 'analyzer' part is wierd, What is the problem ? thankyou in advanceArticle: 87246
Paul Leventis (at home) wrote: [...] > Perhaps another way of looking at things is by comparing Stratix II to > Stratix (lets remove competition for a moment). When we introduced Stratix > II, we said that an ALM is equivalent to approximately 2.5 Stratix LEs; this > result was based on our own internal benchmarking with the tools and circuit > set we had available at that time. We also have shown in previous white > papers that the Stratix LE is more efficient than a Virtex half-slice by a > margin of ~10% (I'd need to look up the exact number). This difference > arises primarily from the increased (routable) register-packing capabilities > of the Stratix LE architecture. So *if* you believe these two results, then > the results we give for Stratix II vs. Virtex-4 are at least consistent with > our previous claims. Howdy Paul, I thought my post was pretty clear, but let me try one last time. I'm NOT disputing the very real possibility that the ALM allows for more efficient logic packing than the Slice. The *only* thing I'm disputing is the _fact_ that, as Tim pointed out originally and I tried to explain in different words, Altera is happy to boost the S2 actual LUT count by an additional 30% for the "equivilant" logic that can implemented, yet increases Xilinx's actual LUT count by exactly 0% for the stuff surrounding their LUT. As I quoted, even Vaughn admitted that the extra stuff surrounding Xilinx's LUT can't be 0%. To head off Altera claiming that they "don't know how much to add to Xilinx's actual LUT count", either use the inflation figure that Xilinx does (12%), or make up your own and justify it. But it has to be greater than 0%, otherwise all the Altera marketing white papers and websites are comparing apples to oranges. The below columns provide as-near-as-possible apples to apples comparisons: V4 Slices Actual LUTs Logic cells (Xilinx claim) ----------------------------------------------------------- LX200 89088 178176 200448 S2 ALMs ALUTs Equiv_four_input_LUTs (Altera claim) -------------------------------------------------------------------- 2S180 71760 143520 186576 In short, please explain which of the above comparison columns is incorrect, and why. Regards, MarcArticle: 87247
JD, V4 is just like V2, V2P, V2P-X, and S3: it has traditional CMOS output structures that have the diodes to ground and Vcco as part of the nfet and pfet devices themselves. "hot swap" means many different things to many people - 1. The most strict: insertion and removal of a device from a parallel bus must not affect data being sent/received by others on the bus. This is really tough. Even if the diodes aren't there (such as in a competitor's part) there is still the power on/off of the IO and its intrinsic capacitive loading (however small). At slow speeds this works if the diodes are not present, but at high speeds the secondary factors become primary, and even the "hot swap" part that claims full compliance fails to meet the requirement of no glitches whatsoever. 2. Less strict: insertion and removal which uses a stepped or sequenced connector. This is achievable. Our app notes detail these solutions. They apply to V4 equally as to V2 or V2P. By sequencing the connections, one can overcome the diode issue of clamping, and the potential glitching issue by control of the pins prior to their mating. Again, some engineering is required, but it does work. 3. Common: insertion and removal on a parallel bus that uses a protocol to recognize insertion, and back off and retry (or ignor). Nice, because you do nothing, and the system is designed to work even if there are glitches. 4. Self-powering: since the diode to Vcco can be forward biased, and the IO bank in the V4 needs 8 mA to power ON completely, the IO bank can be powered from the wide parallel bus itself. A number of customers figured this out (with our help), and their system backplanes work this way. No glitches as the bus uses a very strong driver on transmitting cards (which all together end up powering ON the IO banks of inserted cards without glitching -- they are guaranteed to power on tri-state before configuration). 5. MGT's, LVDS, or other point to point: here "hotswap" just means that no damage is done when you insert/remove. And, no damage is done to the MGTs on V2, V2P, or V4. Data isn't the issue (when the board is unplugged, there is no point to point link!). Hope this helps, AustinArticle: 87248
I have never used latches. What is the problem about them ? Do Fitter treat them predictable ? Andr=E9Article: 87249
Hi, I have programmed my FPGA with an evaluation bitstream file for DDR-SDRAM controller. When measuring bank address bit 0 I can see that it does not get '1' during initialization phase. That means that EXTENDED mode register is never written to that is the DLL is not enabled. But the evaluation design seems to work. So how can that be ? Rgds Andr=E9
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z