Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
John_H wrote: >>http://dwb.unl.edu/Teacher/NSF/C01/C01Links/www.ualberta.ca/~bderksen/windowpane.html > > Thanks for the link. The information presents some concepts with glass > deformation that I wasn't familiar with but the comment "This author > believes that the correct explanation lies in the process by which window > panes were manufactured at that time" without support for why the thicker > end would always be installed on the bottom leaves me with the issue still > open. I'll nudge it more toward "maybe." > > Apologies to those disturbed by how off topic this got. Well someone please go turn one upside down, measure it with a micrometer, and make a note of it, so at least our grandkids will be able to settle this. (make sure it's not near where said grandkids are likely to be playing ball in the years before they gain an appreciation for experimental science)Article: 101926
Symon, You have to be careful if you have differing thermal coefficients of expansion (if you epoxy it to the pcb). If the part heats up, and the pcb heats up, you would like the coefficients of thermal expansion to all be the same, so you do not shear the solder bumps off the pcb. It is a much bigger problem with flip chip, to match all the coefficients and mount the die to the substrate such that it will tolerate many years of thermal cycling without cracking and of the solder bumps. I suspect this is a wire bond epoxy/plastic based molding compound type of application, where the FPGA package is of similar material to the pcb, so that thermal stresses should be minimal. Why no use a RF link to send the data from sensors in the wheel to somewhere else? How the heck does one power a pcb in a wheel? Sounds like there are a ton of problems to solve. How do you communicate with the wheel? "Yo, wheel..."? Austin Symon wrote: > "Peter Alfke" <peter@xilinx.com> wrote in message > news:1147110712.154203.47590@i39g2000cwa.googlegroups.com... > >>Let's remember that the g-forces have a direction (outward), and it is >>up to the pc-board designer to take advantage of this. >>Peter Alfke >> > > Peter, > You raise an interesting point. I wonder if the yield stress limit is the > same for compression as for expansion? > Sod it, just epoxy the damn thing to the board! :-) > Cheers, Syms. > >Article: 101927
Piotr Wyderski wrote: > JJ wrote: > > > I have fantastic disbelief about that 6 ops /clock except in very > > specific circumstances perhaps in a video codec using MMX/SSE etc where > > those units really do the equiv of many tiny integer codes per cycle on > > 4 or more parallel 8 bit DSP values. > > John, of course it is about peak performance, reachable with great effort. Ofcourse, I don't think we differ much in opinion on the matter. But I prefer to stick to avg throughputs available with C codes. I think in summary any HW acceleration is justified when it is pretty much busy all the time, embedded or or least can shrink very significantly the time spent waiting to complete, but few opportunities are going to get done I fear since the software experts are far from having the knowhow to do this in HW.. For many apps that an FPGA might barely be considered, one might also look at the GPUs or the Physix chip or maybe wait for ClearSpeed to get on board (esp for flops) so FPGA will be the least visible option. > But the existence of every accelerator is explained only when even that > peak performance is not enough. Otherwise you simply could write better > code at no additional hardware cost. I know that in most cases the CPU > sleeps because of lack of load or stalls because of a cache miss, but it > is completely different song... > > > Now thats looking pretty much like what FPGA DSP can do pretty trivially > > except for the clock ratio 2GHz v 150MHz. > > Yes, in my case a Cyclone @ 65MHz (130MHz internally + SDR interface, > 260 MHz at the critical path with timesharing) is enough. But it is a > specialized > waveforming device, not a generic-purpose computer. As a processor, it could > reach 180MHz and then stabilize -- not an impressive value today, not to > mention > that it contsins no cache, as BRAMs are too precious resources to be wasted > that > way. The BRAMs are what define the opportunity, 500 odd BRAMs all whacking data at say 300MHz & dual port is orders more bandwidth than any commodity cpu will ever see, so if they can be used independantly, FPGAs win hand down. I suspect alot of poorly executed software to hardware conversion combines too many BRAMs into a single large and relatively very expensive SRAM which gives all the points back to cpus. That is also the problem with soft core cpus, to be usefull you wants lots of cache, but merging BRAMs into useful size caches throws all their individual bandwidth away. Thats why I propose using RLDRAM as it allows FPGA cpus to use 1 BRAM each and share RLDRAM bandwidth over many threads with full associativity of memory lines using hashed MMU structure IPT sort of. > > > A while back, Toms Hardware did a comparison of 3GHz P4s v the P100 1st > > pentium and all the in betweens and the plot was basically linear > > Interesting. In fact I don't care about P4, as its architecture is one > big mistake, but linear speedup would be a shame for a Pentium 3... > Toms IIRC didn't have AMD on the lineup, must have been 1-2yrs ago. The P4 end of the curve was still linear but the tests are IMO bogus as they push linear memmory tests rather than the random test I use. I hate when people talk of bandwidth for blasting GB of contiguous large data around and completely ignore pushing millions of tiny blocks around. > > benchmark performance, it also used perhaps 100x the transistor count > > Northwood has 55 million, the old Pentium had 4.5 million. > 100x overstating it a bit I admit, but the turn to multi cores puts cpu back on the same path as FPGAs, Moores law for quantity rather than raw clock speed which keeps the arguments for & against relatively constant. > > as well and that is all due to the Memory Wall and the necessiity to > > avoid at all costs accessing DRAM. > > Yes, that is true. 144 MiB of caches of a POWER5 does help. > A 1.5GHz POWER5 is as fast as a 3.2GHz Pentium 4 (measured > on a large memory-hungry application). But you can buy many P4s > at the price of a single POWER5 MQM. > > > Try running a random number generator say R250 which can generate a new > > rand number every 3ns on a XP2400 (9 ops IIRC). Now use that no to > > address a table >> 4MB. All of a sudden my 12Gops Athlon is running at > > 3MHz ie every memory access takes 300ns > > Man, what 4MiB... ;-) Our application's working set is 200--600MiB. That's > the PITA! :-/ > Actually I ran that test from 32k doubling until I got to my ram limit 640MB (no swapping) on a 1GB system and the speed reduction is sort of stair case log. At 32K obviously no real slow down, the step bumps obviously indicate the memory system gradually failing, L1, L2, TLB, after 16M, the drop to 300ns can't get any worse since the L2,TLBs have long failed having so very little associativity. But then again it all depends on temporal locality, how much work gets done per cache line refill and is all the effort of the cache transfer thrown away every time (trees). or only some of the time (code). In the RLDRAM approach I use, the Virtex 2Pro would effectively see 3ns raw memory issue rates for full random accesses but the true latency of 20ns is well hidden and the issue rate is reduced probably 2x to allow for rehashing and bank collisions. Still 6ns issue rate v 300ns for full random access is something to crow about. Ofcourse the technology would work even better on full custom cpu. The OS never really gets involved to fix up TLBs since there aren't any, the MMU does the rehash work. The 2 big penalties are that tagging adds 20% to memory cost, 1 tag every 32bytes, and with hashing, the store should be left <80% full, but memory is cheap, bandwidth isn't. > > So on an FPGA cpu, without OoO, no Branch prediction, and with tiny > > caches, I would expect to see only abouit .6 to .8 ops/cycle and > > without caches > > In a soft DSP processor it would be much less, as there is much vector > processing, which omits (or at least should) the funny caches built of > BRAMs. > DSP has highly predicable data structures and high locality, not much tree walking so SDRAM bandwidth can be better used directly, still code should be cached. > > I have no experience with the Opterons yet, I have heard they might be > > 10x faster than my old 1GHx TB but I remain skeptical based on past > > experience. > > I like the Cell approach -- no chache => no cache misses => tremendous > preformance. > But there are only 256KiB of local memory, so it is restricted to > specialized tasks. > I suspect Cell will get used to accelerate as many apps as FPGAs or more but it is so manually cached. I can't say I like it myself, so much theoretical peak, but how to get at it. I much prefer the Niagara approach to cpu design, if only the memory was done the same way. > Best regards > Piotr Wyderski regards John Jakson transputer guyArticle: 101928
Andreas Ehliar wrote: > On 2006-05-07, JJ <johnjakson@gmail.com> wrote: > > I would say that if we were to see PCIe on chip, even if on a higher $ > > part, we would quickly see alot more co pro board activity even just > > plain vanilla PC boards. > > You might be interested in knowing that Lattice is doing just that in > some of their LatticeSC parts. On the other hand, you are somewhat > limited in the kinds of application you are going to accelerate since > LatticeSC do not have embedded multipliers IIRC. (Lattice are > targetting communication solutions such as line cards that rarely needs > high performance multiplication in LatticeSC.) > > /Andreas Yeh, I have been following Lattice more closely recently, will take me some time to evaluate their specs more fully, may get more interested if they have a free use tool chain I can redo my work with. Does anyone have PCIe on chip though? John Jakson transputer guyArticle: 101929
On a sunny day (Mon, 08 May 2006 11:11:38 -0700) it happened Austin Lesea <austin@xilinx.com> wrote in <e3o1kq$rf35@xco-news.xilinx.com>: >Why no use a RF link to send the data from sensors in the wheel to >somewhere else? You do not want the cellphone to jam yyour brakes.... >How the heck does one power a pcb in a wheel? Rotary transformer (classic solution, can also be used for data transfer, basically 2 halves of a pot core. One could make a mechanical generator with some mass, every time speed changes it would turn, rotate a magnet, like those old automatic mechanical rewind watches. I'd go with the transformer.Article: 101930
Piotr Wyderski wrote: > Andreas Ehliar wrote: > > > One interesting application for most of the people on this > > newsgroup would be synthesis, place & route and HDL simulation. > > My guess would be that these applications could be heavily > > accelerated by FPGA:s. > > A car is not the best tool to make another cars. > It's not a bees & butterflies story. :-) Same with FPGAs. > Well xyz auto workers do eat their own usually subsidised by the employer. I disagree, in a situation where FPGAs develop relatively slowly and P/R jobs take many hours, there would be a good opportunity to use FPGAs for just such a job. But then again FPGAs and the software is evolving too fast and P/R jobs in my case have gone from 8-30hrs a few years ago to a few minutes today so the incentive has gone. If I was paying $250K like the ASIC guys do for this and that, a hardware coprocessor might look quite cheap and the EDA software is much more independant of the foundries. DAC usually has a few hardware copro vendors most of them based on FPGAs. At one time some of those were even done in full custom silicon, that was really eating your own. > > My second guess that it is far from trivial to actually do this :) > > And who actually would need that? > I would be rather amazed if in a few years my 8 core Ulteron x86 chip was still running EDA tools on 1 core. > Best regards > Piotr Wyderski John Jakson transputer guyArticle: 101931
"Symon" <symon_brewer@hotmail.com> wrote in message news:445f8769$0$15784$14726298@news.sunsite.dk... > > Sod it, just epoxy the damn thing to the board! :-) > Cheers, Syms. Epoxy adds mass. Does the improved tolerance to shear stress (assuming lateral g-force here) outweigh the effects of added mass? (I like the term "outmass" versus "outweigh")Article: 101932
Thomas Womack wrote: > As you might imagine, I would be ecstatic to see a few wide > multipliers appearing in FPGAs - a 64x64->128 unit isn't _that_ large > an IP block Hi Tom. No, a 64x64->128 integer multiplier isn't that large at all. Here is the device utilization report for mine: Device utilization summary: --------------------------- Selected Device : 3s500epq208-4 Number of Slices: 557 out of 4656 11% Number of Slice Flip Flops: 370 out of 9312 3% Number of 4 input LUTs: 867 out of 9312 9% Number of bonded IOBs: 18 out of 158 11% Number of GCLKs: 1 out of 24 4% Keep in mind though that my primary concern at present is minimizing LUT (gate) count, not speed, so this multiplier requires N clock cycles to multiply two N bit numbers together yielding a 2N length result. If you'd like I'll be happy to send you a copy of the Verilog source code for my multiplier and the combination multiplier/modulo module (which is only slightly larger than the multiplier module). It's 110 lines of Verilog. Regards, RonArticle: 101933
This news is interesting http://www.eet.com/news/latest/showArticle.jhtml?articleID=187200783&pgno=2 You do need a hype filter when reading this, and many claims are extropolation-gone-wrong, but the base idea already exists in ring osc designs inside FPGA now. Seems ( with the right tools ) you could extend this inside a FPGA, by creating a large physical ring (long routes), with the sprinkled buffers. The physical delays would reduce the process variations in the clock, and you get the phase taps 'for free'. - but the tools _will_ need to co-operate :) We have done this inside CPLDs, and get appx 1.3ns granularity. With FPGAs the buffer delays are much lower, and the routing can be made to dominate. Sounds like a project for Antti :) -jgArticle: 101934
Symon wrote: > To be pedantic, 400g is a pretty substantial _acceleration_ on a mass! > > The upshot is, if you want to put an unclamped FPGA in a tyre, I suggest > your FPGA has big balls. (With apologies to AC/DC!) Or, "Lots of Balls" .. :) -jgArticle: 101935
Anyone know a UK stockist for the Xilinx/Digilent S3 board ? (or anyone have a spare one - I only want the PCB) Xilinx website seems to have discontinued it & only lists Cedar as UK disti & their website has minimal info, and Avnet only list their own boards as far as I can see. Digilent want $32 to ship USPS (which would cost $9 in a GP flat-rate envelope)Article: 101936
Do you have a counter? (wthout clocks )Article: 101937
frank wrote: > uh, guy - why the hell u wanna brute force rsa with an fpga. > there r quite better (faster and cheaper) methods to do so. Example please? RSA-640 was solved with a distributed network of something like 80 Opterons doing sieving. I wouldn't call those "cheap." > hope u calculated the throughput and the years/centurys of trying. That's one of the the shortcomings of ECM that Tom touched on earlier. Unlike traditional factorization methods, ECM doesn't even guarantee any result at all! Because of that, I had to have two status LEDs; one to indicate completion, and another to indicate whether or not a solution was found. The average throughput rate will hopefully be blazingly fast at about one or two bits per day. ;-) There is no input to the FPGA because the number to be factored is hard coded into the FPGA (although I could easily read it from an external device if needed), and the factor (if found) will be displayed on the board's LCD display, so the only thing connected to the board during operation is power. Because of the probabilistic nature of ECM, to the best of my knowledge no one has ever been able to calculate how long ECM would require on average for a particular factorization. I wonder if Tom Womack has investigated this in his work with ECM? RonArticle: 101938
Ron, it's amazing how nice and patient you can be when you want to... Greetings Peter AlfkeArticle: 101939
Thomas Womack wrote: > In article <1147022814.787257.294510@i39g2000cwa.googlegroups.com>, > Peter Alfke <alfke@sbcglobal.net> wrote: > > > >Ron wrote: > >> So to multiply two 704 bit numbers > >> together (depending upon how it's implemented of course) would require > >> roughly sixty 64-bit multiplies and a bunch of adds. ... > > > >If I remember right, 704 is 11 times 64, so the multiplication would > >take 121 of those 64-bit multipliers, not "roughly sixty"... > > It depends on precise details of the implementation, and you have to > write moderately ugly code because the x86 multiply instruction > produces its outputs in fixed registers, but if you apply You could use the IMUL instruction (signed multiply) you free yourself from that restriction, you have to make certain that your product will fit in 32-bits and that your values stay in their restricted place (no overflow).Article: 101940
A counter has a clock by definition. The clock is the signal you are counting. It either comes in from the outside, or you can generate an internal clock bty means of a string of buffers plus one inverter, connected back to the input ( a ring oscillator) The frequency stability is bad, + or - 50%, but in some cases (like this one) nobody cares. Anything between 1 kHz and 100 MHz would do the job. Peter AlfkeArticle: 101941
Xilinx gives complete data on thermal resistance with and without heatsink and airflow. We cannot give blanket data for power consumption, because it depends on the aggregate frequency times capacitance product of every node (assuming the same Vcc for all of them). That is a problem shared by all programmable devices, but not shared by ASICs and ASSPs, like microprocessors. They usually operate under fairly well-specified internal conditions. FPGAs do not. "Worst case" would be a shift register running at max frequency. Such a design is not only unrealistic, but would most likely overheat even with the best heatsink. But if you reduce the frequency, you can easily test this "worst-case" design. Just do not overdo the frequency... Peter AlfkeArticle: 101942
Just for fun, here are the figures for a bus-width of 704 bits and 1024 bits. Device utilization summary: (704 bit bus-width) --------------------------- Selected Device : 3s500epq208-4 Number of Slices: 2592 out of 4656 55% Number of Slice Flip Flops: 1779 out of 9312 19% Number of 4 input LUTs: 4176 out of 9312 44% Number of bonded IOBs: 18 out of 158 11% Number of GCLKs: 1 out of 24 4% Device utilization summary: (1024 bit bus-width) --------------------------- Selected Device : 3s500epq208-4 Number of Slices: 2975 out of 4656 63% Number of Slice Flip Flops: 2099 out of 9312 22% Number of 4 input LUTs: 4896 out of 9312 52% Number of bonded IOBs: 18 out of 158 11% Number of GCLKs: 1 out of 24 4% The amazing thing is that the slice and LUT counts seem to increase *less* than the bus-width increases (ie; the size of the numbers it can multiply). I've taken pains to ensure the optimizer isn't optimizing something away that it shouldn't, so as far as I know these numbers are correct. The synthesizer reports a maximum frequency of 58MHz for the 64 bit design, 16MHz for the 704 bit design, 12 MHz for the 1024 bit design "as is" without any tweaking to improve the timing, so it should take about 1.1 microseconds to multiply two 64 bit numbers together, and 85 microseconds to multiply two 1024 bit numbers together. RonArticle: 101943
On Mon, 08 May 2006 16:07:18 -0700, Ron <News5@spamex.com> wrote: >Just for fun, here are the figures for a bus-width of 704 bits and 1024 >bits. > >Device utilization summary: (704 bit bus-width) >--------------------------- >Selected Device : 3s500epq208-4 > Number of Slices: 2592 out of 4656 55% > Number of Slice Flip Flops: 1779 out of 9312 19% > Number of 4 input LUTs: 4176 out of 9312 44% > Number of bonded IOBs: 18 out of 158 11% > Number of GCLKs: 1 out of 24 4% > >Device utilization summary: (1024 bit bus-width) >--------------------------- >Selected Device : 3s500epq208-4 > Number of Slices: 2975 out of 4656 63% > Number of Slice Flip Flops: 2099 out of 9312 22% > Number of 4 input LUTs: 4896 out of 9312 52% > Number of bonded IOBs: 18 out of 158 11% > Number of GCLKs: 1 out of 24 4% > > >The amazing thing is that the slice and LUT counts seem to increase >*less* than the bus-width increases (ie; the size of the numbers it can >multiply). I've taken pains to ensure the optimizer isn't optimizing >something away that it shouldn't, so as far as I know these numbers are >correct. > >The synthesizer reports a maximum frequency of 58MHz for the 64 bit >design, 16MHz for the 704 bit design, 12 MHz for the 1024 bit design "as >is" without any tweaking to improve the timing, so it should take about >1.1 microseconds to multiply two 64 bit numbers together, and 85 >microseconds to multiply two 1024 bit numbers together. Presumably you could do it rather quicker using the S3's multiplier blocks.....Article: 101944
On Tue, 09 May 2006 07:56:10 +1200, Jim Granville <no.spam@designtools.co.nz> wrote: > >This news is interesting > >http://www.eet.com/news/latest/showArticle.jhtml?articleID=187200783&pgno=2 > > You do need a hype filter when reading this, and many claims are >extropolation-gone-wrong, but the base idea already exists in ring >osc designs inside FPGA now. > > Seems ( with the right tools ) you could extend this inside a FPGA, by >creating a large physical ring (long routes), with the sprinkled >buffers. The physical delays would reduce the process variations in the >clock, and you get the phase taps 'for free'. > - but the tools _will_ need to co-operate :) > > We have done this inside CPLDs, and get appx 1.3ns granularity. > >With FPGAs the buffer delays are much lower, and the routing >can be made to dominate. > > Sounds like a project for Antti :) > >-jg Just a silly thought - how about using a very long async delay path as a memory device - like the mercury delay-line memories of olden times . Not useful but maybe an interesting exercise for those with too much time on their hands....Article: 101945
In article <e3mq62$k21$2@news.lysator.liu.se>, Andreas Ehliar <ehliar@lysator.liu.se> wrote: >On 2006-05-06, Piotr Wyderski ><wyderski@mothers.against.spam-ii.uni.wroc.pl> wrote: >> What could it accelerate? Modern PCs are quite fast beasts... >> If you couldn't speed things up by a factor of, say, 300%, your >> device would be useless. Modest improvements by several tens >> of percents can be neglected -- Moore's law constantly works >> for you. FPGAs are good for special-purpose tasks, but there >> are not many such tasks in the realm of PCs. > >One interesting application for most of the people on this >newsgroup would be synthesis, place & route and HDL simulation. >My guess would be that these applications could be heavily >accelerated by FPGA:s. My second guess that it is far from trivial >to actually do this :) > Certainly on the simulation side of things various companies like Ikos (are they still around?) have been doing stuff like this for years. To some extent this is what ChipScope and Synplicity's Identify are doing only using more of a logic analyzer metaphor. Breakpoints are set and triggered through JTAG. As far as synthesis itself and P&R I would think that these could be acellerated in a highly parallel architecture like an FPGA. There are lots of algorithms that could be sped up in an FPGA - someone earlier in the thread said that the set of algorithms that could benefit from the parallelism available in FPGAs was small, bit I suspect it's actually quite large. PhilArticle: 101946
In article <e3nvv7$h30$1@atlantis.news.tpi.pl>, Piotr Wyderski <wyderskiREMOVE@ii.uni.wroc.pl> wrote: >Andreas Ehliar wrote: > >> One interesting application for most of the people on this >> newsgroup would be synthesis, place & route and HDL simulation. >> My guess would be that these applications could be heavily >> accelerated by FPGA:s. > >A car is not the best tool to make another cars. >It's not a bees & butterflies story. :-) Same with FPGAs. Err... well, cars aren't exactly reprogrammable for many different purposes, though, are they? > >> My second guess that it is far from trivial to actually do this :) > >And who actually would need that? > Possibly you? What if we could decrease your wait for P&R from hours to minutes? I suspect you'd find that interesting, no? PhilArticle: 101947
Mike Harrison wrote: > Presumably you could do it rather quicker using the S3's multiplier blocks..... Good point, but then I'd be tied into a particular FPGA. The multipliers are very impressive however. If I ever get my design to fit on something, then I can start taking advantage of things like the built-in multipliers to speed things up. Lets see, 18x18->36 bits in less than 5 ns. For a 1024 bit multiply, it would take roughly 1,624 eighteen bit multiplies and a bunch of multi-precision additions, which translates into around 8 microseconds per 1024 bit word! Very impressive indeed. RonArticle: 101948
P.S. Before someone catches my error, yes indeed you could run some of these multiplies in parallel to cut the timing even more. The datasheet says the Spartan-3E devices have between 4 to 36 dedicated multiplier blocks per device, so depending on how many there are on the FPGA the 8 microseconds I mentioned earlier could be cut by as much as 1/4 to 1/36th to 22ns for 1024 bits!!! I will definitely have to look into this at some point. It would be great if a multiprecision package for the multipliers were already available in Verilog.Article: 101949
Alan, We were having issues with the EDK profiling tools for ppc405 also. And we were/are using 7.1. Now I am a little anxious to get back to the lab and see if an upgrade to 8.1 makes things go more smoothly. Thanks for your posts. Oh, and in response to one of your questions: lots of folks are using the ppc405, but I am not sure how many are using the Xilinx profiling tools (for reasons you have already discovered!). Joey
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z