Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Karl wrote: >> If anyone knows of any 2C70 boards, do let us know. > > You're lucky ! > > The Lead-Free / RoHS compliant version of the Cyclone II based DSP > board will be with the EP2C70 ! Thanks Karl, that sounds very promising. I couldn't find any info on it alas. Do you have any idea of when and how much? Cheers, TommyArticle: 100651
The stackup you have looks good but are 3+ routing layers going to be sufficient? Technically you have 4 but for a surface mount board the top surface is so cluttered with parts that it turns out to not be something that you can actually route a lot on unless there is a lot of nice patterning to how the parts happen to have to be connected. One thing left out is probably the MOST important consideration from an EMI perspective. If you have chopped up planes make sure that NO signal crosses that split unless it has been adequately filtered. Keep in mind that if the chopped up plane is providing the AC return path that when it encounters a plane split that you've just made yourself a huge loop that will radiate and distort whatever signal you're trying to send across it...so it best not be any sort of periodic signal or anything of importance. If you do have to jump the gap, bridge it with as large of a resistor as possible that still allows your circuit to function...and that's only after you've exhausted all ways to avoid having a signal cross the gap in the first place. KJArticle: 100652
Hello, I am a Python programmer writing neural network code with binary firing and binary weight values. My code will take many days to parse my large data sets. I have no idea how much fpga could help, what the cost would be, and how easy it would be to access it from Python. The problem is similar to competitive networks, where I must dot product many million-length bit vectors (which only change occasionally) with 1 input vector. Anybody want to estimate the cost, speedup, and value an fpga could offer me? Seems like this problem shouldn't be so hard, but from the little research I've done I haven't found a good value product that is ready-made, so I'm looking at (multiple?) fpga as a coprocessor.Article: 100653
rickman wrote: > I have looked at the data sheet and they say very clearly that the > Spartan 3 is held in reset until all three power supplies are fully up. > But the range of voltages is very wide, with reset being released when > the Vcoo on Bank four is as low as 0.4 volts. > > I get a lot of grief from the FPGA firmware designers on every little > nit and pic that they don't like about the board design. I need to > know that this will keep the FPGA in reset and all IOs tristated > whether the various power voltages are above or below the internal > reset threshold, up to the point of being configured. IIRC, the I/O's are inputs (or HiZ) w/ soft pullup until after configuration. It should be simple enough to delay the start of configuration until after the last supply is up.Article: 100654
Dini has a new PCIe x8 board. It should work really well for your needs but nobody has built the DMA engine and drivers for it yet so that adds to its $10k cost. FPGAs are great for accelerating neural network projects. There are lots of papers with algorithms for it in ACM and IEEE journals. The Python interface is not a problem. It will come to IOCTLs at some point. You may have to make a C API and DLL wrapper if your Python cannot make IOCTL calls directly. Here's my usual plug: I just wish there were a hardware vendor who would put some cheap FPGAs (Spartan 3e 1600s) on a cheap board with some standard DRAM and SRAM slots (unpopulated) and a PCIe x8 (or x4) slot and then sell the board for < $300. Design the darn thing for acceleration, not prototyping. They could make a killing on a well-made board with 8 or 16 fast DMA channels and a driver that worked for that really well.Article: 100655
rickman wrote: > I have looked at the data sheet and they say very clearly that the > Spartan 3 is held in reset until all three power supplies are fully up. > But the range of voltages is very wide, with reset being released when > the Vcoo on Bank four is as low as 0.4 volts. > > I get a lot of grief from the FPGA firmware designers on every little > nit and pic that they don't like about the board design. I need to > know that this will keep the FPGA in reset and all IOs tristated > whether the various power voltages are above or below the internal > reset threshold, up to the point of being configured. I assume that you are looking at Table 28 on page 54 in the Spartan-3 data sheet. http://www.xilinx.com/bvdocs/publications/ds099.pdf These are essentially the trip points for the power-on reset (POR) circuit inside the FPGA. The trip voltage range is somewhat wide due to process variation, etc. The POR circuit prevents configuration from starting until all three power rails meet are within the trip-point range. The POR can happen as early as the minimum voltage levels or as late as the maximum limits. Until the POR is released, all I/Os not actively involved in configuration are high-impedance. The HSWAP_EN pin controls whether or not internal pull-ups are applied to these I/Os. When HSWAP_EN = High, the I/Os are turned off. Also, the pull-ups connect to their associated power rail so you won't see the effect until VCCO ramps up. --------------------------------- Steven K. Knapp Applications Manager, Xilinx Inc. General Products Division Spartan-3/-3E FPGAs http://www.xilinx.com/spartan3e --------------------------------- The Spartan(tm)-3 Generation: The World's Lowest-Cost FPGAs.Article: 100656
On a sunny day (14 Apr 2006 14:28:38 -0700) it happened andrewfelch@gmail.com wrote in <1145050118.699722.123650@e56g2000cwe.googlegroups.com>: >Hello, > >I am a Python programmer writing neural network code with binary firing >and binary weight values. My code will take many days to parse my >large data sets. I have no idea how much fpga could help, what the >cost would be, and how easy it would be to access it from Python. The >problem is similar to competitive networks, where I must dot product >many million-length bit vectors (which only change occasionally) with 1 >input vector. Anybody want to estimate the cost, speedup, and value an >fpga could offer me? > >Seems like this problem shouldn't be so hard, but from the little >research I've done I haven't found a good value product that is >ready-made, so I'm looking at (multiple?) fpga as a coprocessor. > > Sounds to me like vector processing. Cray (supercomputers) know all about it. Can be done in FPGA. For neural nets many hardware units have been designed. Is not python horribly slow for this? C would be better? ASM? Anyways, perhaps (this has been done), you can implement your neuron in hardware. There also exist vector plugin cards for the PC (designed one once).Article: 100657
John Larkin wrote: > > Since the max serial-slave configuration rate on things like Spartan3 > chips is, what, 20 MHz or something, you might consider slowing down > the CCLK input path, and/or adding some serious hysteresis on future > parts. On a pcb, CCLK is often a shared SPI clock, with lots of loads > and stubs and vias and such, so may not be as pristine as a system > clock. CCLK seems to be every bit as touchy as main clock pins, and it > really needn't be. Wouldn't one expect this to be 'normal design practise' ? I suppose Xilinx missed that obvious feature, becasue there are no other Schmitt cells on the die, and even tho the CPLDs have this, I'm sure their inter-department sharing is like most large companies :) -jgArticle: 100658
Hi John, Thank you for the feedback. Fortunately, this is already a planned enhancement on future families. --------------------------------- Steven K. Knapp Applications Manager, Xilinx Inc. General Products Division Spartan-3/-3E FPGAs http://www.xilinx.com/spartan3e --------------------------------- The Spartan(tm)-3 Generation: The World's Lowest-Cost FPGAs.Article: 100659
On 14 Apr 2006 15:47:28 -0700, "Steve Knapp (Xilinx Spartan-3 Generation FPGAs)" <steve.knapp@xilinx.com> wrote: >Hi John, > >Thank you for the feedback. Fortunately, this is already a planned >enhancement on future families. > Thank you! Thank you! Maybe I'm not crazy after all. Maybe. Next, how about making the real clock inputs programmable to be slower and less noise sensitive? Yeah, some people are never satisfied. JohnArticle: 100660
Hello, I am a graduating college senior who is working on his seminar project. I am trying to get a CM11a controller (x10 technology) to talk to a Spartan 3 starter kit. I've written a programm in C# that allows me to cut on a light via that controller. My goal is to download this to the Spartan board, but instead of turning on a light, I would just to display a simple statement through the board saying, "the signal from the CM11a has been receive." So my question is, is it possible to run a program written in C# on a Spartan 3 Starter Kit? If so what will I need to use to run it. Thanks in advance, AaronArticle: 100661
Aaron wrote: > Hello, > > I am a graduating college senior who is working on his seminar > project. I am trying to get a CM11a controller (x10 technology) to talk > to a Spartan 3 starter kit. I've written a programm in C# that allows > me to cut on a light via that controller. My goal is to download this > to the Spartan board, but instead of turning on a light, I would just > to display a simple statement through the board saying, "the signal > from the CM11a has been receive." So my question is, is it possible to > run a program written in C# on a Spartan 3 Starter Kit? If so what > will I need to use to run it. > > > Thanks in advance, > > Aaron > Hmm...... Short answer no. It sounds like you want to use an FPGA as a microcontroller. This is normally done using the the Microblaze soft core (The Xilinx EDK). The EDK toolchain (gcc based) allows you to code in C. No one (at least that I know of) has a .net engine running on a soft CPU. I am sure its possible by porting the work done on the Mono project but this would be a waste of time. C# is more of a macro level (like VB) development tool. I (at least in my opinion) don't feel that it is a serious tool for embedded programming. Primarily because it was not designed to allow you have unprotected access to memory. No Dynamic Allocation, Garbage Collection etc!!!!. It is critical in embedded systems to be able to talk directly to hardware registers, etc. C# simply can't do this (at least not in any way that is simple and easy. If you are still interested in just doing the work in C, get ahold of the EDK. Its not that hard to get a simple system running using the base system builder. Writing you own custom peripherals will take you alot longer if you dont know HDL. I think you may be better off using a small 8-bit micocontroller (8051, PIC, etc). You will actually get your project done in time. -EliArticle: 100662
I don't think I've posted anything in years, but I just couldn't resist adding to this one because I played with it for some time. As the previous posters said -- it depends on the device. But I'd also add (in some detail) it also depends even within one device. To answer the question most directly an 8:1 mux requires two slices in Virtex IV or 2 Adaptive Logic Modules (ALMs) in Stratix II. But whether you actually get that in a full system depends on the struture of your design. The Virtex IV version is easy to see because it's just the output of the F6 mux provided as dedicated hardware. Spartan III is a cost-reduced Virtex IV, so it should behave identically. In Stratix II we can do it without the need for dedicated hardware but it's a bit trickier to synthesize: For Z = mux(d0,d1,d2,d3,d4,d5,d6,d7; s0,s1) synthesis will give you: y0 = mux(d0,d1,d2; s0,s1) y1 = mux(d4,d5,d6; s0,s1) which are two 5-input functions that pack into a single ALM. In the second ALM z0 = (s0 & s1 & d3) # !(s0 & s1) & y0 z1 = (s0 & s1 & d7) # !(s0 & s1) & y1 Z = mux(z0,z1,s2) will be generated using 7-LUT mode. I attached Verilog at the end if you want to run it through Quartus, and you can look at the result in the equation file and will see what I just described. Note that depending on what else is in the design the 5-LUTs might get packed differently or synthesized differently i.e. Quartus may prefer to pack the two 5-LUTs with two unrelated 2 or 3-LUTs to make two 7-input ALMs rather than 1 8-input ALM and a second 6 input ALM or may synthesize differently at the cost of area to hit a delay constraint. On older devices (Altera Stratix, Cyclone; Xilinx Spartan I, 4000) and on MAX II and Cyclone II, you can basically use "4-LUT" in the discussion below, though it will depend on other issues in practice. I haven't thought about PTERM devices like MAX 7000. But this brings me to the bigger discussion. I would stress that in practice it makes a big difference what the surrounding context is, and also if you have more than one mux in your design, because in a mux system like a barrel shifter or crossbar the amortized cost of k muxes in Stratix II is less than k times the cost of one (which is a benefit over Virtex IV). In a generic 4-LUT architecture with no dedicated hardware, a simple 2:1 mux is a 3-input function and takes one LUT (with one input going unused). A 4:1 mux would take two LUTs (not three -- exercise to the reader; it's easier than the 8:1 above). An 8:1 mux reqires five vanilla 4-LUTs because it's 2 4:1 muxes and 1 2:1. But it's arguably something like 4.5 LUTs (see two paragraphs down). I already mentioned the Virtex IV hardware. Stratix-and some earlier Altera architectures have hardware that facilitates other special cases, e.g. a set of mux(a,b,c,0; s0,s1) can be implemented in a LAB cluster by stealing functionality from the LAB-wide SLOAD hardware before the DFF. So you can a restricted 4:1 mux in one LE instead of 2. (that's the "basically" in the above). When I said context I meant this: If an 8:1 mux is followed by an AND-gate (e.g. Z = mux(a,b,c,d; s0,s1) & e), then the AND gate would be a "free" addtion to the 5 4-LUT implementation in the vanilla architecture (because there's a leftover input on the last LE), but would cost an new LE using the Virtex IV hardware. So F5 gives a a maximal 20% savings for a lone 8:1 mux, but depending on the surrounding logic the relative benefit could disappear. That's not a deficiency, you just can't count on getting the benefit in all cases. Note that if it's a 3-input AND gate, the situation reverses and the dedicated hardware is again ahead by one LE. In reality, though, you don't probably don't care about one simple mux, you care about systems of muxes that consume huge numbers of LUTs. For example, a simple 16-bit barrel shifter out[15:0] = in[15:0] << k[3:0] results in 16 16:1 muxes or 16x5 4:1 muxes = 16x5x2 LUTs = 160 LUTs synthesized in the obvious way or 16x4 2:1 muxes = 64 LUTs synthesized properly into a n*log(n) shifter network of 2:1 muxes. The Virtex hardware would get some savings from this vs. the vanilla 4-LUT, but it bounces between 0 and 20% based on round-off and arrangement issues in the size of the barrel shifter, and because the advantage is lost for all the shifter bits that source a zero in the shifter network and go non-symmetric. I should mention that it's also not technically correct to compare #LEs in the presence of any dedicated hardware, because you use fewer LEs but the cost of an LE changes. From the architecture point of view you have to multiply #LEs * sizeof(LE) (even better #LABs * sizeof(LAB) or #CLB * sizeof(CLB)) to evaluate whether the HW is beneficial to put in the device (or simply compare the dollar-price of the smallest Virtex IV or Stratix II device your complete design fits in). Although 64 LEs from a simple one-line statement sounds like a lot, it's actually worse because usually in and out are w-bit words, so everything gets repeated w times. A properly synthesized w=16, 16x16 barrel shifter, for example, requires 16x64 = 1024 4-LUTs. The dedicated hardware in Virtex gets 16x58 half slices or about 9% better than a 4-LUT implementation, and Stratix II can do this in 16x32 ALUTs or 50% fewer -- see full data below. Note that a rotating barrel shifter (second version I attached code for) will require more resources in both. This is because of the wrap-around data -- none of the muxes collapse due to zeroed inputs. You can see this in an ALU, but the zero-padded version will be more common in commercial designs. On to crossbars. A crossbar is like a barrel shifter, except that you can't re-synthesize it into a shifter network, you're stuck with the k k:1 muxes. So a 16x16 crossbar with 16 4-bit select inputs actually requires 16 independent 16:1 muxes, again times data-width. Because there is no re-expression of this that isn't a plain mux, the F5 and F6 hardware should be more beneficial here on average (closer to the 20%). When we designed the Stratix II architecture, we spent a lot of time looking at crossbar, barrel shifters and multiplexor structures. But you might have figured that out by now. What we came up with is particularly beneficial for systems with many muxes -- the sub-linear growth I mentioned earlier. The Stratix II ALM is a 8 input fracturable logic block that can implement (among other combinations not listed) a) two independent 4-LUTs b) independent 5-LUT and 3-LUT c) two 5-LUTs that share 2 common inputs d) a single 6-LUT e) some 7-LUTs f) two 6-LUTs that have 4 common inputs, and additionally the same LUT-mask Note that for (a) an ALM is (all other things equal) equivalent to two Stratix LEs or one Virtex slice, for (b,f) it's always better, and for (c,d,e) usually but not guaranteed to be better. But you can find this in the ALM vs. slice discussion from a year or two ago. Way off topic, but even the word "better" is a bit abstract-- it's dependent on other issues like the tech-mapping algorithm and the relative routability of the device and Si area. For example, though a nxn xbar might fit in f(n) cells, a (2n)x(2n) may not fit in the optimal number f(2n) cells because a lack of routability in the device forces the placer to spread the design out. E.g. interconnect doesn't scale as smoothly in older architctures like Altera Apex or Xilinx 4000 (we've gotten better at it, but it's also a function of modern designs). Since a 4:1 mux is a 6-input function, it can fit in one ALM. With the tricks described above using (c) and (e) an 8:1 fits in two ALMs. A 16:1 mux requires 4 ALMs + a 2:1 mux, which is 4.5 ALMs (though, again the 3-input function has two or more additional inputs to absorb more logic, so you could argue this is 4.25 ALMs instead of 4.5). Item (f) is where the real benefit comes in for muxes. The decomposition of crossbars and barrel shifters into primitive muxes results in large numbers of 4:1 muxes that have either (i) similar data and common select bits in the case of barrel shifters, or (ii) common data and different select bits in the case of xbars. By the latter, I mean mux(a,b,c,d; s0,s1) and mux(a,b,c,d; t0,t1). Not by coincidence, this fits into the template of two 6-input functions with 4 common inputs and the same LUT-mask so a single ALM can implement two 4:1 muxes arising from such a mux system, which makes it roughly 2X the efficiency for powers of 4 and between 1.5X and 2X for odd powers of 2 (i.e. 8:1). That's a generalization, because it also depends on whether barrel shifters are rotating or shift in zeros, and whether all the outputs are used (in packet processing you might do a 3n->2n type shifter so some of the bits get dropped). Same as the discussion above on F5 and F6-- as soon as you introduce 0's on the mux inputs you have leftover neighbouring logic to slurp up and the numbers get fuzzy. But we can look at least look the bottom line of all this using output from Quartus II and ISE. I ran this more than a year ago, so both tools have newer versions. 16x16 zero-shifting barrel shifter Cyclone, Stratix, 4-LUT 64 LUTs (LEs) Virtex IV 59 half-slices (packs to 47 slices) Stratix II ALM 32 ALUTs (or half-ALM) (packs to 23 ALMs) 16x16 xbar Cyclone, Stratix, 4-LUT 160 LEs Virtex IV 128 half-slices Stratix II ALM 88 half-ALM (for w-bit datapaths just take all the numbers and multiply by w). Again, I included the Verilog below, in case someone says I'm cheating, and both ISE and Quartus are available in free versions. So try it yourself. Note that neither Quartus nor ISE will guarantee perfect packing (half-slice to slice or ALUT to ALM). This is either due to things like the placer choosing to split up two sub-blocks that could be packed in order to improve delay, or other reasons. For example, ISE used 47 slices to implement the 59 half-slices after placement, but at least some of the 35 unused half-slice partners are likely available to be packed with 2,3,4 input functions from elsewhere in the design, were the design bigger. Quartus II uses 23 ALMs for the 32 ALUTs, meaning that 6 ALUTs are still potentially available for other logic without consuming further ALMs. For a common sub-design like a SPI4.2 PHY interface, component pieces such I mentioned above contain modules like a a M-bit xbar to 2M-1:1 shifter into a 3M bit buffer from which 2M bits are selected. I synthesized such a design in each of Stratix, V4 and Stratix II. Stratix: 907 LEs Virtex IV: 1368 half slices (741 full slices after placement) Stratix II: 536 ALUTs (514 ALMs after placement) (Sorry, can't provide Verilog for this one because it's part of the IP core.) You have to treat the synthesis of small designs carefully. The XST solution is non-optimal for Virtex IV -- I can hand-map this design into the hardware and use fewer slices. For example, it's nearly trivial to get the 907 that I got in Stratix, though that also uses the 3:1 mux trick I mentioned above, but XST isn't doing it for some reason. Finally, bus-muxes. This is when you have e.g. an simple 8:1 mux where all the inputs are 16 bits wide. Synthesis often re-structures these for delay vs. area tradeoffs because you can play games with the selects to amortize different structures through the datapath. So be careful trying to analyze these for area out of context. There are a couple publications on this that I listed below. The FPL paper below also talks about crossbar and barrel shifter synthesis into the ALM. I also didn't understand the question about sharing LUTs, but I agree with the previous poster that the answer is probably "no" all around. You might mean resource sharing as in making the mux iterative / multi-cycle, but that would probably be more expensive in area. In terms of delay, you can always pipeline. Also, as someone else also said, a multiplier can be used for a barrel shifter (multiply data by unary k) if you have no other purpose for the dedicated DSP block. All this information is in published papers; below are some references. The first three references are on the general mux synthesis topic. The other two are on the Stratix II ALM and architecture and discuss some of barrel-shifter/xbar discussion I repeated above. Paul Metzgen and Dominic Nancekievill, "Multiplexor Restructuring for FPGA Implementation Cost Reduction". Design Automation Conference, June, 2005. Dominic Nancekievill and Paul Metzgen, "Factorizing Multiplexers in the Datapath to Reduce Cost in FPGAs". IWLS, June 2005. Jennifer Stephenson and Paul Metzgen, "Logic Optimization Techniques for Multiplexors", in Mentor user2user conference, 2004 Mike Hutton, Jay Schleicher, David Lewis, Bruce Pedersen, Richard Yuan, Sinan Kaptanoglu, Gregg Baeckler, Boris Ratchev, Ketan Padalia, Mark Bourgeault, Andy Lee, Henry Kim and Rahul Saini, "Improving FPGA Performance and Area Using an Adaptable Logic Module", Proc. 14th International Conference on Field-Programamble Logic, Antwerp, Belgium, pp. 135-144, Sept 2004. LNCS 3203 David Lewis, Elias Ahmed, Gregg Baeckler, Vaughn Betz, Mark Bourgeault, David Cashman, David Galloway, Mike Hutton, Chris Lane, Andy Lee, Paul Leventis, Sandy Marquardt, Cameron McClintock, Ketan Padalia, Bruce Pedersen, Giles Powell, Boris Ratchev, Srinivas Reddy, Jay Schleicher, Kevin Stevens, Richard Yuan, Richard Cliff, Jonathan Rose, "The Stratix-II Routing and Logic Architecture". 2005 Int'l Symposium on FPGAs (FPGA, Feb 2005). Regards, Mike Hutton Altera Corp San Jose CA <firstinitial><lastname>@altera.com Note: Please don't bother sending email to the yahoo account in the header, I won't read it. My real email is in the signature. ------------------------------ Here's the Verilog for the 8:1 mux, barrel shifters and crossbars. // Simple 8:1 mux // M. Hutton, Altera Corp, 2006 module mux(in,out,s, clk); input [7:0] in; input [2:0] s; input clk; output out; reg out; always@ (posedge clk) begin case (s) 3'b000: out <= in[0]; 3'b001: out <= in[1]; 3'b010: out <= in[2]; 3'b011: out <= in[3]; 3'b100: out <= in[4]; 3'b101: out <= in[5]; 3'b110: out <= in[6]; 3'b111: out <= in[7]; endcase end endmodule // Simple barrel shifter with no rotation // M. Hutton, Altera Corp, 2003 module barrel (data_in, data_out, shift_by, clk) ; input [15:0] data_in ; input [15:0] shift_by; input clk; output [15:0] data_out ; reg [15:0] data_out ; reg [15:0] reg_data_in ; reg [15:0] reg_shift_by ; always @(posedge clk) begin reg_data_in <= data_in ; reg_shift_by <= shift_by ; data_out = reg_data_in << reg_shift_by; end endmodule // Simple 16-bit barrel shifter with rotation. // Mike Hutton, Altera Corp. 2003 module barrel16 (data_in, data_out, shift_by, clk) ; input [15:0] data_in ; input [15:0] shift_by ; input clk; output [15:0] data_out ; reg [15:0] data_out ; reg [15:0] reg_shift_by; reg [15:0] reg_data_in ; always @(posedge clk) begin reg_data_in <= data_in ; reg_shift_by <= shift_by; case (reg_shift_by) 4'b0000: data_out <= reg_data_in [15:0] ; 4'b0001: data_out <= {reg_data_in[0], reg_data_in[15:1]}; 4'b0010: data_out <= {reg_data_in[1:0], reg_data_in[15:2]}; 4'b0011: data_out <= {reg_data_in[2:0], reg_data_in[15:3]}; 4'b0100: data_out <= {reg_data_in[3:0], reg_data_in[15:4]}; 4'b0101: data_out <= {reg_data_in[4:0], reg_data_in[15:5]}; 4'b0110: data_out <= {reg_data_in[5:0], reg_data_in[15:6]}; 4'b0111: data_out <= {reg_data_in[6:0], reg_data_in[15:7]}; 4'b1000: data_out <= {reg_data_in[7:0], reg_data_in[15:8]}; 4'b1001: data_out <= {reg_data_in[8:0], reg_data_in[15:9]}; 4'b1010: data_out <= {reg_data_in[9:0], reg_data_in[15:10]}; 4'b1011: data_out <= {reg_data_in[10:0], reg_data_in[15:11]}; 4'b1100: data_out <= {reg_data_in[11:0], reg_data_in[15:12]}; 4'b1101: data_out <= {reg_data_in[12:0], reg_data_in[15:13]}; 4'b1110: data_out <= {reg_data_in[13:0], reg_data_in[15:14]}; 4'b1111: data_out <= {reg_data_in[14:0], reg_data_in[15]}; endcase end endmodule // Simple 16-bit crossbar with one-bit width // M. Hutton, Altera Corp, 2003 module xbar(in,out,s,clk); input [15:0] in; input [63:0] s; input clk; output [15:0] out; reg [15:0] out; reg [15:0] out1; integer k; reg [15:0] inreg; always@ (posedge clk) begin inreg <= in; for (k = 0; k < 16; k = k+1) begin out1[k] <= inreg[{s[4*k+3],s[4*k+2],s[4*k+1],s[4*k]}]; end out <= out1; end endmoduleArticle: 100663
On Sat, 15 Apr 2006 10:45:06 +1200, Jim Granville <no.spam@designtools.co.nz> wrote: > >John Larkin wrote: >> >> Since the max serial-slave configuration rate on things like Spartan3 >> chips is, what, 20 MHz or something, you might consider slowing down >> the CCLK input path, and/or adding some serious hysteresis on future >> parts. On a pcb, CCLK is often a shared SPI clock, with lots of loads >> and stubs and vias and such, so may not be as pristine as a system >> clock. CCLK seems to be every bit as touchy as main clock pins, and it >> really needn't be. > >Wouldn't one expect this to be 'normal design practise' ? > >I suppose Xilinx missed that obvious feature, becasue there are no >other Schmitt cells on the die, and even tho the CPLDs have this, >I'm sure their inter-department sharing is like most large companies :) > I have it secondhand (one of my guys tells me) that all S3 inputs have about 100 mV of hysteresis. But that's not enough to improve noise immunity in most practical situations. It's good that FPGAs keep getting faster, but not all applications need all that speed, and pickiness about clock edge quality can be a real liability in a lot of slower applications. JohnArticle: 100664
So how many million-length bit-vector dot products might I be able to do per second? My 3.8ghz P4 can do 125/sec. I would prefer building a beowulf cluster if the price:performance was similar (because fpga is so foreign to me). Of course if you tell me 10,000/sec I will become an instant fpga evangelist, hehe. Jan: I use a matrix library written in C. Thanks for your help guys, AndrewFArticle: 100665
John Larkin wrote: > On Sat, 15 Apr 2006 10:45:06 +1200, Jim Granville > <no.spam@designtools.co.nz> wrote: > > >>John Larkin wrote: >> >>>Since the max serial-slave configuration rate on things like Spartan3 >>>chips is, what, 20 MHz or something, you might consider slowing down >>>the CCLK input path, and/or adding some serious hysteresis on future >>>parts. On a pcb, CCLK is often a shared SPI clock, with lots of loads >>>and stubs and vias and such, so may not be as pristine as a system >>>clock. CCLK seems to be every bit as touchy as main clock pins, and it >>>really needn't be. >> >>Wouldn't one expect this to be 'normal design practise' ? >> >>I suppose Xilinx missed that obvious feature, becasue there are no >>other Schmitt cells on the die, and even tho the CPLDs have this, >>I'm sure their inter-department sharing is like most large companies :) >> > > > I have it secondhand (one of my guys tells me) that all S3 inputs have > about 100 mV of hysteresis. But that's not enough to improve noise > immunity in most practical situations. > > It's good that FPGAs keep getting faster, but not all applications > need all that speed, and pickiness about clock edge quality can be a > real liability in a lot of slower applications. True - there is also a slight speed penalty for the full schmitt cells, so that's a reason why the speed-at-all-costs FPGA sector ignores the benefits. Still, the news from Steve K is good :) -jgArticle: 100666
I have not found a decent bit-vector dot product plugin card. I think they usually do integers or floating points, but not bits in an efficient manner.Article: 100667
I personally don't like this stackup because you have designated your Top and Bottom layers to be your high-speed layers; ie, the layers adjacent to GND planes being the best. I would move the high speed layers to the inside instead of your outer two, or even separate teh power and GND planes. I have used Alternating Power and GND and it worked quite well. I do hear that Coupling the planes is good for some capacitive-related reasons however, so I can't comment on that..Article: 100668
jai.dhar@gmail.com schrieb: > 1 GND plane in an 8-layer stack? I was under the belief that yes, a > power plane can serve as a return path for a signal, but it's not > preferred or equal over a GND plane. I would think partitioning the > power planes is a safer bet than cutting another GND layer. Nope. As a power supply for a high speed circuit you need a low inductance power supply. A CMOS circuit is completly symmetric between VCC and GND. The magnitude of electrical effects will depend on the maximum of the VCC and GND inductance. Driving a falling output for example will make the GND plane bounce up, driving a rising output will make the VCC blane bounce down. Both changes the input threshold voltage by the same magnitude introducing jitter and reducing the noise margin. Therefore design all supplies for the same inductance. The only difference is that GND usually is common for the whole board, whereas sometimes certain power supply voltages are only need in some areas of the board so that you can have one plane in one half of the board and another in the other half. Islands are possible but very dangerous. Remember that you can not have a high speed signal cross the island boundary on the adjacent routing layer. Haveing a seprate layer for each supply therefore simplifies routing a lot. Also consider microvias. They save a lot of area, are great for signal integrity and do not cost much extra. Kolja SulimmaArticle: 100669
On a sunny day (14 Apr 2006 21:56:19 -0700) it happened andrewfelch@gmail.com wrote in <1145076979.683557.142540@g10g2000cwb.googlegroups.com>: >So how many million-length bit-vector dot products might I be able to >do per second? My 3.8ghz P4 can do 125/sec. I would prefer building a >beowulf cluster if the price:performance was similar (because fpga is >so foreign to me). Of course if you tell me 10,000/sec I will become >an instant fpga evangelist, hehe. > >Jan: I use a matrix library written in C. Ah, Ok. You know, this is not a 'saturday afternoon after shopping' thing (it is that now here), I can asnwer just like that. My old boss used to DEMAND to see the whole project else he did not even want to venture. Because often these things can be broken down, done in different ways. If it was a simple multiply you could see how many nnbit ,multipliers there are in the largest FPGA, but millions.... And you would soemhopw have to get the data in and out. Some project. in Virtex (Xilinx knows more) at 500MHz you can have 512 xtreme DSP blocks with 18x18 multiplyer. 512 x 18 = 9216 bits at the time in 2 nS. To make a million (in a loop) x 109 = 218 nS per multiply... Sort of a wild number, you really need to talk to these guys, I have no experience with the virtex 4. Over to X (or Altera). Budget? Time? All counts.Article: 100670
andrewfelch@gmail.com schrieb: > where I must dot product > many million-length bit vectors (which only change occasionally) with 1 > input vector. Anybody want to estimate the cost, speedup, and value an > fpga could offer me? If I understand you correctly, for each vector you want to know at how many places both the vector and the input vector are 1? Your vectors change rarely, the single input vector changes rapidly? I would say that in an FPGA you would not do it one vector at a time but N bits at a time for many vectors in parallel. If your vectors are stored of chip you can process them as fast as you can read them. With the right board at a rate of a few hundred gigabits per second. >So how many million-length bit-vector dot products might I be able to > do per second? My 3.8ghz P4 can do 125/sec. Thats 125MBits/s. That is very easy to beat. You probably can get some affordable board with 64-bit 200MHz SRAM and a small FPGA. This will get you about a factor of ten over the P4. On the other hand the P4 value seems to low. According to "Hacker's Delight" pages 65ff counting the number of bits in a 32-bit word takes less than 20 instruction. Adding one instruction for the initial AND, the loads, index updates and some loop control instructions results in about one instruction per bit. This means that a P4 should be able to do a few gigabits per second. And thats wihtout using MMX instructions which can do the dot product after the first three reductions. KoljaArticle: 100671
andrewfelch@gmail.com wrote: > So how many million-length bit-vector dot products might I be able to > do per second? My 3.8ghz P4 can do 125/sec. I would prefer building a > beowulf cluster if the price:performance was similar (because fpga is > so foreign to me). Of course if you tell me 10,000/sec I will become > an instant fpga evangelist, hehe. If I understood correctly : - You have many (what is many for you ? 100, 1000, 1000000 ?) million bits vectors that are quite 'static' (or at least don't change much compared to your 'input' vector). Let says you have N of them and that you million bit vector is in fact 1024*1024 bits long. - You also have 1 input vector that change quite often. - You want the N cross products which is basically the number of 1 in the bit wise AND of the fixed vectors and the input vector. To get an estimation on how fast it could be done, N should be known ... or at least a range because I think the main limitation is gonna be the bandwith between the host and the card and not the FPGA itself. An FPGA can do the cross product pretty easily, imagine you get the vectors 32 bits by 32 bits. First the 32 first bits of 'input' then the 32 first bits of the 'references' one by one. Then the 32 bits after that, and so on. So to enter all the vectors info for 1 given input vector, you need (N+1)*2^15 cycles. The logic doing the cross product is just a AND bit by bit, a stage that perform the counting of the 32 bits, then a 21 bits adder that stores the result in block rams (given that N is sufficiently small to fits the 21 bits results in block ram. Let says < 16384 for a small FPGA). The logic doing that could easily be pipelined to go at > 100 MHz even in a small cheap spartan 3 and since you need 2^15 cycles to do a complete vector (if N>>1) that would be 3000/s and that's in a small FPGA. Now, use a 128 bits wide DDR2 memory that's 256 bits in parallel, use a high speed grade to run the whole things at > 250 MHz and you get 60.000 of them in parallel ... But as I said, you need to get the data into the DDR2 memory and organized so that the read is efficient. pretty easy. A million bit vector is 128kb, getting 60 thousands of them per seconds is 7.5 GBytes of traffic per second ... Of course, you need to define N better and theses numbers are just for the first design I can think of with the info you provided. You mileage may vary. I think it could be done pretty quickly if you hire someone that already has and has used, a memory controller and whatever controller is needed to input/output the data. And getting data in/out is the real challenge here ... SylvainArticle: 100672
Steve Knapp (Xilinx Spartan-3 Generation FPGAs) wrote: > rickman wrote: > > I have looked at the data sheet and they say very clearly that the > > Spartan 3 is held in reset until all three power supplies are fully up. > > But the range of voltages is very wide, with reset being released when > > the Vcoo on Bank four is as low as 0.4 volts. > > > > I get a lot of grief from the FPGA firmware designers on every little > > nit and pic that they don't like about the board design. I need to > > know that this will keep the FPGA in reset and all IOs tristated > > whether the various power voltages are above or below the internal > > reset threshold, up to the point of being configured. > > I assume that you are looking at Table 28 on page 54 in the Spartan-3 > data sheet. > http://www.xilinx.com/bvdocs/publications/ds099.pdf > > These are essentially the trip points for the power-on reset (POR) > circuit inside the FPGA. The trip voltage range is somewhat wide due > to process variation, etc. > > The POR circuit prevents configuration from starting until all three > power rails meet are within the trip-point range. The POR can happen > as early as the minimum voltage levels or as late as the maximum > limits. > > Until the POR is released, all I/Os not actively involved in > configuration are high-impedance. The HSWAP_EN pin controls whether or > not internal pull-ups are applied to these I/Os. When HSWAP_EN = High, > the I/Os are turned off. Also, the pull-ups connect to their > associated power rail so you won't see the effect until VCCO ramps up. Thanks for the info. Yes, I was looking at that table, plus table 30 on the next page. I am concerned about letting the DSP run before the FPGA power is fully up and also operating the DSP while the FPGA power has a momentary glitch for what ever reason. The DSP has a separate core voltage from the FPGA and shares the Vcco of 3.3 volts. The FPGA is configured and operated on the DSP external memory bus which also connects to the program/data flash memory. I just want to make sure I can defend my power up and power glitch operation of the board. When the board is powering up, it is clear that the FPGA is held in reset until the three power rails are somewhere within the trip ranges or above. Then the DSP can hold the PROG_B signal low to continue holding the FPGA in reset until the DSP is happy with the power supplies and is ready to configure the FPGA without concern that the FPGA will mess up the memory bus. That part seems clear. But table 30 on page 55 seems to be saying that if Vccint or Vccaux dip below the minimum values, but still above the reset trip points, the configuration can be corrupted and the FPGA will not be put in reset. In this case should I assume that the IOs can then be in any state and may hang the DSP memory bus? If so, I need to use the PowerOK on the LDO regulators to either halt the DSP or make sure it gets an NMI and runs only from internal memory. I would prefer to be able to keep the DSP running normally and record the power event in memory. I have some concerns about the system power supply design and would like to be able to show clear evidence that the power is not stable rather than having to extrapolate from processor resets.Article: 100673
On Sat, 15 Apr 2006 17:09:11 +1200, Jim Granville <no.spam@designtools.co.nz> wrote: >John Larkin wrote: >> On Sat, 15 Apr 2006 10:45:06 +1200, Jim Granville >> <no.spam@designtools.co.nz> wrote: >> >> >>>John Larkin wrote: >>> >>>>Since the max serial-slave configuration rate on things like Spartan3 >>>>chips is, what, 20 MHz or something, you might consider slowing down >>>>the CCLK input path, and/or adding some serious hysteresis on future >>>>parts. On a pcb, CCLK is often a shared SPI clock, with lots of loads >>>>and stubs and vias and such, so may not be as pristine as a system >>>>clock. CCLK seems to be every bit as touchy as main clock pins, and it >>>>really needn't be. >>> >>>Wouldn't one expect this to be 'normal design practise' ? >>> >>>I suppose Xilinx missed that obvious feature, becasue there are no >>>other Schmitt cells on the die, and even tho the CPLDs have this, >>>I'm sure their inter-department sharing is like most large companies :) >>> >> >> >> I have it secondhand (one of my guys tells me) that all S3 inputs have >> about 100 mV of hysteresis. But that's not enough to improve noise >> immunity in most practical situations. >> >> It's good that FPGAs keep getting faster, but not all applications >> need all that speed, and pickiness about clock edge quality can be a >> real liability in a lot of slower applications. > > True - there is also a slight speed penalty for the full schmitt cells, >so that's a reason why the speed-at-all-costs FPGA sector ignores the >benefits. > Still, the news from Steve K is good :) > >-jg > Right. As noted in another thread, one can always add a deglitch circuit to any input, including clock pins, except for CCLK. So if that's the only one they slow down, we may elect to routinely deglitch system clock inputs except when we really need the speed. I suppose they'll schmitt the jtag pins, too; I don't use them, but they seem like great candidates for noise problems. Purists will argue that once the sacred word "clock" is voiced, we are obliged to drive it appropriately. But it's getting so that a 5 ns rise with a couple hundred mV of noise is not a reliable clock any more, and designing brutally fast, star-distributed clocks into a slow industrial-environment product really doesn't make a lot of sense. I'd hazard that the majority of FPGAs are used at a fraction of their speed capability. JohnArticle: 100674
In article <1145050118.699722.123650@e56g2000cwe.googlegroups.com>, <andrewfelch@gmail.com> wrote: > Hello, > > I am a Python programmer writing neural network code with binary firing > and binary weight values. My code will take many days to parse my > large data sets. I have no idea how much fpga could help, what the > cost would be, and how easy it would be to access it from Python. The > problem is similar to competitive networks, where I must dot product > many million-length bit vectors (which only change occasionally) with 1 > input vector. Anybody want to estimate the cost, speedup, and value an > fpga could offer me? I assume you have looked for algorithmic speed-ups? (Also FPGAs have different algorithmic speed-ups available than conventional computers.) Algorithmic speed-ups might be available if: a) the bit vectors are sparse (i.e. only a small fraction are ones, or a small fraction are zeros) b) the bit vectors are non-random (e.g. you are matching to shift register sequences, or to highly-compressible sequences that can be described in considerably less data than the raw bit stream) c) the bit vectors are related (e.g. you are using the neural net to listen to a data stream for a pattern: you don't find it, so you shift by one bit and try again). d) You can do pruning (e.g. if you don't find any evidence of a match after doing 10% of the sequence, you can abandon that vector and try the next) e) You can match multiple input vectors instead of just 1. (Since most of your conventional processor time is going to be spent waiting around for slow DRAM to get the next memory fetch of megabit matching vectors, you may as well compare it to a few dozen inputs, rather than just one). As a Python programmer, you will probably find it easier to use C than to learn VHDL/Verilog to the extent you need to implement this. If a single order-of-magnitude speed-up will solve your problems, then changing to a language closer to the metal may be enough and is easy enough to try. -- David M. Palmer dmpalmer@email.com (formerly @clark.net, @ematic.com)
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z