Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
I have done successful designs which laid memory out in the application to avoid having hardware refresh. One such machine just laid the instruction and data fetches already executed by the clock/timer routine out in a straight line. To make this work, you must put low order address lines on the row portion of the address multiplexor. With memories that have a burst mode this may not always be the highest performance option. CPU's with caches also can cause problems. But in general for small cpu's, this isn't a problem. You may find that the client is willing to take a slightly slower memory, with slightly higher software service latencies, and not accept this tradeoff. Just depends how cost sensitive the design is. If the cost of slightly larger and faster fpga isn't a budget stopper, it's probably best not to do this, as it can cause other problems if not careful .... like memory randomly disappearing because some software bug occured.Article: 91676
"Subhasri Krishnan" <subhasri.krishnan@gmail.com> wrote > if I read faster than I need to refresh, then I can avoid > refresh altogether. i.e if the refresh period is 64ms and if i access > the data every, say, 20ms then I don't have to refresh. Please tell me > if this is true or if I am getting confused. Asserting RAS causes a row of capacitors to have their charge topped up. If they are above the voltage sense threshold then the have at least some charge, and they are given a full charge. Asserting CAS causes one capacitor to be connected to a column line, and this either drives charge in/out for writing, or senses it for a read. Capacitors are not completely discharged when read, of course. From the points above you can deduce what is going to happen and what needs to be done. So long as every row gets strobed at least once every 64 ms, every capacitor is refreshed. It does not matter if this is done by a refresh cycle (RAS only), or a read/write cycles (RAS then CAS). The original IBM PC had an interrupt routine to do a series of DRAM accesses to refresh the DRAM. It had no DRAM controller at all. If you can arrange your system software so that every row is accessed every refresh period, that should do the trick. If you are doing a non-PC embedded system, the CPU may be running code from ROM most of the time. You could try refreshing the rows with those cycles: i.e. the DRAM gets a RAS during DRAM _and_ ROM cycles, but CAS only for RAM access. The row address will be whatever the CPU address bus is driven to, so obviously you have to make the ROM cycles cover every row. For this reason, it is easier to do if you use the least-significant address bits for the row address. Note that the number of accesses needed is the square root of the DRAM chip size. I don't think refreshing 64K DRAM chips is too bad (256 accesses), but you might not like doing a 16Mbit DRAM chip (2048 accesses).Article: 91677
should be aware every dram is different in the number of rows that must be accessed and the max period between accesses/refersh.Article: 91678
So if its a 64Mb chip (4096 accesses and 64mx between accesses) and I can do some kind of a serial reading then its better to skip the refresh? I am looking to push the SDRAM to the limit and to get highest bandwidth. anything other than bank interleaving and getting rid of refresh that can be done to maximize performance? this is my first controller and any suggestion is greatly appreciated.Article: 91679
motty wrote: > Mike-- > > Seems you are telling me to sample the data on the rising edge. This > is the same clock that the external part is seeing. The external part > changes data on the rising edge. I can't be sure that data is valid > then. The external part can't change its outputs immediately with the rising edge of the clock -- there's always some clock-to-out time. RTFDS. While you're at it, add in some prop delay between the external device and the FPGA. Capturing the "previous" data on the rising edge of the clock is basically how all synchronous systems work. -aArticle: 91680
might look at other mfgrs devices, as timings and setup for multibank accesses can make a huge difference if concurrent reads/writes are to the same device.Article: 91681
air_bits@yahoo.com wrote: > A number of the various papers fail to search out the best space time > tradeoffs. Mistakes like doing 64bit floating point multipliers the > hard > way in an fpga, or doing an FFT/IFFT as wide parallel which isn't > always the best space time tradeoff. > > There are MANY other architectures that can be developed to optimize > the performance of a particular application to FPGA, beside brute force > implementation of wide RISC/CISC processor core elements here. > Frequently bit serial will yield a higher clocking rate (as it doesn't > need > a long carry chain), and doesn't need extra logic for partial sums or > carry lookahead, so it also delivers more functional units per part, > but > at the cost of latency which can frequently be hidden with the faster > clock rate and high function density per part. It can also remove > memory as a staging area for wide paralle functional units, and thus > remove a serialization imposed by the solutions architecture. > > Bit serial operations using Xilinx LUT fifo's can be expensive in both > power and clock rate reductions, but that is not the only way to use > LUTs for bit serial memory. Consider using some greycode counters > and using the LUT's simply as 16x1 rams instead ... faster and less > dynamic power. > > There are lots of ways to get unexpected performance from FPGAs, > but not by doing it the worst possible way. > > Be creative. $30M US of FPGAs and memories can easily build a > 1-10 Petaflop super computer that would smoke existing RISC/CISC > designs ... we just don't have good software tools and compilers to > run applications on these machines, or have developed enough > programming talent used to getting good/excellent performance > from these devices. > > There are a few dozen better ideas about how to make FPGAs > as we know them today, into the processor chip of tomarrow, > but that is another discussion. > > Consider distributed arithmetic made FPGA's popular for high > performance integer applications, and it's not even a basic type > available from any of the common compilers or HDL's. Consider > the space time performance of three variable floating point multiple > accumulate (MAC) algorithms using this approach for large matrix > operations. > > Consider this approach for doing high end energy/force/weather > simulations using a traditional Red/Black interleave as you would > use for these applications under MPI. 3, 6, 9, 12 variable MAC's > are a piece of cake with distributed arithmetic, and highly space > time efficient. The core algorithms of many of these simulations > are little more than MAC's, frequently with constants, or near > constrants that seldom need to be changed. > > Consider for many applications the dynamic range needed during > most of the simulation is very limited, allowing systems to be > built with FP on both ends of the run, and scaled integers in the > middle of the run, even simpifing the hardware and improving the > space time fit even more. > > The big advantage to FPGAs is breaking the serialization that > memory creates in RISC/CISC architectures. Memoryless > computing using pipelined distributed arithmetic is the ultimate > speedup for many applications, including a lot of computer > vision and pattern recognition applications. > > So read the papers carefully, and consider if there might not be > a better architecture to solve the problem. If so, take the numbers > and conclusions presented with a grain of salt. It can't quite be 'memoryless', but I understand your point; I'm waiting for Stacked die FPGAs that have fast/wide memory interfaces to Mbytes of fast xRAM...... There is quite a speed/Icc cost, to driving the all the pin buffers.pcb traces in more normal memories. Meanwhile, I see more 'opening' of the Cell processor, which could revise some of these FPGA/CPU benchmarks. The Cell might even make a half-decent FPGA simulation engine, for development ? -jgArticle: 91682
g.wall wrote: > has anyone in the dig. design and reconfig. computing community looked > seriously at open source hardware design libraries, working toward a > hardware paradigm similar to that in the open source software community? Problem 1. There are ten times as many software designers as digital hardware designers. The average software guy is much better at setting up repositories, web sites and running regression tests than the average hardware guy. The average hardware guy knows enough HDL to get get by and maybe enough C language to turn on a circuit board. Standard software development processes like source control and code reuse are much less evolved in the hardware area. Problem 2. The average software designer couldn't describe two gates and flip flop in vhdl or verilog. -- Mike TreselerArticle: 91683
Hi, Davy - You may want to browse a number of papers on my web page for coding guidelines and coding styles related to multi-clock design and asynchronous FIFO design. At the web page: www.sunburst-design.com/papers Look for the San Jose SNUG 2001 paper: Synthesis and Scripting Techniques for Designing Multi-Asynchronous Clock Designs Look for the San Jose SNUG 2002 paper: Simulation and Synthesis Techniques for Asynchronous FIFO Design Look for the second San Jose SNUG 2002 paper (co-authored with Peter Alfke of Xilinx): Simulation and Synthesis Techniques for Asynchronous FIFO Design with Asynchronous Pointer Comparisons Peter likes the second FIFO style better but the asynchronous nature of the design does not lend itself well to timing analysis and DFT. I prefer the more synchronous style of the first FIFO paper. I hope to have another FIFO paper on my web page soon that uses Peter's clever quadrant-based full-empty detection with a more synchronous coding style. We spend hours covering multi-clock and Async FIFO design in my Advanced Verilog Class. These are non-trivial topics that are poorly covered in undergraduate training. I have had engineers email me to tell me that their manager told them to run all clock-crossing signals through a pair of flip-flops and everything should work! WRONG! Regards - Cliff Cummings Verilog & SystemVerilog Guru www.sunburst-design.comArticle: 91684
>Meanwhile, I see more 'opening' of the Cell processor The Cell processor architecture does have some interesting uses, and strong memory bandwidth, which delivers better than impressive performance for it's target markets. Architecturally it's strengths are also some of it's worst weaknesses for building high end machines that would scale well for applications which assume distributed memory. The cell processor is a next generation CPU to continue Moore's Law. The FPGA's which follow to target the same high performance computing market, will also come with application specific cores and multiple memory interfaces to kick but in the same markets. These FPGA's with the same die size and production volumes will have the same cost. The large FPGAs today which have similar die sizes are produced in lower volumes at a higher cost which currently skews the cost effectiveness equation toward traditional CPUs. Missing are good compiler tools and libraries to even the playing field. Cell will suffer some from that too.Article: 91685
>Problem 2. > The average software designer couldn't describe > two gates and flip flop in vhdl or verilog. does that even matter for "reconfig. computing"?Article: 91686
Should have noted that the FpgaC project is still looking for additional developers, and the long term results of this project are still very open to change. It would be great to be able to build a comprehensive set of library that allow typical MPI and posix-threaded applications to build and dynamically load/run on multiple FPGA platforms. And to mature the compiler to handle a full traditional C syntax transparently. I personally would like to see it handle distributed arithmetic transparently, so that it handles the data pipelining of high performance applications well using data flow like strategies. But that is open to the team as a whole, with inputs from the user community.Article: 91687
air_bits@yahoo.com wrote: >>Problem 2. >>The average software designer couldn't describe >>two gates and flip flop in vhdl or verilog. > > does that even matter for "reconfig. computing"? > The OP asked about open source hardware design libraries, not reconfig. computing. -- Mike TreselerArticle: 91688
Mike Treseler <mike_treseler@comcast.net> writes: > Problem 2. > > The average software designer couldn't describe > two gates and flip flop in vhdl or verilog. Problem 3. The average software designer couldn't describe two gates and a flip-flop in C (or any other programming language), but would instead describe something that synthesizes to a large collection of gates and flip-flops.Article: 91689
Subhasri krishnan wrote: >Hey all, >I am designing(trying to design) an sdram controller (for a PC133 >module) to work as fast as it is possible and as I understand from the >datasheet, if I read faster than I need to refresh, then I can avoid >refresh altogether. i.e if the refresh period is 64ms and if i access >the data every, say, 20ms then I dont have to refresh. Please tell me >if this is true or if I am getting confused. >Thanks in Advance. > > > This is true provided you access every single row, well at least every row you have data in, within the refresh time. This can be used to advantage in video frame buffers, for example as long as the frame time does not exceed the refresh time. So yes, it can be useful. It doesn't save a lot of meory bandwidth or time, but it can substantially simplify the DRAM controller in your design. -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 91690
Hi Anthony, Data bus signals' tri-state FF not being included in an IOB is for timing reasons. Even when Address Stepping is used, those FFs still do rely on several unregistered signals, and when larger devices are used, the unregistered signals (i.e., FRAME#, IRDY#, etc.) will have to travel longer distance, thus making harder to meet the PCI's stringent setup time requirement. (3ns for 66MHz PCI and 7ns for 33MHz PCI.) Instead, by not including a tri-state FF in IOBs, it allows the tri-state FFs to be placed near the unregistered signals, making it easier to meet setup time. Once those unregistered signals go through LUTs, and get captured by a FF, they become registered, and once registered, the registered signal has much more timing margin. (15ns for 66MHz PCI and 30ns for 33MHz PCI.) Kevin Brace Anthony Ellis wrote: > Hi Kevin, > > I can't figure your explanation. Even if you wnated to step (in clock cycles) the IO enable could still be in the IOB! Using an internal FF, with defined placement and routing, gives control of scew within the same cycle - if you wanted it! > > Anthony. -- Brace Design Solutions Xilinx (TM) LogiCORE (TM) PCI compatible BDS XPCI PCI IP core available for as little as $100 for non-commercial, non-profit, personal use. http://www.bracedesignsolutions.com Xilinx and LogiCORE are registered trademarks of Xilinx, Inc.Article: 91691
>Problem 3. > >The average software designer couldn't describe two gates >and a flip-flop in C (or any other programming language), but >would instead describe something that synthesizes to a large >collection of gates and flip-flops. in TMCC/FpgaC (and Celoxica, and a number of other C HDL like tools) what you just asked for is pretty easy, and comments otherwise are pretty egocentric bigotry that just isn't justified. int 1 a,b,c,d; // four bit values, possibly mapped to input pins int 1 singlebit; // describes a single register, possibly an output pin singlebit = (a&b) | (c&d); // combinatorial sum of products for ab+cd; I can train most kids older than about 6-10 to understand this process and the steps to produce it. It doesn't take an EE degree to understand or implement. So beating your chest here is pretty childish, at best.Article: 91692
There is a small setup overhead for the main, but for example this certainly does NOT synthesize "to a large collection of gates and flip-flps" as you so errantly assert cluelessly: main() { int a:1,b:1,c:1,d:1; #pragma inputport (a); #pragma inputport (b); #pragma inputport (c); #pragma inputport (d); int sum_of_products:1; #pragma outputport (sum_of_products); while(1) { sum_of_products = (a&b) | (c&d); } } Produces the following default output (fpgac -S example.c) as example.xnf: LCANET, 4 PWR, 1, VCC PWR, 0, GND PROG, fpgac, 4.1, "Thu Nov 10 19:42:27 2005" PART, xcv2000ebg560-8 SYM, CLK-AA, BUFGS PIN, I, I, CLKin PIN, O, O, CLK END SYM, FFin-0_1_0Running, INV PIN, I, I, 0_1_0Zero PIN, O, O, FFin-0_1_0Running END SYM, 0_1_0Running, DFF PIN, D, I, FFin-0_1_0Running PIN, C, I, CLK PIN, CE, I, VCC PIN, Q, O, 0_1_0Running END SYM, FFin-0_1_0Zero, BUF PIN, I, I, 0_1_0Zero PIN, O, O, FFin-0_1_0Zero END SYM, 0_1_0Zero, DFF PIN, D, I, FFin-0_1_0Zero PIN, C, I, CLK PIN, CE, I, VCC PIN, Q, O, 0_1_0Zero END SYM, 0_4__a, IBUF PIN, I, I, a PIN, O, O, 0_4__a END EXT, a, I SYM, 0_4__b, IBUF PIN, I, I, b PIN, O, O, 0_4__b END EXT, b, I SYM, 0_4__c, IBUF PIN, I, I, c PIN, O, O, 0_4__c END EXT, c, I SYM, 0_4__d, IBUF PIN, I, I, d PIN, O, O, 0_4__d END EXT, d, I SYM, 0_10__sum_of_products-OBUF, OBUF PIN, I, I, 0_10__sum_of_products PIN, O, O, sum_of_products END EXT, sum_of_products, O SYM, FFin-0_10__sum_of_products, BUF PIN, I, I, T0_15L49_0_10__sum_of_products PIN, O, O, FFin-0_10__sum_of_products END SYM, 0_10__sum_of_products, DFF PIN, D, I, FFin-0_10__sum_of_products PIN, C, I, CLK PIN, CE, I, 0_13_L21looptop PIN, Q, O, 0_10__sum_of_products END SYM, FFin-0_13_L21looptop, EQN, EQN=((~I1)+(I0)) PIN, I1, I, 0_1_0Running PIN, I0, I, 0_13_L21looptop PIN, O, O, FFin-0_13_L21looptop END SYM, 0_13_L21looptop, DFF PIN, D, I, FFin-0_13_L21looptop PIN, C, I, CLK PIN, CE, I, VCC PIN, Q, O, 0_13_L21looptop END SYM, SYMT0_15L49_0_10__sum_of_products, EQN, EQN=((I0*I1)+(I2*I3)) PIN, I3, I, 0_4__a PIN, I2, I, 0_4__b PIN, I1, I, 0_4__c PIN, I0, I, 0_4__d PIN, O, O, T0_15L49_0_10__sum_of_products END EOFArticle: 91693
Go back and read the first line of the first post, and you will clearly see the author included reconfigurable computing in the discussion.Article: 91694
Eric Smith wrote: > Mike Treseler <mike_treseler@comcast.net> writes: > >>Problem 2. >> >>The average software designer couldn't describe >>two gates and flip flop in vhdl or verilog. > > > Problem 3. > > The average software designer couldn't describe two gates > and a flip-flop in C (or any other programming language), but > would instead describe something that synthesizes to a large > collection of gates and flip-flops. 3b, Without realising it. -jgArticle: 91695
john wrote: > > It is for a bidirectionnal signal: input is registered into IOB, output is also registered there, > but the duplicated tristate_enable registers don't want to go inside the OLOGIC (Virtex 4). > Each of them is not that far, but not into the IOB! > Last time I tried this with XST 6.3 / Spartan-3, I had to try a few coding variants before all the data registers and tristate controls were properly stuffed into the IOBs from non-structural HDL code. Below are some simplified (hand edited,uncompiled!!) code snippets from a S3 eval kit RAM test that I posted last fall, for the whole thing see : ftp://members.aol.com/fpgastuff/ram_test.zip Code Snippets: <ports> ram_addr : out std_logic_vector(17 downto 0); ram_dat : inout std_logic_vector(15 downto 0); <signals> -- -- internal ram signals -- signal addr : std_logic_vector(17 downto 0); signal din : std_logic_vector(15 downto 0); signal ram_dat_reg : std_logic_vector(15 downto 0); signal wdat_oe_l : std_logic; -- -- IOB attribute needed to replicate tristate enable FFs in each IOB -- attribute iob of wdat_oe_l : signal is "true"; <code> -- -- output data bus tristate -- -- XST seems to want tristates coded like this to push both -- the tristate control register and the data register into IOB -- ( had previously been coded as clocked tristate assignment ) -- ram_dat <= ram_dat_reg when wdat_oe_l = '0' else ( others => 'Z' ); -- -- registered RAM I/O -- process(clk) begin if rising_edge(clk) then -- -- IOB registers -- ram_dat_reg <= tdat(15 downto 0); ram_addr <= taddr; -- -- registered tristate control signal -- coded this way, with IOB attribute on wdat_oe_l, so -- XST will replicate tristate control and push into IOBs -- if (done_p1 = '0') and ( read_write_p1 = '0') then wdat_oe_l <= '0'; else wdat_oe_l <= '1'; end if; -- -- register input data -- din <= ram_dat; end if; end process;Article: 91696
> 3b, Without realising it. The interesting point in this process, is that the tools are evolving to hide design issues that are seldom a worry for typical cases like reconfigurable computing on FPGA compute engines. Every programmer decides how big each and every variable should be. For a machine that has a few Gigabytes of memory, and a 64bit native word size, using 64bit variable may be either free, or faster as it may not take an extra step to sign extend the memory on register load. When programmers move to smaller processors, they quickly learn that when programming a PIC micro, that 64bit word sizes just don't work well. When programmers encounter FPGA compute engines the same processes quickly come into play, and a short mentoring of the newbies to size variables by the bit, or be careful and use char, int, long, and long long properly isn't that difficult, or even unexpected. if the fpga is a single toy sized fpga, it's no different that programming a PIC micro, as resources are tight, and the programmer will adapt. If the fpga system is 4,096 tightly interconnected XC4VLX200's and the application isn't particularly large, I suspect the programmer writing applications for this fpga based super computer will not have to worry about fit. If they are fine tuning the bread and butter simulations at places like Sandia Labs, I suspect the programmers will have more than enough experience and skill to size variables properly and be very much in tune with space time tradeoffs for applications far more complex than even a typical programmer would consider. it's reconfigurable computing projects where libraries of designs become very useful, particularly for SoC designs that used to be an EE design task, and is rapidly becoming mainstreamed that software engineers are the most likely target as the market continues to mature and expand. There will be some dino's that stand in the tar pits admiring the bits as the sun sets on that segment of their employment history.Article: 91697
vssumesh wrote: > just one question (not directly related to the topic)... if i write A= > C+ D in the verilog and choose optimize for speed will the tool > generate the CLA adder ??? Try it and see. Synthesis only guarantees to match a netlist to your code. -- Mike TreselerArticle: 91698
air_bits@yahoo.com writes: > There is a small setup overhead for the main, but for example > this certainly does NOT synthesize "to a large collection of > gates and flip-flps" as you so errantly assert cluelessly: > > main() > { > > int a:1,b:1,c:1,d:1; > #pragma inputport (a); > #pragma inputport (b); > #pragma inputport (c); > #pragma inputport (d); > > int sum_of_products:1; > #pragma outputport (sum_of_products); > > while(1) { > sum_of_products = (a&b) | (c&d); > } > } Why should a C programmer expect that to synthesize any flip-flops at all? It looks purely combinatorial. How would you write it if you did NOT want a flip-flop, but only a combinatorial output? Anyhow, I wasn't suggesting that the language couldn't represent a few gates and a flip-flop. My point is that C programmers don't think in those terms, so anything they write is likely to result in really inefficient hardware designs. For example, typical C code for a discrete cosine transform can be found here: http://www.bath.ac.uk/elec-eng/pages/sipg/resource/c/fastdct.c But I suspect that code will synthesize to something at least an order of magnitude larger and an order of magnitude slower than a typical HDL implementation. That doesn't mean that you couldn't write a DCT in C that would synthesize to something efficient; it just means that a normal C programmer *wouldn't* do that. You'd have to train the C programmer to be hardware designer first, and by the time you've done that there's little point to using C as the HDL, since the whole point of using C as an HDL was to take advantage of the near-infinite pool of C programmers. EricArticle: 91699
I wrote: > Problem 3. > The average software designer couldn't describe two gates > and a flip-flop in C (or any other programming language), but > would instead describe something that synthesizes to a large > collection of gates and flip-flops. Jim Granville <no.spam@designtools.co.nz> writes: > 3b, Without realising it. Exactly so. It's perhaps less commonly seen in C, since C *only* has low-level constructs, but the vast majority of C++ and Java programmers seem to have no conception of what the compiler is likely to emit for the programming constructs they use. A former coworker once tried to write C++ code to talk to Dallas one-wire devices. He spend days trying to debug it before someone took pity on him and pointed out that by the time the constructor for one of his objects executed the entire transaction had timed out. Eric
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z