Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Hello All, While you really should use a custom symbol with all your signal names it's foolish to hand generate schematic symbols. Fortunately,I have just the tool for you. I have written a Xilinx .pad to Viewdraw symbol translator. It automatically generates a complete acurate symbol that contains all the pins on the package. Grounds and core supplies are added as SIGNAL= attributes to keep the size of the symbol within reason. I lobbied Xilinx for many years to standardize on a common .pad format for all their devices and I think they have done that now for everything except the pld's. Tonight from my home account I will post a url where you can download the translator. It's written in C++ and if someone wants to modify it for other applications I would probably share it in the spirit of an open internet. Later, Pete Dudley "Peter Fenn" <Peter.Fenn@avnet.com> wrote in message news:ee73c6a.-1@WebX.sUN8CHnE... > Spartan-IIE: I am urgently looking for a (board-level) schematic symbol (preferably ORCAD or VIEWLOGIC) for an XC2S100E-6FT256C Xilinx FPGA. Is anyone in a position to help on this? > -Thanks in advance :-)Article: 37676
i want to know if the way does.Article: 37677
Hi, I will like to know if someone knows the strategies on how to reduce routing (net) delays for Spartan-II. So far, I treated synthesis tool(XST)/Map/Par as a blackbox, but because my design (a PCI IP core) was not meeting Tsu (Tsu < 7ns), I started to take a closer look of how LUTs are placed on the FPGA. Using Floorplanner, I saw the LUTs being placed all over the FPGA, so I decided to hand place the LUTs using UCF flow. That was the most effective thing I did to reduce interconnect delay (reduced the worst interconnect delay by about 2.7 ns (11 ns down to 8.3 ns)), but unfortunately, I still have to reduce the interconnect delay by 1.3 ns (worst Tsu currently at 8.3 ns). Basically, I have two input signals, FRAME# and IRDY# that are not meeting timings. Here are the two of the worst violators for FRAME# and IRDY#, respectively. ________________________________________________________________________________ ================================================================================ Timing constraint: COMP "frame_n" OFFSET = IN 7 nS BEFORE COMP "clk" ; 503 items analyzed, 61 timing errors detected. Minimum allowable offset is 8.115ns. -------------------------------------------------------------------------------- Slack: -1.115ns (requirement - (data path - clock path - clock arrival)) Source: frame_n Destination: PCI_IP_Core_Instance_ad_Port_2 Destination Clock: clk_BUFGP rising at 0.000ns Requirement: 7.000ns Data Path Delay: 10.556ns (Levels of Logic = 6) Clock Path Delay: 2.441ns (Levels of Logic = 2) Timing Improvement Wizard Data Path: frame_n to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tiopi 1.224 frame_n frame_n_IBUF net (fanout=45) 0.591 frame_n_IBUF Tilo 0.653 PCI_IP_Core_Instance_I_25_LUT_7 net (fanout=3) 0.683 N21918 Tbxx 0.981 PCI_IP_Core_Instance_I_XXL_1357_1 net (fanout=15) 2.352 PCI_IP_Core_Instance_I_XXL_1357_1 Tilo 0.653 PCI_IP_Core_Instance_I_125_LUT_17 net (fanout=1) 0.749 PCI_IP_Core_Instance_N3059 Tilo 0.653 PCI_IP_Core_Instance_I__n0055 net (fanout=1) 0.809 PCI_IP_Core_Instance_N3069 Tioock 1.208 PCI_IP_Core_Instance_ad_Port_2 ---------------------------- ------------------------------ Total 10.556ns (5.372ns logic, 5.184ns route) (50.9% logic, 49.1% route) Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tgpio 1.082 clk clk_BUFGP/IBUFG net (fanout=1) 0.007 clk_BUFGP/IBUFG Tgio 0.773 clk_BUFGP/BUFG net (fanout=423) 0.579 clk_BUFGP ---------------------------- ------------------------------ Total 2.441ns (1.855ns logic, 0.586ns route) (76.0% logic, 24.0% route) -------------------------------------------------------------------------------- ================================================================================ Timing constraint: COMP "irdy_n" OFFSET = IN 7 nS BEFORE COMP "clk" ; 698 items analyzed, 74 timing errors detected. Minimum allowable offset is 8.290ns. -------------------------------------------------------------------------------- Slack: -1.290ns (requirement - (data path - clock path - clock arrival)) Source: irdy_n Destination: PCI_IP_Core_Instance_ad_Port_2 Destination Clock: clk_BUFGP rising at 0.000ns Requirement: 7.000ns Data Path Delay: 10.731ns (Levels of Logic = 6) Clock Path Delay: 2.441ns (Levels of Logic = 2) Timing Improvement Wizard Data Path: irdy_n to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tiopi 1.224 irdy_n irdy_n_IBUF net (fanout=138) 0.766 irdy_n_IBUF Tilo 0.653 PCI_IP_Core_Instance_I_25_LUT_7 net (fanout=3) 0.683 N21918 Tbxx 0.981 PCI_IP_Core_Instance_I_XXL_1357_1 net (fanout=15) 2.352 PCI_IP_Core_Instance_I_XXL_1357_1 Tilo 0.653 PCI_IP_Core_Instance_I_125_LUT_17 net (fanout=1) 0.749 PCI_IP_Core_Instance_N3059 Tilo 0.653 PCI_IP_Core_Instance_I__n0055 net (fanout=1) 0.809 PCI_IP_Core_Instance_N3069 Tioock 1.208 PCI_IP_Core_Instance_ad_Port_2 ---------------------------- ------------------------------ Total 10.731ns (5.372ns logic, 5.359ns route) (50.1% logic, 49.9% route) Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tgpio 1.082 clk clk_BUFGP/IBUFG net (fanout=1) 0.007 clk_BUFGP/IBUFG Tgio 0.773 clk_BUFGP/BUFG net (fanout=423) 0.579 clk_BUFGP ---------------------------- ------------------------------ Total 2.441ns (1.855ns logic, 0.586ns route) (76.0% logic, 24.0% route) -------------------------------------------------------------------------------- Timing summary: --------------- Timing errors: 135 Score: 55289 Constraints cover 27511 paths, 0 nets, and 4835 connections (92.1% coverage) ________________________________________________________________________________ Locations of various resources: FRAME#: pin 23 IRDY#: pin 24 AD[2]: pin 62 PCI_IP_Core_Instance_I_25_LUT_7: CLB_R12C1.s1 PCI_IP_Core_Instance_I_XXL_1357_1: CLB_R12C2 PCI_IP_Core_Instance_I_125_LUT_17: CLB_R23C9.s0 PCI_IP_Core_Instance_I__n0055: CLB_R24C9.s0 Input signals other than FRAME# and IRDY# are all meeting Tsu < 7 ns requirement, and because I now figured out how to use IOB FFs, I can easily meet Tval < 11 ns (Tco) for all output signals. I am using Xilinx ISE WebPack 4.1 (which doesn't come with FPGA Editor), and the PCI IP core is written in Verilog. The device I am targeting is Xilinx Spartan-II 150K system gate speed grade -5 part (XC2S150-5CPQ208), and I did meet all 33MHz PCI timings with Spartan-II 150K system gate speed grade -6 part (XC2S150-6CPQ208) when I resynthesized the PCI IP core for speed grade -6 part, and basically reused the same UCF file with the floorplan (I had to make small modifications to the UCF file because some of the LUT names changed). The reason I really care about Xilinx Spartan-II 150K system gate speed grade -5 part is because that is the chip that is on the PCI prototype board of Insight Electronics Spartan-II Development Kit. Yes, I wish the PCI prototype board came with speed grade -6 . . . Because I want the PCI IP core to be portable across different platforms (most notably Xilinx and Altera FPGAs), I am not really interested in making any vendor specific modification to my Verilog RTL code, but I won't mind using various tricks in the .UCF file (for Xilinx) or .ACF file (I believe that is the Altera equivalent of Xilinx .UCF file). Here are some solutions I came up with. 1) Reduce the signal fanout (Currently at 35 globally, but FRAME# and IRDY#'s fanout are 200. What number should I reduce the global fanout to?). 2) Use USELOWSKEWLINES in a UCF file (already tried on some long routings, but didn't seem to help. I will try to play around with this option a little more with different signals.). 3) Floorplan all the LUTs and FFs on the FPGA (currently, I only floorplanned the LUTs that violated Tsu, and most of them take inputs from FRAME# and IRDY#.). 4) Use Guide file Leverage mode in Map and Par. 5) Try routing my design 2000 times (That will take several days . . . I once routed my design about 20 times. After routing my design 20 times, Par seems to get stuck in certain Timing Score range beyond 20 iterations.). 6) Pay for ISE Foundation 4.1 (I don't want to pay for tools because I am poor), and use FPGA Editor (I wish ISE WebPack came with FPGA Editor.). At least from FPGA Editor, I can see how the signals are actually getting routed. 7) Use a different synthesis tool other than XST (I am poor, so I doubt that I can afford.). I will like to hear from anyone who can comment on the solutions I just wrote, or has other suggestions on what I can do to reduce the delays to meet 33MHz PCI's Tsu < 7 ns requirement. Thanks, Kevin Brace (don't respond to me directly, respond within the newsgroup) P.S. Considering that I am struggling to meet 33MHz PCI timings with Spartan-II speed grade -5, how come Xilinx meet 66MHz PCI timings on Virtex/Spartan-II speed grade -6? (I can only barely meet 33MHz PCI timings with Spartan-II speed grade -6 using floorplanner.) Is it possible to move a signal through a input pin like FRAME# and IRDY# (pin 23 and pin 24 respectively for Spartan-II PQ208), go through a few levels of LUTs, and reach far away IOB output FF and tri-state control FF like pin 67 (AD[0]) or pin 203 (AD[31]) in 5 ns? (3 ns + 1.9 to 2 ns natural clock skew = 4.9 ns to 5.0 ns realistic Tsu) Can a signal move that fast on Virtex/Spartan-II speed grade -6? (I sort of doubt from my experience.) I know that Xilinx uses the special IRDY and TRDY pin in LogiCORE PCI, but that won't seem to help FRAME#, since FRAME# has to be sampled unregistered to determine an end of burst transfer. What kind of tricks is Xilinx using in their LogiCORE PCI other than the special IRDY and TRDY pin? Does anyone know?Article: 37678
I don't claim to be an expert at all, but according to this EE Times article, if the IP core came from an FPGA vendor like Xilinx or Altera, you are pretty much stuck with their devices, unless the FPGA vendor offers a conversion service (like Altera's HardCopy (started recently) or Xilinx's HardWire (which if I am correct was discontinued in 1999)). http://www.eetimes.com/story/OEG20010907S0103 Another bad news for a conversion service is that Clear Logic recently lost a key ruling against Altera. http://www.altera.com/corporate/press_box/releases/corporate/pr-wins_clear_logic.html I sort of find the ruling troubling because assuming that an Altera-made IP is not included in the customer's design, should anyone have any control of the bit stream file you generated from Altera's software? I suppose that what Altera wants to say is that because the customer had to agree prior to using an Altera software (like MAX+PLUS II or Quartus), the customer has to use the generated bit stream file in a way agreed in the software licensing agreement. However, recently Clear Logic won a patent on their business model of converting a bit stream file directly to an ASIC, and that business model seems to be very similar to Altera's HardCopy, so I expect Clear Logic to sue Altera soon. http://www.ebnews.com/story/OEG20011108S0031 So seeing that IP cores from FPGA vendors have strings attached to them, I think it will be safer to use a third party (non-device vendor) IP core if FPGA-ASIC conversion is part of the requirement of your application. Kevin Brace (don't respond to me directly, respond within the newsgroup) arlington_sade@yahoo.com (arlington) wrote in message news:<63d93f75.0112160047.77f9982e@posting.google.com>... > Hello all, > > If you were to use IP cores, such as Logicore/Alliance cores from > Xilinx, Megafunction cores from Altera, Inventra cores from Mentor, > etc. how can you get the RTL verilog/VHDL when you want to convert to > ASIC ? > > Thanks.Article: 37679
Additionally, I find that a good floorplan will get you to darned close to the performance you can reach doing hand routing with considerably less effort. I'm lazy, I don't want to do more work than is necessary to obtain a desired result (and my customers surely don't want to pay for that last little bit unless there is a darned good reason for it). Nice thing about stopping at floorplanning is that you can still have everything in the mainstream flow. That said, it sure would be nice to be able to lock routing in a hard macro for those few times when you really need it. Austin Lesea wrote: > Bryan, > > Reminds me of the Dilbert Cartoon where they are telling tales of their early > programming years... > > "I remember using assembly code..." > > "That is nothing, I remember using 1's and 0's...." > > "You had zeroes? Wow, we had to use 'lower case l's and upper case 'ohs'..." > > "Bucnh of babies, I only had 1's!" > > Why we as engineers would enjoy pain, and brag about it still amazes me. > > A design that is well architected, self documented, commented, and reliable is > more important to many customers. I prefer to throw all of my energies into > supporting those designs (in hdl's) which now account for 99% of what is being > done out there. > > Austin > > Bryan wrote: > > > So lets talk controversial.... > > > > If Lucent can support hard macros in Epic with hard routing, then why can't > > Xilinx. My application requires it and Xilinx doesn't support it in FPGA > > editor(which was programmed by the same softies as Epic). Oh, I remember > > why they don't support it. Because nobody cares about designs that push the > > limitations of FPGAs. Because everybody else that is making designs for > > Xilinx parts is still in kindergarten finger painting with verilog and hdl. > > Ha, I didn't get my EE degree to be a soft weirdo. Anybody can throw code > > together and get poor performance. > > > > flame away kindergarten kids > > > > Bryan > > > > "Peter Alfke" <peter.alfke@xilinx.com> wrote in message > > news:3C1F8AEC.BFD2E067@xilinx.com... > > > This is a friendly and helpful newsgroup, but let's make sure that it does > > not > > > get abused. > > > Lots of textbooks explain how to divide by a power of 2, where the > > remainder is, > > > and how you sign-extend the MSB. Explaining that is not the purpose of > > this > > > newsgroup. > > > > > > Let's use our "bandwidth" for more complex and perhaps controversial > > questions > > > that are not explained in textbooks and data books. > > > > > > Peter Alfke, Xilinx Applications > > > > > > -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 37680
> A design that is well architected, self documented, commented, and reliable is > more important to many customers. I prefer to throw all of my energies into > supporting those designs (in hdl's) which now account for 99% of what is being > done out there. Well, Austin, that's all well and good...but one can do, in my opinion, an equally as well documented, if not better documented and certainly a more reliable design in schematics. When the synthesis tools change, the design changes. That's hardly reliability. That is, unless you instantiate everything...but that's hardly synthesis...that's just a netlister. I also believe it's erroneous to claim that HDL code is "self documenting", that's not reality. It takes care and time to correctly document a design. If you make the FPGAs big enough and fast enough, synthesis works. If you make the CPUs fast enough, and have enough memory and disk space, Microsoft code works. Same thing. I'm not saying it's right or wrong, but it is a fact. Regards, AustinArticle: 37681
Ray: >2) The carry chain can also be used for a free doubler circuit. > However, watch the timing. There exist false paths (that are > also quite slow comparatively speaking) introduced by the > non-standard use of the carry chain (the chain connections > are only used to the next neighbor, not all the way up the > chain). Timingwise, the conventional approach seems to yield > better propagation delays in combinatorial only shifters, and > considerably better times in fully pipelined shifters. This > is a good trick to put in your back pocket for those times > where the need for density outweighs the needs of the clock > cycle. This is true. I have another barrrel shift design that's based on the use of the carry chain. I'll post it to the thread later, but as a new topic head. But the way I used the Carry chain the false path was a true one. That is, under certain (rare?) circumstances, the carry chain may have to propagate a signal from one end to the other. I'll post an explanation when I put up the VHDL for that barrel shifter (later tonight, maybe). I think this is enough cool circuits for one thread. > 3) I'd be interested in seeing your layout solution. The layout > is not trivial to making this perform well. No floor planning involved. Here's the statistics when it's implemented by itself with flip-flops on all inputs and outputs. The design is a 16-input barrel shifter, with 3 select inputs giving shifts from 0 to 7 bits. The design is a fall through, as is probably most efficient for this type of barrel shifter. It's placed and routed into a small VirtexE-8, which is a speedy little part. I put a clock period constraint on it of 5ns, but it only got to 165MHz. Still, this isn't bad for a fall through 16-wide barrel shifter with no floor planning and no buffering on the control lines. If I get around to it, I'll convert the source from schematic to (readable) VHDL. The reason it's in schematic form is because I hate to deal with RLOCs in VHDL: <<< Design Information ------------------ Command Line : map -p xcv50e-8-cs144 -o map.ncd arith.ngd arith.pcf Target Device : xv50e Target Package : cs144 Target Speed : -8 Mapper Version : virtexe -- D.27 Mapped Date : Tue Dec 18 19:57:47 2001 Design Summary -------------- Number of errors: 0 Number of warnings: 1 Number of Slices: 36 out of 768 4% Number of Slices containing unrelated logic: 0 out of 36 0% Number of Slice Flip Flops: 51 out of 1,536 3% Number of 4 input LUTs: 36 out of 1,536 2% Number of bonded IOBs: 35 out of 94 37% Number of GCLKs: 1 out of 4 25% Number of GCLKIOBs: 1 out of 4 25% Total equivalent gate count for design: 696 Additional JTAG gate count for IOBs: 1,728 >>> <<< The Number of signals not completely routed for this design is: 0 The Average Connection Delay for this design is: 0.885 ns The Maximum Pin Delay is: 2.310 ns The Average Connection Delay on the 10 Worst Nets is: 1.645 ns ... -------------------------------------------------------------------------------- Constraint | Requested | Actual | Logic | | | Levels -------------------------------------------------------------------------------- * NET "CLK" PERIOD = 5 nS LOW 50.000 % | 5.000ns | 6.036ns | 4 -------------------------------------------------------------------------------- >>> <<< Constraints cover 276 paths, 0 nets, and 184 connections (92.0% coverage) Design statistics: Minimum period: 6.036ns (Maximum frequency: 165.673MHz) Analysis completed Tue Dec 18 19:58:13 2001 -------------------------------------------------------------------------------- >>> Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37682
I agree: one can get large improvements over the placer by hand-placement, but hand-routing rarely provides an improvement and the tiny gains are hardly worth the very painful effort. "Ray Andraka" <ray@andraka.com> wrote in message news:3C2004E5.C4F9C8F8@andraka.com... > Additionally, I find that a good floorplan will get you to darned close to the > performance you can reach doing hand routing with considerably less effort. I'm > lazy, I don't want to do more work than is necessary to obtain a desired result > (and my customers surely don't want to pay for that last little bit unless there > is a darned good reason for it). Nice thing about stopping at floorplanning is > that you can still have everything in the mainstream flow. > > That said, it sure would be nice to be able to lock routing in a hard macro for > those few times when you really need it. > > Austin Lesea wrote: > > > Bryan, > > > > Reminds me of the Dilbert Cartoon where they are telling tales of their early > > programming years... > > > > "I remember using assembly code..." > > > > "That is nothing, I remember using 1's and 0's...." > > > > "You had zeroes? Wow, we had to use 'lower case l's and upper case 'ohs'..." > > > > "Bucnh of babies, I only had 1's!" > > > > Why we as engineers would enjoy pain, and brag about it still amazes me. > > > > A design that is well architected, self documented, commented, and reliable is > > more important to many customers. I prefer to throw all of my energies into > > supporting those designs (in hdl's) which now account for 99% of what is being > > done out there. > > > > Austin > > > > Bryan wrote: > > > > > So lets talk controversial.... > > > > > > If Lucent can support hard macros in Epic with hard routing, then why can't > > > Xilinx. My application requires it and Xilinx doesn't support it in FPGA > > > editor(which was programmed by the same softies as Epic). Oh, I remember > > > why they don't support it. Because nobody cares about designs that push the > > > limitations of FPGAs. Because everybody else that is making designs for > > > Xilinx parts is still in kindergarten finger painting with verilog and hdl. > > > Ha, I didn't get my EE degree to be a soft weirdo. Anybody can throw code > > > together and get poor performance. > > > > > > flame away kindergarten kids > > > > > > Bryan > > > > > > "Peter Alfke" <peter.alfke@xilinx.com> wrote in message > > > news:3C1F8AEC.BFD2E067@xilinx.com... > > > > This is a friendly and helpful newsgroup, but let's make sure that it does > > > not > > > > get abused. > > > > Lots of textbooks explain how to divide by a power of 2, where the > > > remainder is, > > > > and how you sign-extend the MSB. Explaining that is not the purpose of > > > this > > > > newsgroup. > > > > > > > > Let's use our "bandwidth" for more complex and perhaps controversial > > > questions > > > > that are not explained in textbooks and data books. > > > > > > > > Peter Alfke, Xilinx Applications > > > > > > > > > > -- > --Ray Andraka, P.E. > President, the Andraka Consulting Group, Inc. > 401/884-7930 Fax 401/884-7950 > email ray@andraka.com > http://www.andraka.com > > "They that give up essential liberty to obtain a little > temporary safety deserve neither liberty nor safety." > -Benjamin Franklin, 1759 > >Article: 37683
I don't know if we should discourage legitimate questions, but it seems like there is an inordinate amount of traffic from lazy students, of the form: "Hi, I need to make a VHDL program that divides by 4, how exactly would that look, line for line? Hurry because it's due Friday." I hope such people aren't actually getting degrees by having the older kids do their homework. -Kevin "Peter Alfke" <peter.alfke@xilinx.com> wrote in message news:3C1F8AEC.BFD2E067@xilinx.com... > This is a friendly and helpful newsgroup, but let's make sure that it does not > get abused. > Lots of textbooks explain how to divide by a power of 2, where the remainder is, > and how you sign-extend the MSB. Explaining that is not the purpose of this > newsgroup. > > Let's use our "bandwidth" for more complex and perhaps controversial questions > that are not explained in textbooks and data books. > > Peter Alfke, Xilinx Applications > >Article: 37684
"Bryan" <bryan@srccomp.com> wrote in message news:3c1fc73b$0$22747$724ebb72@reader2.ash.ops.us.uu.net... > So lets talk controversial.... > > If Lucent can support hard macros in Epic with hard routing, then why can't > Xilinx. My application requires it and Xilinx doesn't support it in FPGA > editor(which was programmed by the same softies as Epic). Just curious, which constraint is it that is driving you to hard macros? I've long wished for better support for the little things. I haven't got it, so I've drifted towards optimizing VHDL to get what I want. > Oh, I remember > why they don't support it. Because nobody cares about designs that push the > limitations of FPGAs. Because everybody else that is making designs for > Xilinx parts is still in kindergarten finger painting with verilog and hdl. My guess is that the volume users of Xilinx chips do care a lot about performance. If all we're doing is emulating what will be full custom at 1/10th speed, then the kindergarten is the way to go. But for designs that go out in volume and want to capture that incredible ease of reprogrammability, but have to worry about a BOM, performance is the only thing. Remember back in the days of XACT and XC2064s when it was still possible to implement your logic in FPGA editor? Ah, those were the days. Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37685
Hi, Here's a problem that does not qualify as kindergarten stuff. A DCM in a xc2v1000 engineering sample is behaving oddly. I am using this DCM to generate and deskew a clock and antiphased clock (0 & 180 degrees) for a DDR RAM. It is being fed by a 100MHz crystal buffered with a CY2308 zero-delay clock buffer (ie, a pll.) The period of the clock seen at the board is 10 ns - usually. Occasionally, the clock periods is lengthened by between 2.5 and 3 ns. The stretch is always in the low half of the cycle. What could be causing this? Relevant facts: - Chip is a xc2v1000bg456-4 -ES - Software is Xilinx Alliance 3.3.08i with the VirtexII device update files and the bitgen patch applied. - SelectIO type HSTL class II with DCI is being used I used a 1 gigahertz (5 gigasample) oscilloscope to look at this clock pin, so sampling error isn't it. The oscilloscope showed that sometimes (perhaps 1 of 4 stretched periods), the clock starts to go high at the right time, then changes its mind literally half way through the rise time -- it gets to half way between Vol and Voh (or Vref; this is HSTL) and then returns to Vol again. between 2.5 and 3 ns later, it does a proper rising edge. I've even tried locking the DCM to different parts of the chip and it made no difference. I've searched the xilinx answers and found no clues. Any suggestions gratefully received! -- David Miller, BCMS (Hons) | When something disturbs you, it isn't the Endace Measurement Systems | thing that disturbs you; rather, it is Mobile: +64-21-704-djm | your judgement of it, and you have the Fax: +64-21-304-djm | power to change that. -- Marcus AureliusArticle: 37686
Performance of above divide by 3 circuit. Period objective is 5ns, all inputs and outputs are registered (with internal flip-flops). The thing actually achieved 103.907MHz, which is very good for a fall through divide by 3 circuit. No floor planning. Part is a small VirtexE-8: <<< Xilinx Mapping Report File for Design 'divide' Copyright (c) 1995-2000 Xilinx, Inc. All rights reserved. Design Information ------------------ Command Line : map -p xcv50e-8-cs144 -o map.ncd divide.ngd divide.pcf Target Device : xv50e Target Package : cs144 Target Speed : -8 Mapper Version : virtexe -- D.27 Mapped Date : Tue Dec 18 20:51:33 2001 Design Summary -------------- Number of errors: 0 Number of warnings: 1 Number of Slices: 76 out of 768 9% Number of Slices containing unrelated logic: 0 out of 76 0% Number of Slice Flip Flops: 98 out of 1,536 6% Number of 4 input LUTs: 85 out of 1,536 5% Number of bonded IOBs: 65 out of 94 69% Number of GCLKs: 1 out of 4 25% Number of GCLKIOBs: 1 out of 4 25% Total equivalent gate count for design: 1,294 Additional JTAG gate count for IOBs: 3,168 >>> <<< The Number of signals not completely routed for this design is: 0 The Average Connection Delay for this design is: 0.836 ns The Maximum Pin Delay is: 2.583 ns The Average Connection Delay on the 10 Worst Nets is: 1.968 ns -------------------------------------------------------------------------------- Constraint | Requested | Actual | Logic | | | Levels -------------------------------------------------------------------------------- * NET "CLK" PERIOD = 5 nS LOW 50.000 % | 5.000ns | 9.624ns | 7 -------------------------------------------------------------------------------- 1 constraint not met. Dumping design to file divide.ncd. >>> <<< Constraints cover 8188 paths, 0 nets, and 447 connections (93.1% coverage) Design statistics: Minimum period: 9.624ns (Maximum frequency: 103.907MHz) Analysis completed Tue Dec 18 20:52:40 2001 >>> Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37687
> ALWAYS > >>want my designs to use IOB flip flops if possible. It seems to me that > That's what you get for using Design Mangler...er...Manager ;-) heh. I find that make does a fair job of managing builds. But then, I always did find CLIs more user friendly than GUIs. Even if you invoke map from the commandline or means other than through DM, packing flops into I/Os is not done unless the -pr flag is supplied. So I suppose DM is following the defaults of map. M. Ramirez's question still holds good -- is there ever a reason not to pack flops into IOBs? -- David Miller, BCMS (Hons) | When something disturbs you, it isn't the Endace Measurement Systems | thing that disturbs you; rather, it is Mobile: +64-21-704-djm | your judgement of it, and you have the Fax: +64-21-304-djm | power to change that. -- Marcus AureliusArticle: 37688
Oops, I meant to say it would be nice to be able to lock routing within the normal design flow for those few times when it is needed. Ray Andraka wrote: > > That said, it sure would be nice to be able to lock routing in a hard macro for > those few times when you really need it. > --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 37689
We don't use them in a fall through configuration very often. As a point of reference, we have one in a 160 MHz design in a VirtexE-6 that does a 0 to 15 position shift with a 2 clock latency, including rounding at the output. It is 19 bits wide at the output. It is not the critical path in the design. IIRC, there are 3 levels of conventional shifter, a register, then the last layer along with a carry chain for the round. That one is VHDL with the RLOCs in the code. We prefer VHDL for placed designs now because of the capability of the generate statement (the design I was describing is a parameterized generate so it can take arbitrary input and output widths as well as number of clocks of latency). Off-hand, I think with careful layout you could get a 3 layer fall through design using the conventional approach well above 200 MHz in an E-8. Carl Brannen wrote: > Ray: > > >2) The carry chain can also be used for a free doubler circuit. > > However, watch the timing. There exist false paths (that are > > also quite slow comparatively speaking) introduced by the > > non-standard use of the carry chain (the chain connections > > are only used to the next neighbor, not all the way up the > > chain). Timingwise, the conventional approach seems to yield > > better propagation delays in combinatorial only shifters, and > > considerably better times in fully pipelined shifters. This > > is a good trick to put in your back pocket for those times > > where the need for density outweighs the needs of the clock > > cycle. > > This is true. I have another barrrel shift design that's based on the use > of the carry chain. I'll post it to the thread later, but as a new topic > head. But the way I used the Carry chain the false path was a true one. > That is, under certain (rare?) circumstances, the carry chain may have to > propagate a signal from one end to the other. I'll post an explanation when > I put up the VHDL for that barrel shifter (later tonight, maybe). I think > this is enough cool circuits for one thread. > > > 3) I'd be interested in seeing your layout solution. The layout > > is not trivial to making this perform well. > > No floor planning involved. Here's the statistics when it's implemented by > itself with flip-flops on all inputs and outputs. The design is a 16-input > barrel shifter, with 3 select inputs giving shifts from 0 to 7 bits. The > design is a fall through, as is probably most efficient for this type of > barrel shifter. It's placed and routed into a small VirtexE-8, which is a > speedy little part. I put a clock period constraint on it of 5ns, but it only > got to 165MHz. Still, this isn't bad for a fall through 16-wide barrel > shifter with no floor planning and no buffering on the control lines. If I > get around to it, I'll convert the source from schematic to (readable) VHDL. > The reason it's in schematic form is because I hate to deal with RLOCs in > VHDL: > > <<< > Design Information > ------------------ > Command Line : map -p xcv50e-8-cs144 -o map.ncd arith.ngd arith.pcf > Target Device : xv50e > Target Package : cs144 > Target Speed : -8 > Mapper Version : virtexe -- D.27 > Mapped Date : Tue Dec 18 19:57:47 2001 > > Design Summary > -------------- > Number of errors: 0 > Number of warnings: 1 > Number of Slices: 36 out of 768 4% > Number of Slices containing > unrelated logic: 0 out of 36 0% > Number of Slice Flip Flops: 51 out of 1,536 3% > Number of 4 input LUTs: 36 out of 1,536 2% > Number of bonded IOBs: 35 out of 94 37% > Number of GCLKs: 1 out of 4 25% > Number of GCLKIOBs: 1 out of 4 25% > Total equivalent gate count for design: 696 > Additional JTAG gate count for IOBs: 1,728 > >>> > > <<< > The Number of signals not completely routed for this design is: 0 > > The Average Connection Delay for this design is: 0.885 ns > The Maximum Pin Delay is: 2.310 ns > The Average Connection Delay on the 10 Worst Nets is: 1.645 ns > ... > > -------------------------------------------------------------------------------- > Constraint | Requested | Actual | Logic > | | | Levels > > -------------------------------------------------------------------------------- > * NET "CLK" PERIOD = 5 nS LOW 50.000 % | 5.000ns | 6.036ns | 4 > > -------------------------------------------------------------------------------- > >>> > > <<< > Constraints cover 276 paths, 0 nets, and 184 connections (92.0% coverage) > > Design statistics: > Minimum period: 6.036ns (Maximum frequency: 165.673MHz) > > Analysis completed Tue Dec 18 19:58:13 2001 > > -------------------------------------------------------------------------------- > >>> > > Carl > > -- > Posted from firewall.terabeam.com [216.137.15.2] > via Mailgate.ORG Server - http://www.Mailgate.ORG -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 37690
Hi, I will like to know if someone knows the strategies on how to reduce routing (net) delays for Spartan-II. So far, I treated synthesis tool(XST)/Map/Par as a blackbox, but because my design (a PCI IP core) was not meeting Tsu (Tsu < 7ns), I started to take a closer look of how LUTs are placed on the FPGA. Using Floorplanner, I saw the LUTs being placed all over the FPGA, so I decided to hand place the LUTs using UCF flow. That was the most effective thing I did to reduce interconnect delay (reduced the worst interconnect delay by about 2.7 ns (11 ns down to 8.3 ns)), but unfortunately, I still have to reduce the interconnect delay by 1.3 ns (worst Tsu currently at 8.3 ns). Basically, I have two input signals, FRAME# and IRDY# that are not meeting timings. Here are the two of the worst violators for FRAME# and IRDY#, respectively. ________________________________________________________________________________ ================================================================================ Timing constraint: COMP "frame_n" OFFSET = IN 7 nS BEFORE COMP "clk" ; 503 items analyzed, 61 timing errors detected. Minimum allowable offset is 8.115ns. -------------------------------------------------------------------------------- Slack: -1.115ns (requirement - (data path - clock path - clock arrival)) Source: frame_n Destination: PCI_IP_Core_Instance_ad_Port_2 Destination Clock: clk_BUFGP rising at 0.000ns Requirement: 7.000ns Data Path Delay: 10.556ns (Levels of Logic = 6) Clock Path Delay: 2.441ns (Levels of Logic = 2) Timing Improvement Wizard Data Path: frame_n to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tiopi 1.224 frame_n frame_n_IBUF net (fanout=45) 0.591 frame_n_IBUF Tilo 0.653 PCI_IP_Core_Instance_I_25_LUT_7 net (fanout=3) 0.683 N21918 Tbxx 0.981 PCI_IP_Core_Instance_I_XXL_1357_1 net (fanout=15) 2.352 PCI_IP_Core_Instance_I_XXL_1357_1 Tilo 0.653 PCI_IP_Core_Instance_I_125_LUT_17 net (fanout=1) 0.749 PCI_IP_Core_Instance_N3059 Tilo 0.653 PCI_IP_Core_Instance_I__n0055 net (fanout=1) 0.809 PCI_IP_Core_Instance_N3069 Tioock 1.208 PCI_IP_Core_Instance_ad_Port_2 ---------------------------- ------------------------------ Total 10.556ns (5.372ns logic, 5.184ns route) (50.9% logic, 49.1% route) Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tgpio 1.082 clk clk_BUFGP/IBUFG net (fanout=1) 0.007 clk_BUFGP/IBUFG Tgio 0.773 clk_BUFGP/BUFG net (fanout=423) 0.579 clk_BUFGP ---------------------------- ------------------------------ Total 2.441ns (1.855ns logic, 0.586ns route) (76.0% logic, 24.0% route) -------------------------------------------------------------------------------- ================================================================================ Timing constraint: COMP "irdy_n" OFFSET = IN 7 nS BEFORE COMP "clk" ; 698 items analyzed, 74 timing errors detected. Minimum allowable offset is 8.290ns. -------------------------------------------------------------------------------- Slack: -1.290ns (requirement - (data path - clock path - clock arrival)) Source: irdy_n Destination: PCI_IP_Core_Instance_ad_Port_2 Destination Clock: clk_BUFGP rising at 0.000ns Requirement: 7.000ns Data Path Delay: 10.731ns (Levels of Logic = 6) Clock Path Delay: 2.441ns (Levels of Logic = 2) Timing Improvement Wizard Data Path: irdy_n to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tiopi 1.224 irdy_n irdy_n_IBUF net (fanout=138) 0.766 irdy_n_IBUF Tilo 0.653 PCI_IP_Core_Instance_I_25_LUT_7 net (fanout=3) 0.683 N21918 Tbxx 0.981 PCI_IP_Core_Instance_I_XXL_1357_1 net (fanout=15) 2.352 PCI_IP_Core_Instance_I_XXL_1357_1 Tilo 0.653 PCI_IP_Core_Instance_I_125_LUT_17 net (fanout=1) 0.749 PCI_IP_Core_Instance_N3059 Tilo 0.653 PCI_IP_Core_Instance_I__n0055 net (fanout=1) 0.809 PCI_IP_Core_Instance_N3069 Tioock 1.208 PCI_IP_Core_Instance_ad_Port_2 ---------------------------- ------------------------------ Total 10.731ns (5.372ns logic, 5.359ns route) (50.1% logic, 49.9% route) Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tgpio 1.082 clk clk_BUFGP/IBUFG net (fanout=1) 0.007 clk_BUFGP/IBUFG Tgio 0.773 clk_BUFGP/BUFG net (fanout=423) 0.579 clk_BUFGP ---------------------------- ------------------------------ Total 2.441ns (1.855ns logic, 0.586ns route) (76.0% logic, 24.0% route) -------------------------------------------------------------------------------- Timing summary: --------------- Timing errors: 135 Score: 55289 Constraints cover 27511 paths, 0 nets, and 4835 connections (92.1% coverage) ________________________________________________________________________________ Locations of various resources: FRAME#: pin 23 IRDY#: pin 24 AD[2]: pin 62 PCI_IP_Core_Instance_I_25_LUT_7: CLB_R12C1.s1 PCI_IP_Core_Instance_I_XXL_1357_1: CLB_R12C2 PCI_IP_Core_Instance_I_125_LUT_17: CLB_R23C9.s0 PCI_IP_Core_Instance_I__n0055: CLB_R24C9.s0 Input signals other than FRAME# and IRDY# are all meeting Tsu < 7 ns requirement, and because I now figured out how to use IOB FFs, I can easily meet Tval < 11 ns (Tco) for all output signals. I am using Xilinx ISE WebPack 4.1 (which doesn't come with FPGA Editor), and the PCI IP core is written in Verilog. The device I am targeting is Xilinx Spartan-II 150K system gate speed grade -5 part (XC2S150-5CPQ208), and I did meet all 33MHz PCI timings with Spartan-II 150K system gate speed grade -6 part (XC2S150-6CPQ208) when I resynthesized the PCI IP core for speed grade -6 part, and basically reused the same UCF file with the floorplan (I had to make small modifications to the UCF file because some of the LUT names changed). The reason I really care about Xilinx Spartan-II 150K system gate speed grade -5 part is because that is the chip that is on the PCI prototype board of Insight Electronics Spartan-II Development Kit. Yes, I wish the PCI prototype board came with speed grade -6 . . . Because I want the PCI IP core to be portable across different platforms (most notably Xilinx and Altera FPGAs), I am not really interested in making any vendor specific modification to my Verilog RTL code, but I won't mind using various tricks in the .UCF file (for Xilinx) or .ACF file (I believe that is the Altera equivalent of Xilinx .UCF file). Here are some solutions I came up with. 1) Reduce the signal fanout (Currently at 35 globally, but FRAME# and IRDY#'s fanout are 200. What number should I reduce the global fanout to?). 2) Use USELOWSKEWLINES in a UCF file (already tried on some long routings, but didn't seem to help. I will try to play around with this option a little more with different signals.). 3) Floorplan all the LUTs and FFs on the FPGA (currently, I only floorplanned the LUTs that violated Tsu, and most of them take inputs from FRAME# and IRDY#.). 4) Use Guide file Leverage mode in Map and Par. 5) Try routing my design 2000 times (That will take several days . . . I once routed my design about 20 times. After routing my design 20 times, Par seems to get stuck in certain Timing Score range beyond 20 iterations.). 6) Pay for ISE Foundation 4.1 (I don't want to pay for tools because I am poor), and use FPGA Editor (I wish ISE WebPack came with FPGA Editor.). At least from FPGA Editor, I can see how the signals are actually getting routed. 7) Use a different synthesis tool other than XST (I am poor, so I doubt that I can afford.). I will like to hear from anyone who can comment on the solutions I just wrote, or has other suggestions on what I can do to reduce the delays to meet 33MHz PCI's Tsu < 7 ns requirement. Thanks, Kevin Brace (don't respond to me directly, respond within the newsgroup) P.S. Considering that I am struggling to meet 33MHz PCI timings with Spartan-II speed grade -5, how come Xilinx meet 66MHz PCI timings on Virtex/Spartan-II speed grade -6? (I can only barely meet 33MHz PCI timings with Spartan-II speed grade -6 using floorplanner.) Is it possible to move a signal through a input pin like FRAME# and IRDY# (pin 23 and pin 24 respectively for Spartan-II PQ208), go through a few levels of LUTs, and reach far away IOB output FF and tri-state control FF like pin 67 (AD[0]) or pin 203 (AD[31]) in 5 ns? (3 ns + 1.9 to 2 ns natural clock skew = 4.9 ns to 5.0 ns realistic Tsu) Can a signal move that fast on Virtex/Spartan-II speed grade -6? (I sort of doubt from my experience.) I know that Xilinx uses the special IRDY and TRDY pin in LogiCORE PCI, but that won't seem to help FRAME#, since FRAME# has to be sampled unregistered to determine an end of burst transfer. What kind of tricks is Xilinx using in their LogiCORE PCI other than the special IRDY and TRDY pin? Does anyone know?Article: 37691
I don't claim to be an expert at all, but according to this EE Times article, if the IP core came from an FPGA vendor like Xilinx or Altera, you are pretty much stuck with their devices, unless the FPGA vendor offers a conversion service (like Altera's HardCopy (started recently) or Xilinx's HardWire (which if I am correct was discontinued in 1999)). http://www.eetimes.com/story/OEG20010907S0103 Another bad news for a conversion service is that Clear Logic recently lost a key ruling against Altera. http://www.altera.com/corporate/press_box/releases/corporate/pr-wins_clear_logic.html I sort of find the ruling troubling because assuming that an Altera-made IP is not included in the customer's design, should anyone have any control of the bit stream file you generated from Altera's software? I suppose that what Altera wants to say is that because the customer had to agree prior to using an Altera software (like MAX+PLUS II or Quartus), the customer has to use the generated bit stream file in a way agreed in the software licensing agreement. However, recently Clear Logic received a patent on their business model of converting a bit stream file directly to an ASIC, and that business model seems to be very similar to Altera's HardCopy, so I expect Clear Logic to sue Altera soon. http://www.ebnews.com/story/OEG20011108S0031 So seeing that IP cores from FPGA vendors have strings attached to them, I think it will be safer to use a third party (non-device vendor) IP core if FPGA-ASIC conversion is part of the requirement of your application. Kevin Brace (don't respond to me directly, respond within the newsgroup) arlington_sade@yahoo.com (arlington) wrote in message news:<63d93f75.0112160047.77f9982e@posting.google.com>... > Hello all, > > If you were to use IP cores, such as Logicore/Alliance cores from > Xilinx, Megafunction cores from Altera, Inventra cores from Mentor, > etc. how can you get the RTL verilog/VHDL when you want to convert to > ASIC ? > > Thanks.Article: 37692
I know this topic is not directly related to this newsgroup, but I noticed some postings I made to comp.arch.fpga through Google Groups got posted on Google Groups' comp.arch.fpga archive, but wasn't there when I checked my postings through mailgate.org's service (http://www.mailgate.org). Anyone else noticed this problem? Kevin Brace (don't respond to me directly, respond within the newsgroup)Article: 37693
I think that depends on the type of degree. Wouldn't finding a way to get someone else to do the hard work for you be considered reasonable grounds for being awarded an MBA? Kevin Neilson wrote: > I don't know if we should discourage legitimate questions, but it seems like > there is an inordinate amount of traffic from lazy students, of the form: > > "Hi, I need to make a VHDL program that divides by 4, how exactly would that > look, line for line? Hurry because it's due Friday." > > I hope such people aren't actually getting degrees by having the older kids > do their homework. > > -KevinArticle: 37694
Hi everyone I have a newbie question: I add /*synthesis syn_maxfan=20*/ to many wire reg output and input type, but some take effect, but some do not, why? if I add this attribute to a wire , but this wire drive some load at higher level structure, will this attribute take effect for that loads on higher level?Article: 37695
It is very useful (and necessary) for truely modular design. It is one of the key pieces missing in the current PAR flow. THere are times when a keep routing in or out of an area would be handy, especially if it were accessible hierarchically in the source. Yes, it is on the wish list, but I get the feeling Santa won't be bringing it this year. Guess I haven't been good enough. Falk Brunner wrote: > "Christian Plessl" <plessl@remove.tik.ee.ethz.ch> schrieb im Newsbeitrag > news:3c1f4e5f@pfaff.ethz.ch... > > > Is possible to (completely) prohibit the use of routing ressources on a > > specific area of the FPGA? > > Why do you want to do so? > > -- > MfG > Falk -- --Ray Andraka, P.E. President, the Andraka Consulting Group, Inc. 401/884-7930 Fax 401/884-7950 email ray@andraka.com http://www.andraka.com "They that give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, 1759Article: 37696
Hi..to all I wrote some functions for a CDMA receiver and I want to find the number of MIPS required by each function. How do I calculate it? and which is more accurate measure, MIPS or MOPS? More info: data rate 2Mbps. system clock 50MHz. 4 time over sampling. 16 Spreading factor. Thanks.Article: 37697
Hi Ray, No question that messing around with the carries is going to slow down a barrel shifter. It's been a very long time since I used one (okay, it was actually a "funnel shifter", but I hate that term, so I call them all barrel shifters). I just got the one that uses the carries to effectively fit a 3 to 1 mux into a LUT to place and route correctly. It uses 38 LUTs to do a two stage shift of 0 to 8 bits (i.e. a 9-bit barrel shift) on a 16-bit input. It's very efficient when you need a power of 3 shift size. You get a 9-bit shift with only 2 stages of logic instead of the usual 4. I'll post it to the thread in a minute. The reason I made it two stage was simply to prevent the synthesizer from messing with my logic. It's slow enough (due to the full length carry) that it would make more engineering sense as a space saving circuit rather than a highly pipelined design. But I'm doing this for fun, so what the heck. The last rough spot was getting the synthesizer to recognize that a LUT4 wasn't interfering with a MULT_AND. I had to instantiate the LUT4s. I haven't figured out how to apply an attribute to a generated component (i.e. like your RLOC usage on generated components). I'm assuming here that "generated" means the use of the "generate" command in VHDL. The basic problem is that I haven't figured out how to properly address the components. So I went ahead and instantiated them individually. If you have a way around that, I'd appreciate the secret. Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37698
This 16x9 barrel shifter uses only 38 LUTs. There are two stages of 3 to 1 muxes, giving a total mux of 9 to 1. The 3 to 1 muxes are implemented using a "trick" that involves the use of the carry logic. This dimension barrel shifter is rather inconvenient for conventional barrel shifters, which will require 4 stages of logic. This one does it in only two stages, but I have to add 2 LUTs for the two mode pins required by each stage, and 1 LUT for the carry input to each stage. Thus the total LUT count is 2 * (16 + 3) = 38. This is considerably below the conventional barrel shifter requirement for this size barrel shifter of 4 * 16 = 64. I'm putting this up for people to admire and critique. If you're going to use it, I'd note that getting stuff like this to synthesize correctly is not particularly easy. I've tried to include design notes to illustrate the pitfalls that arise when the design is modified. Also, I'd note that it's been a few days since I simulated this logic, and it's possible that in the process of forcing it into its designed number of LUTs I messed something up. This was written for fun, I am not using it in any work. So beware. library IEEE; use IEEE.std_logic_1164.all; -- Efficient Barrel Shifter, fully pipelined. -- -- 16-input x 9 barrel shifter. Uses two stages of logic. -- Requires only 39FGs, 30CYs, and 34DFFs. -- -- Designer: Carl Brannen -- -- Feel free to modify this circuit and use it in -- your own designs. I am aware of no patents that -- it infringes on, but you will have to make your -- own determination of this. My only request is -- that you leave a comment to the effect that your -- knowledge of the algorithm is through me. Of course -- this is freeware and I make no guarantees that it -- will be at all helpful to you. -- -- Synthesize with optimize set for "low", and -- "area". This circuit is already optimized, -- the computer will only be waste its time (and likely -- increase the size and delay of the result) if it -- tries to optimize further. entity BRL16_8 is port ( CLK: in STD_LOGIC; DIN: in STD_LOGIC_VECTOR(15 downto 0); SHIFT: in STD_LOGIC_VECTOR( 3 downto 0); Y: out STD_LOGIC_VECTOR(15 downto 0) ); end BRL16_8; architecture BRL16_8_arch of BRL16_8 is -- -- The standard version of a 16x8 barrel shifter uses -- 48FGs and 51DFFs, and a 16x9 barrel shifter would use -- 64FGs. -- -- The usual barrel shifter uses stages that shift by either 0 -- or 2^n bits. This barrel shifter uses stages that shift -- by three different amounts instead of two. -- -- The usual barrel shifter stage consists of a vector of 2 to 1 -- muxes. A stage in this barrel shifter is instead, effectively, -- a 3 to 1 mux. It nevertheless takes up about the same space that -- the 2 to 1 stage in a regular barrel shifter takes. -- -- The 3 to 1 muxes can be used in pairs to give a 9 to 1 mux, -- which is equivalent to three stages of the usual barrel shifter. -- -- The area reduction is therefore up to 33% of the area used by a -- regular 8^n size barrel shifter, but it can be more or less for -- other sizes. -- -- This barrel shifter type is particularly efficient when the -- number of bits to be shifted includes a power of 3. For instance, -- a 9-bit barrel shifter could be accomplished in only 2 stages, -- while a regular barrel shifter of that length would require -- a full 4 stages. The example here, however, only does shifts -- by 8, not by 9. The reason for so restricting it is to get -- the control circuitry simpler. -- -- The usual barrel shifter consists of 2 to 1 muxes. These are -- implemented in LUT3 (i.e. two data inputs and one select input). -- Since there is a free control pin, it's clear that more functionality -- can be packed into the LUT. In addition, none of the arithmetic -- functionality of the slice is being used. -- -- A 4 to 1 mux would allow 2 "bits" of barrel shifting to happen -- in a single stage, but it's pretty obvious that that goal is -- beyond us. There would be 6 inputs required for such a mux, but -- there are only 4 LUT inputs, and one CIN input. -- -- But 4 LUT and 1 CIN does give 5 inputs per bit, and this is -- enough to implement a 3 to 1 mux. So we need to look at designs -- that use 3 to 1 muxes. -- -- I'll combine consecutive 3 to 1 mux stages so that together they -- implement three bits of a standard barrel shifter. The overall -- functionality will be as follows: -- -- SHFT[2..0] 0 1 2 3 4 5 6 7 8 -- - - - - - - - - - -- Shift Amt: 0 1 2 3 4 5 6 7 8 -- -- This functionality can be implemented in two stages using -- 3 to 1 muxes as follows: -- -- SHFT[2:0] 0 1 2 3 4 5 6 7 8 -- - - - - - - - - - -- SHFA[1:0] 0 1 2 0 1 2 0 1 2 (Shift by 0, 1, or 2) -- SHFB[2:1] 0 0 0 3 3 3 6 6 6 (Shift by 0, 3, or 6) -- - - - - - - - - - -- Shift Amt: 0 1 2 3 4 5 6 7 8 -- -- Into each LUT I'll bring 2 mode control bits and two data -- bits. Two of the shifts will correspond to selecting the -- two data bits. I'll call these the "mux" shifts. The 3rd -- shift is passed through the CIN / COUT lines, so it's called -- the "arithmetic" shift. -- -- Before continuing, I need to comment on some mathematics. -- Barrel shifting, as implemented by stages of shifts, for -- a barrel shifter of width "n", is equivalent to Z_n. That -- is, it is equivalent to addition in the integers modulo n. -- Another way of putting this is to say that shifting is additive. -- If you shift by "S" and then shift by "T", the result is -- equivalent to a shift by "S+T". The usual barrel shifter -- uses the binary representation of the shift amount, with -- stage "m" shifting by either 0 or 2^m. But the same shift -- result could be accomplished by shifting by other amounts. -- The thing to remember is that it is additive. -- -- Because of this, it's useful to know a little bit about Z_n. -- I'm going to have to use the CIN to bring in one of the -- shift quantities, so these CIN to COUT chains are going to -- depend on the structure of the "orbits" of the shift amounts. -- "Orbit" has a specific meaning to the pure mathematicians, -- in this case I'm going to redefine it to mean something -- similar. Don't show this to any mathematicians, they'll -- undoubtedly be disgusted with my abuse of the language. -- -- A given shift amount creates an orbit out of a start bit -- by the sequence of places it shifts that bit to as the shift -- is repeated. For example, with a barrel shifter width of 8, -- if the shift amount is 3 bits, the "0" bit is taken through -- the following sequence: (0, 3, 6, 1, 4, 7, 2, 5), and then -- back to "0". This "orbit" has 8 elements. The orbit of the -- other bits, under a shift of 3 bits, is going to look very -- much the same (i.e. be "isomorphic".) Because of this -- similarity, I'm only going to look at the orbits of "0". -- -- The orbits due to other shift amounts may or may not be -- isomorphic. For example, the orbit of "0" with a shift -- amount of 1 is (0, 1, 2, 3, 4, 5, 6, 7), which is indeed -- isomorphic to the orbit of "3". But the orbit of "2" is -- shorter: (0, 2, 4, 6). Mathematically, the orbits of a -- shift by n bits in a barrel shift of width w is of maximum -- length of n and w have no common divisors. -- -- The isomorphisms of the orbits is not of significance for -- the shifts that are performed with the usual multiplexers, -- but are very important for shifts performed using the -- arithmetic CIN / COUT wires. The carry logic for that -- shift forms a chain. A shift of 1 bit always has the same -- orbit, the one with maximum length, no matter what the -- size of the barrel shifter: (0, 1, 2, ... WIDTH-1), where -- WIDTH is the width of the barrel shifter. This orbit is -- preferred to other possible orbits because I have to make -- sure that CIN = 0 in order to avoid an unwanted arithmetic -- operation (i.e. an increment) when performing the regular -- mux type shifts. This means that I have to control the CIN -- by forcing it to zero for the two mux shifts, and connecting -- it to COUT for the one arithmetic shift. If I select an -- orbit of less than the full Barrel Shift width for the -- arithmetic shift, I'll have to build more than one copy of -- the CIN control logic. -- -- There's another reason for analyzing the orbit structure of -- barrel shifters. The two mode bits routed to each LUT have -- some freedom. It's possible that I might be able to arrange -- for those mode bits to be one of the 3 select inputs, thereby -- reducing the number of LUTs needed to compute the control -- lines. I suppose there is also a chance that I might be -- able to share a control line between the SHFA and SHFB shifts, -- though it surely seems unlikely. -- -- The arithmetic shift is accomplished through the CIN input. -- This means I'll have to use the XORCY output, and therefore -- I will have to make sure that COUT = 0 (and therefore CIN = 0 -- for the next bit in the orbit) during the mux shift modes. -- Now one of the data inputs will have to connect to I0 in -- order for it to be sent out the COUT. This will be the -- arithmetic shift. When that same data input is to be used -- as a mux shift (to the current bit), the COUT will automatically -- be cleared by the arithmetic logic. (If "I0" is the mux / -- arithmetic data input, the LUT4 will be programmed to output -- I0 when selecting I0 as the shift amount. The MUXCY, which -- has the LUT4 output as its selector, will then select I0 -- when the LUT4 output is 0, and CIN when the LUT4 output is -- 1. Since the LUT4 output in this case is I0, the I0 input -- will only be selected when it is zero.) But I have to make -- sure that I0 input to the MUXCY is zero when selecting the -- other mux shift. The only way to do this for free is to use -- the MULT_AND. -- -- This gives the structure of one stage of a general purpose -- 3-way shifter as the following. XA and XB are shifts that -- use the multiplexers, and C is the arithmetic shift. "M1" -- and "M0" are the two mode inputs to the LUT, while "XA" and -- "XB" are the two data inputs. The mode values "s" need to -- be selected later, hopefully in a manner that reduces the -- need for decoding logic. -- -- SHFA M1 M0 LUT4 CIN COUT Result -- ---- -- -- ---- --- ---- ------ -- XA s s XA 0 0 XA -- XB s 0 XB 0 0 XB -- C s 1 0 C XA C -- -- There's one remaining bit of mathematical gibberish. -- In addition to performing isomorphisms between shifts, -- I can also add a constant shift to a stage for free. -- That is, I can renumber the outputs of a stage by shifting -- them around. This is just a wire change, and it means -- that given a stage that, for instance, shifts by 2,3, or -- 4 bits, I can rearrange the output bits on the same stage -- and make it shift by 4,5, or 6 bits. (Or 6,7, or 0 bits.) -- This gives me some freedom in how I assign mode pins, -- and freedom is very important for minimizing logic functions. -- -- The effect of the above mathematical note is simple. I -- can add an arbitrary shift amount to each stage, in terms -- of what shift is associated with a particular pattern on -- the SHFT inputs. Then I can remove that shift by doing -- a "wire" barrel shift for free. -- -- The XB shift is completely arbitrary. By that I mean -- it could connect up any way without respect to how the -- XA and C shifts are done. But the XA and C shifts are -- related. In order for the carry structure to have only -- one overall CIN, I need to have that XA and C shifts -- differ (in terms of how many bits they shift) by an amount -- that corresponds to a full length orbit. -- -- For the example given, with a barrel shifter width of 8, -- this means that I have to have the XA and C shifters -- differ by {1,3,5, or 7} bits. (Note all arithmetic is to -- be done modulo the width of the barrel shifter.) This -- is rather liberal, as it means that I need only ensure -- that not all the shift amounts be even or odd for either -- SHFA or SHFB. -- -- The other thing to notice is that I only need 3 different -- shifts but with two mode pins I'm going to have 4 codes. -- This means that I can map two codes to the same mode. This -- may help reduce the amount of control logic. But I'll use -- the extra degree of freedom to define two states for SHFB == 3. -- This way, if SHIFT[3] is tied low (and the design reduced -- from a shift by "0 to 8" to a shift by "0 to 7") an FG will -- be saved. The resulting truth table is: -- -- -- A0_MODE A1_MODE -- SHIFT SHFA SHFB 1 0 1 0 -- ----- ---- ---- - - - - -- 0000 0 0 0 0 0 0 -- 0001 1 0 1 0 0 0 -- 0010 2 0 0 1 0 0 -- 0011 0 3 0 0 1 0 -- 0100 1 3 1 0 1 1 -- 0101 2 3 0 1 1 1 -- 0110 0 6 0 0 0 1 -- 0111 1 6 1 0 0 1 -- 1000 2 6 0 1 0 1 -- Also note that the original version of this code got the arithmetic -- mux by connecting the Carry-out back around to the Carry-input. This -- generates an apparent "cycle", and causes a "post layout timing report" -- warning something like the following: -- -- ---------------------------------------------------------------------- -- ! Warning: The following connections close cycles, and some paths ! -- ! through these connections may not be analyzed. ! -- ! ! -- ! Signal Driver Load ! -- ! -------------------------------- ---------------- ---------------- ! -- ! U3/A0_CRY2 LB_R13C8.S1.COUT CLB_R12C8.S1.CIN ! -- ! U3/A1_CRY8 CLB_R9C4.S0.COUT CLB_R8C4.S0.CIN ! -- ---------------------------------------------------------------------- -- -- First of all, this warning may be ignored for this particular circuit. -- The reason is that the circuit is operated in two modes. In the non -- arithmetic shifts, the "0th" carry is forced to be zero by the CIN selector. -- This propagates through the rest of the circuit, so there is no cycle. -- In the arithmetic shift mode the carry is always equal to the applied A[] -- input, and no carries propagate through the circuit at all. -- -- But it's best to avoid warnings, so the circuitry shown here simply ignores -- the final carry-out and consequently is manifestly free of cycles. -- -- There are some useful arithmetic circuits where the COUT has to be connected -- back to the CIN input but this is not one of them. A great example of a -- an arithmetic circuit where cycles have to be dealt with is in the addition -- circuitry of an ALU designed to add floating point numbers in CRAY notation. -- Ah, for the days when I was a CPU designer! component XORCY port ( CI: in STD_LOGIC; LI: in STD_LOGIC; O: out STD_LOGIC); end component; component MUXCY port ( DI: in STD_LOGIC; CI: in STD_LOGIC; S: in STD_LOGIC; O: out STD_LOGIC); end component; component MULT_AND port ( I0: in STD_LOGIC; I1: in STD_LOGIC; LO: out STD_LOGIC); end component; component LUT4 port ( I0: in STD_LOGIC; I1: in STD_LOGIC; I2: in STD_LOGIC; I3: in STD_LOGIC; O: out STD_LOGIC); end component; -- Define LUT4s to the correct function. For some reason the synthesizer -- couldn't figure out that this was what I wanted. attribute INIT: string; attribute INIT of L00: label is "0E04"; attribute INIT of L01: label is "0E04"; attribute INIT of L02: label is "0E04"; attribute INIT of L03: label is "0E04"; attribute INIT of L04: label is "0E04"; attribute INIT of L05: label is "0E04"; attribute INIT of L06: label is "0E04"; attribute INIT of L07: label is "0E04"; attribute INIT of L08: label is "0E04"; attribute INIT of L09: label is "0E04"; attribute INIT of L10: label is "0E04"; attribute INIT of L11: label is "0E04"; attribute INIT of L12: label is "0E04"; attribute INIT of L13: label is "0E04"; attribute INIT of L14: label is "0E04"; attribute INIT of L15: label is "0E04"; -- Stage 0 declarations signal A0_MODE: STD_LOGIC_VECTOR( 1 downto 0); -- SHFA -- Arithmetic inputs signal A0_A: STD_LOGIC_VECTOR(15 downto 0); -- A[] signal input signal A0_B: STD_LOGIC_VECTOR(15 downto 0); -- B[] signal input -- Arithmetic internal signals signal A0_LUT: STD_LOGIC_VECTOR(15 downto 0); -- LUT (internal use) signal A0_MA: STD_LOGIC_VECTOR(15 downto 0); -- MULT_AND (internal use) signal A0_XC: STD_LOGIC_VECTOR(15 downto 0); -- XORCY (internal use) signal A0_CRY: STD_LOGIC_VECTOR(16 downto 0); -- Carry (internal use) -- Arithmetic outputs signal A0_COUT: STD_LOGIC; -- Arithmetic Carry output signal A0_SUMQ: STD_LOGIC_VECTOR(15 downto 0); -- Arithmetic Sum output signal A0_SUMD: STD_LOGIC_VECTOR(15 downto 0); -- Arithmetic Sum output -- Stage 1 declarations signal A1_MODEQ: STD_LOGIC_VECTOR( 1 downto 0); -- SHFB signal A1_MODED: STD_LOGIC_VECTOR( 1 downto 0); -- SHFB -- Arithmetic inputs signal A1_A: STD_LOGIC_VECTOR(15 downto 0); -- A[] signal input signal A1_B: STD_LOGIC_VECTOR(15 downto 0); -- B[] signal input -- Arithmetic internal signals signal A1_LUT: STD_LOGIC_VECTOR(15 downto 0); -- LUT (internal use) signal A1_MA: STD_LOGIC_VECTOR(15 downto 0); -- MULT_AND (internal use) signal A1_XC: STD_LOGIC_VECTOR(15 downto 0); -- XORCY (internal use) signal A1_CRY: STD_LOGIC_VECTOR(16 downto 0); -- Carry (internal use) -- Arithmetic outputs signal A1_COUT: STD_LOGIC; -- Arithmetic Carry output signal A1_SUMQ: STD_LOGIC_VECTOR(15 downto 0); -- Arithmetic Sum output signal A1_SUMD: STD_LOGIC_VECTOR(15 downto 0); -- Arithmetic Sum output begin -- The selector needed is a type of arithmetic circuit with -- two mode pins. In addition, I have to be able to force the "A" -- input to be ignored completely for at least one mode. That means -- that the mode will require a MULT_AND, so the template required is -- the MODE3-0 MULT_AND template (which see). -- -- The three operations required will be A[], B[], and A[]+A[]+COUT. -- B[] will have to correspond to AR_MODE(1) "low", and A[]+A[] will -- need to have AR_MODE(1) "high". Also, I want a shift by "0" to -- correspond to the A[] version, while a shift by "1" or "3" to correspond -- to the arithmetic shift (i.e. A[]+A[]+COUT). In addition, it's -- possible to save a LUT by fiddling with N0 so that two different codes -- correspond to the arithmetic shift: -- -- A0_MODE A1_MODE -- SHIFT SHFA SHFB 1 0 1 0 -- ----- ---- ---- - - - - -- 0000 0 0 0 0 0 0 -- 0001 1 0 1 0 0 0 -- 0010 2 0 0 1 0 0 -- 0011 0 3 0 0 1 0 -- 0100 1 3 1 0 1 1 -- 0101 2 3 0 1 1 1 -- 0110 0 6 0 0 0 1 -- 0111 1 6 1 1 0 1 -- 1000 2 6 0 1 0 1 -- The bizarre modes for "10" and "11" is to possibly save the A1_MODE(0) LUT. -- SHFA modes with SHIFT(3 downto 0) select A0_MODE(1 downto 0) <= "00" when "0000" | "0011" | "0110", -- Shift by 0 A {0,3,6} "10" when "0001" , -- Shift by 1 0 {1, "11" when "0100" | "0111", -- Shift by 1 4,7} "01" when others; -- Shift by 2 B {2,5,8} -- SHFB modes with SHIFT(3 downto 0) select A1_MODED(1 downto 0) <= "00" when "0000" | "0001" | "0010", -- Shift by 0 "10" when "0011" , -- Shift by 3 "11" when "0100" | "0101", -- Shift by 3 "01" when others; -- Shift by 6 ------------------------------------------------------ -- SHFA Shifts by 0,1, or 2 bits ------------------------------------------------------ -- Arithmetic functions required: -- -- A0_MODE Arithmetic -- 1 0 Function Shift -- - - ------------ -------- -- 0 0 A[] 0 -- 0 1 B[] 2 -- 1 0 A[]+A[]+COUT 1 -- 1 1 A[]+A[]+COUT 1 -- -- This is all I need to know to define the carry logic. Because -- this is sort of a complicated substitution, I'll do the bit substitution -- explicitly, using assignments, rather than try to substitute the text -- in the arithmetic. The other substitutions are as follows: -- -- Substitutions: -- AR_MAXBIT <= 15 -- AR_MODE <= A0_MODE(1 downto 0) -- AR_xxx <= A0_xxx -- Bit assignments, stage 0 -- -- Note that the bit assignments for stage 0 are trivial. This is because -- the orbit of a bit in this length barrel shifter, when shifted by 1 -- is the sequence (0,1,2,3,4 ... 15), and this is a trivial sequence. -- the bit assignment for stage 1 is more complicated. A0_A(15 downto 0) <= DIN(15 downto 0); -- Shift by 0 -- A0_B is the same as A0_A, but shifted by two places: A0_B(15 downto 0) <= A0_A(13 downto 0) & A0_A(15 downto 14); -- Shift by 2 -- Carry-in control -- Note that when the circuit is in A[]+A[]+COUT mode, the COUT will -- be precisely equal to A0_A(15), so I choose that as the CIN instead -- of the carry-out A0_CRY(16). This uses no more gates and is kinder -- and gentler to the tools. with A0_MODE(1 downto 0) select A0_CRY(0) <= ( '0' ) when "00", -- CIN = 0 A[] ( '0' ) when "01", -- CIN = 0 B[] (A0_A(15)) when "10", -- CIN = COUT A[]+A[]+COUT (A0_A(15)) when others; -- CIN = COUT A[]+A[]+COUT -- Unfortunately, I couldn't get the synthesizer to clue in that -- A0_LUT would fit into a LUT. I hate to instantiate these things, -- but here goes: -- -- with A0_MODE(1 downto 0) select -- A0_LUT(I) <= -- ( A0_A(I)) when "00", -- A[] A[]+1 A[]+CIN -- ( A0_B(I)) when "01", -- B[] B[]+1 B[]+CIN (Note 1.) -- (A0_A(I) xor A0_A(I)) when "10", -- A[]+A[] A[]+A[]+1 A[]+A[]+CIN -- (A0_A(I) xor A0_A(I)) when others; -- A[]+A[] A[]+A[]+1 A[]+A[]+CIN -- -- INIT calculation: -- -- 3 1111111100000000 B -- 2 1111000011110000 MODE(0) -- 1 1100110011001100 A -- 0 1010101010101010 MODE(1) -- - ---------------- -- 1 0 1 0 A0_A when "00", -- 1 1 0 0 A0_B when "01", -- 0000 0000 '0' when others (note A xor A == '0') -- ---------------- -- 0000111000000100 = 0x0E04 (assigned as an attribute near signals) -- -- LUT4 ugliness... L00: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 0), I2 => A0_MODE(0), I3 => A0_B( 0), O => A0_LUT( 0)); L01: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 1), I2 => A0_MODE(0), I3 => A0_B( 1), O => A0_LUT( 1)); L02: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 2), I2 => A0_MODE(0), I3 => A0_B( 2), O => A0_LUT( 2)); L03: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 3), I2 => A0_MODE(0), I3 => A0_B( 3), O => A0_LUT( 3)); L04: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 4), I2 => A0_MODE(0), I3 => A0_B( 4), O => A0_LUT( 4)); L05: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 5), I2 => A0_MODE(0), I3 => A0_B( 5), O => A0_LUT( 5)); L06: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 6), I2 => A0_MODE(0), I3 => A0_B( 6), O => A0_LUT( 6)); L07: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 7), I2 => A0_MODE(0), I3 => A0_B( 7), O => A0_LUT( 7)); L08: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 8), I2 => A0_MODE(0), I3 => A0_B( 8), O => A0_LUT( 8)); L09: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A( 9), I2 => A0_MODE(0), I3 => A0_B( 9), O => A0_LUT( 9)); L10: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A(10), I2 => A0_MODE(0), I3 => A0_B(10), O => A0_LUT(10)); L11: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A(11), I2 => A0_MODE(0), I3 => A0_B(11), O => A0_LUT(11)); L12: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A(12), I2 => A0_MODE(0), I3 => A0_B(12), O => A0_LUT(12)); L13: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A(13), I2 => A0_MODE(0), I3 => A0_B(13), O => A0_LUT(13)); L14: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A(14), I2 => A0_MODE(0), I3 => A0_B(14), O => A0_LUT(14)); L15: LUT4 port map ( I0 => A0_MODE(1), I1 => A0_A(15), I2 => A0_MODE(0), I3 => A0_B(15), O => A0_LUT(15)); -- Generate command A0: for I in 0 to 15 generate -- Carry chain instantiation MA: MULT_AND port map ( I0 => A0_A(I), I1 => A0_MODE(1), LO => A0_MA(I)); MC: MUXCY port map ( DI => A0_MA(I), CI => A0_CRY(I), S => A0_LUT(I), O => A0_CRY(I+1)); XC: XORCY port map ( CI => A0_CRY(I), LI => A0_LUT(I), O => A0_SUMD(I)); end generate; ------------------------------------------------------ -- SHFB Shifts by 0,3, or 6 bits ------------------------------------------------------ -- -- -- The logic for the second level shifts is similar to the logic for -- the first level, but the shifts are by amounts 3x as much. This -- means that I have to scramble bits to get the right results. -- -- A0_MODE Arithmetic -- 1 0 Function Shift -- - - ------------ -------- -- 0 0 A[] 0 -- 0 1 B[] 6 -- 1 0 A[]+A[]+COUT 3 -- 1 1 A[]+A[]+COUT 3 -- -- -- Substitutions: -- AR_MAXBIT <= 15 -- AR_MODE <= A1_MODE(1 downto 0) -- AR_xxx <= A1_xxx -- Bit assignments, stage 1 -- -- The results from the previous stage is A0_SUM(15 downto 0), and -- the bits are positioned just as they appear. But I'm going to have -- to juggle the ordering for this stage. -- -- I'm going to choose to start the selector with the selected -- bit 0. That is, A1_SUM(0) will have a position of "0" in the -- (15 downto 0) set of bits. -- -- As soon as I make that choice, I know that A1_A(0) will be given -- A0_SUM(0) in order to make the B[] choice correspond to a shift by -- 0. Then A1_B(0) will have to be A0_SUM(6) to get the shift by 6. -- -- The carry-out of from this stage (during the arithmetic mux) will have -- a value of A0_SUM(0), and since the arithmetic shift is supposed to -- have a shift amount of "3", it must be that the next bit higher -- than bit "0" must be bit "3". This rule continues through all -- 16 bits of input bits, and determines the bits for A1_A: A1_A(15 downto 0) <= A0_SUMQ(13) & A0_SUMQ(10) & A0_SUMQ( 7) & A0_SUMQ( 4) & A0_SUMQ( 1) & A0_SUMQ(14) & A0_SUMQ(11) & A0_SUMQ( 8) & A0_SUMQ( 5) & A0_SUMQ( 2) & A0_SUMQ(15) & A0_SUMQ(12) & A0_SUMQ( 9) & A0_SUMQ( 6) & A0_SUMQ( 3) & A0_SUMQ( 0); -- A1_B is the same as A1_A, but shifted by two places: A1_B(15 downto 0) <= A1_A(13 downto 0) & A1_A(15 downto 14); -- Shift by 6 -- Note: The following carry chain logic may seem mysterious, but -- with the use of a template it's easy to implement. It's my -- intention to publish the templates I use as freeware, but I haven't -- got good enough comments written into them. Half the problem is -- that after you've used the templates a few times you just remember -- how to get the arithmetic functions you want, so I don't bother -- with them much. -- -- The template that covers this case allows for a very large number -- of 2-mode bit functions of 2 arithmetic vectors (and constants and -- stuff) to be implemented in a single slice. -- Carry-in control with A1_MODEQ(1 downto 0) select A1_CRY(0) <= ( '0' ) when "00", -- CIN = 0 A[] ( '0' ) when "01", -- CIN = 0 B[] (A1_A(15)) when "10", -- CIN = COUT A[]+A[]+COUT (A1_A(15)) when others; -- CIN = COUT A[]+A[]+COUT -- Generate command A1: for I in 0 to 15 generate with A1_MODEQ(1 downto 0) select A1_LUT(I) <= ( A1_A(I)) when "00", -- A[] A[]+1 A[]+CIN ( A1_B(I)) when "01", -- B[] B[]+1 B[]+CIN (Note 1.) (A1_A(I) xor A1_A(I)) when "10", -- A[]+A[] A[]+A[]+1 A[]+A[]+CIN (A1_A(I) xor A1_A(I)) when others; -- A[]+A[] A[]+A[]+1 A[]+A[]+CIN -- Carry chain instantiation MA: MULT_AND port map ( I0 => A1_A(I), I1 => A1_MODEQ(1), LO => A1_MA(I)); MC: MUXCY port map ( DI => A1_MA(I), CI => A1_CRY(I), S => A1_LUT(I), O => A1_CRY(I+1)); XC: XORCY port map ( CI => A1_CRY(I), LI => A1_LUT(I), O => A1_SUMD(I)); end generate; process (CLK) begin if CLK'event and CLK='1' then --CLK rising edge A0_SUMQ <= A0_SUMD(15 downto 0); A1_MODEQ <= A1_MODED(1 downto 0); A1_SUMQ <= A1_SUMD(15 downto 0); end if; end process; -- The result is A1_SUM, but the bits are not in the correct order. -- To get the bits in correct order, I simply reverse the operation -- that gave A1_A: Y(15 downto 0) <= A1_SUMQ( 5) & A1_SUMQ(10) & A1_SUMQ(15) & A1_SUMQ( 4) & A1_SUMQ( 9) & A1_SUMQ(14) & A1_SUMQ( 3) & A1_SUMQ( 8) & A1_SUMQ(13) & A1_SUMQ( 2) & A1_SUMQ( 7) & A1_SUMQ(12) & A1_SUMQ( 1) & A1_SUMQ( 6) & A1_SUMQ(11) & A1_SUMQ( 0); end BRL16_8_arch; -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37699
Design reports: <<< Design Summary -------------- Number of errors: 0 Number of warnings: 1 Number of Slices: 38 out of 768 4% Number of Slices containing unrelated logic: 0 out of 38 0% Number of Slice Flip Flops: 70 out of 1,536 4% Number of 4 input LUTs: 38 out of 1,536 2% Number of bonded IOBs: 36 out of 94 38% Number of GCLKs: 1 out of 4 25% Number of GCLKIOBs: 1 out of 4 25% Total equivalent gate count for design: 1,034 Additional JTAG gate count for IOBs: 1,776 >>> Note that the 38 slices mentioned above include slices that contain only DFFs. The number of slices that contain FGs or carry logic is 20. There are extra DFFs included on the inputs and outputs in order to "free" the design from the IO region. <<< The Average Connection Delay for this design is: 0.923 ns The Maximum Pin Delay is: 1.982 ns The Average Connection Delay on the 10 Worst Nets is: 1.430 ns ... -------------------------------------------------------------------------------- Constraint | Requested | Actual | Logic | | | Levels -------------------------------------------------------------------------------- * NET "CLK" PERIOD = 5 nS LOW 50.000 % | 5.000ns | 5.272ns | 11 -------------------------------------------------------------------------------- <<< Constraints cover 2038 paths, 0 nets, and 234 connections (93.6% coverage) Design statistics: Minimum period: 5.272ns (Maximum frequency: 189.681MHz) >>> I should here mention that the worst case timing does involve the long carry chain, but the carry logic itself only increases by .097ns per stage (i.e. half that per bit). Consequently, other than the additional routing congestion, the carry chain, per se, is not really much of a limit to it: <<< ================================================================================ Timing constraint: NET "CLK" PERIOD = 5 nS LOW 50.000 % ; 2038 items analyzed, 6 timing errors detected. Minimum period is 5.272ns. -------------------------------------------------------------------------------- Slack: -0.272ns path B<3> to U14/A1_A<10> relative to 5.000ns delay constraint Path B<3> to U14/A1_A<10> contains 11 levels of logic: Path starting from Comp: CLB_R14C5.S1.CLK (from CLK) To Delay type Delay(ns) Physical Resource Logical Resource(s) ------------------------------------------------- -------- CLB_R14C5.S1.XQ Tcko 0.772R B<3> U2 CLB_R15C6.S1.G2 net (fanout=4) 0.613R B<3> CLB_R15C6.S1.Y Tilo 0.398R U14/C1/N5 U14/C586 CLB_R15C6.S1.F3 net (fanout=17) 0.476R U14/C17/N16 CLB_R15C6.S1.X Tilo 0.398R U14/C1/N5 U14/C582 CLB_R16C6.S0.BX net (fanout=1) 0.680R U14/C1/N5 CLB_R16C6.S0.COUT Tbxcy 0.357R U14/A1_A<0> U14/MC_0 U14/MC_1 CLB_R15C6.S0.CIN net (fanout=1) 0.000R U14/A0_CRY<2> CLB_R15C6.S0.COUT Tbyp 0.097R U14/A1_A<6> U14/MC_2 U14/MC_3 CLB_R14C6.S0.CIN net (fanout=1) 0.000R U14/A0_CRY<4> CLB_R14C6.S0.COUT Tbyp 0.097R U14/A1_A<12> U14/MC_4 U14/MC_5 CLB_R13C6.S0.CIN net (fanout=1) 0.000R U14/A0_CRY<6> CLB_R13C6.S0.COUT Tbyp 0.097R U14/A1_A<2> U14/MC_6 U14/MC_7 CLB_R12C6.S0.CIN net (fanout=1) 0.000R U14/A0_CRY<8> CLB_R12C6.S0.COUT Tbyp 0.097R U14/A1_A<8> U14/MC_8 U14/MC_9 CLB_R11C6.S0.CIN net (fanout=1) 0.000R U14/A0_CRY<10> CLB_R11C6.S0.COUT Tbyp 0.097R U14/A1_A<14> U14/MC_10 U14/MC_11 CLB_R10C6.S0.CIN net (fanout=1) 0.000R U14/A0_CRY<12> CLB_R10C6.S0.COUT Tbyp 0.097R U14/A1_A<4> U14/MC_12 U14/MC_13 CLB_R9C6.S0.CIN net (fanout=1) 0.000R U14/A0_CRY<14> CLB_R9C6.S0.CLK Tcckx 0.996R U14/A1_A<10> U14/XC_14 U14/A0_SUMQ_reg<14> ------------------------------------------------- Total (3.503ns logic, 1.769ns route) 5.272ns (to CLK) (66.4% logic, 33.6% route) >>> Even though the carry is only used in two modes, one where it is zero throughout the carry chain, and the other where each carry bit propagates only to the next stage, it is nevertheless the case that the worst case speed of the circuit includes the full timing chain. It's not easy to get a delay as long as the whole carry chain. To do it, you'll have to apply '1's to the DIN[] input, and tool around with the SHIFT[] input. When the carry chain is in the "pass data" mode, it will all be '1's. When the carry chain is switched to the all '0' mode, the '0' will be supplied at the least significant bit, and will propagate through the chain. (Note that the chain does not propagate evenly from least significant to most significant because of the remapping that makes the barrel shifter work.) Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORG
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z