Messages from 37675

Article: 37675
Subject: Re: Spartan-IIE schematic symbol?
From: "Pete Dudley" <padudle@sandia.gov>
Date: Tue, 18 Dec 2001 17:16:34 -0700
Links: << >> << T >> << A >>

Hello All,

While you really should use a custom symbol with all your signal names it's
foolish to hand generate schematic symbols. Fortunately,I have just the tool
for you. I have written a Xilinx .pad to Viewdraw symbol translator.

It automatically generates a complete acurate symbol that contains all the
pins on the package. Grounds and core supplies are added as SIGNAL=
attributes to keep the size of the symbol within reason.

I lobbied Xilinx for many years to standardize on a common .pad format for
all their devices and I think they have done that now for everything except
the pld's.

Tonight from my home account I will post a url where you can download the
translator. It's written in C++ and if someone wants to modify it for other
applications I would probably share it in the spirit of an open internet.

Later,

  Pete Dudley

"Peter Fenn" <Peter.Fenn@avnet.com> wrote in message
news:ee73c6a.-1@WebX.sUN8CHnE...
> Spartan-IIE: I am urgently looking for a (board-level) schematic symbol
(preferably ORCAD or VIEWLOGIC) for an XC2S100E-6FT256C Xilinx FPGA. Is
anyone in a position to help on this?
> -Thanks in advance :-)

Article: 37676
Subject: Re: is it OK?
From: Kenily <chensw20@hotmail.com>
Date: Tue, 18 Dec 2001 17:13:39 -0800
Links: << >> << T >> << A >>

i want to know if the way does.

Article: 37677
Subject: How can I reduce Spartan-II routing delays to meet 33MHz PCI's Tsu < 7 ns requirement?
From: kevinbraceusenet@hotmail.com (Kevin Brace)
Date: 18 Dec 2001 17:43:28 -0800
Links: << >> << T >> << A >>

Hi, I will like to know if someone knows the strategies on how to
reduce routing (net) delays for Spartan-II.
So far, I treated synthesis tool(XST)/Map/Par as a blackbox, but
because my design (a PCI IP core) was not meeting Tsu (Tsu < 7ns), I
started to take a closer look of how LUTs are placed on the FPGA.
Using Floorplanner, I saw the LUTs being placed all over the FPGA, so
I decided to hand place the LUTs using UCF flow.
That was the most effective thing I did to reduce interconnect delay
(reduced the worst interconnect delay by about 2.7 ns (11 ns down to
8.3 ns)), but unfortunately, I still have to reduce the interconnect
delay by 1.3 ns (worst Tsu currently at 8.3 ns).
Basically, I have two input signals, FRAME# and IRDY# that are not
meeting timings.
Here are the two of the worst violators for FRAME# and IRDY#,
respectively.



________________________________________________________________________________


================================================================================
 Timing constraint: COMP "frame_n" OFFSET = IN 7 nS  BEFORE COMP "clk"
;

 503 items analyzed, 61 timing errors detected.
 Minimum allowable offset is   8.115ns.
 --------------------------------------------------------------------------------
Slack:                  -1.115ns (requirement - (data path - clock
path - clock arrival))
  Source:               frame_n
  Destination:          PCI_IP_Core_Instance_ad_Port_2
  Destination Clock:    clk_BUFGP rising at 0.000ns
  Requirement:          7.000ns
  Data Path Delay:      10.556ns (Levels of Logic = 6)
  Clock Path Delay:     2.441ns (Levels of Logic = 2)
  Timing Improvement Wizard
  Data Path: frame_n to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tiopi                 1.224   frame_n
                                  frame_n_IBUF
    net (fanout=45)       0.591   frame_n_IBUF
    Tilo                  0.653   PCI_IP_Core_Instance_I_25_LUT_7
    net (fanout=3)        0.683   N21918
    Tbxx                  0.981   PCI_IP_Core_Instance_I_XXL_1357_1
    net (fanout=15)       2.352   PCI_IP_Core_Instance_I_XXL_1357_1
    Tilo                  0.653   PCI_IP_Core_Instance_I_125_LUT_17
    net (fanout=1)        0.749   PCI_IP_Core_Instance_N3059
    Tilo                  0.653   PCI_IP_Core_Instance_I__n0055
    net (fanout=1)        0.809   PCI_IP_Core_Instance_N3069
    Tioock                1.208   PCI_IP_Core_Instance_ad_Port_2
    ----------------------------  ------------------------------
    Total                10.556ns (5.372ns logic, 5.184ns route)
                                  (50.9% logic, 49.1% route)

   Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2
   Delay type         Delay(ns)  Logical Resource(s)
   ----------------------------  -------------------
    Tgpio                 1.082   clk
                                  clk_BUFGP/IBUFG
    net (fanout=1)        0.007   clk_BUFGP/IBUFG
    Tgio                  0.773   clk_BUFGP/BUFG
    net (fanout=423)      0.579   clk_BUFGP
    ----------------------------  ------------------------------
    Total                 2.441ns (1.855ns logic, 0.586ns route)
                                  (76.0% logic, 24.0% route)

 --------------------------------------------------------------------------------


 ================================================================================
Timing constraint: COMP "irdy_n" OFFSET = IN 7 nS  BEFORE COMP "clk" ;

 698 items analyzed, 74 timing errors detected.
 Minimum allowable offset is   8.290ns.
 --------------------------------------------------------------------------------
Slack:                  -1.290ns (requirement - (data path - clock
path - clock arrival))
  Source:               irdy_n
  Destination:          PCI_IP_Core_Instance_ad_Port_2
  Destination Clock:    clk_BUFGP rising at 0.000ns
  Requirement:          7.000ns
  Data Path Delay:      10.731ns (Levels of Logic = 6)
  Clock Path Delay:     2.441ns (Levels of Logic = 2)
  Timing Improvement Wizard
  Data Path: irdy_n to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tiopi                 1.224   irdy_n
                                  irdy_n_IBUF
    net (fanout=138)      0.766   irdy_n_IBUF
    Tilo                  0.653   PCI_IP_Core_Instance_I_25_LUT_7
    net (fanout=3)        0.683   N21918
    Tbxx                  0.981   PCI_IP_Core_Instance_I_XXL_1357_1
    net (fanout=15)       2.352   PCI_IP_Core_Instance_I_XXL_1357_1
    Tilo                  0.653   PCI_IP_Core_Instance_I_125_LUT_17
    net (fanout=1)        0.749   PCI_IP_Core_Instance_N3059
    Tilo                  0.653   PCI_IP_Core_Instance_I__n0055
    net (fanout=1)        0.809   PCI_IP_Core_Instance_N3069
    Tioock                1.208   PCI_IP_Core_Instance_ad_Port_2
    ----------------------------  ------------------------------
    Total                10.731ns (5.372ns logic, 5.359ns route)
                                  (50.1% logic, 49.9% route)

  Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tgpio                 1.082   clk
                                  clk_BUFGP/IBUFG
    net (fanout=1)        0.007   clk_BUFGP/IBUFG
    Tgio                  0.773   clk_BUFGP/BUFG
    net (fanout=423)      0.579   clk_BUFGP
    ----------------------------  ------------------------------
    Total                 2.441ns (1.855ns logic, 0.586ns route)
                                  (76.0% logic, 24.0% route)

 --------------------------------------------------------------------------------


Timing summary:
---------------

Timing errors: 135  Score: 55289

Constraints cover 27511 paths, 0 nets, and 4835 connections (92.1%
coverage)

________________________________________________________________________________


Locations of various resources:

FRAME#: pin 23
IRDY#:  pin 24
AD[2]:  pin 62
PCI_IP_Core_Instance_I_25_LUT_7: CLB_R12C1.s1
PCI_IP_Core_Instance_I_XXL_1357_1: CLB_R12C2
PCI_IP_Core_Instance_I_125_LUT_17: CLB_R23C9.s0
PCI_IP_Core_Instance_I__n0055: CLB_R24C9.s0



Input signals other than FRAME# and IRDY# are all meeting Tsu < 7 ns
requirement, and because I now figured out how to use IOB FFs, I can
easily meet Tval < 11 ns (Tco) for all output signals.
I am using Xilinx ISE WebPack 4.1 (which doesn't come with FPGA
Editor), and the PCI IP core is written in Verilog.
The device I am targeting is Xilinx Spartan-II 150K system gate speed
grade -5 part (XC2S150-5CPQ208), and I did meet all 33MHz PCI timings
with Spartan-II 150K system gate speed grade -6 part (XC2S150-6CPQ208)
when I resynthesized the PCI IP core for speed grade -6 part, and
basically reused the same UCF file with the floorplan (I had to make
small modifications to the UCF file because some of the LUT names
changed).
The reason I really care about Xilinx Spartan-II 150K system gate
speed grade -5 part is because that is the chip that is on the PCI
prototype board of Insight Electronics Spartan-II Development Kit.
Yes, I wish the PCI prototype board came with speed grade -6 . . .
Because I want the PCI IP core to be portable across different
platforms (most notably Xilinx and Altera FPGAs), I am not really
interested in making any vendor specific modification to my Verilog
RTL code, but I won't mind using various tricks in the .UCF file (for
Xilinx) or .ACF file (I believe that is the Altera equivalent of
Xilinx .UCF file).
Here are some solutions I came up with.


1) Reduce the signal fanout (Currently at 35 globally, but FRAME# and
IRDY#'s fanout are 200. What number should I reduce the global fanout
to?).

2) Use USELOWSKEWLINES in a UCF file (already tried on some long
routings, but didn't seem to help. I will try to play around with this
option a little more with different signals.).

3) Floorplan all the LUTs and FFs on the FPGA (currently, I only
floorplanned the LUTs that violated Tsu, and most of them take inputs
from FRAME# and IRDY#.).

4) Use Guide file Leverage mode in Map and Par.

5) Try routing my design 2000 times (That will take several days . . .
I once routed my design about 20 times. After routing my design 20
times, Par seems to get stuck in certain Timing Score range beyond 20
iterations.).

6) Pay for ISE Foundation 4.1 (I don't want to pay for tools because I
am poor), and use FPGA Editor (I wish ISE WebPack came with FPGA
Editor.). At least from FPGA Editor, I can see how the signals are
actually getting routed.

7) Use a different synthesis tool other than XST (I am poor, so I
doubt that I can afford.).


I will like to hear from anyone who can comment on the solutions I
just wrote, or has other suggestions on what I can do to reduce the
delays to meet 33MHz PCI's Tsu < 7 ns requirement.




Thanks,



Kevin Brace (don't respond to me directly, respond within the
newsgroup)




P.S.  Considering that I am struggling to meet 33MHz PCI timings with
Spartan-II speed grade -5, how come Xilinx meet 66MHz PCI timings on
Virtex/Spartan-II speed grade -6? (I can only barely meet 33MHz PCI
timings with Spartan-II speed grade -6 using floorplanner.)
Is it possible to move a signal through a input pin like FRAME# and
IRDY# (pin 23 and pin 24 respectively for Spartan-II PQ208), go
through a few levels of LUTs, and reach far away IOB output FF and
tri-state control FF like pin 67 (AD[0]) or pin 203 (AD[31]) in 5 ns?
(3 ns + 1.9 to 2 ns natural clock skew = 4.9 ns to 5.0 ns realistic
Tsu)
Can a signal move that fast on Virtex/Spartan-II speed grade -6? (I
sort of doubt from my experience.)
I know that Xilinx uses the special IRDY and TRDY pin in LogiCORE PCI,
but that won't seem to help FRAME#, since FRAME# has to be sampled
unregistered to determine an end of burst transfer.
What kind of tricks is Xilinx using in their LogiCORE PCI other than
the special IRDY and TRDY pin?
Does anyone know?

Article: 37678
Subject: Re: FPGA-Conversion. IP Cores
From: kevinbraceusenet@hotmail.com (Kevin Brace)
Date: 18 Dec 2001 18:18:36 -0800
Links: << >> << T >> << A >>

I don't claim to be an expert at all, but according to this EE Times
article, if the IP core came from an FPGA vendor like Xilinx or
Altera, you are pretty much stuck with their devices, unless the FPGA
vendor offers a conversion service (like Altera's HardCopy (started
recently) or Xilinx's HardWire (which if I am correct was discontinued
in 1999)).

http://www.eetimes.com/story/OEG20010907S0103


Another bad news for a conversion service is that Clear Logic recently
lost a key ruling against Altera.

 http://www.altera.com/corporate/press_box/releases/corporate/pr-wins_clear_logic.html


I sort of find the ruling troubling because assuming that an
Altera-made IP is not included in the customer's design, should anyone
have any control of the bit stream file you generated from Altera's
software?
I suppose that what Altera wants to say is that because the customer
had to agree prior to using an Altera software (like MAX+PLUS II or
Quartus), the customer has to use the generated bit stream file in a
way agreed in the software licensing agreement. However, recently
Clear Logic won a patent on their business model of converting a bit
stream file directly to an ASIC, and that business model seems to be
very similar to Altera's HardCopy, so I expect Clear Logic to sue
Altera soon.

http://www.ebnews.com/story/OEG20011108S0031



So seeing that IP cores from FPGA vendors have strings attached to
them, I think it will be safer to use a third party (non-device
vendor) IP core if FPGA-ASIC conversion is part of the requirement of
your application.




Kevin Brace (don't respond to me directly, respond within the
newsgroup)




arlington_sade@yahoo.com (arlington) wrote in message news:<63d93f75.0112160047.77f9982e@posting.google.com>...
> Hello all, 
> 
> If you were to use IP cores, such as Logicore/Alliance cores from
> Xilinx, Megafunction cores from Altera, Inventra cores from Mentor,
> etc. how can you get the RTL verilog/VHDL when you want to convert to
> ASIC ?
> 
> Thanks.

Article: 37679
Subject: Re: You take the low road and I'll ......
From: Ray Andraka <ray@andraka.com>
Date: Wed, 19 Dec 2001 03:08:02 GMT
Links: << >> << T >> << A >>

Additionally, I find that a good floorplan will get you to darned close to the
performance you can reach doing hand routing with considerably less effort.  I'm
lazy, I don't want to do more work than is necessary to obtain a desired result
(and my customers surely don't want to pay for that last little bit unless there
is a darned good reason for it).  Nice thing about stopping at floorplanning is
that you can still have everything in the mainstream flow.

That said, it sure would be nice to be able to lock routing in a hard macro for
those few times when you really need it.

Austin Lesea wrote:

> Bryan,
>
> Reminds me of the Dilbert Cartoon where they are telling tales of their early
> programming years...
>
> "I remember using assembly code..."
>
> "That is nothing, I remember using 1's and 0's...."
>
> "You had zeroes?  Wow, we had to use 'lower case l's and upper case 'ohs'..."
>
> "Bucnh of babies, I only had 1's!"
>
> Why we as engineers would enjoy pain, and brag about it still amazes me.
>
> A design that is well architected, self documented, commented, and reliable is
> more important to many customers.  I prefer to throw all of my energies into
> supporting those designs (in hdl's) which now account for 99% of what is being
> done out there.
>
> Austin
>
> Bryan wrote:
>
> > So lets talk controversial....
> >
> > If Lucent can support hard macros in Epic with hard routing, then why can't
> > Xilinx.  My application requires it and Xilinx doesn't support it in FPGA
> > editor(which was programmed by the same softies as Epic).  Oh, I remember
> > why they don't support it.  Because nobody cares about designs that push the
> > limitations of FPGAs.  Because everybody else that is making designs for
> > Xilinx parts is still in kindergarten finger painting with verilog and hdl.
> > Ha, I didn't get my EE degree to be a soft weirdo.  Anybody can throw code
> > together and get poor performance.
> >
> > flame away kindergarten kids
> >
> > Bryan
> >
> > "Peter Alfke" <peter.alfke@xilinx.com> wrote in message
> > news:3C1F8AEC.BFD2E067@xilinx.com...
> > > This is a friendly and helpful newsgroup, but let's make sure that it does
> > not
> > > get abused.
> > > Lots of textbooks explain how to divide by a power of 2, where the
> > remainder is,
> > > and how you sign-extend the MSB. Explaining that is not the purpose of
> > this
> > > newsgroup.
> > >
> > > Let's use our "bandwidth" for more complex and perhaps controversial
> > questions
> > > that are not explained in textbooks and data books.
> > >
> > > Peter Alfke, Xilinx Applications
> > >
> > >

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Article: 37680
Subject: Re: You take the low road and I'll ......
From: "Austin Franklin" <austin@dark09room.com>
Date: Tue, 18 Dec 2001 22:23:56 -0500
Links: << >> << T >> << A >>

> A design that is well architected, self documented, commented, and
reliable is
> more important to many customers.  I prefer to throw all of my energies
into
> supporting those designs (in hdl's) which now account for 99% of what is
being
> done out there.

Well, Austin, that's all well and good...but one can do, in my opinion, an
equally as well documented, if not better documented and certainly a more
reliable design in schematics.  When the synthesis tools change, the design
changes.  That's hardly reliability.  That is, unless you instantiate
everything...but that's hardly synthesis...that's just a netlister.

I also believe it's erroneous to claim that HDL code is "self documenting",
that's not reality.  It takes care and time to correctly document a design.

If you make the FPGAs big enough and fast enough, synthesis works.  If you
make the CPUs fast enough, and have enough memory and disk space, Microsoft
code works.  Same thing.  I'm not saying it's right or wrong, but it is a
fact.

Regards,

Austin

Article: 37681
Subject: Re: Barrel shifter puts three 2->1 muxes / slice in Xilinx
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Wed, 19 Dec 2001 04:05:32 +0000 (UTC)
Links: << >> << T >> << A >>

Ray:

>2) The carry chain can also be used for a free doubler circuit.
> However, watch the timing.  There exist false paths (that are
> also quite slow comparatively speaking) introduced by the
> non-standard use of the carry chain (the chain connections
> are only used to the next neighbor, not all the way up the
> chain).  Timingwise, the conventional approach seems to yield
> better propagation delays in combinatorial only shifters, and
> considerably better times in fully pipelined shifters.  This
> is a good trick to put in your back pocket for those times
> where the need for density outweighs the needs of the  clock
> cycle.

This is true.  I have another barrrel shift design that's based on the use
of the carry chain.  I'll post it to the thread later, but as a new topic
head.  But the way I used the Carry chain the false path was a true one.
That is, under certain (rare?) circumstances, the carry chain may have to
propagate a signal from one end to the other.  I'll post an explanation when
I put up the VHDL for that barrel shifter (later tonight, maybe).  I think
this is enough cool circuits for one thread.

> 3) I'd be interested in seeing your layout solution.  The layout
> is not trivial to making this perform well.

No floor planning involved.  Here's the statistics when it's implemented by
itself with flip-flops on all inputs and outputs.  The design is a 16-input
barrel shifter, with 3 select inputs giving shifts from 0 to 7 bits.  The
design is a fall through, as is probably most efficient for this type of
barrel shifter.  It's placed and routed into a small VirtexE-8, which is a
speedy little part.  I put a clock period constraint on it of 5ns, but it only
got to 165MHz.  Still, this isn't bad for a fall through 16-wide barrel
shifter with no floor planning and no buffering on the control lines.  If I
get around to it, I'll convert the source from schematic to (readable) VHDL.
The reason it's in schematic form is because I hate to deal with RLOCs in
VHDL:

<<<
Design Information
------------------
Command Line   : map -p xcv50e-8-cs144 -o map.ncd arith.ngd arith.pcf 
Target Device  : xv50e
Target Package : cs144
Target Speed   : -8
Mapper Version : virtexe -- D.27
Mapped Date    : Tue Dec 18 19:57:47 2001

Design Summary
--------------
   Number of errors:      0
   Number of warnings:    1
   Number of Slices:                 36 out of    768    4%
   Number of Slices containing
      unrelated logic:                0 out of     36    0%
   Number of Slice Flip Flops:       51 out of  1,536    3%
   Number of 4 input LUTs:           36 out of  1,536    2%
   Number of bonded IOBs:            35 out of     94   37%
   Number of GCLKs:                   1 out of      4   25%
   Number of GCLKIOBs:                1 out of      4   25%
Total equivalent gate count for design:  696
Additional JTAG gate count for IOBs:  1,728
>>>

<<<
The Number of signals not completely routed for this design is: 0

   The Average Connection Delay for this design is:        0.885 ns
   The Maximum Pin Delay is:                               2.310 ns
   The Average Connection Delay on the 10 Worst Nets is:   1.645 ns
...

--------------------------------------------------------------------------------
  Constraint                                | Requested  | Actual     | Logic 
                                            |            |            | Levels

--------------------------------------------------------------------------------
* NET "CLK" PERIOD =  5 nS   LOW 50.000 %   | 5.000ns    | 6.036ns    | 4    

--------------------------------------------------------------------------------
>>>

<<<
Constraints cover 276 paths, 0 nets, and 184 connections (92.0% coverage)

Design statistics:
   Minimum period:   6.036ns (Maximum frequency: 165.673MHz)

Analysis completed Tue Dec 18 19:58:13 2001

--------------------------------------------------------------------------------
>>>

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37682
Subject: Re: You take the low road and I'll ......
From: "Kevin Neilson" <kevin_neilson@removethis-yahoo.com>
Date: Wed, 19 Dec 2001 04:19:32 GMT
Links: << >> << T >> << A >>

I agree:  one can get large improvements over the placer by hand-placement,
but hand-routing rarely provides an improvement and the tiny gains are
hardly worth the very painful effort.

"Ray Andraka" <ray@andraka.com> wrote in message
news:3C2004E5.C4F9C8F8@andraka.com...
> Additionally, I find that a good floorplan will get you to darned close to
the
> performance you can reach doing hand routing with considerably less
effort.  I'm
> lazy, I don't want to do more work than is necessary to obtain a desired
result
> (and my customers surely don't want to pay for that last little bit unless
there
> is a darned good reason for it).  Nice thing about stopping at
floorplanning is
> that you can still have everything in the mainstream flow.
>
> That said, it sure would be nice to be able to lock routing in a hard
macro for
> those few times when you really need it.
>
> Austin Lesea wrote:
>
> > Bryan,
> >
> > Reminds me of the Dilbert Cartoon where they are telling tales of their
early
> > programming years...
> >
> > "I remember using assembly code..."
> >
> > "That is nothing, I remember using 1's and 0's...."
> >
> > "You had zeroes?  Wow, we had to use 'lower case l's and upper case
'ohs'..."
> >
> > "Bucnh of babies, I only had 1's!"
> >
> > Why we as engineers would enjoy pain, and brag about it still amazes me.
> >
> > A design that is well architected, self documented, commented, and
reliable is
> > more important to many customers.  I prefer to throw all of my energies
into
> > supporting those designs (in hdl's) which now account for 99% of what is
being
> > done out there.
> >
> > Austin
> >
> > Bryan wrote:
> >
> > > So lets talk controversial....
> > >
> > > If Lucent can support hard macros in Epic with hard routing, then why
can't
> > > Xilinx.  My application requires it and Xilinx doesn't support it in
FPGA
> > > editor(which was programmed by the same softies as Epic).  Oh, I
remember
> > > why they don't support it.  Because nobody cares about designs that
push the
> > > limitations of FPGAs.  Because everybody else that is making designs
for
> > > Xilinx parts is still in kindergarten finger painting with verilog and
hdl.
> > > Ha, I didn't get my EE degree to be a soft weirdo.  Anybody can throw
code
> > > together and get poor performance.
> > >
> > > flame away kindergarten kids
> > >
> > > Bryan
> > >
> > > "Peter Alfke" <peter.alfke@xilinx.com> wrote in message
> > > news:3C1F8AEC.BFD2E067@xilinx.com...
> > > > This is a friendly and helpful newsgroup, but let's make sure that
it does
> > > not
> > > > get abused.
> > > > Lots of textbooks explain how to divide by a power of 2, where the
> > > remainder is,
> > > > and how you sign-extend the MSB. Explaining that is not the purpose
of
> > > this
> > > > newsgroup.
> > > >
> > > > Let's use our "bandwidth" for more complex and perhaps controversial
> > > questions
> > > > that are not explained in textbooks and data books.
> > > >
> > > > Peter Alfke, Xilinx Applications
> > > >
> > > >
>
> --
> --Ray Andraka, P.E.
> President, the Andraka Consulting Group, Inc.
> 401/884-7930     Fax 401/884-7950
> email ray@andraka.com
> http://www.andraka.com
>
>  "They that give up essential liberty to obtain a little
>   temporary safety deserve neither liberty nor safety."
>                                           -Benjamin Franklin, 1759
>
>

Article: 37683
Subject: Re: Kindergarten Stuff
From: "Kevin Neilson" <kevin_neilson@removethis-yahoo.com>
Date: Wed, 19 Dec 2001 04:38:30 GMT
Links: << >> << T >> << A >>

I don't know if we should discourage legitimate questions, but it seems like
there is an inordinate amount of traffic from lazy students, of the form:

"Hi, I need to make a VHDL program that divides by 4, how exactly would that
look, line for line?  Hurry because it's due Friday."

I hope such people aren't actually getting degrees by having the older kids
do their homework.

-Kevin

"Peter Alfke" <peter.alfke@xilinx.com> wrote in message
news:3C1F8AEC.BFD2E067@xilinx.com...
> This is a friendly and helpful newsgroup, but let's make sure that it does
not
> get abused.
> Lots of textbooks explain how to divide by a power of 2, where the
remainder is,
> and how you sign-extend the MSB. Explaining that is not the purpose of
this
> newsgroup.
>
> Let's use our "bandwidth" for more complex and perhaps controversial
questions
> that are not explained in textbooks and data books.
>
> Peter Alfke, Xilinx Applications
>
>

Article: 37684
Subject: Re: Kindergarten Stuff
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Wed, 19 Dec 2001 04:41:39 +0000 (UTC)
Links: << >> << T >> << A >>

"Bryan" <bryan@srccomp.com> wrote in message
news:3c1fc73b$0$22747$724ebb72@reader2.ash.ops.us.uu.net...

> So lets talk controversial....
> 
> If Lucent can support hard macros in Epic with hard routing, then why can't
> Xilinx.  My application requires it and Xilinx doesn't support it in FPGA
> editor(which was programmed by the same softies as Epic).

Just curious, which constraint is it that is driving you to hard macros?  I've
long wished for better support for the little things.  I haven't got it, so
I've drifted towards optimizing VHDL to get what I want.

> Oh, I remember
> why they don't support it.  Because nobody cares about designs that push the
> limitations of FPGAs.  Because everybody else that is making designs for
> Xilinx parts is still in kindergarten finger painting with verilog and hdl.

My guess is that the volume users of Xilinx chips do care a lot about
performance.  If all we're doing is emulating what will be full custom at
1/10th speed, then the kindergarten is the way to go.  But for designs that go
out in volume and want to capture that incredible ease of reprogrammability,
but have to worry about a BOM, performance is the only thing.  Remember back
in the days of XACT and XC2064s when it was still possible to implement your
logic in FPGA editor?  Ah, those were the days.

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37685
Subject: DCM stability in Virtex2 -ES
From: David Miller <spam@quartz.net.nz>
Date: Wed, 19 Dec 2001 17:49:28 +1300
Links: << >> << T >> << A >>

Hi,

Here's a problem that does not qualify as kindergarten stuff.

A DCM in a xc2v1000 engineering sample is behaving oddly.  I am using 
this DCM to generate and deskew a clock and antiphased clock (0 & 180 
degrees) for a DDR RAM.  It is being fed by a 100MHz crystal buffered 
with a CY2308 zero-delay clock buffer (ie, a pll.)

The period of the clock seen at the board is 10 ns - usually. 
Occasionally, the clock periods is lengthened by between 2.5 and 3 ns. 
  The stretch is always in the low half of the cycle.

What could be causing this?

Relevant facts:
	- Chip is a xc2v1000bg456-4 -ES
	- Software is Xilinx Alliance 3.3.08i with the VirtexII device update 
files and the bitgen patch applied.
	- SelectIO type HSTL class II with DCI is being used
	
I used a 1 gigahertz (5 gigasample) oscilloscope to look at this clock 
pin, so sampling error isn't it.  The oscilloscope showed that sometimes 
(perhaps 1 of 4 stretched periods), the clock starts to go high at the 
right time, then changes its mind literally half way through the rise 
time -- it gets to half way between Vol and Voh (or Vref; this is HSTL) 
and then returns to Vol again.  between 2.5 and 3 ns later, it does a 
proper rising edge.

I've even tried locking the DCM to different parts of the chip and it 
made no difference.

I've searched the xilinx answers and found no clues.  Any suggestions 
gratefully received!

-- 
David Miller, BCMS (Hons)  | When something disturbs you, it isn't the
Endace Measurement Systems | thing that disturbs you; rather, it is
Mobile: +64-21-704-djm     | your judgement of it, and you have the
Fax:    +64-21-304-djm     | power to change that.  -- Marcus Aurelius

Article: 37686
Subject: Re: Divide by 3, with remainder, efficient and fast, for Altera or Xilinx
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Wed, 19 Dec 2001 04:58:05 +0000 (UTC)
Links: << >> << T >> << A >>

Performance of above divide by 3 circuit.  Period objective is 5ns,
all inputs and outputs are registered (with internal flip-flops).
The thing actually achieved 103.907MHz, which is very good for a
fall through divide by 3 circuit.  No floor planning.  Part is
a small VirtexE-8:

<<<
Xilinx Mapping Report File for Design 'divide'
Copyright (c) 1995-2000 Xilinx, Inc.  All rights reserved.

Design Information
------------------
Command Line   : map -p xcv50e-8-cs144 -o map.ncd divide.ngd divide.pcf 
Target Device  : xv50e
Target Package : cs144
Target Speed   : -8
Mapper Version : virtexe -- D.27
Mapped Date    : Tue Dec 18 20:51:33 2001

Design Summary
--------------
   Number of errors:      0
   Number of warnings:    1
   Number of Slices:                 76 out of    768    9%
   Number of Slices containing
      unrelated logic:                0 out of     76    0%
   Number of Slice Flip Flops:       98 out of  1,536    6%
   Number of 4 input LUTs:           85 out of  1,536    5%
   Number of bonded IOBs:            65 out of     94   69%
   Number of GCLKs:                   1 out of      4   25%
   Number of GCLKIOBs:                1 out of      4   25%
Total equivalent gate count for design:  1,294
Additional JTAG gate count for IOBs:  3,168
>>>

<<<
The Number of signals not completely routed for this design is: 0

   The Average Connection Delay for this design is:        0.836 ns
   The Maximum Pin Delay is:                               2.583 ns
   The Average Connection Delay on the 10 Worst Nets is:   1.968 ns



--------------------------------------------------------------------------------
  Constraint                                | Requested  | Actual     | Logic 
                                            |            |            | Levels

--------------------------------------------------------------------------------
* NET "CLK" PERIOD =  5 nS   LOW 50.000 %   | 5.000ns    | 9.624ns    | 7    

--------------------------------------------------------------------------------

1 constraint not met.
Dumping design to file divide.ncd.
>>>

<<<
Constraints cover 8188 paths, 0 nets, and 447 connections (93.1% coverage)

Design statistics:
   Minimum period:   9.624ns (Maximum frequency: 103.907MHz)

Analysis completed Tue Dec 18 20:52:40 2001
>>>

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37687
Subject: Re: Defauolt Should Be "Inputs and Outputs" For IOBs
From: David Miller <spam@quartz.net.nz>
Date: Wed, 19 Dec 2001 18:14:21 +1300
Links: << >> << T >> << A >>

> ALWAYS
> 
>>want my designs to use IOB flip flops if possible.  It seems to me that


> That's what you get for using Design Mangler...er...Manager ;-)


heh.  I find that make does a fair job of managing builds.  But then, I 
always did find CLIs more user friendly than GUIs.

Even if you invoke map from the commandline or means other than through 
DM, packing flops into I/Os is not done unless the -pr flag is supplied. 
  So I suppose DM is following the defaults of map.

M. Ramirez's question still holds good -- is there ever a reason not to 
pack flops into IOBs?

-- 
David Miller, BCMS (Hons)  | When something disturbs you, it isn't the
Endace Measurement Systems | thing that disturbs you; rather, it is
Mobile: +64-21-704-djm     | your judgement of it, and you have the
Fax:    +64-21-304-djm     | power to change that.  -- Marcus Aurelius

Article: 37688
Subject: Re: You take the low road and I'll ......
From: Ray Andraka <ray@andraka.com>
Date: Wed, 19 Dec 2001 05:24:50 GMT
Links: << >> << T >> << A >>

Oops, I meant to say it would be nice to be able to lock routing within the normal
design flow for those few times when it is needed.

Ray Andraka wrote:

>
> That said, it sure would be nice to be able to lock routing in a hard macro for
> those few times when you really need it.
>

--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Article: 37689
Subject: Re: Barrel shifter puts three 2->1 muxes / slice in Xilinx
From: Ray Andraka <ray@andraka.com>
Date: Wed, 19 Dec 2001 05:37:11 GMT
Links: << >> << T >> << A >>

We don't use them in a fall through configuration very often.  As a point of
reference, we have one in a 160 MHz design in a VirtexE-6 that does a 0 to 15
position shift with a 2 clock latency, including rounding at the output.  It is 19
bits wide at the output.  It is not the critical path in the design.  IIRC, there
are 3 levels of conventional shifter, a register, then the last layer along with a
carry chain for the round.  That one is VHDL with the RLOCs in the code.  We prefer
VHDL for placed designs now because of the capability of the generate statement (the
design I was describing is a parameterized generate so it can take arbitrary input
and output widths as well as number of clocks of latency).  Off-hand, I think with
careful layout you could get a 3 layer fall through design using the conventional
approach well above 200 MHz in an E-8.

Carl Brannen wrote:

> Ray:
>
> >2) The carry chain can also be used for a free doubler circuit.
> > However, watch the timing.  There exist false paths (that are
> > also quite slow comparatively speaking) introduced by the
> > non-standard use of the carry chain (the chain connections
> > are only used to the next neighbor, not all the way up the
> > chain).  Timingwise, the conventional approach seems to yield
> > better propagation delays in combinatorial only shifters, and
> > considerably better times in fully pipelined shifters.  This
> > is a good trick to put in your back pocket for those times
> > where the need for density outweighs the needs of the  clock
> > cycle.
>
> This is true.  I have another barrrel shift design that's based on the use
> of the carry chain.  I'll post it to the thread later, but as a new topic
> head.  But the way I used the Carry chain the false path was a true one.
> That is, under certain (rare?) circumstances, the carry chain may have to
> propagate a signal from one end to the other.  I'll post an explanation when
> I put up the VHDL for that barrel shifter (later tonight, maybe).  I think
> this is enough cool circuits for one thread.
>
> > 3) I'd be interested in seeing your layout solution.  The layout
> > is not trivial to making this perform well.
>
> No floor planning involved.  Here's the statistics when it's implemented by
> itself with flip-flops on all inputs and outputs.  The design is a 16-input
> barrel shifter, with 3 select inputs giving shifts from 0 to 7 bits.  The
> design is a fall through, as is probably most efficient for this type of
> barrel shifter.  It's placed and routed into a small VirtexE-8, which is a
> speedy little part.  I put a clock period constraint on it of 5ns, but it only
> got to 165MHz.  Still, this isn't bad for a fall through 16-wide barrel
> shifter with no floor planning and no buffering on the control lines.  If I
> get around to it, I'll convert the source from schematic to (readable) VHDL.
> The reason it's in schematic form is because I hate to deal with RLOCs in
> VHDL:
>
> <<<
> Design Information
> ------------------
> Command Line   : map -p xcv50e-8-cs144 -o map.ncd arith.ngd arith.pcf
> Target Device  : xv50e
> Target Package : cs144
> Target Speed   : -8
> Mapper Version : virtexe -- D.27
> Mapped Date    : Tue Dec 18 19:57:47 2001
>
> Design Summary
> --------------
>    Number of errors:      0
>    Number of warnings:    1
>    Number of Slices:                 36 out of    768    4%
>    Number of Slices containing
>       unrelated logic:                0 out of     36    0%
>    Number of Slice Flip Flops:       51 out of  1,536    3%
>    Number of 4 input LUTs:           36 out of  1,536    2%
>    Number of bonded IOBs:            35 out of     94   37%
>    Number of GCLKs:                   1 out of      4   25%
>    Number of GCLKIOBs:                1 out of      4   25%
> Total equivalent gate count for design:  696
> Additional JTAG gate count for IOBs:  1,728
> >>>
>
> <<<
> The Number of signals not completely routed for this design is: 0
>
>    The Average Connection Delay for this design is:        0.885 ns
>    The Maximum Pin Delay is:                               2.310 ns
>    The Average Connection Delay on the 10 Worst Nets is:   1.645 ns
> ...
>
> --------------------------------------------------------------------------------
>   Constraint                                | Requested  | Actual     | Logic
>                                             |            |            | Levels
>
> --------------------------------------------------------------------------------
> * NET "CLK" PERIOD =  5 nS   LOW 50.000 %   | 5.000ns    | 6.036ns    | 4
>
> --------------------------------------------------------------------------------
> >>>
>
> <<<
> Constraints cover 276 paths, 0 nets, and 184 connections (92.0% coverage)
>
> Design statistics:
>    Minimum period:   6.036ns (Maximum frequency: 165.673MHz)
>
> Analysis completed Tue Dec 18 19:58:13 2001
>
> --------------------------------------------------------------------------------
> >>>
>
> Carl
>
> --
> Posted from firewall.terabeam.com [216.137.15.2]
> via Mailgate.ORG Server - http://www.Mailgate.ORG

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Article: 37690
Subject: How can I reduce Spartan-II routing delays to meet 33MHz PCI's Tsu < 7
From: Kevin Brace <kevinbraceusenetkillspam@hotmail.com.killspam>
Date: Tue, 18 Dec 2001 23:48:10 -0600
Links: << >> << T >> << A >>

Hi, I will like to know if someone knows the strategies on how to reduce
routing (net) delays for Spartan-II.
So far, I treated synthesis tool(XST)/Map/Par as a blackbox, but because
my design (a PCI IP core) was not meeting Tsu (Tsu < 7ns), I started to
take a closer look of how LUTs are placed on the FPGA.
Using Floorplanner, I saw the LUTs being placed all over the FPGA, so I
decided to hand place the LUTs using UCF flow.
That was the most effective thing I did to reduce interconnect delay
(reduced the worst interconnect delay by about 2.7 ns (11 ns down to 8.3
ns)), but unfortunately, I still have to reduce the interconnect delay
by 1.3 ns (worst Tsu currently at 8.3 ns).
Basically, I have two input signals, FRAME# and IRDY# that are not
meeting timings.
Here are the two of the worst violators for FRAME# and IRDY#,
respectively.



________________________________________________________________________________


================================================================================
 Timing constraint: COMP "frame_n" OFFSET = IN 7 nS  BEFORE COMP "clk" ;

 503 items analyzed, 61 timing errors detected.
 Minimum allowable offset is   8.115ns.

--------------------------------------------------------------------------------
Slack:                  -1.115ns (requirement - (data path - clock path
- clock arrival))
  Source:               frame_n
  Destination:          PCI_IP_Core_Instance_ad_Port_2
  Destination Clock:    clk_BUFGP rising at 0.000ns
  Requirement:          7.000ns
  Data Path Delay:      10.556ns (Levels of Logic = 6)
  Clock Path Delay:     2.441ns (Levels of Logic = 2)
  Timing Improvement Wizard
  Data Path: frame_n to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tiopi                 1.224   frame_n
                                  frame_n_IBUF
    net (fanout=45)       0.591   frame_n_IBUF
    Tilo                  0.653   PCI_IP_Core_Instance_I_25_LUT_7
    net (fanout=3)        0.683   N21918
    Tbxx                  0.981   PCI_IP_Core_Instance_I_XXL_1357_1
    net (fanout=15)       2.352   PCI_IP_Core_Instance_I_XXL_1357_1
    Tilo                  0.653   PCI_IP_Core_Instance_I_125_LUT_17
    net (fanout=1)        0.749   PCI_IP_Core_Instance_N3059
    Tilo                  0.653   PCI_IP_Core_Instance_I__n0055
    net (fanout=1)        0.809   PCI_IP_Core_Instance_N3069
    Tioock                1.208   PCI_IP_Core_Instance_ad_Port_2
    ----------------------------  ------------------------------
    Total                10.556ns (5.372ns logic, 5.184ns route)
                                  (50.9% logic, 49.1% route)

   Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2
   Delay type         Delay(ns)  Logical Resource(s)
   ----------------------------  -------------------
    Tgpio                 1.082   clk
                                  clk_BUFGP/IBUFG
    net (fanout=1)        0.007   clk_BUFGP/IBUFG
    Tgio                  0.773   clk_BUFGP/BUFG
    net (fanout=423)      0.579   clk_BUFGP
    ----------------------------  ------------------------------
    Total                 2.441ns (1.855ns logic, 0.586ns route)
                                  (76.0% logic, 24.0% route)


--------------------------------------------------------------------------------



================================================================================
Timing constraint: COMP "irdy_n" OFFSET = IN 7 nS  BEFORE COMP "clk" ;

 698 items analyzed, 74 timing errors detected.
 Minimum allowable offset is   8.290ns.

--------------------------------------------------------------------------------
Slack:                  -1.290ns (requirement - (data path - clock path
- clock arrival))
  Source:               irdy_n
  Destination:          PCI_IP_Core_Instance_ad_Port_2
  Destination Clock:    clk_BUFGP rising at 0.000ns
  Requirement:          7.000ns
  Data Path Delay:      10.731ns (Levels of Logic = 6)
  Clock Path Delay:     2.441ns (Levels of Logic = 2)
  Timing Improvement Wizard
  Data Path: irdy_n to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tiopi                 1.224   irdy_n
                                  irdy_n_IBUF
    net (fanout=138)      0.766   irdy_n_IBUF
    Tilo                  0.653   PCI_IP_Core_Instance_I_25_LUT_7
    net (fanout=3)        0.683   N21918
    Tbxx                  0.981   PCI_IP_Core_Instance_I_XXL_1357_1
    net (fanout=15)       2.352   PCI_IP_Core_Instance_I_XXL_1357_1
    Tilo                  0.653   PCI_IP_Core_Instance_I_125_LUT_17
    net (fanout=1)        0.749   PCI_IP_Core_Instance_N3059
    Tilo                  0.653   PCI_IP_Core_Instance_I__n0055
    net (fanout=1)        0.809   PCI_IP_Core_Instance_N3069
    Tioock                1.208   PCI_IP_Core_Instance_ad_Port_2
    ----------------------------  ------------------------------
    Total                10.731ns (5.372ns logic, 5.359ns route)
                                  (50.1% logic, 49.9% route)

  Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tgpio                 1.082   clk
                                  clk_BUFGP/IBUFG
    net (fanout=1)        0.007   clk_BUFGP/IBUFG
    Tgio                  0.773   clk_BUFGP/BUFG
    net (fanout=423)      0.579   clk_BUFGP
    ----------------------------  ------------------------------
    Total                 2.441ns (1.855ns logic, 0.586ns route)
                                  (76.0% logic, 24.0% route)


--------------------------------------------------------------------------------


Timing summary:
---------------

Timing errors: 135  Score: 55289

Constraints cover 27511 paths, 0 nets, and 4835 connections (92.1%
coverage)

________________________________________________________________________________


Locations of various resources:

FRAME#: pin 23
IRDY#:  pin 24
AD[2]:  pin 62
PCI_IP_Core_Instance_I_25_LUT_7: CLB_R12C1.s1
PCI_IP_Core_Instance_I_XXL_1357_1: CLB_R12C2
PCI_IP_Core_Instance_I_125_LUT_17: CLB_R23C9.s0
PCI_IP_Core_Instance_I__n0055: CLB_R24C9.s0



Input signals other than FRAME# and IRDY# are all meeting Tsu < 7 ns
requirement, and because I now figured out how to use IOB FFs, I can
easily meet Tval < 11 ns (Tco) for all output signals.
I am using Xilinx ISE WebPack 4.1 (which doesn't come with FPGA Editor),
and the PCI IP core is written in Verilog.
The device I am targeting is Xilinx Spartan-II 150K system gate speed
grade -5 part (XC2S150-5CPQ208), and I did meet all 33MHz PCI timings
with Spartan-II 150K system gate speed grade -6 part (XC2S150-6CPQ208)
when I resynthesized the PCI IP core for speed grade -6 part, and
basically reused the same UCF file with the floorplan (I had to make
small modifications to the UCF file because some of the LUT names
changed).
The reason I really care about Xilinx Spartan-II 150K system gate speed
grade -5 part is because that is the chip that is on the PCI prototype
board of Insight Electronics Spartan-II Development Kit.
Yes, I wish the PCI prototype board came with speed grade -6 . . .
Because I want the PCI IP core to be portable across different platforms
(most notably Xilinx and Altera FPGAs), I am not really interested in
making any vendor specific modification to my Verilog RTL code, but I
won't mind using various tricks in the .UCF file (for Xilinx) or .ACF
file (I believe that is the Altera equivalent of Xilinx .UCF file).
Here are some solutions I came up with.


1) Reduce the signal fanout (Currently at 35 globally, but FRAME# and
IRDY#'s fanout are 200. What number should I reduce the global fanout
to?).

2) Use USELOWSKEWLINES in a UCF file (already tried on some long
routings, but didn't seem to help. I will try to play around with this
option a little more with different signals.).

3) Floorplan all the LUTs and FFs on the FPGA (currently, I only
floorplanned the LUTs that violated Tsu, and most of them take inputs
from FRAME# and IRDY#.).

4) Use Guide file Leverage mode in Map and Par.

5) Try routing my design 2000 times (That will take several days . . . I
once routed my design about 20 times. After routing my design 20 times,
Par seems to get stuck in certain Timing Score range beyond 20
iterations.).

6) Pay for ISE Foundation 4.1 (I don't want to pay for tools because I
am poor), and use FPGA Editor (I wish ISE WebPack came with FPGA
Editor.). At least from FPGA Editor, I can see how the signals are
actually getting routed.

7) Use a different synthesis tool other than XST (I am poor, so I doubt
that I can afford.).


I will like to hear from anyone who can comment on the solutions I just
wrote, or has other suggestions on what I can do to reduce the delays to
meet 33MHz PCI's Tsu < 7 ns requirement.




Thanks,



Kevin Brace (don't respond to me directly, respond within the newsgroup)




P.S.  Considering that I am struggling to meet 33MHz PCI timings with
Spartan-II speed grade -5, how come Xilinx meet 66MHz PCI timings on
Virtex/Spartan-II speed grade -6? (I can only barely meet 33MHz PCI
timings with Spartan-II speed grade -6 using floorplanner.)
Is it possible to move a signal through a input pin like FRAME# and
IRDY# (pin 23 and pin 24 respectively for Spartan-II PQ208), go through
a few levels of LUTs, and reach far away IOB output FF and tri-state
control FF like pin 67 (AD[0]) or pin 203 (AD[31]) in 5 ns? (3 ns + 1.9
to 2 ns natural clock skew = 4.9 ns to 5.0 ns realistic Tsu)
Can a signal move that fast on Virtex/Spartan-II speed grade -6? (I sort
of doubt from my experience.)
I know that Xilinx uses the special IRDY and TRDY pin in LogiCORE PCI,
but that won't seem to help FRAME#, since FRAME# has to be sampled
unregistered to determine an end of burst transfer.
What kind of tricks is Xilinx using in their LogiCORE PCI other than the
special IRDY and TRDY pin?
Does anyone know?

Article: 37691
Subject: Re: FPGA-Conversion. IP Cores
From: Kevin Brace <kevinbraceusenetkillspam@hotmail.com.killspam>
Date: Tue, 18 Dec 2001 23:53:13 -0600
Links: << >> << T >> << A >>

I don't claim to be an expert at all, but according to this EE Times
article, if the IP core came from an FPGA vendor like Xilinx or Altera,
you are pretty much stuck with their devices, unless the FPGA vendor
offers a conversion service (like Altera's HardCopy (started recently)
or Xilinx's HardWire (which if I am correct was discontinued in 1999)).

http://www.eetimes.com/story/OEG20010907S0103


Another bad news for a conversion service is that Clear Logic recently
lost a key ruling against Altera.


http://www.altera.com/corporate/press_box/releases/corporate/pr-wins_clear_logic.html


I sort of find the ruling troubling because assuming that an Altera-made
IP is not included in the customer's design, should anyone have any
control of the bit stream file you generated from Altera's software?
I suppose that what Altera wants to say is that because the customer had
to agree prior to using an Altera software (like MAX+PLUS II or
Quartus), the customer has to use the generated bit stream file in a way
agreed in the software licensing agreement.
However, recently Clear Logic received a patent on their business model
of converting a bit stream file directly to an ASIC, and that business
model seems to be very similar to Altera's HardCopy, so I expect Clear
Logic to sue Altera soon.

http://www.ebnews.com/story/OEG20011108S0031



So seeing that IP cores from FPGA vendors have strings attached to them,
I think it will be safer to use a third party (non-device vendor) IP
core if FPGA-ASIC conversion is part of the requirement of your
application.




Kevin Brace (don't respond to me directly, respond within the newsgroup)




arlington_sade@yahoo.com (arlington) wrote in message
news:<63d93f75.0112160047.77f9982e@posting.google.com>...
> Hello all,
>
> If you were to use IP cores, such as Logicore/Alliance cores from
> Xilinx, Megafunction cores from Altera, Inventra cores from Mentor,
> etc. how can you get the RTL verilog/VHDL when you want to convert to
> ASIC ?
>
> Thanks.

Article: 37692
Subject: Google Groups problems?
From: Kevin Brace <kevinbraceusenetkillspam@hotmail.com.killspam>
Date: Tue, 18 Dec 2001 23:59:24 -0600
Links: << >> << T >> << A >>

I know this topic is not directly related to this newsgroup, but I
noticed some postings I made to comp.arch.fpga through Google Groups got
posted on Google Groups' comp.arch.fpga archive, but wasn't there when I
checked my postings through mailgate.org's service
(http://www.mailgate.org).
Anyone else noticed this problem?



Kevin Brace (don't respond to me directly, respond within the newsgroup)

Article: 37693
Subject: Re: Kindergarten Stuff
From: Steve Underwood <steveu@dis.org>
Date: Wed, 19 Dec 2001 14:16:27 +0800
Links: << >> << T >> << A >>

I think that depends on the type of degree. Wouldn't finding a way to 
get someone else to do the hard work for you be considered reasonable 
grounds for being awarded an MBA?

Kevin Neilson wrote:

> I don't know if we should discourage legitimate questions, but it seems like
> there is an inordinate amount of traffic from lazy students, of the form:
> 
> "Hi, I need to make a VHDL program that divides by 4, how exactly would that
> look, line for line?  Hurry because it's due Friday."
> 
> I hope such people aren't actually getting degrees by having the older kids
> do their homework.
> 
> -Kevin

Article: 37694
Subject: the effect of syn_maxfan
From: shengyu_shen@hotmail.com (ssy)
Date: 18 Dec 2001 23:21:19 -0800
Links: << >> << T >> << A >>

Hi everyone

I have a newbie question:

I add /*synthesis syn_maxfan=20*/ to many wire  reg output and input
type, but some take effect, but some do not, why?

if I add this attribute to a wire , but this wire drive some load at
higher level structure, will this attribute take effect for that loads
on higher level?

Article: 37695
Subject: Re: Xilinx Foundation - Routing constraints/prohibit
From: Ray Andraka <ray@andraka.com>
Date: Wed, 19 Dec 2001 07:54:04 GMT
Links: << >> << T >> << A >>

It is very useful (and necessary) for truely modular design.  It is one of
the key pieces missing in the current PAR flow.  THere are times when a keep
routing in or out of an area would be handy, especially if it were accessible
hierarchically in the source.  Yes, it is on the wish list, but I get the
feeling Santa won't be bringing it this year.  Guess I haven't been good
enough.

Falk Brunner wrote:

> "Christian Plessl" <plessl@remove.tik.ee.ethz.ch> schrieb im Newsbeitrag
> news:3c1f4e5f@pfaff.ethz.ch...
>
> > Is possible to (completely) prohibit the use of routing ressources on a
> > specific area of the FPGA?
>
> Why do you want to do so?
>
> --
> MfG
> Falk

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930     Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

 "They that give up essential liberty to obtain a little
  temporary safety deserve neither liberty nor safety."
                                          -Benjamin Franklin, 1759

Article: 37696
Subject: MIPS or MOPS?
From: "AAP3" <aams@dr.com>
Date: Wed, 19 Dec 2001 08:03:40 GMT
Links: << >> << T >> << A >>

Hi..to all
I wrote some functions for a CDMA receiver and I want to find the number of
MIPS required by each function. How do I calculate it?
and which is more accurate measure, MIPS  or MOPS?
More info:
data rate 2Mbps.
system clock 50MHz.
4 time over sampling.
16 Spreading factor.

Thanks.

Article: 37697
Subject: Re: Barrel shifter puts three 2->1 muxes / slice in Xilinx
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Wed, 19 Dec 2001 08:31:45 +0000 (UTC)
Links: << >> << T >> << A >>

Hi Ray,

No question that messing around with the carries is going to slow down a barrel
shifter. It's been a very long time since I used one (okay, it was actually a
"funnel shifter", but I hate that term, so I call them all barrel shifters).  I
just got the one that uses the carries to effectively fit a 3 to 1 mux into a
LUT to place and route correctly.  It uses 38 LUTs to do a two stage shift of 0
to 8 bits (i.e. a 9-bit barrel shift) on a 16-bit input.

It's very efficient when you need a power of 3 shift size.  You get a 9-bit
shift with only 2 stages of logic instead of the usual 4.  I'll post it to the
thread in a minute.

The reason I made it two stage was simply to prevent the synthesizer from
messing with my logic.  It's slow enough (due to the full length carry) that it
would make more engineering sense as a space saving circuit rather than a
highly pipelined design.  But I'm doing this for fun, so what the heck.

The last rough spot was getting the synthesizer to recognize that a LUT4 wasn't
interfering with a MULT_AND.  I had to instantiate the LUT4s.  I haven't
figured out how to apply an attribute to a generated component (i.e. like your
RLOC usage on generated components).  I'm assuming here that "generated" means
the use of the "generate" command in VHDL.  The basic problem is that I haven't
figured out how to properly address the components.  So I went ahead and
instantiated them individually.

If you have a way around that, I'd appreciate the secret.

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37698
Subject: Low area barrel shift puts 3 to 1 mux in a Xilinx LUT:
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Wed, 19 Dec 2001 08:42:45 +0000 (UTC)
Links: << >> << T >> << A >>

This 16x9 barrel shifter uses only 38 LUTs.  There are
two stages of 3 to 1 muxes, giving a total mux of 9 to 1.

The 3 to 1 muxes are implemented using a "trick" that
involves the use of the carry logic.

This dimension barrel shifter is rather inconvenient for
conventional barrel shifters, which will require 4 stages
of logic.  This one does it in only two stages, but I have
to add 2 LUTs for the two mode pins required by each stage,
and 1 LUT for the carry input to each stage.  Thus the
total LUT count is 2 * (16 + 3) = 38.

This is considerably below the conventional barrel shifter
requirement for this size barrel shifter of 4 * 16 = 64.

I'm putting this up for people to admire and critique.
If you're going to use it, I'd note that getting stuff like
this to synthesize correctly is not particularly easy.

I've tried to include design notes to illustrate the pitfalls
that arise when the design is modified.

Also, I'd note that it's been a few days since I simulated
this logic, and it's possible that in the process of forcing
it into its designed number of LUTs I messed something up.
This was written for fun, I am not using it in any work.  So
beware.

library IEEE;
use IEEE.std_logic_1164.all;

-- Efficient Barrel Shifter, fully pipelined.
--
-- 16-input x 9 barrel shifter.  Uses two stages of logic.
-- Requires only 39FGs, 30CYs, and 34DFFs.
--
-- Designer:  Carl Brannen
--
-- Feel free to modify this circuit and use it in
-- your own designs. I am aware of no patents that
-- it infringes on, but you will have to make your
-- own determination of this.  My only request is
-- that you leave a comment to the effect that your
-- knowledge of the algorithm is through me.  Of course
-- this is freeware and I make no guarantees that it
-- will be at all helpful to you.
--
-- Synthesize with optimize set for "low", and 
-- "area".  This circuit is already optimized,
-- the computer will only be waste its time (and likely
-- increase the size and delay of the result) if it
-- tries to optimize further.

entity BRL16_8 is
   port (
      CLK:    in  STD_LOGIC;
      DIN:    in  STD_LOGIC_VECTOR(15 downto 0);
      SHIFT:  in  STD_LOGIC_VECTOR( 3 downto 0);
      Y:      out STD_LOGIC_VECTOR(15 downto 0)
   );
end BRL16_8;

architecture BRL16_8_arch of BRL16_8 is

--
-- The standard version of a 16x8 barrel shifter uses
-- 48FGs and 51DFFs, and a 16x9 barrel shifter would use
-- 64FGs.
--
-- The usual barrel shifter uses stages that shift by either 0
-- or 2^n bits.  This barrel shifter uses stages that shift
-- by three different amounts instead of two.
--
-- The usual barrel shifter stage consists of a vector of 2 to 1
-- muxes.  A stage in this barrel shifter is instead, effectively,
-- a 3 to 1 mux.  It nevertheless takes up about the same space that
-- the 2 to 1 stage in a regular barrel shifter takes.
--
-- The 3 to 1 muxes can be used in pairs to give a 9 to 1 mux,
-- which is equivalent to three stages of the usual barrel shifter.
--
-- The area reduction is therefore up to 33% of the area used by a
-- regular 8^n size barrel shifter, but it can be more or less for
-- other sizes.
--
-- This barrel shifter type is particularly efficient when the
-- number of bits to be shifted includes a power of 3.  For instance,
-- a 9-bit barrel shifter could be accomplished in only 2 stages,
-- while a regular barrel shifter of that length would require
-- a full 4 stages.  The example here, however, only does shifts
-- by 8, not by 9.  The reason for so restricting it is to get
-- the control circuitry simpler.
--
-- The usual barrel shifter consists of 2 to 1 muxes.  These are
-- implemented in LUT3 (i.e. two data inputs and one select input).
-- Since there is a free control pin, it's clear that more functionality
-- can be packed into the LUT.  In addition, none of the arithmetic
-- functionality of the slice is being used.
--
-- A 4 to 1 mux would allow 2 "bits" of barrel shifting to happen
-- in a single stage, but it's pretty obvious that that goal is
-- beyond us.  There would be 6 inputs required for such a mux, but
-- there are only 4 LUT inputs, and one CIN input.
--
-- But 4 LUT and 1 CIN does give 5 inputs per bit, and this is
-- enough to implement a 3 to 1 mux.  So we need to look at designs
-- that use 3 to 1 muxes.
--
-- I'll combine consecutive 3 to 1 mux stages so that together they
-- implement three bits of a standard barrel shifter.  The overall
-- functionality will be as follows:
--
-- SHFT[2..0]  0 1 2 3 4 5 6 7 8
--             - - - - - - - - -
-- Shift Amt:  0 1 2 3 4 5 6 7 8
--
-- This functionality can be implemented in two stages using
-- 3 to 1 muxes as follows:
--
-- SHFT[2:0]   0 1 2 3 4 5 6 7 8
--             - - - - - - - - -
-- SHFA[1:0]   0 1 2 0 1 2 0 1 2 (Shift by 0, 1, or 2)
-- SHFB[2:1]   0 0 0 3 3 3 6 6 6 (Shift by 0, 3, or 6)
--             - - - - - - - - -
-- Shift Amt:  0 1 2 3 4 5 6 7 8
--
-- Into each LUT I'll bring 2 mode control bits and two data
-- bits.  Two of the shifts will correspond to selecting the
-- two data bits.  I'll call these the "mux" shifts.  The 3rd
-- shift is passed through the CIN / COUT lines, so it's called
-- the "arithmetic" shift.
--
-- Before continuing, I need to comment on some mathematics.
-- Barrel shifting, as implemented by stages of shifts, for
-- a barrel shifter of width "n", is equivalent to Z_n.  That
-- is, it is equivalent to addition in the integers modulo n.
-- Another way of putting this is to say that shifting is additive.
-- If you shift by "S" and then shift by "T", the result is
-- equivalent to a shift by "S+T".  The usual barrel shifter
-- uses the binary representation of the shift amount, with
-- stage "m" shifting by either 0 or 2^m.  But the same shift
-- result could be accomplished by shifting by other amounts.
-- The thing to remember is that it is additive.
--
-- Because of this, it's useful to know a little bit about Z_n.
-- I'm going to have to use the CIN to bring in one of the
-- shift quantities, so these CIN to COUT chains are going to
-- depend on the structure of the "orbits" of the shift amounts.
-- "Orbit" has a specific meaning to the pure mathematicians,
-- in this case I'm going to redefine it to mean something
-- similar.  Don't show this to any mathematicians, they'll
-- undoubtedly be disgusted with my abuse of the language.
--
-- A given shift amount creates an orbit out of a start bit
-- by the sequence of places it shifts that bit to as the shift
-- is repeated.  For example, with a barrel shifter width of 8,
-- if the shift amount is 3 bits, the "0" bit is taken through
-- the following sequence: (0, 3, 6, 1, 4, 7, 2, 5), and then
-- back to "0".  This "orbit" has 8 elements.  The orbit of the
-- other bits, under a shift of 3 bits, is going to look very
-- much the same (i.e. be "isomorphic".)  Because of this
-- similarity, I'm only going to look at the orbits of "0".
--
-- The orbits due to other shift amounts may or may not be
-- isomorphic.  For example, the orbit of "0" with a shift
-- amount of 1 is (0, 1, 2, 3, 4, 5, 6, 7), which is indeed
-- isomorphic to the orbit of "3".  But the orbit of "2" is
-- shorter: (0, 2, 4, 6).  Mathematically, the orbits of a
-- shift by n bits in a barrel shift of width w is of maximum
-- length of n and w have no common divisors.
--
-- The isomorphisms of the orbits is not of significance for
-- the shifts that are performed with the usual multiplexers,
-- but are very important for shifts performed using the
-- arithmetic CIN / COUT wires.  The carry logic for that
-- shift forms a chain.  A shift of 1 bit always has the same
-- orbit, the one with maximum length, no matter what the
-- size of the barrel shifter: (0, 1, 2, ... WIDTH-1), where
-- WIDTH is the width of the barrel shifter.  This orbit is
-- preferred to other possible orbits because I have to make
-- sure that CIN = 0 in order to avoid an unwanted arithmetic
-- operation (i.e. an increment) when performing the regular
-- mux type shifts.  This means that I have to control the CIN
-- by forcing it to zero for the two mux shifts, and connecting
-- it to COUT for the one arithmetic shift.  If I select an
-- orbit of less than the full Barrel Shift width for the
-- arithmetic shift, I'll have to build more than one copy of
-- the CIN control logic.
--
-- There's another reason for analyzing the orbit structure of
-- barrel shifters.  The two mode bits routed to each LUT have
-- some freedom.  It's possible that I might be able to arrange
-- for those mode bits to be one of the 3 select inputs, thereby
-- reducing the number of LUTs needed to compute the control
-- lines.  I suppose there is also a chance that I might be
-- able to share a control line between the SHFA and SHFB shifts,
-- though it surely seems unlikely.
--
-- The arithmetic shift is accomplished through the CIN input.
-- This means I'll have to use the XORCY output, and therefore
-- I will have to make sure that COUT = 0 (and therefore CIN = 0
-- for the next bit in the orbit) during the mux shift modes.
-- Now one of the data inputs will have to connect to I0 in
-- order for it to be sent out the COUT.  This will be the
-- arithmetic shift.  When that same data input is to be used
-- as a mux shift (to the current bit), the COUT will automatically
-- be cleared by the arithmetic logic.  (If "I0" is the mux /
-- arithmetic data input, the LUT4 will be programmed to output
-- I0 when selecting I0 as the shift amount.  The MUXCY, which
-- has the LUT4 output as its selector, will then select I0
-- when the LUT4 output is 0, and CIN when the LUT4 output is
-- 1.  Since the LUT4 output in this case is I0, the I0 input
-- will only be selected when it is zero.)  But I have to make
-- sure that I0 input to the MUXCY is zero when selecting the
-- other mux shift. The only way to do this for free is to use
-- the MULT_AND.
-- 
-- This gives the structure of one stage of a general purpose
-- 3-way shifter as the following.  XA and XB are shifts that
-- use the multiplexers, and C is the arithmetic shift.  "M1"
-- and "M0" are the two mode inputs to the LUT, while "XA" and
-- "XB" are the two data inputs.  The mode values "s" need to
-- be selected later, hopefully in a manner that reduces the
-- need for decoding logic.
--
-- SHFA  M1 M0 LUT4  CIN COUT   Result
-- ----  -- -- ----  --- ----   ------
--  XA    s  s  XA     0   0      XA
--  XB    s  0  XB     0   0      XB
--   C    s  1   0     C  XA       C
--
-- There's one remaining bit of mathematical gibberish.
-- In addition to performing isomorphisms between shifts,
-- I can also add a constant shift to a stage for free.
-- That is, I can renumber the outputs of a stage by shifting
-- them around.  This is just a wire change, and it means
-- that given a stage that, for instance, shifts by 2,3, or
-- 4 bits, I can rearrange the output bits on the same stage
-- and make it shift by 4,5, or 6 bits.  (Or 6,7, or 0 bits.)
-- This gives me some freedom in how I assign mode pins,
-- and freedom is very important for minimizing logic functions.
--
-- The effect of the above mathematical note is simple.  I
-- can add an arbitrary shift amount to each stage, in terms
-- of what shift is associated with a particular pattern on
-- the SHFT inputs.  Then I can remove that shift by doing
-- a "wire" barrel shift for free.
--
-- The XB shift is completely arbitrary.  By that I mean
-- it could connect up any way without respect to how the
-- XA and C shifts are done.  But the XA and C shifts are
-- related.  In order for the carry structure to have only
-- one overall CIN, I need to have that XA and C shifts
-- differ (in terms of how many bits they shift) by an amount
-- that corresponds to a full length orbit.
--
-- For the example given, with a barrel shifter width of 8,
-- this means that I have to have the XA and C shifters
-- differ by {1,3,5, or 7} bits.  (Note all arithmetic is to
-- be done modulo the width of the barrel shifter.)  This
-- is rather liberal, as it means that I need only ensure
-- that not all the shift amounts be even or odd for either
-- SHFA or SHFB.
--
-- The other thing to notice is that I only need 3 different
-- shifts but with two mode pins I'm going to have 4 codes.
-- This means that I can map two codes to the same mode.  This
-- may help reduce the amount of control logic. But I'll use
-- the extra degree of freedom to define two states for SHFB == 3.
-- This way, if SHIFT[3] is tied low (and the design reduced
-- from a shift by "0 to 8" to a shift by "0 to 7") an FG will
-- be saved.  The resulting truth table is:
--
--
--                 A0_MODE  A1_MODE
-- SHIFT SHFA SHFB   1  0     1  0
-- ----- ---- ----   -  -     -  -
-- 0000    0    0    0  0     0  0
-- 0001    1    0    1  0     0  0
-- 0010    2    0    0  1     0  0
-- 0011    0    3    0  0     1  0
-- 0100    1    3    1  0     1  1
-- 0101    2    3    0  1     1  1
-- 0110    0    6    0  0     0  1
-- 0111    1    6    1  0     0  1
-- 1000    2    6    0  1     0  1


-- Also note that the original version of this code got the arithmetic
-- mux by connecting the Carry-out back around to the Carry-input.  This
-- generates an apparent "cycle", and causes a "post layout timing report"
-- warning something like the following:
--
--  ----------------------------------------------------------------------
-- ! Warning: The following connections close cycles, and some paths      !
-- !          through these connections may not be analyzed.              !
-- !                                                                      !
-- ! Signal                            Driver            Load             !
-- ! --------------------------------  ----------------  ---------------- !
-- ! U3/A0_CRY2                        LB_R13C8.S1.COUT  CLB_R12C8.S1.CIN !
-- ! U3/A1_CRY8                        CLB_R9C4.S0.COUT  CLB_R8C4.S0.CIN  !
--  ---------------------------------------------------------------------- 
--
-- First of all, this warning may be ignored for this particular circuit.
-- The reason is that the circuit is operated in two modes.  In the non
-- arithmetic shifts, the "0th" carry is forced to be zero by the CIN selector.
-- This propagates through the rest of the circuit, so there is no cycle.
-- In the arithmetic shift mode the carry is always equal to the applied A[]
-- input, and no carries propagate through the circuit at all.
--
-- But it's best to avoid warnings, so the circuitry shown here simply ignores
-- the final carry-out and consequently is manifestly free of cycles.
--
-- There are some useful arithmetic circuits where the COUT has to be connected
-- back to the CIN input but this is not one of them.  A great example of a
-- an arithmetic circuit where cycles have to be dealt with is in the addition
-- circuitry of an ALU designed to add floating point numbers in CRAY notation.
-- Ah, for the days when I was a CPU designer!

component XORCY port (
      CI:     in  STD_LOGIC;
      LI:     in  STD_LOGIC;
      O:      out STD_LOGIC);
end component;

component MUXCY port (
      DI:     in  STD_LOGIC;
      CI:     in  STD_LOGIC;
      S:      in  STD_LOGIC;
      O:      out STD_LOGIC);
end component;

component MULT_AND port (
      I0:      in  STD_LOGIC;
      I1:      in  STD_LOGIC;
      LO:      out STD_LOGIC);
end component;

component LUT4 port (
      I0:      in  STD_LOGIC;
      I1:      in  STD_LOGIC;
      I2:      in  STD_LOGIC;
      I3:      in  STD_LOGIC;
      O:       out STD_LOGIC);
end component;

-- Define LUT4s to the correct function.  For some reason the synthesizer
-- couldn't figure out that this was what I wanted.
attribute INIT: string;
attribute INIT of L00: label is "0E04";
attribute INIT of L01: label is "0E04";
attribute INIT of L02: label is "0E04";
attribute INIT of L03: label is "0E04";
attribute INIT of L04: label is "0E04";
attribute INIT of L05: label is "0E04";
attribute INIT of L06: label is "0E04";
attribute INIT of L07: label is "0E04";
attribute INIT of L08: label is "0E04";
attribute INIT of L09: label is "0E04";
attribute INIT of L10: label is "0E04";
attribute INIT of L11: label is "0E04";
attribute INIT of L12: label is "0E04";
attribute INIT of L13: label is "0E04";
attribute INIT of L14: label is "0E04";
attribute INIT of L15: label is "0E04";

-- Stage 0 declarations
signal A0_MODE:  STD_LOGIC_VECTOR( 1 downto 0);     -- SHFA
-- Arithmetic inputs
signal A0_A:     STD_LOGIC_VECTOR(15 downto 0);     -- A[] signal input
signal A0_B:     STD_LOGIC_VECTOR(15 downto 0);     -- B[] signal input
-- Arithmetic internal signals
signal A0_LUT:   STD_LOGIC_VECTOR(15 downto 0);     -- LUT      (internal use)
signal A0_MA:    STD_LOGIC_VECTOR(15 downto 0);     -- MULT_AND (internal use)
signal A0_XC:    STD_LOGIC_VECTOR(15 downto 0);     -- XORCY    (internal use)
signal A0_CRY:   STD_LOGIC_VECTOR(16 downto 0);     -- Carry    (internal use)
-- Arithmetic outputs
signal A0_COUT:  STD_LOGIC;                         -- Arithmetic Carry output
signal A0_SUMQ:  STD_LOGIC_VECTOR(15 downto 0);     -- Arithmetic Sum output
signal A0_SUMD:  STD_LOGIC_VECTOR(15 downto 0);     -- Arithmetic Sum output

-- Stage 1 declarations
signal A1_MODEQ: STD_LOGIC_VECTOR( 1 downto 0);     -- SHFB
signal A1_MODED: STD_LOGIC_VECTOR( 1 downto 0);     -- SHFB
-- Arithmetic inputs
signal A1_A:     STD_LOGIC_VECTOR(15 downto 0);     -- A[] signal input
signal A1_B:     STD_LOGIC_VECTOR(15 downto 0);     -- B[] signal input
-- Arithmetic internal signals
signal A1_LUT:   STD_LOGIC_VECTOR(15 downto 0);     -- LUT      (internal use)
signal A1_MA:    STD_LOGIC_VECTOR(15 downto 0);     -- MULT_AND (internal use)
signal A1_XC:    STD_LOGIC_VECTOR(15 downto 0);     -- XORCY    (internal use)
signal A1_CRY:   STD_LOGIC_VECTOR(16 downto 0);     -- Carry    (internal use)
-- Arithmetic outputs
signal A1_COUT:  STD_LOGIC;                         -- Arithmetic Carry output
signal A1_SUMQ:  STD_LOGIC_VECTOR(15 downto 0);     -- Arithmetic Sum output
signal A1_SUMD:  STD_LOGIC_VECTOR(15 downto 0);     -- Arithmetic Sum output


begin


-- The selector needed is a type of arithmetic circuit with 
-- two mode pins.  In addition, I have to be able to force the "A"
-- input to be ignored completely for at least one mode.  That means
-- that the mode will require a MULT_AND, so the template required is
-- the MODE3-0 MULT_AND template (which see).
--
-- The three operations required will be A[], B[], and A[]+A[]+COUT.
-- B[] will have to correspond to AR_MODE(1) "low", and A[]+A[] will
-- need to have AR_MODE(1) "high".  Also, I want a shift by "0" to
-- correspond to the A[] version, while a shift by "1" or "3" to correspond
-- to the arithmetic shift (i.e. A[]+A[]+COUT).  In addition, it's
-- possible to save a LUT by fiddling with N0 so that two different codes
-- correspond to the arithmetic shift:
--
--                 A0_MODE  A1_MODE
-- SHIFT SHFA SHFB   1  0     1  0
-- ----- ---- ----   -  -     -  -
-- 0000    0    0    0  0     0  0
-- 0001    1    0    1  0     0  0
-- 0010    2    0    0  1     0  0
-- 0011    0    3    0  0     1  0
-- 0100    1    3    1  0     1  1
-- 0101    2    3    0  1     1  1
-- 0110    0    6    0  0     0  1
-- 0111    1    6    1  1     0  1
-- 1000    2    6    0  1     0  1

-- The bizarre modes for "10" and "11" is to possibly save the A1_MODE(0) LUT.

-- SHFA modes
with SHIFT(3 downto 0) select
    A0_MODE(1 downto 0) <=
      "00" when "0000" | "0011" | "0110",      -- Shift by 0   A {0,3,6}
      "10" when "0001"                  ,      -- Shift by 1   0 {1,
      "11" when          "0100" | "0111",      -- Shift by 1        4,7}
      "01" when others;                        -- Shift by 2   B {2,5,8}

-- SHFB modes
with SHIFT(3 downto 0) select
    A1_MODED(1 downto 0) <=
      "00" when "0000" | "0001" | "0010",      -- Shift by 0
      "10" when "0011"                  ,      -- Shift by 3
      "11" when          "0100" | "0101",      -- Shift by 3
      "01" when others;                        -- Shift by 6


------------------------------------------------------
-- SHFA   Shifts by 0,1, or 2 bits
------------------------------------------------------

-- Arithmetic functions required:
--
-- A0_MODE  Arithmetic
--   1 0     Function      Shift
--   - -    ------------   --------
--   0 0    A[]              0
--   0 1    B[]              2
--   1 0    A[]+A[]+COUT     1
--   1 1    A[]+A[]+COUT     1
--
-- This is all I need to know to define the carry logic. Because
-- this is sort of a complicated substitution, I'll do the bit substitution
-- explicitly, using assignments, rather than try to substitute the text
-- in the arithmetic.  The other substitutions are as follows:
--
-- Substitutions:
-- AR_MAXBIT <= 15
-- AR_MODE   <= A0_MODE(1 downto 0)
-- AR_xxx    <= A0_xxx

-- Bit assignments, stage 0
--
-- Note that the bit assignments for stage 0 are trivial.  This is because
-- the orbit of a bit in this length barrel shifter, when shifted by 1
-- is the sequence (0,1,2,3,4 ... 15), and this is a trivial sequence.
-- the bit assignment for stage 1 is more complicated.

A0_A(15 downto 0) <= DIN(15 downto 0);      -- Shift by 0

-- A0_B is the same as A0_A, but shifted by two places:
A0_B(15 downto 0) <= A0_A(13 downto 0) & A0_A(15 downto 14);      -- Shift by 2

-- Carry-in control
-- Note that when the circuit is in A[]+A[]+COUT mode, the COUT will
-- be precisely equal to A0_A(15), so I choose that as the CIN instead
-- of the carry-out A0_CRY(16).  This uses no more gates and is kinder
-- and gentler to the tools.
with A0_MODE(1 downto 0) select
    A0_CRY(0) <=
    (  '0'   ) when "00",     -- CIN = 0      A[]
    (  '0'   ) when "01",     -- CIN = 0      B[]
    (A0_A(15)) when "10",     -- CIN = COUT   A[]+A[]+COUT
    (A0_A(15)) when others;   -- CIN = COUT   A[]+A[]+COUT

-- Unfortunately, I couldn't get the synthesizer to clue in that
-- A0_LUT would fit into a LUT. I hate to instantiate these things,
-- but here goes:
--
-- with A0_MODE(1 downto 0) select
--   A0_LUT(I) <=
--     (             A0_A(I)) when "00",      --   A[]        A[]+1
A[]+CIN
--     (             A0_B(I)) when "01",      --   B[]        B[]+1
B[]+CIN       (Note 1.)
--     (A0_A(I)  xor A0_A(I)) when "10",      --   A[]+A[]    A[]+A[]+1
A[]+A[]+CIN
--     (A0_A(I)  xor A0_A(I)) when others;    --   A[]+A[]    A[]+A[]+1
A[]+A[]+CIN
--
-- INIT calculation:
--
-- 3 1111111100000000  B
-- 2 1111000011110000  MODE(0)
-- 1 1100110011001100  A
-- 0 1010101010101010  MODE(1)
-- - ----------------
--        1 0     1 0  A0_A when "00",
--       1 1     0 0   A0_B when "01",
--   0000    0000       '0' when others (note A xor A == '0')
--   ----------------
--   0000111000000100 = 0x0E04  (assigned as an attribute near signals)
--
-- LUT4 ugliness...
L00: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 0),
    I2  => A0_MODE(0),
    I3  => A0_B( 0),
    O   => A0_LUT( 0));
L01: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 1),
    I2  => A0_MODE(0),
    I3  => A0_B( 1),
    O   => A0_LUT( 1));
L02: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 2),
    I2  => A0_MODE(0),
    I3  => A0_B( 2),
    O   => A0_LUT( 2));
L03: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 3),
    I2  => A0_MODE(0),
    I3  => A0_B( 3),
    O   => A0_LUT( 3));
L04: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 4),
    I2  => A0_MODE(0),
    I3  => A0_B( 4),
    O   => A0_LUT( 4));
L05: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 5),
    I2  => A0_MODE(0),
    I3  => A0_B( 5),
    O   => A0_LUT( 5));
L06: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 6),
    I2  => A0_MODE(0),
    I3  => A0_B( 6),
    O   => A0_LUT( 6));
L07: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 7),
    I2  => A0_MODE(0),
    I3  => A0_B( 7),
    O   => A0_LUT( 7));
L08: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 8),
    I2  => A0_MODE(0),
    I3  => A0_B( 8),
    O   => A0_LUT( 8));
L09: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A( 9),
    I2  => A0_MODE(0),
    I3  => A0_B( 9),
    O   => A0_LUT( 9));
L10: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(10),
    I2  => A0_MODE(0),
    I3  => A0_B(10),
    O   => A0_LUT(10));
L11: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(11),
    I2  => A0_MODE(0),
    I3  => A0_B(11),
    O   => A0_LUT(11));
L12: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(12),
    I2  => A0_MODE(0),
    I3  => A0_B(12),
    O   => A0_LUT(12));
L13: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(13),
    I2  => A0_MODE(0),
    I3  => A0_B(13),
    O   => A0_LUT(13));
L14: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(14),
    I2  => A0_MODE(0),
    I3  => A0_B(14),
    O   => A0_LUT(14));
L15: LUT4 port map (
    I0  => A0_MODE(1),
    I1  => A0_A(15),
    I2  => A0_MODE(0),
    I3  => A0_B(15),
    O   => A0_LUT(15));

-- Generate command
A0: for I in 0 to 15 generate
-- Carry chain instantiation
  MA: MULT_AND port map (
    I0  => A0_A(I),
    I1  => A0_MODE(1),
    LO  => A0_MA(I));
  MC: MUXCY port map (
    DI  => A0_MA(I),
    CI  => A0_CRY(I),
    S   => A0_LUT(I),
    O   => A0_CRY(I+1));
  XC: XORCY port map (
    CI  => A0_CRY(I),
    LI  => A0_LUT(I),
    O   => A0_SUMD(I));
end generate;


------------------------------------------------------
-- SHFB   Shifts by 0,3, or 6 bits
------------------------------------------------------
--
--
-- The logic for the second level shifts is similar to the logic for
-- the first level, but the shifts are by amounts 3x as much.  This
-- means that I have to scramble bits to get the right results.
--
-- A0_MODE  Arithmetic
--   1 0     Function      Shift
--   - -    ------------   --------
--   0 0    A[]              0
--   0 1    B[]              6
--   1 0    A[]+A[]+COUT     3
--   1 1    A[]+A[]+COUT     3
--
--
-- Substitutions:
-- AR_MAXBIT <= 15
-- AR_MODE   <= A1_MODE(1 downto 0)
-- AR_xxx    <= A1_xxx

-- Bit assignments, stage 1
--
-- The results from the previous stage is A0_SUM(15 downto 0), and
-- the bits are positioned just as they appear.  But I'm going to have
-- to juggle the ordering for this stage.
--
-- I'm going to choose to start the selector with the selected
-- bit 0.  That is, A1_SUM(0) will have a position of "0" in the
-- (15 downto 0) set of bits.
--
-- As soon as I make that choice, I know that A1_A(0) will be given
-- A0_SUM(0) in order to make the B[] choice correspond to a shift by
-- 0.  Then A1_B(0) will have to be A0_SUM(6) to get the shift by 6.
--
-- The carry-out of from this stage (during the arithmetic mux) will have
-- a value of A0_SUM(0), and since the arithmetic shift is supposed to
-- have a shift amount of "3", it must be that the next bit higher
-- than bit "0" must be bit "3".  This rule continues through all
-- 16 bits of input bits, and determines the bits for A1_A:

A1_A(15 downto 0)
      <= A0_SUMQ(13) & A0_SUMQ(10) & A0_SUMQ( 7) & A0_SUMQ( 4)
      &  A0_SUMQ( 1) & A0_SUMQ(14) & A0_SUMQ(11) & A0_SUMQ( 8)
      &  A0_SUMQ( 5) & A0_SUMQ( 2) & A0_SUMQ(15) & A0_SUMQ(12)
      &  A0_SUMQ( 9) & A0_SUMQ( 6) & A0_SUMQ( 3) & A0_SUMQ( 0);

-- A1_B is the same as A1_A, but shifted by two places:
A1_B(15 downto 0) <= A1_A(13 downto 0) & A1_A(15 downto 14);      -- Shift by 6

-- Note: The following carry chain logic may seem mysterious, but
-- with the use of a template it's easy to implement.  It's my
-- intention to publish the templates I use as freeware, but I haven't
-- got good enough comments written into them.  Half the problem is
-- that after you've used the templates a few times you just remember
-- how to get the arithmetic functions you want, so I don't bother
-- with them much.
--
-- The template that covers this case allows for a very large number
-- of 2-mode bit functions of 2 arithmetic vectors (and constants and
-- stuff) to be implemented in a single slice.

-- Carry-in control
with A1_MODEQ(1 downto 0) select
    A1_CRY(0) <=
    (  '0'   ) when "00",     -- CIN = 0      A[]
    (  '0'   ) when "01",     -- CIN = 0      B[]
    (A1_A(15)) when "10",     -- CIN = COUT   A[]+A[]+COUT
    (A1_A(15)) when others;   -- CIN = COUT   A[]+A[]+COUT

-- Generate command
A1: for I in 0 to 15 generate
with A1_MODEQ(1 downto 0) select
    A1_LUT(I) <=
    (             A1_A(I)) when "00",      --   A[]        A[]+1      A[]+CIN
    (             A1_B(I)) when "01",      --   B[]        B[]+1      B[]+CIN
(Note 1.)
    (A1_A(I)  xor A1_A(I)) when "10",      --   A[]+A[]    A[]+A[]+1
A[]+A[]+CIN
    (A1_A(I)  xor A1_A(I)) when others;    --   A[]+A[]    A[]+A[]+1
A[]+A[]+CIN

-- Carry chain instantiation
  MA: MULT_AND port map (
    I0  => A1_A(I),
    I1  => A1_MODEQ(1),
    LO  => A1_MA(I));
  MC: MUXCY port map (
    DI  => A1_MA(I),
    CI  => A1_CRY(I),
    S   => A1_LUT(I),
    O   => A1_CRY(I+1));
  XC: XORCY port map (
    CI  => A1_CRY(I),
    LI  => A1_LUT(I),
    O   => A1_SUMD(I));
end generate;


process (CLK)
begin
    if CLK'event and CLK='1' then  --CLK rising edge
      A0_SUMQ  <= A0_SUMD(15 downto 0);
      A1_MODEQ <= A1_MODED(1 downto 0);
      A1_SUMQ  <= A1_SUMD(15 downto 0);
    end if;
end process;


-- The result is A1_SUM, but the bits are not in the correct order.
-- To get the bits in correct order, I simply reverse the operation
-- that gave A1_A:

Y(15 downto 0)
      <= A1_SUMQ( 5) & A1_SUMQ(10) & A1_SUMQ(15) & A1_SUMQ( 4)
      &  A1_SUMQ( 9) & A1_SUMQ(14) & A1_SUMQ( 3) & A1_SUMQ( 8)
      &  A1_SUMQ(13) & A1_SUMQ( 2) & A1_SUMQ( 7) & A1_SUMQ(12)
      &  A1_SUMQ( 1) & A1_SUMQ( 6) & A1_SUMQ(11) & A1_SUMQ( 0);

end BRL16_8_arch;


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37699
Subject: Re: Low area barrel shift puts 3 to 1 mux in a Xilinx LUT:
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Wed, 19 Dec 2001 08:57:42 +0000 (UTC)
Links: << >> << T >> << A >>

Design reports:

<<<
Design Summary
--------------
   Number of errors:      0
   Number of warnings:    1
   Number of Slices:                 38 out of    768    4%
   Number of Slices containing
      unrelated logic:                0 out of     38    0%
   Number of Slice Flip Flops:       70 out of  1,536    4%
   Number of 4 input LUTs:           38 out of  1,536    2%
   Number of bonded IOBs:            36 out of     94   38%
   Number of GCLKs:                   1 out of      4   25%
   Number of GCLKIOBs:                1 out of      4   25%
Total equivalent gate count for design:  1,034
Additional JTAG gate count for IOBs:  1,776
>>>

Note that the 38 slices mentioned above include slices that contain
only DFFs. The number of slices that contain FGs or carry logic is
20.  There are extra DFFs included on the inputs and outputs in order
to "free" the design from the IO region.

<<<

   The Average Connection Delay for this design is:        0.923 ns
   The Maximum Pin Delay is:                               1.982 ns
   The Average Connection Delay on the 10 Worst Nets is:   1.430 ns
...

--------------------------------------------------------------------------------
  Constraint                                | Requested  | Actual     | Logic 
                                            |            |            | Levels

--------------------------------------------------------------------------------
* NET "CLK" PERIOD =  5 nS   LOW 50.000 %   | 5.000ns    | 5.272ns    | 11   

--------------------------------------------------------------------------------

<<<
Constraints cover 2038 paths, 0 nets, and 234 connections (93.6% coverage)

Design statistics:
   Minimum period:   5.272ns (Maximum frequency: 189.681MHz)
>>>

I should here mention that the worst case timing does
involve the long carry chain, but the carry logic itself
only increases by .097ns per stage (i.e. half that per bit).

Consequently, other than the additional routing congestion,
the carry chain, per se, is not really much of a limit to it:

<<<

================================================================================
Timing constraint: NET "CLK" PERIOD =  5 nS   LOW 50.000 % ;
 2038 items analyzed, 6 timing errors detected.
 Minimum period is   5.272ns.

--------------------------------------------------------------------------------
Slack:    -0.272ns path B<3> to U14/A1_A<10> relative to
           5.000ns delay constraint

Path B<3> to U14/A1_A<10> contains 11 levels of logic:
Path starting from Comp: CLB_R14C5.S1.CLK (from CLK)
To                   Delay type         Delay(ns)  Physical Resource
                                                   Logical Resource(s)
-------------------------------------------------  --------
CLB_R14C5.S1.XQ      Tcko                  0.772R  B<3>
                                                   U2
CLB_R15C6.S1.G2      net (fanout=4)        0.613R  B<3>
CLB_R15C6.S1.Y       Tilo                  0.398R  U14/C1/N5
                                                   U14/C586
CLB_R15C6.S1.F3      net (fanout=17)       0.476R  U14/C17/N16
CLB_R15C6.S1.X       Tilo                  0.398R  U14/C1/N5
                                                   U14/C582
CLB_R16C6.S0.BX      net (fanout=1)        0.680R  U14/C1/N5
CLB_R16C6.S0.COUT    Tbxcy                 0.357R  U14/A1_A<0>
                                                   U14/MC_0
                                                   U14/MC_1
CLB_R15C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<2>
CLB_R15C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<6>
                                                   U14/MC_2
                                                   U14/MC_3
CLB_R14C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<4>
CLB_R14C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<12>
                                                   U14/MC_4
                                                   U14/MC_5
CLB_R13C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<6>
CLB_R13C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<2>
                                                   U14/MC_6
                                                   U14/MC_7
CLB_R12C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<8>
CLB_R12C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<8>
                                                   U14/MC_8
                                                   U14/MC_9
CLB_R11C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<10>
CLB_R11C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<14>
                                                   U14/MC_10
                                                   U14/MC_11
CLB_R10C6.S0.CIN     net (fanout=1)        0.000R  U14/A0_CRY<12>
CLB_R10C6.S0.COUT    Tbyp                  0.097R  U14/A1_A<4>
                                                   U14/MC_12
                                                   U14/MC_13
CLB_R9C6.S0.CIN      net (fanout=1)        0.000R  U14/A0_CRY<14>
CLB_R9C6.S0.CLK      Tcckx                 0.996R  U14/A1_A<10>
                                                   U14/XC_14
                                                   U14/A0_SUMQ_reg<14>
-------------------------------------------------
Total (3.503ns logic, 1.769ns route)       5.272ns (to CLK)
      (66.4% logic, 33.6% route)
>>>

Even though the carry is only used in two modes, one where it is
zero throughout the carry chain, and the other where each carry
bit propagates only to the next stage, it is nevertheless the case
that the worst case speed of the circuit includes the full timing
chain.

It's not easy to get a delay as long as the whole carry chain.
To do it, you'll have to apply '1's to the DIN[] input, and
tool around with the SHIFT[] input.  When the carry chain is
in the "pass data" mode, it will all be '1's.  When the carry
chain is switched to the all '0' mode, the '0' will be supplied
at the least significant bit, and will propagate through the
chain.  (Note that the chain does not propagate evenly from least
significant to most significant because of the remapping that
makes the barrel shifter work.)

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search