Messages from 37825

Article: 37825
Subject: An algorithm improvement...
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Fri, 21 Dec 2001 01:49:04 +0000 (UTC)
Links: << >> << T >> << A >>

In the process of typing this algorithm into some examples, I realized
that I can improve the part of the algorithm that is involved with the
base conversion.  This will speed the circuit up only very slightly, but
a few LUTs will get removed, and it's a sweet non-standard use of a Virtex
carry chain.  The algorithm involves using a Xilinx carry chain to compute
some, but not all, of the carries in the addition of a constant to a variable.

The section this applies to is here:

-- From this it is obvious that to convert the number "A" in base 8
-- to base 8-4, I need merely add the octal constant o444444... to "A".
-- This perfectly converts it to the corresponding (i.e. carrying the
-- same numerical value) number in base 8-4.
--
-- This conversion is very convenient when the multiplier has a lot of
-- bits, but it isn't needed for relatively short multipliers.  In
-- particular, a multiplier of n x 5 would do well to avoid performing
-- the base conversion explicitly.
--
--
-- After performing the base conversion, I take each digit from B
-- (where B = A + o44...44. = A + "100100100...100100") and use
-- it to create a single partial product.  For an n x (3m-1) multiply,
-- I'll end up with m partial products.

What I realized is that instead of sending the complete "Base 8-4"
value to the logic that determines the mode lines, (each set of mode
lines takes 3 bits of the Base 8-4 version of the multiplier and
creates 2 mode bits suitable for selecting which partial result to
compute, i.e. which of {1M, 2M, 3M, 4M}) I can instead make the
mode lines work off of four values.  The four values would include the
3 bits of the base 8 version of the multiplier, and another bit which
indicates whether or not a carry into these three bits will be caused
by the addition of the "4444...444" constant.

Example, 24 bits.  Each of the eight partial products will always use
their associated three bits.  The lowest partial products needs no other
input.  The other partial products need one more bit, and this bit
is the carry-in to that digit in the + "44..44" calculation.  To get
this, I need a chain of functions C.  Each C is a function of four
variables.  Three of them are the previous digit, the fourth is the
carry-in to the previous digit.  Together, those four values give the
carry-in to the given digit.  The whole thing connects together like
this:

                 X
   222 211 111 111 11
   321 098 765 432 109 876 543 210
   --- --- --- --- --- --- --- ---
    7   6   5   4   3   2   1   0   (digit #)
   ||| ||| ||| ||| ||| ||| ||| |||
   VVV VVV VVV VVV VVV VVV VVV VVV
    C<--C<--C<--C<--C<--C<--C<--C--'0'
    8   7   6   5   4   3   2   1

C(0) = '0';

C(I) = X(3*I-1) or (C(I-1) and X(3*I-2) and X(3*I-3));

(Note: C(1) = X(2).)

The logic for the mode bits for the Ith digit is then a function
of X(3*I+2 downto 3*I) & C(I).  In that logic, C acts exactly as
a carry-in to the three X bits used, modifying the mode lines
appropriately.

It turns out that the "C" function can be programmed into the
carry structure of a Virtex.  (I haven't built this or simulated
it yet, so there could be an error, but I'm pretty sure this
works.)  The carry outputs are brought out of the carry chain, the
sum outputs are ignored.  This means that the carry chain leaves
the flip-flops, along with their CE, CLK, and SR controls unused.

X(3*I-1) is placed on "I0", and sent to the '0' input of the MUXCY,
which gets it's select from the LUT, as is standard.  Only 3 inputs
to the LUT are used, so it's really a LUT3.

When the LUT3 is zero, this will send X(3*I-1) to the carry.  The
'1' input of the MUXCY is connected to C(I-1), as is standard with
Virtex carry chains.  The other two LUT3 inputs are connected to
X(3*I-2 downto 3*I-3).  The LUT3 is programmed to give a '1' only
to the case where X(3*I-1 downto 3*I-3) == "011".  This is the
propagate case.  If X is this value, the carry-out needs to be equal
to the carry-in, and this is exactly what happens.

The resulting operation in the LUT3 / Carry chain section is as follows,
where I've renumbered the three bits of X to be Y(2 downto 0):

   Y
  210 CIN | COUT  Description
  --- --- + ----  -----------
  000  0  |  0
  000  1  |  0
  001  0  |  0
  001  1  |  0
  010  0  |  0
  010  1  |  0
  011  0  |  0
  011  1  |  1    Propagate case
  100  0  |  1    Generate
  100  1  |  1    Generate
  101  0  |  1    Generate
  101  1  |  1    Generate
  110  0  |  1    Generate
  110  1  |  1    Generate
  111  0  |  1    Generate
  111  1  |  1    Generate

It's pretty clear that this is exactly what I need.

The result is an extremely efficient way of precomputing the carry-ins.
Only one LUT is used per 3-bits of the X multiplier.

I'll be coding this up over the next few days, providing I find the time.

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37826
Subject: Re: Defauolt Should Be "Inputs and Outputs" For IOBs
From: "S. Ramirez" <sramirez@cfl.rr.com>
Date: Fri, 21 Dec 2001 02:56:09 GMT
Links: << >> << T >> << A >>

"Rick Filipkiewicz" <rick@algor.co.uk> wrote in message
news:3C21B681.887ECA2D@algor.co.uk...
> I think I've figured out how these things happen & how the obvious default
of
> ``-pr b'' deosn't happen.
>
> It goes like this
> 7. User complains of stupid default value.
> 8. engineer agrees, changes it, and then gets told ``You can't do that,
its now in
> the manual''.
> Swap h/w for s/w and all the  above is based on personal experience.

Don't forget the marketing guy telling everyone that leaving the default as
"don't pack IOB FFs" forces the engineers to use an attribute that makes
their code less portable, thus locking in Xilinx.  Of course the newbie
won't make his setup and clock to out times, and he'll switch anyway.

Article: 37827
Subject: Re: Best-case timing?
From: Bob Perlman <no-spam@sonic.net>
Date: Fri, 21 Dec 2001 03:32:20 GMT
Links: << >> << T >> << A >>

On Fri, 21 Dec 2001 08:47:02 +1300, Jim Granville
<jim.granville@designtools.co.nz> wrote:

>Bob Perlman wrote:
>> 
>> On Thu, 20 Dec 2001 18:19:54 GMT, Andy Peters
>> <andy@exponentmedia.nospam.com> wrote:
>> 
>> >Bob Perlman wrote:
>> >
>> >> On Wed, 19 Dec 2001 17:19:58 GMT, Andy Peters
>> >> <andy@exponentmedia.nospam.com> wrote:
>> >>
>> >>
>> >>>Stephen Byrne wrote:
>> >>>
>> >>>
>> >>>>I originally posted this yesterday on google groups, but I'm not seeing it
>> >>>>on my home news server.  In case it is not visible to all, I'm reposting.
>> >>>>
>> >>>>Hello All,
>> >>>>
>> >>>>My company is currently comparing 66MHz PCI core solutions from Xilinx
>> >>>>and Altera, as well as debating using a home-spun core.  One issue
>> >>>>I've come upon is the PCI requirement for a MAX clock-to-out time of 6
>> >>>>ns and MIN clock-to-out time of 2 ns.  Both the Xilinx ISE and Altera
>> >>>>Quartus II tools seem very helpful in supplying MAX (worst-case) Tco
>> >>>>times, but I don't see any info on best-case times.  Apparently the
>> >>>>SDF files for back-annotated timing sim have the same worst-case
>> >>>>numbers repeated 3 times, resulting in the same simulation regardless
>> >>>>of case selection.  My question is: how is anyone (FPGA vendors
>> >>>>included) guaranteeing a MIN Tco of 2 ns across all conditions and
>> >>>>parts if the design tools don't even yield that information?
>> >>>>
>> >>>
>> >>>You like to live dangerously if you depend on best-case timing information.
>> >>>
>> >>
>> >> What's the alternative?
>> >
>> >
>> >
>> >Um, worst-case timing information?
>> 
>> Worst-case timing information isn't a substitute for best-case timing
>> information; you need both.  If you're trying to calculate setup
>> margin, you need the worst-case clock-to-Q time of the driving device.
>> But if you're trying to calculate the hold margin, you need the
>> best-case (i.e., shortest) clock-to-Q time.  That's why Stephen Byrne
>> was looking for best-case timing.
>> 
>> Bob Perlman
>
> Seems a terminology problem. 
> BOTH the shortest and longest time delays can be considered 'worst
>case', 
>from a statistical and tolerance sense.

I don't see how this can be written off as a disagreement on
terminology.  Take a look at Stephen Byrne's original post, above.
It's clear that he already has max timingnumbers, which he's called
"worst-case", and that he's looking for min timing, which he calls
"best-case."  We can quibble about terms, but he's done us the favor
of defining what he means (besides, 95% of the designers I've worked
with use the very same terminology).  It's also clear that he's
talking about communications among chips on a PCI bus, not between a
chip and itself, where delay tracking might help.

Bob Perlman
Cambrian Design Works

Article: 37828
Subject: Re: Clock pins in Virtex-E
From: "H.L" <alphaboran@yahoo.com>
Date: Thu, 20 Dec 2001 19:33:06 -0800
Links: << >> << T >> << A >>

Thanks a lot for the information, none you are very helpful
Iwant to ask something else, I synthesize my VHDL models using FPGA EXPRESS
3.6 but I can't edit any constraints there so I first produce the EDIF from
my models and later I edit the timing constraints using ISE4.1i (alliance)
constraint editor. I have heard that this is not a good thing, the best
option is to use the ucf during synthesis. In FPGA EXPRESS I cant edit
constraints because this option is unavailable (so is the "view schematic"
option), xilinx refers to that and says I need a special license. I e-mailed
xilinx and asked them how i can enable these 2 functions and they sent me a
license.dat file where in the package declaration FPGA EXPRESS is not
mentioned so i cant load the program, i searched everywhere but nothing
works. I am sure that i make the correct changes in the license.dat file
(hostid , host name, path of the daemon etc), the sad thing for me is that
xilinx's support will be closed during Christmas holidays and that means I
will have 10 days delay which is gonna be catastrophic. Is there any way i
can fix that problem and to use FPGA EXPRESS "edit constraints" and "view
schematic".

Thanks and my best regards,

Harris

none" <x@y.z> wrote in message news:RIlU7.942$yw1.4862@news.uk.colt.net...
> Hi Harris,
>
> > I have a small question. I work in a Virtex-E FPGA, my model has 4
clocks
> (3
> > in 155MHz and the other one slower~100MHz) as inputs. In the ucf file I
> > located all clocks in the GCK pins, is that right?
> Yup
>
> > If yes, is this the only constraint that ensures proper distribution of
> the
> > signals?
> I think it is likely that your synthesis tool will infer clock buffers for
> you (provided you use the dedicated clock pins).
> Alternatively you could instantiate the components directly to remove any
> doubt (I needed to 'cos at the time it was the only was to get the LVPECL
> input standard that I needed).
> I used IBUFG to get the clock on chip, then BUFG to distribute it for
simple
> clocking and
> IBUFG then CLKDLLE then BUFG in a more complex setup.
> You could do this if your synthesis tool doesn't infer clock distribution
> directly (try it and see).
>
> > Ah , and another one :) .. one of the processes (VHDL coding) in my
model
> > that i want to implement uses the falling edge of the slow clock while
the
> > others use the rising edge of all clocks, is this going to be a problem?
> No problem, an entity can contain rising edge clocked processes and
falling
> edge clocked process, you just can't have rising & falling in one process.
>
> Fred
>
>
>

Article: 37829
Subject: Re: Hardware FPGA questions
From: "H.L" <alphaboran@yahoo.com>
Date: Thu, 20 Dec 2001 19:43:10 -0800
Links: << >> << T >> << A >>

Hello ,

That difference on speed is owing to "faster" flip-flops?
So if you want to buy a FPGA you must determine the speed grade you want or
the same FPGA has the ability to operate in different speeds (i.e 6,8)?
I am a bit confused
Harris

"Ray Andraka" <ray@andraka.com> wrote in message
news:3C216F7E.76102A77@andraka.com...
> Those suffixes are the speed grade.  The parts are graded as they come
> through test and "binned" according to their performance scores and the
> relative demand for the various speed grades.  This way the vendor can
> sell the faster parts at a higher premium.  For the virtex families, the
> higher the number, the higher the performance.  In the 4K families, it
> went the other way with the smaller numbers indicating faster parts.
>
> Better is a relative term.  If you need the speed, then the faster parts
> are 'better'.  If you need to keep your costs down, then the slower parts
> are 'better' because they are significantly cheaper.
>
> Antonio wrote:
>
> > Some hardware question on FPGA :
> >
> > 1) What's the difference between a part with speed -3 and another with
> > speed -4 , the number is the number of metal layers ??
> >
> > 2) I read data sheet of Virtex and Virtex E, I didn't found really
> > much difference, can you explain me which is better and why ??
> >
> > Thanks
>
> --
> --Ray Andraka, P.E.
> President, the Andraka Consulting Group, Inc.
> 401/884-7930     Fax 401/884-7950
> email ray@andraka.com
> http://www.andraka.com
>
>  "They that give up essential liberty to obtain a little
>   temporary safety deserve neither liberty nor safety."
>                                           -Benjamin Franklin, 1759
>
>

Article: 37830
Subject: Re: DCM stability in Virtex2 -ES
From: David Miller <spam@quartz.net.nz>
Date: Fri, 21 Dec 2001 17:27:59 +1300
Links: << >> << T >> << A >>

> The recommendation is to wait 500 ms at room temp (longer if cold), to 


I have done some more work on this and have found that it behaves itself 
when the feedback loop samples the CLK0 output directly.  This DCM is 
being used for deskewing a board level clock, so I want to connect an 
IOB to FB, not CLK0.  When I do this, the problem reappears.  It's quite 
consistent and reproducable - it breaks when I use external feedback.

Any ideas?

The IOBs (both output K and K# and input FB) are set to HSTL_II_DCI.

Thanks for your help.

-- 
David Miller, BCMS (Hons)  | When something disturbs you, it isn't the
Endace Measurement Systems | thing that disturbs you; rather, it is
Mobile: +64-21-704-djm     | your judgement of it, and you have the
Fax:    +64-21-304-djm     | power to change that.  -- Marcus Aurelius

Article: 37831
Subject: Re: Hardware FPGA questions
From: Peter Alfke <palfke@earthlink.net>
Date: Fri, 21 Dec 2001 04:29:28 GMT
Links: << >> << T >> << A >>

Kevin Brace wrote:

>
> I must say that if Virtex-E/Spartan-IIE supported 5V PCI I/O, I rather
> have used them instead of Virtex/Spartan-II because newer devices tends
> to be cheaper than the older devices of the same density, or for the
> same amount of money, you get more gates (and features).
> Although not the question you asked, Altera basically did the same thing
> ...

It was with a heavy heart that we dropped 5-V tolerance, but the newer
processes do not support it, at least not at a reasonable cost.
We are being pushed relentlessly to build faster and cheaper chips, and
there is one thing that has to give: it's the oxide thickness and thus the
resultant supply voltage and input voltage tolerance.
From our perspective, 5-V really ought to be retired. I remember when it
was introduced as DTL and later TTL supply voltge around 1965. That means
it has lived a long and productive life. Let's do our best to retire this
standard...
And, watch out, 3.3 V will not live forever, either!
But as you said, you can still buy 5-V Vcc devices, and 3.3-V Vcc devices
with 5-V input tolerance. They are just not the fastest, biggest or most
cost-effective ones.

Peter Alfke

Article: 37832
Subject: Re: Defauolt Should Be "Inputs and Outputs" For IOBs
From: Peter Alfke <palfke@earthlink.net>
Date: Fri, 21 Dec 2001 04:40:55 GMT
Links: << >> << T >> << A >>

"S. Ramirez" wrote:

> Don't forget the marketing guy telling everyone that leaving the default as
> "don't pack IOB FFs" forces the engineers to use an attribute that makes
> their code less portable, thus locking in Xilinx...

We are rally not that devious.
Instead, we suggest the designer use the DCM (with 50-ps phase-delay stepping),
the dual-ported ( honestly dual-ported ) BlockRAM, the multiplier, the SRL16
and the digitally-controlled output impedance.
That should keep the other guys out...  :-)

Peter Alfke

Article: 37833
Subject: Re: How can I reduce Spartan-II routing delays to meet 33MHz PCI's Tsu < 7 ns requirement?
From: Phil Hays <spampostmaster@home.com>
Date: Fri, 21 Dec 2001 04:42:13 GMT
Links: << >> << T >> << A >>

Kevin Brace wrote:

> Hi, I will like to know if someone knows the strategies on how to reduce
> routing (net) delays for Spartan-II.

A few things.

1) Look very hard at how logic on failing paths is designed.  Is there a simpler
way to do the function?  Can you split a complex function into two simple
functions?  Can you move some of the logic to the other side of registers?

2) Does XST re-order logic?  If so, you might make sure that the order of
functions is good:

x= f(a,b,c,(f(d,e,f,g)) will be faster for a,b and c than for d,e,f and g.  Fine
if a is the critical signal, bad if g is.  
Change it to (and I don't know enough about XST to tell you how to do this):

x= f(g,a,b,(f(c,d,e,f)) or similar with the speed critical net having the fewest
levels of logic.

[f(a,b,c) is a three input lookup table with input signal a, b and c]

3) What effort level are you running PAR at?  "5" is the highest.  Use it.

> Here are some solutions I came up with.
> 
> 1) Reduce the signal fanout (Currently at 35 globally, but FRAME# and
> IRDY#'s fanout are 200. What number should I reduce the global fanout
> to?).

If you have a problem with fanout, you may want to control how the fanout is
split up.  Telling the synthesis tool to reduce fanout isn't good, as the
synthesis tool does not have a clue as to how the logic is located, so it may
split the net in a way that makes no sense.  No, I should say it will split nets
in ways that make no sense.  This may mean that you will need to add a module to
your design with the buffering for this net.  Again, I don't know how to force
mapping of logic in XST.

> 3) Floorplan all the LUTs and FFs on the FPGA (currently, I only
> floorplanned the LUTs that violated Tsu, and most of them take inputs
> from FRAME# and IRDY#.).

Logic that is near the critical paths may need to be floorplanned to avoid
interaction with the critical path.  "Near" can be logical or physical.

> 4) Use Guide file Leverage mode in Map and Par.

This might help.  To use this feature, make a sub-design with the critical path
and as little else as reasonable, and PAR this design into "my_guidefile.ncd". 
Then go a guided MAP and PAR with this as a guide file.

> P.S.  Considering that I am struggling to meet 33MHz PCI timings with
> Spartan-II speed grade -5, how come Xilinx meet 66MHz PCI timings on
> Virtex/Spartan-II speed grade -6? (I can only barely meet 33MHz PCI
> timings with Spartan-II speed grade -6 using floorplanner.)

They are good, and they cheat.  Their design is clever and well done, and they
use a "magic_box" , a bit of dedicated logic that can only be used from
FPGA_editor.

> I know that Xilinx uses the special IRDY and TRDY pin in LogiCORE PCI,
> but that won't seem to help FRAME#, since FRAME# has to be sampled
> unregistered to determine an end of burst transfer.

Question to make you think:  What do you NEED to do at the end of a burst
transfer?  And when?

-- 
Phil Hays

Article: 37834
Subject: Re: Clock pins in Virtex-E
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Fri, 21 Dec 2001 04:50:23 +0000 (UTC)
Links: << >> << T >> << A >>

Re those 3 155MHz clocks.

SONET?

Carl




-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37835
Subject: A ram wish
From: "Rob Finch" <robfinch@sympatico.ca>
Date: Fri, 21 Dec 2001 00:04:28 -0500
Links: << >> << T >> << A >>

I wish block ram had an async read option like distributed ram. How hard
would this be to do ? The problem I have is using the same port to perform
both read and write operations for a cpu. The cpu always generates an
address that is a registered output registered on the clock edge. So just
after the clock edge, the address is available. This works great for writes
because the next clock edge can be used to write the data to the block ram.
However it doesn't work for reads, because we want the next clock edge to
latch the data into a cpu register. Instead, the read data isn't available
until after the next clock edge. So 1) a wait state could be inserted for
read operations (cuts performance in half). 2) we can use the address from
the cpu as it is just before it's registered and use a second port of the
block ram - means we have two address busses and the block ram can't be
shared with another device, or twice as many blocks rams are required.

Rob

Article: 37836
Subject: Re: Dual-port ram templates
From: Russell Shaw <rjshaw@iprimus.com.au>
Date: Fri, 21 Dec 2001 16:19:31 +1100
Links: << >> << T >> << A >>

I found out how to do it (from helpfull altera support).

Use the maxplus2 wizard to generate the dual-port ram vhdl files.

Copy the component declaration wizard code into your architecture
code, and instantiate it.

Leonardo generates an .edf file with the component as a black box.

Maxplus2 reads a wizard-generated vhdl file to fill in the
black-box component. Make sure the wizard files are in the
same directory as the .edf file.

Mike Treseler wrote:
> 
> Russell Shaw wrote:
> 
> > -- Pre Optimizing Design .work.sync_dpram_8_8.synth
> > -- Boundary optimization.
> > "E:/AAProjs/Bugs/Leonardo/main.vhd", line 34:Info, Inferred ram instance 'ix26409' of type
> > 'ram_dq_da_inclock_outclock_8_8_256'
> 
> So it found the right module but .  .  .
> 
> > -- optimize -target acex1 -effort quick -chip -area -hierarchy=auto
> > Using default wire table: STD-1
> > Warning, Dual read ports not supported for FLEX/APEX/MERCURY RAMs; using default implementation.
> > Warning, using default ram implementation for ram_dq_da_inclock_outclock_8_8_256, run time can get large.
> 
> . . . it refused to use it.
> 
> Since acex1k is not in the unsupported list above,
> it's either a bug or a deliberate dumbing down
> of the oem version.
> 
> Note that this ram is inferred properly with
> acex1k technology on the mentor version of leo.

Article: 37837
Subject: Re: Clock pins in Virtex-E
From: "H.L" <alphaboran@yahoo.com>
Date: Fri, 21 Dec 2001 08:57:00 +0200
Links: << >> << T >> << A >>


WDM

Harris

"Carl Brannen" <carl.brannen@terabeam.com> wrote in message
news:94c9d180ad1ec9713e5672513e311ddb.51709@mygate.mailgate.org...
> Re those 3 155MHz clocks.
>
> SONET?
>
> Carl
>
>
>
>
> --
> Posted from firewall.terabeam.com [216.137.15.2]
> via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37838
Subject: Re: How can I reduce Spartan-II routing delays to meet 33MHz PCI's Tsu <
From: Kevin Brace
Date: Fri, 21 Dec 2001 01:37:07 -0600
Links: << >> << T >> << A >>

Carl Brannen wrote:
> 
> Kevin,  Are you at a relatively high utilization on the part (presumably with
> logic from something other than the PCI interface)?
> 

        The logic utilization of the version I posted the part of Timing
Analyzer report is about 40% of Spartan-II 150K system gate part (about
700 Configurable Logic Blocks (CLBs) = about 1,400 Slices).
Because developing the PCI IP core is more important to me than spending
time on the user side logic, I kept the user side really to a really
minimal design (the user side takes only single I/O cycles).
I don't know exactly, but I am sure that the user side uses less than
10% of Spartan-II's logic resources (I think around 7%).


> If you are, one great strategy is to remove all your logic but a stub, and then
> get the PCI interface to route to perfection.  Then you use that result as a
> guide file (or copy the PCI placement into the UCF file).
> 
> If this works, then you've just got the tool to do the routing for you.  I
> don't know if it will help here, but the technique works great on Altera
> designs especially.
> 

        Xilinx tools also have a Guide mode, but somehow when I set
Guide Mode for P&R to Leverage rather than Exact, the P&R software
crashes.
When the Guide Mode is at Exact, the P&R software doesn't crash.
I haven't checked out Xilinx's support page, so I don't know the status
of this bug, but I hope the next release of ISE WebPack will fix this
problem.
It is probably because I don't have a good understanding of the Guide
Mode, and perhaps because Leverage Guide Mode doesn't work properly in
P&R software, but I haven't seen any major improvement in routing from
using Guide Mode.
I know that Xilinx uses a Guide file for their 66MHz PCI LogiCore, but
not for 33MHz one, so Guide file must be doing something, so I will
still think about using this method in the future when the P&R
software's bug is fixed.
        Regarding Altera's tools, does MAX+PLUS II or Quartus II 1.1
support floorplanning like Xilinx does?
I am thinking of porting my PCI IP to Altera's FPGAs like ACEX 1K or
FLEX10KE, and will like to know if the floorplanning support is as good
as Xilinx's because from the experience of using Xilinx devices,
automatic place & route tools just don't do a good job of placing the
LUTs in a good location and related ones close to each other.
Quartus II 1.1 Web Edition (I only use free tools because I am poor)
looks like it is as good as ISE WebPack 4.1, but I personally will
rather not deal with MAX+PLUS II-BASELINE because the tool looks hard to
use and old.


> The idea is to route the critical logic first, but do it while unloading the
> place and route from having to deal with the uncritical logic.
> 
> Best of luck.
> 
> Carl
> 
> --
> Posted from firewall.terabeam.com [216.137.15.2]
> via Mailgate.ORG Server - http://www.Mailgate.ORG

        In my case, I floorplanned timing critical LUTs by hand, but
because ISE WebPack 4.1 doesn't come with FPGA Editior, I don't know how
the routing is being done.
I used to treat everything (devices and tools) as a blackbox because the
design is done on HDL, but I no longer want to treat everything as a
blackbox because I no longer totally trust automatic tools.



Kevin Brace (don't respond to me directly, respond within the newsgroup)

Article: 37839
Subject: 16x5 multiplier uses new multiply algorithm
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Fri, 21 Dec 2001 09:18:17 +0000 (UTC)
Links: << >> << T >> << A >>

This 16x5 unsigned multiplier uses the algorithm listed above.  It has a
single register stage, at the end.  I'm fairly sure it works, having simulated
a lot of numbers through it.

If it were to be pipelined, it would be most efficient to put the first
register stage at the M input, but at the "MODE" stage in the X input.

It uses 33 slices, with a total of 65 FGs used.  Two of the FGs are programmed
to be zero so that two carry-outs can be made visible.  I don't immediately
see how to avoid this.

Despite being a fall through, with no internal registers, (though note that
the final register is necessary in order to get M x zero = zero), and without
being floorplanned, (I haven't even gone back through the code to see if I
can improve it), it still gets 131MHz in the xcv50e -8.

Were it to be pipelined, it would be natural to bring the "X" input into the
logic a clock early.  This would allow the mode inputs for the partial
products to be registered.  This could be done without unbalancing the
multiplier by registering the "M" inputs on the inputs, but registering the
"X" inputs only after a clock.  I haven't taken a look at how to minimize
the logic in this case...

I typically over-comment my VHDL. I've removed the comments here to save
bandwidth on the internet.  If anyone is interested, I can add them back in.

Thanks to Frédéric Rivoallon at Xilinx for e-mailing me a link to instructions
on how to instantiate LUT4s inside generate statements without a lot of grief.
For those interested, the link is here:
http://tech-www.informatik.uni-hamburg.de/vhdl/doc/faq/FAQ1.html#attributes

This is the first in a series of multipliers.  The next, a 16x8, will use
3 partial products, which is not a particularly natural number for this
algorithm.  But the one after that, 16x11, will use 4 and will be quite
sweet.

Total LUT usage with this algorithm will increase by 3 per additional bit
beyond 16.  That is, the number of LUTs for a Nx5 multiplier will be about
LUT( multiplier Nx5) = 17 + 3N.

The 3 adders are hooked up as follows:

LUT#1 \           -- PP0V creates { 1M, 2M, 3M, 4M}
      + LUT#3     -- PS0V creates final result between 0M and 31M
LUT#2 /           -- PP3V creates { 8M,16M,24M,32M}

The usual algorithm for multiplying by 5 bits on a Virtex will require 4 LUTs
per bit.  The adder tree will look like this (maybe a slightly different
topology will be better):

LUT#1 \                  -- creates { 0M, 1M, 2M, 3M}
       + LUT#3 \         -- creates { 0M ... 15M}
LUT#2 /         \        -- creates { 0M, 4M, 8M,12M}
                 + LUT#4 -- creates final result between 0M and 31M
(M)-------------/        -- creates { 0M,16M} (AND gate absorbed into LUT#4)

For extremely wide multiplies, the savings of the new algorithm approach
25% over the old technique.


-- Multiplier code, 16x5 multiplier
-- Design by Carl Brannen.
-- Uses 3 + 2 bit coding.
-- Multiplier code, 16x5 multiplier

library IEEE;
use IEEE.std_logic_1164.all;
use IEEE.std_logic_unsigned.all;

entity MUL16x5S is
    port (
        CLK:  in  STD_LOGIC;
        M:    in  STD_LOGIC_VECTOR(15 downto 0);
        X:    in  STD_LOGIC_VECTOR( 4 downto 0);
        Y:    out STD_LOGIC_VECTOR(20 downto 0);
        TEST: in  STD_LOGIC
    );
end MUL16x5S;

architecture MUL16x5S_arch of MUL16x5S is
component LUT4 port (
  I0: in  STD_LOGIC;
  I1: in  STD_LOGIC;
  I2: in  STD_LOGIC;
  I3: in  STD_LOGIC;
  O:  out STD_LOGIC);
end component;
attribute INIT: string;
component XORCY port (
	CI:	in  STD_LOGIC;
	LI:	in  STD_LOGIC;
	O:	out STD_LOGIC);
end component;
component MUXCY port (
	DI:	in  STD_LOGIC;
	CI:	in  STD_LOGIC;
	S:	in  STD_LOGIC;
	O:	out STD_LOGIC);
end component;
component MULT_AND port (
	I0:	in  STD_LOGIC;
	I1:	in  STD_LOGIC;
	LO:	out STD_LOGIC);
end component;
component FDR port (
	D:	in  STD_LOGIC;
	C:	in  STD_LOGIC;
	R:	in  STD_LOGIC;
	Q:	out STD_LOGIC);
end component;
signal EM0V:      STD_LOGIC_VECTOR(18 downto 0);   -- Extended M[]
signal PP0PN:     STD_LOGIC_VECTOR( 1 downto 0);   -- PP0.P & PP0.N
signal PP0V:      STD_LOGIC_VECTOR(20 downto 0);   -- PP0.V
signal PP0M:      STD_LOGIC_VECTOR( 1 downto 0);   -- Mode control bits
signal PP0_LUT:   STD_LOGIC_VECTOR(17 downto 1);   -- LUT      
signal PP0_MA:    STD_LOGIC_VECTOR(17 downto 1);   -- MULT_AND
signal PP0_XC:    STD_LOGIC_VECTOR(17 downto 1);   -- XORCY
signal PP0_CRY:   STD_LOGIC_VECTOR(18 downto 1);   -- Carry
signal PP0_SUM:   STD_LOGIC_VECTOR(17 downto 1);   -- Sum output
signal EM3V:      STD_LOGIC_VECTOR(20 downto 0);   -- Extended M[]
signal PP3PN:     STD_LOGIC_VECTOR( 1 downto 0);   -- PP3.P & PP3.N
signal PP3V:      STD_LOGIC_VECTOR(20 downto 2);   -- PP3.V
signal PP3M:      STD_LOGIC_VECTOR( 1 downto 0);   -- Mode control bits
signal PP3_LUT:   STD_LOGIC_VECTOR(20 downto 3);   -- LUT      
signal PP3_MA:    STD_LOGIC_VECTOR(20 downto 3);   -- MULT_AND
signal PP3_XC:    STD_LOGIC_VECTOR(20 downto 3);   -- XORCY
signal PP3_CRY:   STD_LOGIC_VECTOR(21 downto 3);   -- Carry
signal PP3_SUM:   STD_LOGIC_VECTOR(20 downto 3);   -- Sum output
signal PS2SEL:    STD_LOGIC_VECTOR( 3 downto 0);   -- Select bit
signal PS2PN:     STD_LOGIC_VECTOR( 1 downto 0);   -- PS2.P & PS2.N
signal PS2V:      STD_LOGIC_VECTOR(20 downto 0);   -- PS2.V
signal PS2M:      STD_LOGIC_VECTOR( 1 downto 0);   -- Mode control bits
signal PS2_LUT:   STD_LOGIC_VECTOR(20 downto 2);   -- LUT      
signal PS2_MA:    STD_LOGIC_VECTOR(20 downto 2);   -- MULT_AND
signal PS2_XC:    STD_LOGIC_VECTOR(20 downto 2);   -- XORCY
signal PS2_CRY:   STD_LOGIC_VECTOR(21 downto 2);   -- Carry
signal PS2_SUM:   STD_LOGIC_VECTOR(20 downto 2);   -- Sum output
signal YRES:      STD_LOGIC;                       -- Reset final FF
signal YQ:        STD_LOGIC_VECTOR(20 downto 0);   -- Final FF

begin
EM0V(18 downto 0) <= "000" & M(15 downto 0);
PP0V(0) <= (EM0V(0) and X(0));
PP0V(1) <= (EM0V(0) and X(1)) xor (EM0V(1) and X(0));
PP0V(17 downto 2) <= PP0_SUM(17 downto 2);
PP0V(20 downto 18) <= "000";
PP0PN(0) <= X(2);      -- Negative bit
with X(2 downto 0) select  -- Positive bit
  PP0PN(1) <=
    '1' when "001" | "010" | "011",
    '0' when others;
with X(2 downto 0) select
  PP0M(1 downto 0) <=
    "01" when "111" | "001",  -- PP0V <= 1M
    "00" when "110" | "010",  -- PP0V <= 2M
    "11" when "101" | "011",  -- PP0V <= 3M
    "10" when others;         -- PP0V <= 4M
PP0_CRY(1) <= '0';
A0: for I in 1 to 17 generate B: block
  attribute INIT of L0: label is "7484";
begin
L0: LUT4 port map(
  I0 => PP0M(1),
  I1 => EM0V(I-1),
  I2 => PP0M(0),
  I3 => EM0V(I),
  O  => PP0_LUT(I));
  MA: MULT_AND port map (
    I0  => PP0M(1),
    I1  => EM0V(I-1),
    LO  => PP0_MA(I));
  MC: MUXCY port map (
    DI  => PP0_MA(I),
    CI  => PP0_CRY(I),
    S   => PP0_LUT(I),
    O   => PP0_CRY(I+1));
  XC: XORCY port map (
    CI  => PP0_CRY(I),
    LI  => PP0_LUT(I),
    O   => PP0_SUM(I));
end block b; end generate;
EM3V(20 downto 0) <= "00" & M(15 downto 0) & TEST & TEST & TEST;
PP3V(2) <= '0';
PP3V(20 downto 3) <= PP3_SUM(20 downto 3);
PP3PN(0) <= '0';           -- Negative bit (never negative)
with X(4 downto 2) select  -- Positive bit
  PP3PN(1) <=
    '0' when "000",
    '1' when others;       -- Usually positive
with X(4 downto 2) select
  PP3M(1 downto 0) <=
    "01" when "001" | "010",  -- PP3V <= 1M
    "00" when "011" | "100",  -- PP3V <= 2M
    "11" when "101" | "110",  -- PP3V <= 3M
    "10" when others;         -- PP3V <= 4M
PP3_CRY(3) <= '0';
A3: for I in 3 to 20 generate B: block
  attribute INIT of L3: label is "7484"; -- See PP0V
begin
L3: LUT4 port map(
    I0 => PP3M(1),
    I1 => EM3V(I-1),
    I2 => PP3M(0),
    I3 => EM3V(I),
    O  => PP3_LUT(I));
  MA: MULT_AND port map (
    I0  => PP3M(1),
    I1  => EM3V(I-1),
    LO  => PP3_MA(I));
  MC: MUXCY port map (
    DI  => PP3_MA(I),
    CI  => PP3_CRY(I),
    S   => PP3_LUT(I),
    O   => PP3_CRY(I+1));
  XC: XORCY port map (
    CI  => PP3_CRY(I),
    LI  => PP3_LUT(I),
    O   => PP3_SUM(I));
end block b; end generate;
PS2V(1 downto 0) <= PP0V(1 downto 0);
PS2V(20 downto 2) <= PS2_SUM(20 downto 2);
PS2SEL <= PP3PN(1 downto 0) & PP0PN(1 downto 0);
with PS2SEL select
  PS2PN(1 downto 0) <=                            -- Result of sum:
    "01" when "0101" | "0100" | "0110" | "0001",  --  Negative
    "10" when "1001" | "1000" | "1010" | "0010",  --  Positive
    "00" when others;
with PS2SEL select
  PS2M(1 downto 0) <=           -- Mode:
    "00" when "0100" | "1000",  --  A
    "01" when "0001" | "0010",  --  B
    "10" when "0110" | "1001",  -- A-B
    "11" when others;           -- A+B
with PS2M(1 downto 0) select
    PS2_CRY(2) <=
    (        '0'        ) when "00",      -- CIN = 0
    (        '0'        ) when "01",      -- CIN = 0
    (PP0V(1) nor PP0V(0)) when "10",      -- CIN = 1
    (        '0'        ) when others;    -- CIN = 0
S2: for I in 2 to 20 generate B: block
  attribute INIT of L2: label is "7C86";
begin
L2: LUT4 port map(
    I0 => PS2M(1),
    I1 => PP3V(I),
    I2 => PS2M(0),
    I3 => PP0V(I),
    O  => PS2_LUT(I));
  MA: MULT_AND port map (
    I0  => PS2M(1),
    I1  => PP3V(I),
    LO  => PS2_MA(I));
  MC: MUXCY port map (
    DI  => PS2_MA(I),
    CI  => PS2_CRY(I),
    S   => PS2_LUT(I),
    O   => PS2_CRY(I+1));
  XC: XORCY port map (
    CI  => PS2_CRY(I),
    LI  => PS2_LUT(I),
    O   => PS2_SUM(I));
end block b; end generate;
YRES <= not PS2PN(1);
F0: for I in 0 to 20 generate
FR: FDR port map (
  D => PS2V(I),
  C => CLK,
  R => YRES,
  Q => YQ(I));
end generate;
Y <= YQ(20 downto 0);
end MUL16x5S_arch;

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37840
Subject: CE on XILINX FFs and Metastability
From: "Frank Papenfuss" <frank.papenfuss@uni-rostock.de>
Date: Fri, 21 Dec 2001 10:49:15 +0100
Links: << >> << T >> << A >>

Dear FPGA comunity,

I have a design that must cope with asynchronous input
signals. Basically I have a WE pulse that gates a data
vector into the chip. The WE signal is sampled by two
FFs to enshure proper pulse detection. One FF is clocked
by the positive edge of the system clock and
one by the negative edge (I do not want to go
into too much details about why I must do this). The FFs
that sample the pulse connect to the CE (clock enable)
of the following FF to prevent the metastable state from
probagating actually into the design. Since I have only
simulated this so far I cannot say if it will really work
inside the chip (which will be a XILINX FPGA).

My question is: Has anyone experience with using CE as
a mean to prevent a metastable state from probagating
further.

Tool Setup:
-----------
Simulation & Synthesis: SYNOPSIS Ver 1999.10
Target Technology Mapping: XILINX Design Manager V3.3.08i
Target Part: XILINX VirtexE XCV300E-8-PQ240

I would also be greatful if you could point me to some
electronically available article, technote or appnote
about this topic, if available.

Thanks in advance,
FRANK

Article: 37841
Subject: Re: How can I reduce Spartan-II routing delays to meet 33MHz PCI's Tsu <
From: Kevin Brace
Date: Fri, 21 Dec 2001 03:57:16 -0600
Links: << >> << T >> << A >>



Austin Franklin wrote:
> 
> Something sounds wrong...aren't you registering your PCI signals in the
> IOBs, and are you using the built-in PCI logic?  Making 33MHz in an SII
> should be a snap.


        Although you may think I am not being realistic, but I was never
really a fan of using registered inputs because from what I understand,
registering an input means that the signal getting registered incurs one
clock cycle of latency, and I consider this being one cycle "stale."
That being said, initially (about a month ago) I was decoding address on
the AD[31:0] lines directly without registering it for six BARs (Base
Address Register) and one Expansion ROM BAR, and AD[31:0] were having
problems meeting Tsu < 7ns requirement (plus FRAME# and IRDY#).
After realizing that taking "raw" data off the PCI bus is not a good
idea, and the PCI IP core doesn't have to do fast DEVSEL# decode, I
decided to register the AD[31:0] and C/BE#[3:0] during an address phase.
That got AD[31:0] and C/BE#[3:0] to meet Tsu. 
        Considering that the way PCI protocol works, I find that it is
hard to use registered inputs in PCI because again if I am correct,
registered inputs incur one cycle of latency.
Here are situations where registered inputs can be used easily in my
opinion.

- Address decode assuming that the PCI IP core doesn't have to do a fast
DEVSEL# decode (DEVSEL# decode will be medium or slow decode). Since all
chipsets (Northbridges) I know of support medium DEVSEL# at best, this
is not a problem at all.

- During the a single or the first transfer of a burst transfer where
typically the PCI IP has to initiate the user side bus. In this
situation, the PCI IP core is inserting wait cycles on the bus, and
taking advantage of a protocol rule that once a signal is asserted, it
cannot be changed until the end of that access (microaccess I think it
is called in a burst transfer), using a registered input signal should
not make any difference versus a raw (non-registered) input signal.
Although I still will rather use the "raw" one rather than the "stale"
one.


Here are situations where it is difficult to use registered inputs.

- During a burst target transfer where no wait cycle transfer has to be
supported by the PCI IP core. The PCI IP core has to constantly monitor
IRDY# in case the initiator inserts wait cycles, and monitor FRAME# to
know if the present microaccess is the last one or not. Perhaps using
registered inputs for FRAME# and IRDY# during a burst target transfer
will require the target to insert one wait cycle (deasserting TRDY#) for
each microaccess.

- When a PCI IP core asserts STOP#, it has to continuously monitor
FRAME# to make sure that the initiator deasserts FRAME#, and when FRAME#
is deasserted, the PCI IP core has to deassert DEVSEL#, TRDY#, and
STOP#, and stop driving the signal if a back-to-back transfer is not
occurring to itself. If registered inputs are used for FRAME# and IRDY#,
that will miss the correct timing to deassert DEVSEL#, TRDY#, and STOP#
because of the one cycle latency.


        After thinking about the various suggestions I got, I guess I
haven't really used registered inputs that extensively throughout my
design, and I think that is because of my resistance to using one cycle
"stale" data thinking of perhaps a buggy initiator (a host-to-PCI bridge
or a busmaster PCI) not following the PCI protocol correctly might
change the state of FRAME# or IRDY# after being asserted.
        Regarding the "built-in PCI logic," I will assume what you mean
is Xilinx's special IRDY and TRDY logic.
Because the PCI IP core has to be portable across different platforms, I
am not interested in using that special IRDY and TRDY logic, and I don't
really know how it works.




Thanks,



Kevin Brace (don't respond to me directly, respond within the newsgroup)

Article: 37842
Subject: Re: Hardware FPGA questions
From: Rick Filipkiewicz <rick@algor.co.uk>
Date: Fri, 21 Dec 2001 10:17:01 +0000
Links: << >> << T >> << A >>

Kevin Brace wrote:

> If I add my two cents to this question, as a Xilinx Spartan-II user (a
> low cost version of Virtex. From what I see, it is sold at 1/3 of the
> price of the equivalent density Virtex) struggling to get my PCI IP core
> to meet mere 33MHz PCI timings (Tsu < 7ns . . . is hard to meet at least
> in my case), Virtex/Spartan-II (manufactured in UMC's 0.22u process) are
> the last and fastest device that supports 5V PCI I/O.
> Virtex-E/Spartan-IIE (manufactured in UMC's 0.18u process) dropped 5V
> PCI I/O support, and it only supports 3V PCI, which is hardly used on
> regular desktop motherboards.
> I must say that if Virtex-E/Spartan-IIE supported 5V PCI I/O, I rather
> have used them instead of Virtex/Spartan-II because newer devices tends
> to be cheaper than the older devices of the same density, or for the
> same amount of money, you get more gates (and features).
> Although not the question you asked, Altera basically did the same thing
> when they moved from APEX 20K (supported 5V PCI according to their
> datasheet, manufactured in TSMC's 0.22u process) to APEX 20KE (dropped
> 5V PCI support, manufactured in TSMC's 0.18u process).
>
> Regards,
>
>

I became a bit worried about using the Virtex-E's 3v3 PCI with 5V PCI cards
(the voltage conversion is done via QuickSwitch parts) so I did some
investigation.

With a fully loaded board - 4 populated 5V PCI slots and 2 3v3 onboard
devices - the VirtexE-thru'-QS outputs easily met the 5V PCI input specs in
terms of Vih and the time to get there. The longest delay I was seeing was
about 13.5 nsec from the FPGA's clock to the device input.

It was very nearly independent  of where along a bussed line I looked at a
signal (+/- 1 nsec or so).

Also: Even though PCI is an unterminated bus there was only the slightest
hint of a reflection step on a couple of signals. Even for those the
rise/fall were still monotonic.

The upshot of an afternoon's investigation is that, for our system at least,
I'm happy (*) driving 5V PCI devices from the V-E parts.

(*) Definition: Happy = marginally less paranoid than usual.

Article: 37843
Subject: Re: CE on XILINX FFs and Metastability
From: Rick Filipkiewicz <rick@algor.co.uk>
Date: Fri, 21 Dec 2001 10:27:20 +0000
Links: << >> << T >> << A >>

Frank Papenfuss wrote:

> Dear FPGA comunity,
>
> I have a design that must cope with asynchronous input
> signals. Basically I have a WE pulse that gates a data
> vector into the chip. The WE signal is sampled by two
> FFs to enshure proper pulse detection. One FF is clocked
> by the positive edge of the system clock and
> one by the negative edge (I do not want to go
> into too much details about why I must do this). The FFs
> that sample the pulse connect to the CE (clock enable)
> of the following FF to prevent the metastable state from
> probagating actually into the design. Since I have only
> simulated this so far I cannot say if it will really work
> inside the chip (which will be a XILINX FPGA).
>
> My question is: Has anyone experience with using CE as
> a mean to prevent a metastable state from probagating
> further.
>

Frank,

It is an unfortunate fact that if an signal from a source async to a
clock is sampled on that clock then there is always a chance that a
metastable state could propagate arbitrarily far into your system.

Metastability is a statistical thing and so all you can do is reduce the
probability of its affecting your system to some very small number (or
the MTBF >> time between you changing jobs).

IIRC there is even a paper somewhere that proves metstability cannot be
eliminated by purely digital means.

BTW: If anyone has that original reference I'd be grateful - I read it
in ~1984 and have long since lost it.

Article: 37844
Subject: Re: Michelangelo's Counter
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Fri, 21 Dec 2001 10:41:10 +0000 (UTC)
Links: << >> << T >> << A >>

Re the 50% duty cycle divide by 3 counter...

> Because the logic design is correct. The simulator usually
> has a problem with a combinatorial latch because the simulator
> is not intelligent enough to cope with the ambiguity of the
> latch state. In reality the ambiguity resolves itself, and
> is therefore meaningless.

But we all hate getting simulator error messages, so why not
use a real latch?  That will make the tools happy, and they'll
go to the trouble of figuring out our half clock timing for us
as well.  Nobody wants to dig around a 1M gate FPGA trying to figure
that kind of stuff out.  I bet we'd get pretty good at it, but
why not do it with a real latch?

> And by the way, use the DLL, it does it for you.

DLLs are a limited resource.  In addition, they don't work on
quite a large variety of clocks:
(1) Clocks that are too fast
(2) Clocks that are too slow
(3) Clocks that don't have a constant period.

The Virtex series has resources that allow the old Xilinx app note
to be improved upon a bit:

(1) You have SRL16s that give you multiple FF bits per LUT.
(2) You have real latches so you don't have to use combinatorial ones.
(3)

Here's some VHDL code for a 50% divide by 3 that fits in a slice,
leaves one of the slices' flip-flops unused, and runs at 186.081MHz
even in a Spartan2 -5:

-- Divide by 3 in a slice.  50% duty cycle.  No simulator warnings.
-- Carl Brannen
--
library IEEE;
use IEEE.std_logic_1164.all;

entity DIV3V is
  port (
     CLK: in  STD_LOGIC;
     Q:   out STD_LOGIC);
end DIV3V;

architecture DIV3V_arch of DIV3V is

component SRL16 port (
	D:   in  STD_LOGIC;
	CLK: in  STD_LOGIC;
	A0:  in  STD_LOGIC;
	A1:  in  STD_LOGIC;
	A2:  in  STD_LOGIC;
	A3:  in  STD_LOGIC;
	Q:   out STD_LOGIC);
end component;

component LD_1 port (
	D: in  STD_LOGIC;
	G: in  STD_LOGIC;
	Q: out STD_LOGIC);
end component;

component LUT2 port (
	I0: in STD_LOGIC;
	I1: in STD_LOGIC;
	O:  out STD_LOGIC);
end component;

signal MYGND: STD_LOGIC;
signal MYVCC: STD_LOGIC;
signal RD3B:  STD_LOGIC;
signal FD3B:  STD_LOGIC;

attribute INIT: string;
attribute INIT of U1: label is "0001";  -- Start out with a "0"
attribute INIT of U2: label is "R";     -- Not needed.
attribute INIT of U3: label is "E";     -- What the heck is this?

-- Fit into a slice
attribute RLOC: string;
attribute RLOC of U1: label is "R0C0.S1";
attribute RLOC of U2: label is "R0C0.S1";
attribute RLOC of U3: label is "R0C0.S1";

begin

MYGND <= '0';
MYVCC <= '1';

U1: SRL16 port map(
	D => RD3B,
	CLK => CLK,
	A0 => MYGND,
	A1 => MYVCC,
	A2 => MYGND,
	A3 => MYGND,
	Q => RD3B);

-- Note that a Rising edge clock on a SRL16x is compatible
-- with a rising edge clock to a flip-flop on the same slice,
-- but in addition, it is compatible with an active low latch:
U2: LD_1 port map(
	D => RD3B,
	G => CLK,
	Q => FD3B);

-- OR gate
U3: LUT2 port map(
	I0 => RD3B,
	I1 => FD3B,
	O => Q);

end DIV3V_arch;


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37845
Subject: Re: You take the low road and I'll ......
From: mikeandmax@aol.com (Mikeandmax)
Date: 21 Dec 2001 11:24:36 GMT
Links: << >> << T >> << A >>

Mr. Andraka revealed himself:
>and a
>bit more time to grok the design,

Welcome to our Planet!
gives me an idea for a book to read over the holidays!

enjoy them -
Mike Thomas

Article: 37846
Subject: Re: How to make an implementable big counter?
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Fri, 21 Dec 2001 11:41:31 +0000 (UTC)
Links: << >> << T >> << A >>

Re very long counter design...

> In my design I need to make a synchronous counter that counts, let's
> say, till 1000000. (Actual aim for counter is to built in a delay).  I
> do this by the use of integer type signals and with each clock'event I
> add 1 till I reach the wanted 1000000.  When I try to implement this
> in an FPGA it consumes a very high amount of CLBs and it seems very
> disastrous for the maximum reachable clock freq.

Assuming that you don't care about the intervening counts, you can use SRL16s
and SRL16Es to create relatively efficient large counters.  And you don't have
to deal with decoding LFSR values either.

An SRL16 with its Q output brought back to its D input can be initialized (with
an INIT attribute) to have only a single bit high.  The other (of up to 17
bits), are initialized to zero.  As it clocks around, it produces a pulse
every 17th clock.

This puts a counter with a length of up to a little over 4 bits (i.e. log2(17))
into a single LUT.  That's 4x as efficient as regular counters, and you get a
free registered "done" bit.  You can gang these up, either by using the 
enables, or by ANDing the outputs of counters whos periods have no common
divisor.

Example with 5 SRL16/SRL16Es, gets within 5% of 10^6 clocks, uses only 7 LUTs:

First  SRL16 goes high every 17th clock.  It's output connects to the enable
input of an SRL16E that also is set for 17 clocks.  The result: Two bits, that
when ANDed, produce a pulse every 17^2 = 289 clocks.

Third  SRL16 goes high every 15th clock.  It's output connects to the enable
input of an SRL16E that also is set for 15 clocks.  The result: Two bits, that
when ANDed, produce a pulse every 15^2 = 225 clocks.

Fifth SRL16 goes high every 16th clock.

Since 17^2, 15^2, and 16 have no common divisors, the outputs of the five SRL16
/ SRL16Es can be ANDed together to produce a counter that pulses once every
17^2 * 15^2 * 16 = 1040400 clocks.  This is in excess of the 1000000 (as was
asked for), and it only took 7 LUTs (<2 CLBs).  In addition, there are no lines
that have a loading of more than 3.  The 5-input AND can be implemented with
a registered 4-input AND (of the first four SRL16s), and a registered 2-input
AND.  That means that there are no paths that go through un registered logic,
and the design will clock at a very high rate.

One downside is that the SRLs require so much GND and VCC routing, but
you can create all that yourself and prevent the placer from going hog wild
with it.

Another downside is what happens to the SRL16s if you have glitches on your
clock.  Unlike most counters, this circuit will not "fix" itself.  But lets try
to not think too much about that.

You can also play sneaky games with the first layer SRL16s.  When that first
registered 4-input AND gate goes high, all the SRL16s will have just been in
their high state.  That means that if you replace those two SRL16s with two
SRL16Es you can hook the registered AND gate output back up to the (inverted
logic) enables of those first two SRL16Es.  The effect of that modification
will be to change that registered AND gate from counting to (16^2 * 17^2)
to one that counts to (16^2*17^2 + 1).  Since this is relatively prime
to the previous 16^2*17^2, that means that you can build two such circuits
and AND their outputs together to get a period of 73984 * 73985 with just
11 LUTs.  This is getting a 32.35 bit binary count, with DONE pulse, and very
high speed performance for only 11 LUTs or  2.94 bits per LUT.

I should mention that I've never implemented that last sneaky game, so if it
doesn't work I'd not be completely surprised.  Sure seems like it would
though, and my instincts for this sort of stuff are usually pretty good.

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37847
Subject: Re: CE on XILINX FFs and Metastability
From: "none" <x@y.z>
Date: Fri, 21 Dec 2001 13:39:59 -0000
Links: << >> << T >> << A >>

Virtex meta posts in this group:
http://groups.google.com/groups?as_q=virtex&as_oq=metastable%20metastability
&as_ugroup=comp.arch.fpga&num=50&as_scoring=d&hl=en

Meta posts in this group by Peter Alfke:
http://groups.google.com/groups?as_q=metastability&as_ugroup=comp.arch.fpga&
as_uauthors=Peter%20Alfke&num=50&as_scoring=d&hl=en

watch the wrap on the links

Article: 37848
Subject: Re: annoying problem and "simple and clever solution"
From: Greg Neff <gregeneff@yahoo.com>
Date: Fri, 21 Dec 2001 08:47:19 -0500
Links: << >> << T >> << A >>

On Fri, 21 Dec 2001 14:31:06 +1300, Jim Granville
<jim.granville@designtools.co.nz> wrote:

(snip)
>
>It's probably 'good practice' to copy the 80c51 scheme, which uses a
>D-FF on the
>feedback, so the OE width is clock width, not propogation/threshold
>defined.
>
>In a FPGA, D-ff are almost free :-)
>
>- Jim G

It is good practice if the high level output is guaranteed to meet the
high level input requirement on the connected device. 

In the case of using 3.3V Xilinx to drive 5V CMOS, this practice could
cause a one clock delay before the input switches, because the driver
will prevent the pullup resistor from pulling the input above the high
output drive voltage level. 

===================================
Greg Neff
VP Engineering
*Microsym* Computers Inc.
greg@guesswhichwordgoeshere.com

Article: 37849
Subject: Re: Michelangelo's Counter
From: dottavio@ised.it (Antonio)
Date: 21 Dec 2001 06:37:49 -0800
Links: << >> << T >> << A >>

Really Thanks 
for your Christmas present

Antonio

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search