Messages from 100650

Article: 100650
Subject: Re: Cyclone II EP2C70 dev kits, where are they?
From: Tommy Thorn <foobar@nowhere.void>
Date: Fri, 14 Apr 2006 12:22:42 -0700
Links: << >> << T >> << A >>

Karl wrote:
>> If anyone knows of any 2C70 boards, do let us know.
> 
> You're lucky !
> 
> The Lead-Free / RoHS compliant version of the Cyclone II based DSP
> board will be with the EP2C70 !

Thanks Karl, that sounds very promising. I couldn't find any info on it 
alas.  Do you have any idea of when and how much?

Cheers,
Tommy

Article: 100651
Subject: Re: PCB Stack
From: "KJ" <Kevin.Jennings@Unisys.com>
Date: 14 Apr 2006 13:15:39 -0700
Links: << >> << T >> << A >>

The stackup you have looks good but are 3+ routing layers going to be
sufficient?  Technically you have 4 but for a surface mount board the
top surface is so cluttered with parts that it turns out to not be
something that you can actually route a lot on unless there is a lot of
nice patterning to how the parts happen to have to be connected.

One thing left out is probably the MOST important consideration from an
EMI perspective.  If you have chopped up planes make sure that NO
signal crosses that split unless it has been adequately filtered.  Keep
in mind that if the chopped up plane is providing the AC return path
that when it encounters a plane split that you've just made yourself a
huge loop that will radiate and distort whatever signal you're trying
to send across it...so it best not be any sort of periodic signal or
anything of importance.

If you do have to jump the gap, bridge it with as large of a resistor
as possible that still allows your circuit to function...and that's
only after you've exhausted all ways to avoid having a signal cross the
gap in the first place.

KJ

Article: 100652
Subject: Counting bits
From: andrewfelch@gmail.com
Date: 14 Apr 2006 14:28:38 -0700
Links: << >> << T >> << A >>

Hello,

I am a Python programmer writing neural network code with binary firing
and binary weight values.  My code will take many days to parse my
large data sets.  I have no idea how much fpga could help, what the
cost would be, and how easy it would be to access it from Python.  The
problem is similar to competitive networks, where I must dot product
many million-length bit vectors (which only change occasionally) with 1
input vector.  Anybody want to estimate the cost, speedup, and value an
fpga could offer me?

Seems like this problem shouldn't be so hard, but from the little
research I've done I haven't found a good value product that is
ready-made, so I'm looking at (multiple?) fpga as a coprocessor.

Article: 100653
Subject: Re: Spartan 3 chips in power up
From: ghelbig@lycos.com
Date: 14 Apr 2006 14:39:45 -0700
Links: << >> << T >> << A >>

rickman wrote:
> I have looked at the data sheet and they say very clearly that the
> Spartan 3 is held in reset until all three power supplies are fully up.
>  But the range of voltages is very wide, with reset being released when
> the Vcoo on Bank four is as low as 0.4 volts.
>
> I get a lot of grief from the FPGA firmware designers on every little
> nit and pic that they don't like about the board design.  I need to
> know that this will keep the FPGA in reset and all IOs tristated
> whether the various power voltages are above or below the internal
> reset threshold, up to the point of being configured.

IIRC, the I/O's are inputs (or HiZ) w/ soft pullup until after
configuration.

It should be simple enough to delay the start of configuration until
after the last supply is up.

Article: 100654
Subject: Re: Counting bits
From: "Brannon" <brannonking@yahoo.com>
Date: 14 Apr 2006 15:15:41 -0700
Links: << >> << T >> << A >>

Dini has a new PCIe x8 board. It should work really well for your needs
but nobody has built the DMA engine and drivers for it yet so that adds
to its $10k cost. FPGAs are great for accelerating neural network
projects. There are lots of papers with algorithms for it in ACM and
IEEE journals. The Python interface is not a problem. It will come to
IOCTLs at some point. You may have to make a C API and DLL wrapper if
your Python cannot make IOCTL calls directly.

Here's my usual plug: I just wish there were a hardware vendor who
would put some cheap FPGAs (Spartan 3e 1600s) on a cheap board with
some standard DRAM and SRAM slots (unpopulated) and a PCIe x8 (or x4)
slot and then sell the board for < $300. Design the darn thing for
acceleration, not prototyping. They could make a killing on a well-made
board with 8 or 16 fast DMA channels and a driver that worked for that
really well.

Article: 100655
Subject: Re: Spartan 3 chips in power up
From: "Steve Knapp (Xilinx Spartan-3 Generation FPGAs)" <steve.knapp@xilinx.com>
Date: 14 Apr 2006 15:30:37 -0700
Links: << >> << T >> << A >>


rickman wrote:
> I have looked at the data sheet and they say very clearly that the
> Spartan 3 is held in reset until all three power supplies are fully up.
>  But the range of voltages is very wide, with reset being released when
> the Vcoo on Bank four is as low as 0.4 volts.
>
> I get a lot of grief from the FPGA firmware designers on every little
> nit and pic that they don't like about the board design.  I need to
> know that this will keep the FPGA in reset and all IOs tristated
> whether the various power voltages are above or below the internal
> reset threshold, up to the point of being configured.

I assume that you are looking at Table 28 on page 54 in the Spartan-3
data sheet.
http://www.xilinx.com/bvdocs/publications/ds099.pdf

These are essentially the trip points for the power-on reset (POR)
circuit inside the FPGA.  The trip voltage range is somewhat wide due
to process variation, etc.

The POR circuit prevents configuration from starting until all three
power rails meet are within the trip-point range.  The POR can happen
as early as the minimum voltage levels or as late as the maximum
limits.

Until the POR is released, all I/Os not actively involved in
configuration are high-impedance.  The HSWAP_EN pin controls whether or
not internal pull-ups are applied to these I/Os.  When HSWAP_EN = High,
the I/Os are turned off.  Also, the pull-ups connect to their
associated power rail so you won't see the effect until VCCO ramps up.
---------------------------------
Steven K. Knapp
Applications Manager, Xilinx Inc.
General Products Division
Spartan-3/-3E FPGAs
http://www.xilinx.com/spartan3e
---------------------------------
The Spartan(tm)-3 Generation:  The World's Lowest-Cost FPGAs.

Article: 100656
Subject: Re: Counting bits
From: Jan Panteltje <pNaonStpealmtje@yahoo.com>
Date: Fri, 14 Apr 2006 22:41:41 GMT
Links: << >> << T >> << A >>

On a sunny day (14 Apr 2006 14:28:38 -0700) it happened andrewfelch@gmail.com
wrote in <1145050118.699722.123650@e56g2000cwe.googlegroups.com>:

>Hello,
>
>I am a Python programmer writing neural network code with binary firing
>and binary weight values.  My code will take many days to parse my
>large data sets.  I have no idea how much fpga could help, what the
>cost would be, and how easy it would be to access it from Python.  The
>problem is similar to competitive networks, where I must dot product
>many million-length bit vectors (which only change occasionally) with 1
>input vector.  Anybody want to estimate the cost, speedup, and value an
>fpga could offer me?
>
>Seems like this problem shouldn't be so hard, but from the little
>research I've done I haven't found a good value product that is
>ready-made, so I'm looking at (multiple?) fpga as a coprocessor.
>
>
Sounds to me like vector processing.
Cray (supercomputers) know all about it.
Can be done in FPGA.
For neural nets many hardware units have been designed.
Is not python horribly slow for this? C would be better? ASM?
Anyways, perhaps (this has been done), you can implement your neuron in
hardware.
There also exist vector plugin cards for the PC (designed one once).

Article: 100657
Subject: Re: humble suggestion for Xilinx
From: Jim Granville <no.spam@designtools.co.nz>
Date: Sat, 15 Apr 2006 10:45:06 +1200
Links: << >> << T >> << A >>


John Larkin wrote:
> 
> Since the max serial-slave configuration rate on things like Spartan3
> chips is, what, 20 MHz or something, you might consider slowing down
> the CCLK input path, and/or adding some serious hysteresis on future
> parts. On a pcb, CCLK is often a shared SPI clock, with lots of loads
> and stubs and vias and such, so may not be as pristine as a system
> clock. CCLK seems to be every bit as touchy as main clock pins, and it
> really needn't be.

Wouldn't one expect this to be 'normal design practise' ?

I suppose Xilinx missed that obvious feature, becasue there are no
other Schmitt cells on the die, and even tho the CPLDs have this,
I'm sure their inter-department sharing is like most large companies :)

-jg

Article: 100658
Subject: Re: humble suggestion for Xilinx
From: "Steve Knapp (Xilinx Spartan-3 Generation FPGAs)" <steve.knapp@xilinx.com>
Date: 14 Apr 2006 15:47:28 -0700
Links: << >> << T >> << A >>

Hi John,

Thank you for the feedback.  Fortunately, this is already a planned
enhancement on future families.

---------------------------------
Steven K. Knapp
Applications Manager, Xilinx Inc.
General Products Division
Spartan-3/-3E FPGAs
http://www.xilinx.com/spartan3e
---------------------------------
The Spartan(tm)-3 Generation:  The World's Lowest-Cost FPGAs.

Article: 100659
Subject: Re: humble suggestion for Xilinx
From: John Larkin <jjlarkin@highNOTlandTHIStechnologyPART.com>
Date: Fri, 14 Apr 2006 16:10:14 -0700
Links: << >> << T >> << A >>

On 14 Apr 2006 15:47:28 -0700, "Steve Knapp (Xilinx Spartan-3
Generation FPGAs)" <steve.knapp@xilinx.com> wrote:

>Hi John,
>
>Thank you for the feedback.  Fortunately, this is already a planned
>enhancement on future families.
>

Thank you! Thank you!

Maybe I'm not crazy after all. Maybe.

Next, how about making the real clock inputs programmable to be slower
and less noise sensitive? Yeah, some people are never satisfied.

John

Article: 100660
Subject: C# and Spartan 3 Starter Kit
From: "Aaron" <AaronDBenson@gmail.com>
Date: 14 Apr 2006 17:30:46 -0700
Links: << >> << T >> << A >>

Hello,

    I am a graduating college senior who is working on his seminar
project. I am trying to get a CM11a controller (x10 technology) to talk
to a Spartan 3 starter kit. I've written a programm in C# that allows
me to cut on a light via that controller. My goal is to download this
to the Spartan board, but instead of turning on a light, I would just
to display a simple statement through the board saying, "the signal
from the CM11a has been receive."  So my question is, is it possible to
run a program written in C#  on a Spartan 3 Starter Kit? If so what
will I need to use to run it.


             Thanks in advance,

                  Aaron

Article: 100661
Subject: Re: C# and Spartan 3 Starter Kit
From: Eli Hughes <emh203@psu.edu>
Date: Fri, 14 Apr 2006 20:41:54 -0400
Links: << >> << T >> << A >>

Aaron wrote:
> Hello,
> 
>     I am a graduating college senior who is working on his seminar
> project. I am trying to get a CM11a controller (x10 technology) to talk
> to a Spartan 3 starter kit. I've written a programm in C# that allows
> me to cut on a light via that controller. My goal is to download this
> to the Spartan board, but instead of turning on a light, I would just
> to display a simple statement through the board saying, "the signal
> from the CM11a has been receive."  So my question is, is it possible to
> run a program written in C#  on a Spartan 3 Starter Kit? If so what
> will I need to use to run it.
> 
> 
>              Thanks in advance,
> 
>                   Aaron
> 

Hmm......  Short answer no.   It sounds like you want to use an FPGA as 
a microcontroller.  This is normally done using the the Microblaze soft 
core (The Xilinx EDK).  The EDK toolchain (gcc based) allows you to code 
in C.  No one (at least that I know of) has a .net engine running on a 
soft CPU.  I am sure its possible by porting the work done on the Mono 
project but this would be a waste of time.  C# is more of a macro level 
(like VB) development tool.  I (at least in my opinion) don't feel that 
it is a serious tool for embedded programming.  Primarily because it was 
not designed to allow you have unprotected access to memory. No Dynamic 
Allocation, Garbage Collection etc!!!!.  It is critical in embedded 
systems to be able to talk directly to hardware registers, etc. C# 
simply can't do this (at least not in any way that is simple and easy.

  If you are still interested in just doing the work in C, get ahold of 
the EDK.  Its not that hard to get a simple system running using the 
base system builder.  Writing you own custom peripherals will take you 
alot longer if you dont know HDL.

I think you may be better off using a small 8-bit micocontroller (8051, 
PIC, etc).  You will actually get your project done in time.

-Eli

Article: 100662
Subject: Re: 8:1 MUX implementaion in XILINX and ALTERA
From: "Mike Hutton" <mhutton87@yahoo.com>
Date: 14 Apr 2006 17:51:07 -0700
Links: << >> << T >> << A >>


I don't think I've posted anything in years, but I just couldn't resist
adding to this one because I played with it for some time.

As the previous posters said -- it depends on the device.  But I'd also
add (in some detail) it also depends even within one device.

To answer the question most directly an 8:1 mux requires two slices in
Virtex IV or 2 Adaptive Logic Modules (ALMs) in Stratix II.  But
whether
you actually get that in a full system depends on the struture of your
design.

The Virtex IV version is easy to see because it's just the output of
the
F6 mux provided as dedicated hardware.  Spartan III is a cost-reduced
Virtex IV, so it should behave identically.

In Stratix II we can do it without the need for dedicated hardware but
it's a bit trickier to synthesize:

For Z = mux(d0,d1,d2,d3,d4,d5,d6,d7; s0,s1) synthesis will give you:
   y0 = mux(d0,d1,d2; s0,s1)
   y1 = mux(d4,d5,d6; s0,s1)
which are two 5-input functions that pack into a single ALM.

In the second ALM
   z0 = (s0 & s1 & d3) # !(s0 & s1) & y0
   z1 = (s0 & s1 & d7) # !(s0 & s1) & y1
   Z = mux(z0,z1,s2)
will be generated using 7-LUT mode.

I attached Verilog at the end if you want to run it through Quartus,
and
you can look at the result in the equation file and will see what I
just
described.  Note that depending on what else is in the design the
5-LUTs
might get packed differently or synthesized differently i.e.  Quartus
may prefer to pack the two 5-LUTs with two unrelated 2 or 3-LUTs to
make
two 7-input ALMs rather than 1 8-input ALM and a second 6 input ALM or
may synthesize differently at the cost of area to hit a delay
constraint.

On older devices (Altera Stratix, Cyclone; Xilinx Spartan I, 4000) and
on MAX II and Cyclone II, you can basically use "4-LUT" in the
discussion below, though it will depend on other issues in practice.  I
haven't thought about PTERM devices like MAX 7000.

But this brings me to the bigger discussion.  I would stress that in
practice it makes a big difference what the surrounding context is, and
also if you have more than one mux in your design, because in a mux
system like a barrel shifter or crossbar the amortized cost of k muxes
in Stratix II is less than k times the cost of one (which is a benefit
over Virtex IV).

In a generic 4-LUT architecture with no dedicated hardware, a simple
2:1
mux is a 3-input function and takes one LUT (with one input going
unused).  A 4:1 mux would take two LUTs (not three -- exercise to the
reader; it's easier than the 8:1 above).  An 8:1 mux reqires five
vanilla 4-LUTs because it's 2 4:1 muxes and 1 2:1.  But it's arguably
something like 4.5 LUTs (see two paragraphs down).

I already mentioned the Virtex IV hardware.  Stratix-and some earlier
Altera architectures have hardware that facilitates other special
cases,
e.g.  a set of mux(a,b,c,0; s0,s1) can be implemented in a LAB cluster
by stealing functionality from the LAB-wide SLOAD hardware before the
DFF.  So you can a restricted 4:1 mux in one LE instead of 2.  (that's
the "basically" in the above).

When I said context I meant this:  If an 8:1 mux is followed by an
AND-gate (e.g.  Z = mux(a,b,c,d; s0,s1) & e), then the AND gate would
be
a "free" addtion to the 5 4-LUT implementation in the vanilla
architecture (because there's a leftover input on the last LE), but
would cost an new LE using the Virtex IV hardware.  So F5 gives a a
maximal 20% savings for a lone 8:1 mux, but depending on the
surrounding
logic the relative benefit could disappear.  That's not a deficiency,
you just can't count on getting the benefit in all cases.  Note that if
it's a 3-input AND gate, the situation reverses and the dedicated
hardware is again ahead by one LE.

In reality, though, you don't probably don't care about one simple mux,
you care about systems of muxes that consume huge numbers of LUTs.  For
example, a simple 16-bit barrel shifter

   out[15:0] = in[15:0] << k[3:0]

results in 16 16:1 muxes or 16x5 4:1 muxes = 16x5x2 LUTs = 160 LUTs
synthesized in the obvious way or 16x4 2:1 muxes = 64 LUTs synthesized
properly into a n*log(n) shifter network of 2:1 muxes.  The Virtex
hardware would get some savings from this vs.  the vanilla 4-LUT, but
it
bounces between 0 and 20% based on round-off and arrangement issues in
the size of the barrel shifter, and because the advantage is lost for
all the shifter bits that source a zero in the shifter network and go
non-symmetric.

I should mention that it's also not technically correct to compare #LEs
in the presence of any dedicated hardware, because you use fewer LEs
but
the cost of an LE changes.  From the architecture point of view you
have
to multiply #LEs * sizeof(LE) (even better #LABs * sizeof(LAB) or #CLB
*
sizeof(CLB)) to evaluate whether the HW is beneficial to put in the
device (or simply compare the dollar-price of the smallest Virtex IV or
Stratix II device your complete design fits in).

Although 64 LEs from a simple one-line statement sounds like a lot,
it's
actually worse because usually in and out are w-bit words, so
everything
gets repeated w times.  A properly synthesized w=16, 16x16 barrel
shifter, for example, requires 16x64 = 1024 4-LUTs.  The dedicated
hardware in Virtex gets 16x58 half slices or about 9% better than a
4-LUT implementation, and Stratix II can do this in 16x32 ALUTs or 50%
fewer -- see full data below.

Note that a rotating barrel shifter (second version I attached code
for)
will require more resources in both.  This is because of the
wrap-around
data -- none of the muxes collapse due to zeroed inputs.  You can see
this in an ALU, but the zero-padded version will be more common in
commercial designs.

On to crossbars.  A crossbar is like a barrel shifter, except that you
can't re-synthesize it into a shifter network, you're stuck with the k
k:1 muxes.  So a 16x16 crossbar with 16 4-bit select inputs actually
requires 16 independent 16:1 muxes, again times data-width.  Because
there is no re-expression of this that isn't a plain mux, the F5 and F6
hardware should be more beneficial here on average (closer to the 20%).

When we designed the Stratix II architecture, we spent a lot of time
looking at crossbar, barrel shifters and multiplexor structures.  But
you might have figured that out by now.  What we came up with is
particularly beneficial for systems with many muxes -- the sub-linear
growth I mentioned earlier.

The Stratix II ALM is a 8 input fracturable logic block that can
implement
(among other combinations not listed)
a) two independent 4-LUTs
b) independent 5-LUT and 3-LUT
c) two 5-LUTs that share 2 common inputs
d) a single 6-LUT
e) some 7-LUTs
f) two 6-LUTs that have 4 common inputs, and additionally the same
LUT-mask

Note that for (a) an ALM is (all other things equal)  equivalent to two
Stratix LEs or one Virtex slice, for (b,f) it's always better, and for
(c,d,e) usually but not guaranteed to be better.  But you can find this
in the ALM vs.  slice discussion from a year or two ago.

Way off topic, but even the word "better" is a bit abstract-- it's
dependent on other issues like the tech-mapping algorithm and the
relative routability of the device and Si area.  For example, though a
nxn xbar might fit in f(n) cells, a (2n)x(2n) may not fit in the
optimal
number f(2n) cells because a lack of routability in the device forces
the placer to spread the design out.  E.g.  interconnect doesn't scale
as smoothly in older architctures like Altera Apex or Xilinx 4000
(we've
gotten better at it, but it's also a function of modern designs).

Since a 4:1 mux is a 6-input function, it can fit in one ALM.  With the
tricks described above using (c) and (e) an 8:1 fits in two ALMs.  A
16:1 mux requires 4 ALMs + a 2:1 mux, which is 4.5 ALMs (though, again
the 3-input function has two or more additional inputs to absorb more
logic, so you could argue this is 4.25 ALMs instead of 4.5).

Item (f) is where the real benefit comes in for muxes.  The
decomposition of crossbars and barrel shifters into primitive muxes
results in large numbers of 4:1 muxes that have either (i) similar data
and common select bits in the case of barrel shifters, or (ii) common
data and different select bits in the case of xbars.  By the latter, I
mean mux(a,b,c,d; s0,s1) and mux(a,b,c,d; t0,t1).  Not by coincidence,
this fits into the template of two 6-input functions with 4 common
inputs and the same LUT-mask so a single ALM can implement two 4:1
muxes
arising from such a mux system, which makes it roughly 2X the
efficiency
for powers of 4 and between 1.5X and 2X for odd powers of 2 (i.e.
8:1).
That's a generalization, because it also depends on whether barrel
shifters are rotating or shift in zeros, and whether all the outputs
are
used (in packet processing you might do a 3n->2n type shifter so some
of
the bits get dropped).  Same as the discussion above on F5 and F6-- as
soon as you introduce 0's on the mux inputs you have leftover
neighbouring logic to slurp up and the numbers get fuzzy.

But we can look at least look the bottom line of all this using output
from Quartus II and ISE.  I ran this more than a year ago, so both
tools
have newer versions.

16x16 zero-shifting barrel shifter
   Cyclone, Stratix, 4-LUT   64  LUTs (LEs)
   Virtex IV                        59  half-slices  (packs to 47
slices)
   Stratix II ALM                 32  ALUTs (or half-ALM) (packs to 23
ALMs)

16x16 xbar
	Cyclone, Stratix, 4-LUT  160 LEs
	Virtex IV                        128 half-slices
	Stratix II ALM                  88  half-ALM

(for w-bit datapaths just take all the numbers and multiply by w).

Again, I included the Verilog below, in case someone says I'm cheating,
and both ISE and Quartus are available in free versions.  So try it
yourself.

Note that neither Quartus nor ISE will guarantee perfect packing
(half-slice to slice or ALUT to ALM).  This is either due to things
like
the placer choosing to split up two sub-blocks that could be packed in
order to improve delay, or other reasons.  For example, ISE used 47
slices to implement the 59 half-slices after placement, but at least
some of the 35 unused half-slice partners are likely available to be
packed with 2,3,4 input functions from elsewhere in the design, were
the
design bigger.  Quartus II uses 23 ALMs for the 32 ALUTs, meaning that
6
ALUTs are still potentially available for other logic without consuming
further ALMs.

For a common sub-design like a SPI4.2 PHY interface, component pieces
such I mentioned above contain modules like a a M-bit xbar to 2M-1:1
shifter into a 3M bit buffer from which 2M bits are selected.  I
synthesized such a design in each of Stratix, V4 and Stratix II.

   Stratix:    907 LEs
   Virtex IV: 1368 half slices (741 full slices after placement)
   Stratix II: 536 ALUTs (514 ALMs after placement)

(Sorry, can't provide Verilog for this one because it's part of the IP
core.)

You have to treat the synthesis of small designs carefully.  The XST
solution is non-optimal for Virtex IV -- I can hand-map this design
into
the hardware and use fewer slices.  For example, it's nearly trivial to
get the 907 that I got in Stratix, though that also uses the 3:1 mux
trick I mentioned above, but XST isn't doing it for some reason.

Finally, bus-muxes.  This is when you have e.g.  an simple 8:1 mux
where
all the inputs are 16 bits wide.  Synthesis often re-structures these
for delay vs.  area tradeoffs because you can play games with the
selects to amortize different structures through the datapath.  So be
careful trying to analyze these for area out of context.  There are a
couple publications on this that I listed below.  The FPL paper below
also talks about crossbar and barrel shifter synthesis into the ALM.

I also didn't understand the question about sharing LUTs, but I agree
with the previous poster that the answer is probably "no" all around.
You might mean resource sharing as in making the mux iterative /
multi-cycle, but that would probably be more expensive in area.  In
terms of delay, you can always pipeline.  Also, as someone else also
said, a multiplier can be used for a barrel shifter (multiply data by
unary k) if you have no other purpose for the dedicated DSP block.

All this information is in published papers; below are some references.

The first three references are on the general mux synthesis topic.  The
other two are on the Stratix II ALM and architecture and discuss some
of
barrel-shifter/xbar discussion I repeated above.

Paul Metzgen and Dominic Nancekievill, "Multiplexor Restructuring for
FPGA Implementation Cost Reduction".  Design Automation Conference,
June, 2005.

Dominic Nancekievill and Paul Metzgen, "Factorizing Multiplexers in the
Datapath to Reduce Cost in FPGAs".  IWLS, June 2005.

Jennifer Stephenson and Paul Metzgen, "Logic Optimization Techniques
for
Multiplexors", in Mentor user2user conference, 2004

Mike Hutton, Jay Schleicher, David Lewis, Bruce Pedersen, Richard Yuan,
Sinan Kaptanoglu, Gregg Baeckler, Boris Ratchev, Ketan Padalia, Mark
Bourgeault, Andy Lee, Henry Kim and Rahul Saini, "Improving FPGA
Performance and Area Using an Adaptable Logic Module", Proc.
14th International Conference on Field-Programamble Logic, Antwerp,
Belgium, pp.  135-144, Sept 2004.  LNCS 3203

David Lewis, Elias Ahmed, Gregg Baeckler, Vaughn Betz, Mark Bourgeault,
David Cashman, David Galloway, Mike Hutton, Chris Lane, Andy Lee, Paul
Leventis, Sandy Marquardt, Cameron McClintock, Ketan Padalia, Bruce
Pedersen, Giles Powell, Boris Ratchev, Srinivas Reddy, Jay Schleicher,
Kevin Stevens, Richard Yuan, Richard Cliff, Jonathan Rose, "The
Stratix-II Routing and Logic Architecture".  2005 Int'l Symposium on
FPGAs (FPGA, Feb 2005).

Regards,

Mike Hutton
Altera Corp
San Jose CA
<firstinitial><lastname>@altera.com

Note:  Please don't bother sending email to the yahoo account in the
header, I won't read it.  My real email is in the signature.

------------------------------

Here's the Verilog for the 8:1 mux, barrel shifters and crossbars.

// Simple 8:1 mux
// M. Hutton, Altera Corp, 2006
module mux(in,out,s, clk);
    input [7:0] in;
    input [2:0] s;
    input clk;
    output out;
    reg out;

    always@ (posedge clk)
    begin
        case (s)
          3'b000: out <= in[0];
          3'b001: out <= in[1];
          3'b010: out <= in[2];
          3'b011: out <= in[3];
          3'b100: out <= in[4];
          3'b101: out <= in[5];
          3'b110: out <= in[6];
          3'b111: out <= in[7];
        endcase
    end
endmodule


// Simple barrel shifter with no rotation
// M. Hutton, Altera Corp, 2003
module barrel (data_in, data_out, shift_by, clk) ;
   input [15:0] data_in ;
   input [15:0] shift_by;
   input clk;
   output [15:0] data_out ;
   reg [15:0]      data_out ;

   reg [15:0]      reg_data_in ;
   reg [15:0]      reg_shift_by ;

   always @(posedge clk)
   begin
        reg_data_in <= data_in ;
        reg_shift_by <= shift_by ;
        data_out = reg_data_in << reg_shift_by;
   end
endmodule

// Simple 16-bit barrel shifter with rotation.
// Mike Hutton, Altera Corp. 2003
module barrel16 (data_in, data_out, shift_by, clk) ;
   input [15:0] data_in ;
   input [15:0]      shift_by ;
   input      clk;
   output [15:0] data_out ;
   reg [15:0]      data_out ;
   reg [15:0]      reg_shift_by;

   reg [15:0]      reg_data_in ;

   always @(posedge clk)
   begin
    reg_data_in <= data_in ;
    reg_shift_by <= shift_by;

    case (reg_shift_by)
       4'b0000: data_out <= reg_data_in [15:0] ;
       4'b0001: data_out <= {reg_data_in[0], reg_data_in[15:1]};
       4'b0010: data_out <= {reg_data_in[1:0], reg_data_in[15:2]};
       4'b0011: data_out <= {reg_data_in[2:0], reg_data_in[15:3]};
       4'b0100: data_out <= {reg_data_in[3:0], reg_data_in[15:4]};
       4'b0101: data_out <= {reg_data_in[4:0], reg_data_in[15:5]};
       4'b0110: data_out <= {reg_data_in[5:0], reg_data_in[15:6]};
       4'b0111: data_out <= {reg_data_in[6:0], reg_data_in[15:7]};
       4'b1000: data_out <= {reg_data_in[7:0], reg_data_in[15:8]};
       4'b1001: data_out <= {reg_data_in[8:0], reg_data_in[15:9]};
       4'b1010: data_out <= {reg_data_in[9:0], reg_data_in[15:10]};
       4'b1011: data_out <= {reg_data_in[10:0], reg_data_in[15:11]};
       4'b1100: data_out <= {reg_data_in[11:0], reg_data_in[15:12]};
       4'b1101: data_out <= {reg_data_in[12:0], reg_data_in[15:13]};
       4'b1110: data_out <= {reg_data_in[13:0], reg_data_in[15:14]};
       4'b1111: data_out <= {reg_data_in[14:0], reg_data_in[15]};
    endcase
   end
endmodule


// Simple 16-bit crossbar with one-bit width
// M. Hutton, Altera Corp, 2003
module xbar(in,out,s,clk);
    input [15:0] in;
    input [63:0] s;
    input clk;
    output [15:0] out;
    reg [15:0] out;
    reg [15:0] out1;
    integer k;

    reg [15:0] inreg;

    always@ (posedge clk)
    begin
        inreg <= in;
        for (k = 0; k < 16; k = k+1)
        begin
            out1[k] <= inreg[{s[4*k+3],s[4*k+2],s[4*k+1],s[4*k]}];
        end
        out <= out1;
    end
endmodule

Article: 100663
Subject: Re: humble suggestion for Xilinx
From: John Larkin <jjlarkin@highNOTlandTHIStechnologyPART.com>
Date: Fri, 14 Apr 2006 20:03:18 -0700
Links: << >> << T >> << A >>

On Sat, 15 Apr 2006 10:45:06 +1200, Jim Granville
<no.spam@designtools.co.nz> wrote:

>
>John Larkin wrote:
>> 
>> Since the max serial-slave configuration rate on things like Spartan3
>> chips is, what, 20 MHz or something, you might consider slowing down
>> the CCLK input path, and/or adding some serious hysteresis on future
>> parts. On a pcb, CCLK is often a shared SPI clock, with lots of loads
>> and stubs and vias and such, so may not be as pristine as a system
>> clock. CCLK seems to be every bit as touchy as main clock pins, and it
>> really needn't be.
>
>Wouldn't one expect this to be 'normal design practise' ?
>
>I suppose Xilinx missed that obvious feature, becasue there are no
>other Schmitt cells on the die, and even tho the CPLDs have this,
>I'm sure their inter-department sharing is like most large companies :)
>

I have it secondhand (one of my guys tells me) that all S3 inputs have
about 100 mV of hysteresis. But that's not enough to improve noise
immunity in most practical situations.

It's good that FPGAs keep getting faster, but not all applications
need all that speed, and pickiness about clock edge quality can be a
real liability in a lot of slower applications.

John

Article: 100664
Subject: Re: Counting bits
From: andrewfelch@gmail.com
Date: 14 Apr 2006 21:56:19 -0700
Links: << >> << T >> << A >>

So how many million-length bit-vector dot products might I be able to
do per second?  My 3.8ghz P4 can do 125/sec.  I would prefer building a
beowulf cluster if the price:performance was similar (because fpga is
so foreign to me).  Of course if you tell me 10,000/sec I will become
an instant fpga evangelist, hehe.

Jan: I use a matrix library written in C.

Thanks for your help guys,
AndrewF

Article: 100665
Subject: Re: humble suggestion for Xilinx
From: Jim Granville <no.spam@designtools.co.nz>
Date: Sat, 15 Apr 2006 17:09:11 +1200
Links: << >> << T >> << A >>

John Larkin wrote:
> On Sat, 15 Apr 2006 10:45:06 +1200, Jim Granville
> <no.spam@designtools.co.nz> wrote:
> 
> 
>>John Larkin wrote:
>>
>>>Since the max serial-slave configuration rate on things like Spartan3
>>>chips is, what, 20 MHz or something, you might consider slowing down
>>>the CCLK input path, and/or adding some serious hysteresis on future
>>>parts. On a pcb, CCLK is often a shared SPI clock, with lots of loads
>>>and stubs and vias and such, so may not be as pristine as a system
>>>clock. CCLK seems to be every bit as touchy as main clock pins, and it
>>>really needn't be.
>>
>>Wouldn't one expect this to be 'normal design practise' ?
>>
>>I suppose Xilinx missed that obvious feature, becasue there are no
>>other Schmitt cells on the die, and even tho the CPLDs have this,
>>I'm sure their inter-department sharing is like most large companies :)
>>
> 
> 
> I have it secondhand (one of my guys tells me) that all S3 inputs have
> about 100 mV of hysteresis. But that's not enough to improve noise
> immunity in most practical situations.
> 
> It's good that FPGAs keep getting faster, but not all applications
> need all that speed, and pickiness about clock edge quality can be a
> real liability in a lot of slower applications.

  True - there is also a slight speed penalty for the full schmitt cells,
so that's a reason why the speed-at-all-costs FPGA sector ignores the 
benefits.
  Still, the news from Steve K is good :)

-jg

Article: 100666
Subject: Re: Counting bits
From: andrewfelch@gmail.com
Date: 15 Apr 2006 00:33:10 -0700
Links: << >> << T >> << A >>

I have not found a decent bit-vector dot product plugin card.  I think
they usually do integers or floating points, but not bits in an
efficient manner.

Article: 100667
Subject: Re: PCB Stack
From: "jai.dhar@gmail.com" <jai.dhar@gmail.com>
Date: 15 Apr 2006 01:28:35 -0700
Links: << >> << T >> << A >>

I personally don't like this stackup because you have designated your
Top and Bottom layers to be your high-speed layers; ie, the layers
adjacent to GND planes being the best. I would move the high speed
layers to the inside instead of your outer two, or even separate teh
power and GND planes. I have used Alternating Power and GND and it
worked quite well. I do hear that Coupling the planes is good for some
capacitive-related reasons however, so I can't comment on that..

Article: 100668
Subject: Re: PCB Stack
From: Kolja Sulimma <news@sulimma.de>
Date: Sat, 15 Apr 2006 11:35:44 +0200
Links: << >> << T >> << A >>

jai.dhar@gmail.com schrieb:
> 1 GND plane in an 8-layer stack? I was under the belief that yes, a
> power plane can serve as a return path for a signal, but it's not
> preferred or equal over a GND plane. I would think partitioning the
> power planes is a safer bet than cutting another GND layer.
Nope.

As a power supply for a high speed circuit you need a low inductance
power supply. A CMOS circuit is completly symmetric between VCC and GND.
The magnitude of electrical effects will depend on the maximum of the
VCC and GND inductance.

Driving a falling output for example will make the GND plane bounce up,
driving a rising output will make the VCC blane bounce down. Both
changes the input threshold voltage by the same magnitude introducing
jitter and reducing the noise margin.

Therefore design all supplies for the same inductance. The only
difference is that GND usually is common for the whole board, whereas
sometimes certain power supply voltages are only need in some areas of
the board so that you can have one plane in one half of the board and
another in the other half.

Islands are possible but very dangerous. Remember that you can not have
a high speed signal cross the island boundary on the adjacent routing
layer. Haveing a seprate layer for each supply therefore simplifies
routing a lot.

Also consider microvias. They save a lot of area, are great for signal
integrity and do not cost much extra.

Kolja Sulimma

Article: 100669
Subject: Re: Counting bits
From: Jan Panteltje <pNaonStpealmtje@yahoo.com>
Date: Sat, 15 Apr 2006 13:11:05 GMT
Links: << >> << T >> << A >>

On a sunny day (14 Apr 2006 21:56:19 -0700) it happened andrewfelch@gmail.com
wrote in <1145076979.683557.142540@g10g2000cwb.googlegroups.com>:

>So how many million-length bit-vector dot products might I be able to
>do per second?  My 3.8ghz P4 can do 125/sec.  I would prefer building a
>beowulf cluster if the price:performance was similar (because fpga is
>so foreign to me).  Of course if you tell me 10,000/sec I will become
>an instant fpga evangelist, hehe.
>
>Jan: I use a matrix library written in C.
Ah, Ok.
You know, this is not a 'saturday afternoon after shopping' thing
(it is that now here), I can asnwer just like that.
My old boss used to DEMAND to see the whole project else he did
not even want to venture.
Because often these things can be broken down, done in different ways.
If it was a simple multiply you could see how many nnbit ,multipliers there are
in the largest FPGA, but millions.... And you would soemhopw have to get the
data in and out.
Some project.
in Virtex (Xilinx knows more) at 500MHz you can have 512 xtreme DSP blocks
with 18x18 multiplyer. 512 x 18 = 9216 bits at the time in 2 nS.
To make a million (in a loop) x 109 = 218 nS per multiply...
Sort of a wild number, you really need to talk to these guys, I have no experience with
the virtex 4.
Over to X (or Altera).
Budget? Time? All counts.

Article: 100670
Subject: Re: Counting bits
From: Kolja Sulimma <news@sulimma.de>
Date: Sat, 15 Apr 2006 15:59:21 +0200
Links: << >> << T >> << A >>

andrewfelch@gmail.com schrieb:
> where I must dot product
> many million-length bit vectors (which only change occasionally) with 1
> input vector.  Anybody want to estimate the cost, speedup, and value an
> fpga could offer me?

If I understand you correctly, for each vector you want to know at how
many places both the vector and the input vector are 1?
Your vectors change rarely, the single input vector changes rapidly?

I would say that in an FPGA you would not do it one vector at a time but
N bits at a time for many vectors in parallel.

If your vectors are stored of chip you can process them as fast as you
can read them. With the right board at a rate of a few hundred gigabits
per second.


>So how many million-length bit-vector dot products might I be able to
> do per second?  My 3.8ghz P4 can do 125/sec.
Thats 125MBits/s. That is very easy to beat.
You probably can get some affordable board with 64-bit 200MHz SRAM and a
small FPGA. This will get you about a factor of ten over the P4.

On the other hand the P4 value seems to low. According to "Hacker's
Delight" pages 65ff counting the number of bits in a 32-bit word takes
less than 20 instruction. Adding one instruction for the initial AND,
the loads, index updates and some loop control instructions results in
about one instruction per bit. This means that a P4 should be able to do
a few gigabits per second.
And thats wihtout using MMX instructions which can do the dot product
after the first three reductions.

Kolja

Article: 100671
Subject: Re: Counting bits
From: Sylvain Munaut <tnt-at-246tNt-dot-com@youknowwhattodo.com>
Date: Sat, 15 Apr 2006 16:45:30 +0200
Links: << >> << T >> << A >>

andrewfelch@gmail.com wrote:
> So how many million-length bit-vector dot products might I be able to
> do per second?  My 3.8ghz P4 can do 125/sec.  I would prefer building a
> beowulf cluster if the price:performance was similar (because fpga is
> so foreign to me).  Of course if you tell me 10,000/sec I will become
> an instant fpga evangelist, hehe.

If I understood correctly :

  - You have many (what is many for you ? 100, 1000, 1000000 ?) million
bits vectors that are quite 'static' (or at least don't change much
compared to your 'input' vector). Let says you have N of them and that
you million bit vector is in fact 1024*1024 bits long.
  - You also have 1 input vector that change quite often.
  - You want the N cross products which is basically the number of 1 in 
the bit wise AND of the fixed vectors and the input vector.

To get an estimation on how fast it could be done, N should be known ...
or at least a range because I think the main limitation is gonna be
the bandwith between the host and the card and not the FPGA itself.

An FPGA can do the cross product pretty easily, imagine you get the
vectors 32 bits by 32 bits. First the 32 first bits of 'input' then the 
32 first bits of the 'references' one by one. Then the 32 bits after
that, and so on. So to enter all the vectors info for 1 given
input vector, you need (N+1)*2^15 cycles. The logic doing the cross
product is just a AND bit by bit, a stage that perform the counting of 
the 32 bits, then a 21 bits adder that stores the result in block rams
(given that N is sufficiently small to fits the 21 bits results in
block ram. Let says < 16384 for a small FPGA). The logic doing that 
could easily be pipelined to go at > 100 MHz even in a small cheap
spartan 3 and since you need 2^15 cycles to do a complete vector (if 
N>>1) that would be 3000/s and that's in a small FPGA.

Now, use a 128 bits wide DDR2 memory that's 256 bits in parallel, use
a high speed grade to run the whole things at > 250 MHz and you get
60.000 of them in parallel ...

But as I said, you need to get the data into the DDR2 memory and
organized so that the read is efficient. pretty easy. A million bit
vector is 128kb, getting 60 thousands of them per seconds is 7.5 GBytes
of traffic per second ...

Of course, you need to define N better and theses numbers are just for
the first design I can think of with the info you provided. You mileage 
may vary. I think it could be done pretty quickly if you hire someone 
that already has and has used, a memory controller and whatever 
controller is needed to input/output the data. And getting data in/out 
is the real challenge here ...

	Sylvain

Article: 100672
Subject: Re: Spartan 3 chips in power up
From: "rickman" <spamgoeshere4@yahoo.com>
Date: 15 Apr 2006 08:25:48 -0700
Links: << >> << T >> << A >>

Steve Knapp (Xilinx Spartan-3 Generation FPGAs) wrote:
> rickman wrote:
> > I have looked at the data sheet and they say very clearly that the
> > Spartan 3 is held in reset until all three power supplies are fully up.
> >  But the range of voltages is very wide, with reset being released when
> > the Vcoo on Bank four is as low as 0.4 volts.
> >
> > I get a lot of grief from the FPGA firmware designers on every little
> > nit and pic that they don't like about the board design.  I need to
> > know that this will keep the FPGA in reset and all IOs tristated
> > whether the various power voltages are above or below the internal
> > reset threshold, up to the point of being configured.
>
> I assume that you are looking at Table 28 on page 54 in the Spartan-3
> data sheet.
> http://www.xilinx.com/bvdocs/publications/ds099.pdf
>
> These are essentially the trip points for the power-on reset (POR)
> circuit inside the FPGA.  The trip voltage range is somewhat wide due
> to process variation, etc.
>
> The POR circuit prevents configuration from starting until all three
> power rails meet are within the trip-point range.  The POR can happen
> as early as the minimum voltage levels or as late as the maximum
> limits.
>
> Until the POR is released, all I/Os not actively involved in
> configuration are high-impedance.  The HSWAP_EN pin controls whether or
> not internal pull-ups are applied to these I/Os.  When HSWAP_EN = High,
> the I/Os are turned off.  Also, the pull-ups connect to their
> associated power rail so you won't see the effect until VCCO ramps up.

Thanks for the info.  Yes, I was looking at that table, plus table 30
on the next page.  I am concerned about letting the DSP run before the
FPGA power is fully up and also operating the DSP while the FPGA power
has a momentary glitch for what ever reason.  The DSP has a separate
core voltage from the FPGA and shares the Vcco of 3.3 volts. The FPGA
is configured and operated on the DSP external memory bus which also
connects to the program/data flash memory.   I just want to make sure I
can defend my power up and power glitch operation of the board.  When
the board is powering up, it is clear that the FPGA is held in reset
until the three power rails are somewhere within the trip ranges or
above.  Then the DSP can hold the PROG_B signal low to continue holding
the FPGA in reset until the DSP is happy with the power supplies and is
ready to configure the FPGA without concern that the FPGA will mess up
the memory bus.  That part seems clear.

But table 30 on page 55 seems to be saying that if Vccint or Vccaux dip
below the minimum values, but still above the reset trip points, the
configuration can be corrupted and the FPGA will not be put in reset.
In this case should I assume that the IOs can then be in any state and
may hang the DSP memory bus?  If so, I need to use the PowerOK on the
LDO regulators to either halt the DSP or make sure it gets an NMI and
runs only from internal memory.  I would prefer to be able to keep the
DSP running normally and record the power event in memory.  I have some
concerns about the system power supply design and would like to be able
to show clear evidence that the power is not stable rather than having
to extrapolate from processor resets.

Article: 100673
Subject: Re: humble suggestion for Xilinx
From: John Larkin <jjlarkin@highNOTlandTHIStechnologyPART.com>
Date: Sat, 15 Apr 2006 10:10:40 -0700
Links: << >> << T >> << A >>

On Sat, 15 Apr 2006 17:09:11 +1200, Jim Granville
<no.spam@designtools.co.nz> wrote:

>John Larkin wrote:
>> On Sat, 15 Apr 2006 10:45:06 +1200, Jim Granville
>> <no.spam@designtools.co.nz> wrote:
>> 
>> 
>>>John Larkin wrote:
>>>
>>>>Since the max serial-slave configuration rate on things like Spartan3
>>>>chips is, what, 20 MHz or something, you might consider slowing down
>>>>the CCLK input path, and/or adding some serious hysteresis on future
>>>>parts. On a pcb, CCLK is often a shared SPI clock, with lots of loads
>>>>and stubs and vias and such, so may not be as pristine as a system
>>>>clock. CCLK seems to be every bit as touchy as main clock pins, and it
>>>>really needn't be.
>>>
>>>Wouldn't one expect this to be 'normal design practise' ?
>>>
>>>I suppose Xilinx missed that obvious feature, becasue there are no
>>>other Schmitt cells on the die, and even tho the CPLDs have this,
>>>I'm sure their inter-department sharing is like most large companies :)
>>>
>> 
>> 
>> I have it secondhand (one of my guys tells me) that all S3 inputs have
>> about 100 mV of hysteresis. But that's not enough to improve noise
>> immunity in most practical situations.
>> 
>> It's good that FPGAs keep getting faster, but not all applications
>> need all that speed, and pickiness about clock edge quality can be a
>> real liability in a lot of slower applications.
>
>  True - there is also a slight speed penalty for the full schmitt cells,
>so that's a reason why the speed-at-all-costs FPGA sector ignores the 
>benefits.
>  Still, the news from Steve K is good :)
>
>-jg
>

Right. As noted in another thread, one can always add a deglitch
circuit to any input, including clock pins, except for CCLK. So if
that's the only one they slow down, we may elect to routinely deglitch
system clock inputs except when we really need the speed.

I suppose they'll schmitt the jtag pins, too; I don't use them, but
they seem like great candidates for noise problems.

Purists will argue that once the sacred word "clock" is voiced, we are
obliged to drive it appropriately. But it's getting so that a 5 ns
rise with a couple hundred mV of noise is not a reliable clock any
more, and designing brutally fast, star-distributed clocks into a slow
industrial-environment product really doesn't make a lot of sense.

I'd hazard that the majority of FPGAs are used at a fraction of their
speed capability.

John

Article: 100674
Subject: Re: Counting bits
From: "David M. Palmer" <dmpalmer@email.com>
Date: Sat, 15 Apr 2006 12:12:34 -0600
Links: << >> << T >> << A >>

In article <1145050118.699722.123650@e56g2000cwe.googlegroups.com>,
<andrewfelch@gmail.com> wrote:

> Hello,
> 
> I am a Python programmer writing neural network code with binary firing
> and binary weight values.  My code will take many days to parse my
> large data sets.  I have no idea how much fpga could help, what the
> cost would be, and how easy it would be to access it from Python.  The
> problem is similar to competitive networks, where I must dot product
> many million-length bit vectors (which only change occasionally) with 1
> input vector.  Anybody want to estimate the cost, speedup, and value an
> fpga could offer me?

I assume you have looked for algorithmic speed-ups?  (Also FPGAs have
different algorithmic speed-ups available than conventional computers.)

Algorithmic speed-ups might be available if:
a) the bit vectors are sparse (i.e. only a small fraction are ones, or
a small fraction are zeros)
b) the bit vectors are non-random (e.g. you are matching to shift
register sequences, or to highly-compressible sequences that can be
described in considerably less data than the raw bit stream)
c) the bit vectors are related (e.g. you are using the neural net to
listen to a data stream for a pattern: you don't find it, so you shift
by one bit and try again).
d) You can do pruning (e.g. if you don't find any evidence of a match
after doing 10% of the sequence, you can abandon that vector and try
the next)
e) You can match multiple input vectors instead of just 1.  (Since most
of your conventional processor time is going to be spent waiting around
for slow DRAM to get the next memory fetch of megabit matching vectors,
you may as well compare it to a few dozen inputs, rather than just
one).

As a Python programmer, you will probably find it easier to use C than
to learn VHDL/Verilog to the extent you need to implement this.  If a
single order-of-magnitude speed-up will solve your problems, then
changing to a language closer to the metal may be enough and is easy
enough to try.

-- 
David M. Palmer  dmpalmer@email.com (formerly @clark.net, @ematic.com)

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search