Messages from 105175

Article: 105175
Subject: Re: FPGA consultants
From: "Bryan Hackney" <bbhack@gmail.com>
Date: 16 Jul 2006 21:57:07 -0700
Links: << >> << T >> << A >>


hrocarina@gmail.com wrote:
> We are looking for a consultant for a project that involves an FPGA to
> implement our data manipulation algorithm to scale up to 5000
> simultaneous sessions and interface with our Opteron system via either
> PCI or Hypertransport.
>
> Please contact me if interested as soon as possible.
>
> hrocarina@gmail.com
>
> Regards,
> HR

I'll give you a call next week - NOT.

Get a marketing consultant, yesterday.

Article: 105176
Subject: Re: 2048 input or gate ?
From: "John Adair" <g1@enterpoint.co.uk>
Date: 16 Jul 2006 23:55:51 -0700
Links: << >> << T >> << A >>

 I am rusty on Verilog so can't remember if you have a generate
statement available but another way to cut work is to have a layered
component such that the bottom level has say four 4ip OR gates in it.
The layer above has 4 of the supper gate and so on. If you start with
at the bottom with a or gate instantiation and do the same all the way
up with component instatiations the synthesiser won't be able to do
much to insert other gates.

The MUXCY is probably being used as the carry chain is a fast route
compared to general routing and can be used to make a wide OR function
with 2 or more LUTs. To a degree this may be the fastest way to get you
OR but probaby tempered with some imposed structure. As a guess the
synthesiser is currently generating a number of 220-228 i/p OR gates
then putting the output together in another OR function.

John Adair
Enterpoint Ltd.

mk wrote:
> Hi everyone,
> I am trying to 'or' a 2K vector in Virtex4. Looking at the problem as
> a first approximation, it would need 6 levels of 4 input lookup
> tables. So far I have tried XST but it seems to be using the initial
> 512 LUT4s and then 56 levels of MUXCY. Any ideas why it's using the
> MUXCYs? They seem to be quite fast at 45ns each but number of levels
> is quite high. I'm curious what the timing would look like if I could
> force it to use only LUT4s but I really don't want to code it by hand
> and I am too lazy to write a perl script to do it either. Any
> suggestions ?
>
> Thanks.
>
> PS Here is what I am using as a test module. I am trying to map it to
> a virtex4-10.
>
> module orlt(clk, in, out);
> input clk;
> input [2047:0] in;
> output out;
>
> reg [2047:0] inr;
> reg out;
> wire outw;
>
> orl u0(inr, outw);
>
> always @(posedge clk)
> begin
>         out <= outw;
>         inr <= in;
> end
>
> endmodule
>
> module orl(in, out);
> input [2047:0] in;
> output out;
> 
> wire out = |in[2047:0];
> endmodule

Article: 105177
Subject: OpenFire - public domain MicroBlaze clone in verilog
From: "Antti" <Antti.Lukats@xilant.com>
Date: 17 Jul 2006 00:57:32 -0700
Links: << >> << T >> << A >>

Hi

google really finds things:

http://www.ccm.ece.vt.edu/~amarschn/openfire/tools.html

at the above link you can download OpenFire verilog source code

a small writeup with synthesis results

http://xilant.com/content/view/43/2/

Antti

Article: 105178
Subject: Re: Development Boards -Your chance to suggest features
From: Martin Thompson <martin.j.thompson@trw.com>
Date: 17 Jul 2006 09:20:32 +0100
Links: << >> << T >> << A >>

"John Adair" <g1@enterpoint.co.uk> writes:

> It's now been christened and had the obligatory bottle of Champers
> smashed. Darnaw1 is the name to look for.
> 

Go on John, enlighten us :-) Where do you get your names from?  Do you
just open a random OS (that's Ordnance Survey for non-Brits - not
operating system!) map and stick your finger on a remote village or
something?

Cheers,
Martin

-- 
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.trw.com/conekt

Article: 105179
Subject: Re: Need for reset in FPGAs
From: Martin Thompson <martin.j.thompson@trw.com>
Date: 17 Jul 2006 09:28:31 +0100
Links: << >> << T >> << A >>

Thomas Reinemann <thomas.reinemann@aucotronics.de> writes:

> Hello,
> 
> usually a reset signal is applied to put the FFs of an FPGA into a known
> state. Just some days ago I had a discussion. Someone's point of view
> is, that a reset is not necessary, since the FF's output will be always
> zero, after applying the voltage. Does this happen in FPGAs really,
> especially in a Spartan3?
> 

The FFs always come up in a guaranteed way after power-up - the
bitstream you feed into the FPGA defines their power-up state.

Normally this will be a zero, but there are situations where this can
change.  Either because you tell the tools you want a '1' there
instead - or because the tools decided it would make their life easier
for it to power-up to '1'.  If you're lucky they'll even tell you they
did it :-)

The old Altera 10K series FFs could only reset to '0', so if you asked
for a preset FF, the mapper (or whatever it was called back in those
days) would stick not gates either side of it to make it behave how
you asked it to.  This had the side-effect of power-up to '1' also,
which was *usually* what you wanted...  Anyway, those days are passed,
I'll be quiet now :-)

Martin

-- 
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.trw.com/conekt

Article: 105180
Subject: Re: 2048 input or gate ?
From: "Symon" <symon_brewer@hotmail.com>
Date: 17 Jul 2006 11:01:38 +0200
Links: << >> << T >> << A >>

"mk" <kal*@dspia.*comdelete> wrote in message 
news:574jb2h3o7cv3viul4ghq7j3gt2pen6l2d@4ax.com...
> Hi everyone,
> I am trying to 'or' a 2K vector in Virtex4. Looking at the problem as
> a first approximation, it would need 6 levels of 4 input lookup
> tables. So far I have tried XST but it seems to be using the initial
> 512 LUT4s and then 56 levels of MUXCY. Any ideas why it's using the
> MUXCYs? They seem to be quite fast at 45ns each but number of levels
> is quite high. I'm curious what the timing would look like if I could
> force it to use only LUT4s but I really don't want to code it by hand
> and I am too lazy to write a perl script to do it either. Any
> suggestions ?
>
> Thanks.
>
Your synthesiser is using the MUXCYs because it uses less resource (about 
75% of the tree method) and is faster. If the MUXCY propagation delay was 
45ns, I'd be worried, but it's really only 45ps! :-) If you build a tree, 
it'll be slower. It's not just the LUT delay, it's all that routing you need 
for a wide OR gate. To show it, you could try synthesising a 2k XOR gate. 
Your synthesiser might struggle to implement that with a carry structure.
HTH, Syms.

Article: 105181
Subject: Re: OpenFire - public domain MicroBlaze clone in verilog
From: "Sandro" <sdroamt@netscape.net>
Date: 17 Jul 2006 04:23:45 -0700
Links: << >> << T >> << A >>

Antti wrote:
> Hi
>
> google really finds things:
>
> http://www.ccm.ece.vt.edu/~amarschn/openfire/tools.html
>

Another ( maybe the same ;-) ) is at
   http://www.opencores.org/projects.cgi/web/aemb/overview

Sandro

Article: 105182
Subject: Re: OpenFire - public domain MicroBlaze clone in verilog
From: "Antti" <Antti.Lukats@xilant.com>
Date: 17 Jul 2006 04:50:28 -0700
Links: << >> << T >> << A >>

Sandro schrieb:

> Antti wrote:
> > Hi
> >
> > google really finds things:
> >
> > http://www.ccm.ece.vt.edu/~amarschn/openfire/tools.html
> >
>
> Another ( maybe the same ;-) ) is at
>    http://www.opencores.org/projects.cgi/web/aemb/overview
>
> Sandro

its not the same at all - the OpenFire docs explain why they did not
use the aeMB but rather designed a new core.

the author of the aeMB has other priorities (university study) and has
dropped the development ASFAIK

Antti

Article: 105183
Subject: Re: Where are you heading?
From: Eli Hughes <emh203@psu.edu>
Date: Mon, 17 Jul 2006 09:02:31 -0400
Links: << >> << T >> << A >>

Austin Lesea wrote:
> Gee,
> 
> Thanks. (Austin for Austin)
> 
> Peter is on vacation, so I will say thanks for him as well:
> 
> Danke,
> 
> Austin (for Peter)

Add me to list.

While I have not always agreed with Austin, don't come here asking for 
free help and then start crying 'Xilinx is the evil empire' when you 
don't like the answer.  Anytime there has been a serious problem on the 
Xilinx side (like with incorrect documentation, bad chip, etc.) they 
provide help.  When it comes to design advice or something that is 
screwed up (for any reason), don't expect a group of engineers to form a 
consensus that you were the helpless victim of a bad chip or poor 
documentation.  Anytime you get lots of engineers together, there is 
never agreement.  Everyone thinks there way is better.

-Eli

Article: 105184
Subject: Re: Need for reset in FPGAs
From: "Andy" <jonesandy@comcast.net>
Date: 17 Jul 2006 06:15:48 -0700
Links: << >> << T >> << A >>

It is important to have a reset that is synchronously deasserted
relative to every clock used.  These may be fully syncrhonous resets or
asynchronous resets that have the trailing edge syncrhonized.

The reason for this is that if a reset input is not syncrhonized to the
same clock as the circuitry being reset, then all flops in the circuit
will not come out of reset on the same clock, which, unless it is
handled very carefully, will cause problems that can be very hard to
debug.

Whether or not you have a separate reset, or are only resetting on
configuration, the above requirements hold true.

Andy

Nial Stewart wrote:
> "Thomas Reinemann" <thomas.reinemann@aucotronics.de> wrote in message
> news:e981ph$ur5$1@news.boerde.de...
> > Hello,
> > usually a reset signal is applied to put the FFs of an FPGA into a known
> > state. Just some days ago I had a discussion. Someone's point of view
> > is, that a reset is not necessary, since the FF's output will be always
> > zero, after applying the voltage. Does this happen in FPGAs really,
> > especially in a Spartan3?
> > Bye Tom
>
>
> If you use any form of PLL/DLL in your design I don't think you can
> be sure of what's going to happen until it's locked. This can throw
> logic/state machines into complete disarray.
>
> I generate a synchronous reset which de-activates some period
> after all my PLLs have locked.
> 
> 
> 
> Nial

Article: 105185
Subject: Re: design partition across multiple FPGAs
From: "Andy" <jonesandy@comcast.net>
Date: 17 Jul 2006 06:18:32 -0700
Links: << >> << T >> << A >>

Synplicity has a product designed for ASIC verification using FPGAs
that can semi-automate the partitioning problem.  I have no experience
with the product.

Andy


Brannon wrote:
> > I am interested to learn more about techniques for design partition
> > across multiple FPGAs.
>
> Traditionally people have tried to come up with auto-partitioners that
> are somehow smart enough to split up connections between chips. The
> scope of that problem is too large. I propose you do it this way:
>
> First, you have to define a dataset as partitionable. You cannot break
> apart objects unless they are connected by this specific dataset that
> is allowed to be broken. You'll need some communication core that goes
> with that dataset on both ends of the transfer. Then your partition
> software will automatically insert those communication cores in after
> it decides to separate a certain line with the breakable dataset. Um.
> I'm not sure I'm describing this very well. Does that make sense?
>
> So for example, suppose you have a dataset that is made of some data
> bits, an enable bit, a clock, and a busy signal going the opposite
> direction. That dataset is breakable because you can send the data,
> clock, and enable to a fifo on the far chip; that fifo can send back an
> almost busy signal to stop data from being sent. A simpler case would
> be a control line that is stable ages before it is needed; your
> separation objects for those will just be buffers and pads.

Article: 105186
Subject: Re: 2048 input or gate ?
From: "rickman" <spamgoeshere4@yahoo.com>
Date: 17 Jul 2006 07:27:43 -0700
Links: << >> << T >> << A >>

Symon wrote:
> "mk" <kal*@dspia.*comdelete> wrote in message
> news:574jb2h3o7cv3viul4ghq7j3gt2pen6l2d@4ax.com...
> > Hi everyone,
> > I am trying to 'or' a 2K vector in Virtex4. Looking at the problem as
> > a first approximation, it would need 6 levels of 4 input lookup
> > tables. So far I have tried XST but it seems to be using the initial
> > 512 LUT4s and then 56 levels of MUXCY. Any ideas why it's using the
> > MUXCYs? They seem to be quite fast at 45ns each but number of levels
> > is quite high. I'm curious what the timing would look like if I could
> > force it to use only LUT4s but I really don't want to code it by hand
> > and I am too lazy to write a perl script to do it either. Any
> > suggestions ?
> >
> > Thanks.
> >
> Your synthesiser is using the MUXCYs because it uses less resource (about
> 75% of the tree method) and is faster. If the MUXCY propagation delay was
> 45ns, I'd be worried, but it's really only 45ps! :-) If you build a tree,
> it'll be slower. It's not just the LUT delay, it's all that routing you need
> for a wide OR gate. To show it, you could try synthesising a 2k XOR gate.
> Your synthesiser might struggle to implement that with a carry structure.

Is that 45 ps per LUT of the carry or 45 ps per CLB in the carry chain?
 If I use the 56 elements that the OP said, I get 2.52 ns total carry
delay.  That is pretty remarkable if it is correct.

Increasing that to 45 ps per each of the 512 LUTs the carry delay is
still only 23.04 ns.  A combination approach combining say 16 LUTs with
the carry then using an 8 input OR gate should be a bit faster.  16
carries is about the same speed as a LUT.  I have not looked at the
Virtex 4 architecture so I don't know for sure if this is needed or if
the carry delay is 45 ps per CLB.

Article: 105187
Subject: Re: 2048 input or gate ?
From: "Symon" <symon_brewer@hotmail.com>
Date: 17 Jul 2006 16:50:02 +0200
Links: << >> << T >> << A >>

"rickman" <spamgoeshere4@yahoo.com> wrote in message 
news:1153146462.943222.181680@b28g2000cwb.googlegroups.com...
> Symon wrote:
>
> Is that 45 ps per LUT of the carry or 45 ps per CLB in the carry chain?
> If I use the 56 elements that the OP said, I get 2.52 ns total carry
> delay.  That is pretty remarkable if it is correct.
>
Hi Rick,
Yes, that's 45ps per LUT. I believe the carry is actually implemented as a 
two bit look ahead, so that each CLB is a two bit carry with delay of 90ps. 
But, now you mention it, I don't understand the 56 levels thing.
>
> Increasing that to 45 ps per each of the 512 LUTs the carry delay is
> still only 23.04 ns.  A combination approach combining say 16 LUTs with
> the carry then using an 8 input OR gate should be a bit faster.  16
> carries is about the same speed as a LUT.  I have not looked at the
> Virtex 4 architecture so I don't know for sure if this is needed or if
> the carry delay is 45 ps per CLB.
>
Thinking about it a bit harder, and after reading your post, I reckon the 
synthesiser must be doing what you suggest, dividing the chain up into 
sections, and oring together the output.
Cheers, Syms.

Article: 105188
Subject: Re: 2048 input or gate ?
From: "John_H" <johnhandwork@mail.com>
Date: Mon, 17 Jul 2006 16:39:53 GMT
Links: << >> << T >> << A >>

"Symon" <symon_brewer@hotmail.com> wrote in message 
news:44bba39a$1_2@x-privat.org...
> "rickman" <spamgoeshere4@yahoo.com> wrote in message 
> news:1153146462.943222.181680@b28g2000cwb.googlegroups.com...
>> Symon wrote:
>>
>> Is that 45 ps per LUT of the carry or 45 ps per CLB in the carry chain?
>> If I use the 56 elements that the OP said, I get 2.52 ns total carry
>> delay.  That is pretty remarkable if it is correct.
>>
> Hi Rick,
> Yes, that's 45ps per LUT. I believe the carry is actually implemented as a 
> two bit look ahead, so that each CLB is a two bit carry with delay of 
> 90ps. But, now you mention it, I don't understand the 56 levels thing.
>>
>> Increasing that to 45 ps per each of the 512 LUTs the carry delay is
>> still only 23.04 ns.  A combination approach combining say 16 LUTs with
>> the carry then using an 8 input OR gate should be a bit faster.  16
>> carries is about the same speed as a LUT.  I have not looked at the
>> Virtex 4 architecture so I don't know for sure if this is needed or if
>> the carry delay is 45 ps per CLB.
>>
> Thinking about it a bit harder, and after reading your post, I reckon the 
> synthesiser must be doing what you suggest, dividing the chain up into 
> sections, and oring together the output.
> Cheers, Syms.

More specifically the synthesizer is probably splitting into two levels of 
carry chains.  Rather than 512 LUTs feeding a carry chain that's 128 rows 
high (there are 2 carry chain paths in a CLB, 4 LUTs per carry chain) using 
2 levels of carry chains with the first at 5 MUXCY stages (32 inputs) and 
the second at 6 MUXCY stages (64 inputs, specifying 64 initial carry chains) 
the delay ends up being shorter still.  The Tbyp value, by the way, is about 
103 ps in the Spartan3E (-5 speed grade) and corresponds to 2 LUTs worth of 
carry chain since the bypass is on a slice-by-slice basis.

*****
Dadgummit.  The 8.2.01i speedprint numbers for Tbyp doesn't match my Timing 
Analyzer numbers (which did seem to correspond in speedprint 8.1.03i).  I've 
submitted a case to Xilinx on this.
*****

In the Spartan3E -5 speed grade, for instance, using timing numbers from my 
8.2.01i Timing Analyzer (a mixbag of SliceM and SliceL values so the actual 
numbers will vary) the 6-level OR would end up

Tcko+5*(Tnet+Tilo)+Tnet+Tfck
 = 0.567+6*Tnet+5*0.660+0.776
 = 4.643+6*Tnet
  an average Tnet of 1ns (routing to logic of 56% to 44% which is much 
better than what I'd expect for a wide distribution of inputs) gives
= 10.643 ns

While a single carry chain across 128 CLB rows would be
Tcko+Tnet+Topcyf+255*Tbyp+Tcinck
= 0.567+Tnet+1.011+255*(0.103)+0.518
= 28.561+Tnet
or probably under
= 29.561 ns

Which is much worse than the tree or for 2 levels of carry chains which 
would be

Tcko+Tnet+Topcyf+2*Tbyp+Tnet+Topcyf+2*Tbyp+Tcinck
= 0.567+Tnet+1.011+2*0.103+Tnet+1.011+2*0.103+0.518
= 3.519 + 2*Tnet
or around
= 5.519 ns

Two levels of carry chains use significantly fewer resources than an OR tree 
while the delay is about half what the tree would need.

The key to the number of carry chains the tool generates for the longest 
delay would be the number of Topcyf (or Topcyg) values in the path as 
reported by Timing Analyzer.

Ain't optimization fun?

Article: 105189
Subject: Re: 2048 input or gate ?
From: "Ben Jones" <ben.jones@xilinx.com>
Date: Mon, 17 Jul 2006 17:54:44 +0100
Links: << >> << T >> << A >>


"Symon" <symon_brewer@hotmail.com> wrote in message
news:44bba39a$1_2@x-privat.org...

> Yes, that's 45ps per LUT. I believe the carry is actually implemented as a
> two bit look ahead, so that each CLB is a two bit carry with delay of
90ps.
> But, now you mention it, I don't understand the 56 levels thing.
> >
> > Increasing that to 45 ps per each of the 512 LUTs the carry delay is
> > still only 23.04 ns.  A combination approach combining say 16 LUTs with
> > the carry then using an 8 input OR gate should be a bit faster.  16
> > carries is about the same speed as a LUT.  I have not looked at the
> > Virtex 4 architecture so I don't know for sure if this is needed or if
> > the carry delay is 45 ps per CLB.
> >
> Thinking about it a bit harder, and after reading your post, I reckon the
> synthesiser must be doing what you suggest, dividing the chain up into
> sections, and oring together the output.

If you think about it just a tiny bit harder, the structure of the optimal
circuit comes down to an assessment of the relative performance of the LUT
delay + routing, and the carry chain delay.  Intuitively, the best circuit
will have minimal disparity between the fastest and slowest path. Say for
the sake of argument that four stages of carry-OR takes as long as one
LUT-OR. Then an extremely coarse rendition of the fastest circuit to do a
big OR will look a bit like this (L = LUT, ^ = carry-mux OR, inputs [not
shown] on left):

  L-L-L-^ (top (result))
  L-L-L-^
  L-L-L-^
  L-L-L-^
    L-L-^
    L-L-^
    L-L-^
    L-L-^
      L-^
      L-^
      L-^
      L-^ (bottom)

The further up the carry chain you get, the more the inputs to the carry-mux
elements are just "waiting around" for the carry propagation. Eventually it
reaches the point where you can squeeze in an extra level of LUTs in these
higher stages, and thus reduce the total size of the carry chain. Go further
up, and you can afford two extra levels, and so on. I'd hope that at least
some tools are clever enough to exploit this.

(Note: in reality, the ratio of LUT:CY speed in this context is somewhere in
the 12:1 to 16:1 ballpark for most Xilinx architectures.)

Hope this makes sense... perhaps someone can take it a step further and work
out where the 56 levels thing really comes from (and thus deduce what this
particular synthesis tool believes the LUT:CY speed ratio is!).

Cheers,

        -Ben-

Article: 105190
Subject: EDK PowerPC ISS : download errors?
From: Sean <>
Date: Mon, 17 Jul 2006 10:36:08 -0700
Links: << >> << T >> << A >>

Has anyone successfully used the PowerPC Instruction-Set Simulator packaged with EDK (8.1)? I can set it up and launch XMD, but whenever I try to download an elf file it lists all the parts of the file and then reports

"Failed to download ELF file

Unable to write to Sim"

Any subsequent attempts to download or run result in the error

"Error in Resetting Target"

I seem to be able to download just fine to hardware with XMD, but not everyone involved in the research I'm doing has access to the actual board, so it would be nice to have the simulator working, even if its functionality is limited.

Article: 105191
Subject: Re: 2048 input or gate ?
From: "Symon" <symon_brewer@hotmail.com>
Date: 17 Jul 2006 19:55:09 +0200
Links: << >> << T >> << A >>

"Ben Jones" <ben.jones@xilinx.com> wrote in message 
news:e9gfcl$4mg1@cliff.xsj.xilinx.com...
>
>
> Hope this makes sense... perhaps someone can take it a step further and 
> work
> out where the 56 levels thing really comes from (and thus deduce what this
> particular synthesis tool believes the LUT:CY speed ratio is!).
>
> Cheers,
>
>        -Ben-
>
>
Hi Ben,
Thanks for that, it made sense to me. I think we might need to know what 
part the design was in because the carry chain length is limited by the 
number of rows in the FPGA. Smaller parts have smaller maximum length 
chains. Also, as a BTW, I see from the datasheet that the ORCY structure 
that was in V2PRO has been dropped from the V4. That made wide gates even 
faster.
Cheers, Syms.

Article: 105192
Subject: ISE 8.2 WebPack does not support Virtex-5 at all?
From: "Antti Lukats" <antti@openchip.org>
Date: Mon, 17 Jul 2006 19:57:13 +0200
Links: << >> << T >> << A >>

WebPack and SP1 are available but it looks like the ISE WebPack does not 
support any Virtex-5 devices at all? I did assume smallest Virtex-5 would be 
supported. what a pity.

Antti

Article: 105193
Subject: Re: EDK PowerPC ISS : download errors?
From: "Antti Lukats" <antti@openchip.org>
Date: Mon, 17 Jul 2006 19:58:47 +0200
Links: << >> << T >> << A >>

<Sean> schrieb im Newsbeitrag news:ee9d076.-1@webx.sUN8CHnE...
> Has anyone successfully used the PowerPC Instruction-Set Simulator 
> packaged with EDK (8.1)? I can set it up and launch XMD, but whenever I 
> try to download an elf file it lists all the parts of the file and then 
> reports
>
> "Failed to download ELF file
>
> Unable to write to Sim"
>
> Any subsequent attempts to download or run result in the error
>
> "Error in Resetting Target"
>
> I seem to be able to download just fine to hardware with XMD, but not 
> everyone involved in the research I'm doing has access to the actual 
> board, so it would be nice to have the simulator working, even if its 
> functionality is limited.

- same here I tried once got the same error and gave up.

Antti

Article: 105194
Subject: Re: 2048 input or gate ?
From: "John_H" <johnhandwork@mail.com>
Date: Mon, 17 Jul 2006 18:04:33 GMT
Links: << >> << T >> << A >>

"John_H" <johnhandwork@mail.com> wrote in message 
news:t3Pug.5676$Oh1.1853@news01.roc.ny...
<snip>

> Which is much worse than the tree or for 2 levels of carry chains which 
> would be
>
> Tcko+Tnet+Topcyf+2*Tbyp+Tnet+Topcyf+2*Tbyp+Tcinck
> = 0.567+Tnet+1.011+2*0.103+Tnet+1.011+2*0.103+0.518
> = 3.519 + 2*Tnet
> or around
> = 5.519 ns
>
> Two levels of carry chains use significantly fewer resources than an OR 
> tree while the delay is about half what the tree would need.
>
> The key to the number of carry chains the tool generates for the longest 
> delay would be the number of Topcyf (or Topcyg) values in the path as 
> reported by Timing Analyzer.
>
> Ain't optimization fun?

I thought through this too quickly.  The first stage in the example I was 
drawing out could do 64-wide ORs with the first carry chain which is 8 
slices or 7*Tbyp, not 2*Tbyp.  The second stage would be from 32 carry 
chains for 4 slices of MUXCY-based OR for 3*Tbyp, not 2*Tbyp so the timing 
would be more like 6.137 ns, still significantly better than the LUT tree.

I missed the 56 elements mentioned initially; this is probably just poor 
partitioning, relying instead on a "maximum carry width" value.

I'd manually partition the OR into two sets based on the 2 levels of 
carries.  The generate can be used to shorthand the 32 intermediate values. 
The KEEP attribute may be what's needed in XST - I use the syn_keep=1 in the 
synplicity synthesizer.  This synthesized okay but I didn't put a wrapper 
around it to get into a physiacl part (2k I/O is too much for me).


module orlt(clk, in, out);
input clk;
input [2047:0] in;
output out;

reg [2047:0] inr;
reg out;
wire outw;

orl u0(inr, outw);

always @(posedge clk)
begin
        out <= outw;
        inr <= in;
end

endmodule

module orl(in, out);
input [2047:0] in;
output out;

(* KEEP *) wire [31:0] mid;
generate
  genvar i;
  for( i=0; i<32; i=i+1)
  begin : MUXCYtree
    assign mid[i] = |in[i*64 +: 64];
  end
endgenerate

wire out = |mid[31:0];

endmodule

Article: 105195
Subject: Re: ISE 8.2 WebPack does not support Virtex-5 at all?
From: "Tommy Thorn" <tommy.thorn@gmail.com>
Date: 17 Jul 2006 11:05:47 -0700
Links: << >> << T >> << A >>

Antti, not nice to hijack Sean thread.

Anyway is perhaps Austin's attitude "I still have no idea why this
matters whatsoever" is the official Xilinx position (cf.
http://groups.google.com/group/comp.arch.fpga/tree/browse_frm/thread/d3a75da111b452a3/a852a6a48db9a88b?rnum=1&q=new+largest+&_done=%2Fgroup%2Fcomp.arch.fpga%2Fbrowse_frm%2Fthread%2Fd3a75da111b452a3%2F462b1ea94d885aa4%3Fq%3Dnew+largest+%26rnum%3D1%26#doc_a0535aeea2a09638)

Maybe ISE 9.0 will be better.

Tommy

Antti Lukats wrote:
> WebPack and SP1 are available but it looks like the ISE WebPack does not
> support any Virtex-5 devices at all? I did assume smallest Virtex-5 would be
> supported. what a pity.
> 
> Antti

Article: 105196
Subject: Re: ISE 8.2 WebPack does not support Virtex-5 at all?
From: "Antti Lukats" <antti@openchip.org>
Date: Mon, 17 Jul 2006 20:19:24 +0200
Links: << >> << T >> << A >>


"Tommy Thorn" <tommy.thorn@gmail.com> schrieb im Newsbeitrag 
news:1153159547.039604.67790@b28g2000cwb.googlegroups.com...
> Antti, not nice to hijack Sean thread.
>
> Anyway is perhaps Austin's attitude "I still have no idea why this
> matters whatsoever" is the official Xilinx position (cf.
> http://groups.google.com/group/comp.arch.fpga/tree/browse_frm/thread/d3a75da111b452a3/a852a6a48db9a88b?rnum=1&q=new+largest+&_done=%2Fgroup%2Fcomp.arch.fpga%2Fbrowse_frm%2Fthread%2Fd3a75da111b452a3%2F462b1ea94d885aa4%3Fq%3Dnew+largest+%26rnum%3D1%26#doc_a0535aeea2a09638)
>
> Maybe ISE 9.0 will be better.
>
> Tommy
>
>
> Antti Lukats wrote:
>> WebPack and SP1 are available but it looks like the ISE WebPack does not
>> support any Virtex-5 devices at all? I did assume smallest Virtex-5 would 
>> be
>> supported. what a pity.
>>
>> Antti
>

I already apologized!

I havent been able to post with outlook express for a while and I had 
forgotten that by hitting reply and changing subject to completly new one 
the post is still going as reply. silly stupid me. sorry again, wasnt 
intentional.

Antti

Article: 105197
Subject: Re: ISE 8.2 WebPack does not support Virtex-5 at all?
From: "Antti Lukats" <antti@openchip.org>
Date: Mon, 17 Jul 2006 20:20:41 +0200
Links: << >> << T >> << A >>

"Antti Lukats" <antti@openchip.org> schrieb im Newsbeitrag 
news:e9gj1o$pqj$1@online.de...
> WebPack and SP1 are available but it looks like the ISE WebPack does not 
> support any Virtex-5 devices at all? I did assume smallest Virtex-5 would 
> be supported. what a pity.
>
> Antti
>
ops, I posted incorrectly as reply. sorry.
and another ops, 5 seconds ago claimed that I already said sorry, but that 
sorry was sent to the OP only not as reply to me wrong posting

Antti

Article: 105198
Subject: Re: 2048 input or gate ?
From: mk <kal*@dspia.*comdelete>
Date: Mon, 17 Jul 2006 19:54:23 GMT
Links: << >> << T >> << A >>

On Mon, 17 Jul 2006 18:04:33 GMT, "John_H" <johnhandwork@mail.com>
wrote:
...
>> Ain't optimization fun?
>
>I thought through this too quickly.  The first stage in the example I was 
>drawing out could do 64-wide ORs with the first carry chain which is 8 
>slices or 7*Tbyp, not 2*Tbyp.  The second stage would be from 32 carry 
>chains for 4 slices of MUXCY-based OR for 3*Tbyp, not 2*Tbyp so the timing 
>would be more like 6.137 ns, still significantly better than the LUT tree.
>
>I missed the 56 elements mentioned initially; this is probably just poor 
>partitioning, relying instead on a "maximum carry width" value.
>
>I'd manually partition the OR into two sets based on the 2 levels of 
>carries.  The generate can be used to shorthand the 32 intermediate values. 
>The KEEP attribute may be what's needed in XST - I use the syn_keep=1 in the 
>synplicity synthesizer.  This synthesized okay but I didn't put a wrapper 
>around it to get into a physiacl part (2k I/O is too much for me).
...

Thanks John and everyone else,
So far I tried all three options. It turns out a LUT4 tree is slightly
faster at 6.26ns  than what XST comes up with (6.613ns) where as the
number of LUT4s go from 515 to 811. John's two level LUT4+muxcy on the
other hand has a delay of 4.94ns at 648 LUT4s.
In terms of generating the LUT4 tree by hand, I used 5 different
generate statements with keeps on the outputs which convinces XST to
give me what I wanted. By the way 32x64 vs 64x32 partition does not
make a difference but 64x32 is very slightly larger.

Article: 105199
Subject: Re: 2048 input or gate ?
From: "John_H" <johnhandwork@mail.com>
Date: Mon, 17 Jul 2006 20:19:37 GMT
Links: << >> << T >> << A >>

"mk" <kal*@dspia.*comdelete> wrote in message 
news:rs5lb2ltpdofqci81sr6hnvpjraub3rduc@4ax.com...
> Thanks John and everyone else,
> So far I tried all three options. It turns out a LUT4 tree is slightly
> faster at 6.26ns  than what XST comes up with (6.613ns) where as the
> number of LUT4s go from 515 to 811. John's two level LUT4+muxcy on the
> other hand has a delay of 4.94ns at 648 LUT4s.
> In terms of generating the LUT4 tree by hand, I used 5 different
> generate statements with keeps on the outputs which convinces XST to
> give me what I wanted. By the way 32x64 vs 64x32 partition does not
> make a difference but 64x32 is very slightly larger.

I would have thought the result would be 512+16+1 LUTs -- 2048/4 LUTs 
feeding 64 carry chains, 64/4 LUTs feeding the final carry chain, and 1 to 
register the carry at the top of the chain -- for 539 total, not 648.
_____

For the OR tree, rather than 5 generates you could be creative with one big 
wire and do one generate loop:

(*KEEP*) wire [681:0] ORs;  // 512+128+32+8+2 intermediate OR results
wire [2729:0] XtraWideOR = {ORs,inr};
generate
  genvar i;
  for( i=0; i<682; i=i+1)
  begin : ORtree
    ORs[i] = |XtraWideOR[i*4 +: 4];
  end
endgenerate
assign outw = |XtraWideOR[682*4 +: 2];

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search