Messages from 62800

Article: 62800
Subject: Re: Arithmetics with carry
From: "Glen Herrmannsfeldt" <gah@ugcs.caltech.edu>
Date: Fri, 07 Nov 2003 18:22:54 GMT
Links: << >> << T >> << A >>

"Kevin Becker" <starbugs@gmx.net> wrote in message
news:bbc55e92.0311070813.507ea083@posting.google.com...
> Francisco Rodriguez wrote:
> > As your numbers are 32-bit wide, and overflow condition is a flag
> > about the result of the _whole_ v := v + i operation, it can be
determined
> > by the MSB bits only (that is, from columns 31 and 30 of the addition).
>
> Thank you! I think I used the wrong word "underflow". What I mean is:
> if "i" is negative, it might be that abs(lo(i)) is greater than lo(v).

Yes, underflow is the right word.  It isn't used very often, though.

> What happens then? An example:
>
> v = 1234 0100 hex
> i = 0000 0101 hex
>
> The operation for the lo bits will result in FFFF and the carry will
> be set. Then when I add the high bits, it will be 1235 when it should
> actually be 1233. How do I save that the first operation was not an
> overflow but a borrow? (I don't know how to call this. That's what I
> mistakenly called underflow).

Hmm.  X'0100' + X'0101' is X'0201' with no overflow.  If you want to add a
small twos complement negative number, then that number will have F's in the
high bits which will take care of the 1233 part.   In that case, borrow
means that there is no carry, otherwise there is a carry.  Say v=X'12340100'
and i is negative 257.  Negative 257 is X'FFFFFEFF'

X'0100' + X'FEFF' is X'FFFF' with no carry.   X'1234'+X'FFFF' is X'1233'
with carry.   The carry into the high bit is also 1, so that there is no
overflow or underflow.

> Glen Herrmannsfeld wrote:
> > After the high halfword add, you compare the carry out to the carry out
of
> > the sign bit to the carry in of the sign bit.  If they are different
then it
> > is overflow or underflow.  The value of such bit tells you which one.
>
> So does that mean I have to modify my architecture and set TWO flags?
> A carry flag and a negative flag (if sign of last operation was
> negative), and then the Add-With-Carry instruction would look at both?

No, add with carry doesn't need to know.   You only need the two flags at
the end if you want to detect overflow and underflow.

-- glen

Article: 62801
Subject: Re: not replaced by logic error
From: Yen <asicmaverick@yahoo.com.tw>
Date: Fri, 7 Nov 2003 10:29:00 -0800
Links: << >> << T >> << A >>

Took cared of the dumb problem.. It seemed XST (5.2) does not like a 'wire type' 
output port driven by a reg output (e.g. 'Q'). A 'wire type' output driven by combi 
logic is okay tho. For example, I have output [127:0] a; 
reg [31:0] c,d,e,f; 
a[31:0] = c; 
a[63:32] = d; 
etc. will cause some 'a' wires to 'stuck'..

Article: 62802
Subject: PCI - X Boot up
From: muthu_nano@yahoo.co.in (Muthu)
Date: 7 Nov 2003 10:46:47 -0800
Links: << >> << T >> << A >>

Hi,

I am having PCIX Core in FPGA. Hence after power on reset, it takes
some time (FPGA configuration) to get the PCI-X Core logic. Lets say
it is 2 seconds for example.

after this time only, logics being realised and it can respond to PCI
configuration Cycles.

My question is, after power ON when the PCIX controller starts its
enumaration process. ie., reading configuration space

Thanks in advance.

Regards,
Muthu

Article: 62803
Subject: Re: Programmer's unpaid overtime.
From: "Kevin Neilson" <kevin_neilson@removethiscomcast.net>
Date: Fri, 07 Nov 2003 18:48:56 GMT
Links: << >> << T >> << A >>

I've always been amazed that at a big company there can be two coders
sitting next to each other with outputs that vary by a factor of ten, and
their pay varies by a factor of 5%.  Companies seem to be very good at
laying off large swaths of workers, but not at firing really useless ones.
-Kevin

"Ken Land" <kland1@neuralog1.com> wrote in message
news:vqnf5oatba4n85@news.supernews.com...
> I've been a  programmer for over 15 yrs.  I'm still  a programmer and I
> employ programmers in my company.
>
> Programmer output can vary (easily) by a factor of 10 from programmer to
> programmer. (This is documented BTW - see "Rapid Development")
>
> If you are an average or above programmer and you are *actually writing
> code*, your output is so incredibly high that overtime will almost always
be
> unecessary.  Also, average to above average programmers *love* to write
code
> and would work extra hours just for the enjoyment if they didn't have
> families to go home to.
>
> One more thing.  As an employer/business owner - we have no incentive or
> inherent desire for people to work unpaid overtime.  We just need the work
> done to keep the business moving forward.  If you can do your part in 10
> hours/wk. great, if not then whatever it takes is what it takes.
>
> Ken
>
>
> "Nial Stewart" <nial@spamno.nialstewart.co.uk> wrote in message
> news:3fab93a1$0$12691$fa0fcedb@lovejoy.zen.co.uk...
> >
> > Phil Hays <SpamPostmaster@attbi.com> wrote in message
> > news:3FAA5342.B1F91A03@attbi.com...
> >
> > > The current law makes salaried people not get paid overtime.  If you
> don't
> > think
> > > that is fair, you need to convince voters to elect people that will
> change
> > the
> > > laws.
> >
> > Surely all the law says is that if you sign a contract of employment
> > which say you don't get paid overtime, then you can't expect to get
> > paid for overtime?
> >
> > It's up to you whether you sign in the first place.
> >
> > ?
> >
> >
> > Nial
> >
> >
> >
> >
>
>

Article: 62804
Subject: Re: ASIC speed
From: eternal_nan@yahoo.com (Ljubisa Bajic)
Date: 7 Nov 2003 11:00:40 -0800
Links: << >> << T >> << A >>

Hi,

It is not quite as simple as that. In case you are using a
conservative wire-load model, provided by the silicon vendor, and a
healthy margin for clock jitter, scan flip-flop timing overhead and
second order effects, as well as a conservative setting for
environmental parameters (for example 100+ deg. celsius temperature
and voltage 15% lower than nominal for the process you
are using) then the results could be quite realistic. In case you are
running
DC with an optimistic setup than you could be off by way more than
20%. You
need to provide further info about your setup in order to get a
realistic
answer to your question.

Ljubisa Bajic
ATI Technologies
-------------- My opinions do not represent those of my employer
--------------


jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0311070140.5bc4afb@posting.google.com>...
> > Now my question is: Is the ASIC speed result reliable? 
> 
> If it's from DC, then no.
> 
> Since we didn't
> > do P&R( we don't have tools and experiences ), I really doubt the
> > timing report may be over optimistically estimated and not reliable. I
> > was told something about "wire load model" and ours is automatically
> > selected by the compiler.
> 
> Knock off 20%, as you're likely to have a more realistic figure.
> 
> If you're working at .13, you probably want to be using physical
> synthesis rather than synthesis based on wire load models.
> 
> Jon

Article: 62805
Subject: FPGAs and DRAM bandwidth
From: fortiz80@tutopia.com (Fernando)
Date: 7 Nov 2003 11:17:18 -0800
Links: << >> << T >> << A >>

How fast can you really get data in and out of an FPGA?  
With current pin layouts it is possible to hook four (or maybe even
five) DDR memory DIMM modules to a single chip.

Let's say you can create memory controllers that run at 200MHz (as
claimed in an Xcell article), for a total bandwidth of
5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * (2words/cycle) *
(1byte/8bits)=
5*3.2GB/s=16GB/s 

Assuming an application that needs more BW than this, does anyone know
a way around this bottleneck? Is this a physical limit with current
memory technology?

Fernando

Article: 62806
Subject: Re: latch and shift 15 bits.
From: "Ben Twijnstra" <bentw@SPAM.ME.NOT.chello.nl>
Date: Fri, 07 Nov 2003 19:19:41 GMT
Links: << >> << T >> << A >>


"Denis Gleeson" <dgleeson-2@utvinternet.com> wrote in message
news:184c35f9.0311070333.7a6acaae@posting.google.com...
> Hi Chuck
>
> Many thanks for your input on my question.
> I have used your code and it leaves me with just one problem in my
> simulator that you may be able to advise me on.
>
> It is a warning that Net "/clear" does not set/reset
> "/".../Store_trigger_Acquisition_Count_reg<0>
> all other bits for Store_trigger_Acquisition_Count get the same
> warning.
>
> The result is that the synthesis tool warns that no global set/reset
> (GSR) net could be used in the design as there is not a unique net
> that sets or resets all the sequential cells.
>
> I have modified your code to include the use of the clear signal to
> set Store_Trigger_Acquisition_Count
> to 0. This has had no effect.
>
> Any suggestions?
>
> always @ (ACB_Decade_Count_Enable or OUT_Acquisition_Count or clear)
>    if(clear)
>          Store_Trigger_Acquisition_Count <= 14'b0;
>    else

    Shouldn't that be 15'b0?

Best regards,


Ben

Article: 62807
Subject: Re: Arithmetics with carry
From: "Francisco Rodriguez" <prodrig@disca.upv.es>
Date: Fri, 7 Nov 2003 20:36:55 +0100
Links: << >> << T >> << A >>

Hi Kevin

I would recommend you to search and study the instruction set of a
microprocessor
with the support you're trying to implement in your processor.

Many processors have two different add instruccions (with and without carry)
to support large integer arithmetic. The simplest I know of is the 8031
8-bit microcontroller
from Intel, Infineon, Dallas and many other manufacturers.

The first hit I found in google points to the page
http://www.rehn.org/YAM51/51set/instruction.shtml
It contains the description and some numeric examples for arithmetic
operations.

Of interest are:
    ADD   (A =A +x, carry is not used)
    ADDC (A=A+x+carry)
    SUBB (A = A-x-carry)

"Kevin Becker" <starbugs@gmx.net> escribió en el mensaje
news:bbc55e92.0311070813.507ea083@posting.google.com...
> Francisco Rodriguez wrote:
> > As your numbers are 32-bit wide, and overflow condition is a flag
> > about the result of the _whole_ v := v + i operation, it can be
determined
> > by the MSB bits only (that is, from columns 31 and 30 of the addition).
>
> Thank you! I think I used the wrong word "underflow". What I mean is:
> if "i" is negative, it might be that abs(lo(i)) is greater than lo(v).
> What happens then? An example:
>
> v = 1234 0100 hex
> i = 0000 0101 hex
>
> The operation for the lo bits will result in FFFF and the carry will
> be set. Then when I add the high bits, it will be 1235 when it should
> actually be 1233. How do I save that the first operation was not an

No.
I assume you're substracting the numbers, as i is positive and the
result you mention is v-i. Then, take into account that substraction
is performed by an adder in 2's complement arithmetic as follows:

v - i = v + 2's complement(i) = v + not(i) + 1

So v-i operation is converted to 12340100 + FFFFFEFE + 1 = 1233FFFF

The low part is 0100 + FEFF = FFFF,
the carry from the low part is _not_ set, so the high part is 1234 + FFFF =
1233

> overflow but a borrow? (I don't know how to call this. That's what I
> mistakenly called underflow).

Go to the mentioned page, you'll see the different descriptions for carry
(or borrow)
and the overflow.
Carry/borrow is the cy-16 out of the add operation (remember there's no
substraction
circuit). It is the overflow flag if and only if you're using unsigned
arithmetic
Overflow for signed arithmetic is cy-16 xor cy-15. When this xor gives you 1
means you've obtained a positive result adding two negative numbers, or a
negative result
adding two positives.

>
> Glen Herrmannsfeld wrote:
> > After the high halfword add, you compare the carry out to the carry out
of
> > the sign bit to the carry in of the sign bit.  If they are different
then it
> > is overflow or underflow.  The value of such bit tells you which one.
>
> So does that mean I have to modify my architecture and set TWO flags?
> A carry flag and a negative flag (if sign of last operation was
> negative), and then the Add-With-Carry instruction would look at both?

Never look at both.
Your ALU must provide two flags, carry and overflow, and two different add
instructions
x+y and x+y+carry. Of course, every add must update both flags.
If you also provide set-carry/clear-carry instructions, the pseudocode of
the 32-bit operations would be

for 32-bit additions:
    add x-low, y-low
    addc x-high, y-high

for 32-bit substractions:
    set carry
    addc x-low, not(y-low)
    addc x-high, not(y-high)

When the whole operation is finished, check carry/borrow if the 32-bit
numbers are unsigned,
or the overflow flag if the 32-bit numbers are signed. But no both.
Don't check the flags after the low part.

>
> Thanks a lot!

Best regards
    Francisco

Article: 62808
Subject: Re: Announcement
From: jon@beniston.com (Jon Beniston)
Date: 7 Nov 2003 12:15:53 -0800
Links: << >> << T >> << A >>

> > Doesn't really matter, good enough in this case, lets any potential
> > commercial user get the message loud & clear. If its a 600 target,
> > thats one very expensive Arm compared to real thing. For an opensource
> > cpu to be useable, it must be competitive in size, speed, power with
> > commercial cpus.
> > 
> > johnjaksonATusaDOTcom
> 
> Does anyone know when the arm license is going to expire?

I think the exception processing patent was filed around 90-92, so its
got quite a bit in it left..

JonB

Article: 62809
Subject: Re: Impact, SVF, assumed TCK frequency?
From: Petter Gustad <newsmailcomp6@gustad.com>
Date: 07 Nov 2003 21:53:53 +0100
Links: << >> << T >> << A >>

Amontec Team <laurent.gauch@www.DELALLCAPSamontec.com> writes:

> Petter Gustad wrote:
> > In SVF files generated by impact there will be delay statements on the
> > form:
> > 
> >    // Loading device with a 'ferase' instruction. 
> >    ...
> >    RUNTEST 15000000 TCK;
> > 
> > What is the minimum delay as a result of this statement, i.e. what is
> > the assumed TCK frequency for impact generated SVF files?
> > 
> > TIA
> > Petter
> > 
> In the Xilinx SVF file, the assumed TCK is the maximum TCK frequency of 
> the device. Look the datasheet of the FPGA or your CPLD (between 10 to 
> 40 MHz).

Hmmm. But when there's a chain of different devices, or even other
brand names than Xilinx... I guess impact will use the lowest speed in
the chain based upon the attribute in the BSDL files:

attribute TAP_SCAN_CLOCK of TCK  : signal is (10.00e6,BOTH);

Is my assumption correct?

Petter
-- 
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

Article: 62810
Subject: Re: PCI - X Boot up
From: Eric Crabill <eric.crabill@xilinx.com>
Date: Fri, 07 Nov 2003 13:10:04 -0800
Links: << >> << T >> << A >>


Hello,

Once the power becomes "good" there is a minimum 100 ms delay
before the RST# signal deasserts.

Unless you are designing a 32-bit PCI card, you MUST have the
FPGA finished with bitstream loading before RST# deasserts.
This applies to any compliant PCI-X design regardless of bus
width, and also to any PCI design that is 64-bits wide.

The reason for this is that your FPGA design MUST be loaded
so that it can detect the busmode initialization pattern,
which is broadcast at the deassertion of RST#.  If you miss
this, you are in big trouble...

Once RST# is deasserted, you then have 2^25 or 2^27 cycles,
depending on the bus frequency, until the first configuration
access to your device.

Eric

Article: 62811
Subject: Re: Arithmetics with carry
From: Peter Alfke <peter@xilinx.com>
Date: Fri, 07 Nov 2003 15:34:03 -0800
Links: << >> << T >> << A >>

If I remember right, this is a counter application.
Forget the 16-bit ALU and use a 32-bit counter instead, and avoid all
this headache. KISS.
Peter Alfke
==========================
Glen Herrmannsfeldt wrote:
> 
> "Kevin Becker" <starbugs@gmx.net> wrote in message
> news:bbc55e92.0311070813.507ea083@posting.google.com...
> > Francisco Rodriguez wrote:
> > > As your numbers are 32-bit wide, and overflow condition is a flag
> > > about the result of the _whole_ v := v + i operation, it can be
> determined
> > > by the MSB bits only (that is, from columns 31 and 30 of the addition).
> >
> > Thank you! I think I used the wrong word "underflow". What I mean is:
> > if "i" is negative, it might be that abs(lo(i)) is greater than lo(v).
> 
> Yes, underflow is the right word.  It isn't used very often, though.
> 
> > What happens then? An example:
> >
> > v = 1234 0100 hex
> > i = 0000 0101 hex
> >
> > The operation for the lo bits will result in FFFF and the carry will
> > be set. Then when I add the high bits, it will be 1235 when it should
> > actually be 1233. How do I save that the first operation was not an
> > overflow but a borrow? (I don't know how to call this. That's what I
> > mistakenly called underflow).
> 
> Hmm.  X'0100' + X'0101' is X'0201' with no overflow.  If you want to add a
> small twos complement negative number, then that number will have F's in the
> high bits which will take care of the 1233 part.   In that case, borrow
> means that there is no carry, otherwise there is a carry.  Say v=X'12340100'
> and i is negative 257.  Negative 257 is X'FFFFFEFF'
> 
> X'0100' + X'FEFF' is X'FFFF' with no carry.   X'1234'+X'FFFF' is X'1233'
> with carry.   The carry into the high bit is also 1, so that there is no
> overflow or underflow.
> 
> > Glen Herrmannsfeld wrote:
> > > After the high halfword add, you compare the carry out to the carry out
> of
> > > the sign bit to the carry in of the sign bit.  If they are different
> then it
> > > is overflow or underflow.  The value of such bit tells you which one.
> >
> > So does that mean I have to modify my architecture and set TWO flags?
> > A carry flag and a negative flag (if sign of last operation was
> > negative), and then the Add-With-Carry instruction would look at both?
> 
> No, add with carry doesn't need to know.   You only need the two flags at
> the end if you want to detect overflow and underflow.
> 
> -- glen

Article: 62812
Subject: Re: Programmer's unpaid overtime.
From: The Real Bev <bashley@myrealbox.com>
Date: Fri, 07 Nov 2003 17:29:20 -0800
Links: << >> << T >> << A >>

Kevin Neilson wrote:
> 
> I've always been amazed that at a big company there can be two coders
> sitting next to each other with outputs that vary by a factor of ten, and
> their pay varies by a factor of 5%.  Companies seem to be very good at
> laying off large swaths of workers, but not at firing really useless ones.
> -Kevin

And some companies are very good at promoting and throwing great
fistfuls of cash at coders with outputs of 100x the average who can also
solve other technical problems.

It's really hard to fire a useless person without being able to prove in
court that they guy really IS useless, was given the appropriate number
of chances to remedy his uselessness, and that the company bent over
backwards to keep him gainfully employed in spite of his limitations,
especially if said useless person is a member of some EEO "protected"
class.  You have problems even if you give such a person a charity
layoff and a few months of severance pay.  

Carry on...

-- 
Cheers, Bev
+++++++++++++++++++++++++++++++++++++++++++++++++
"I don't care who your father is! Drop that cross  
 one more time and you're out of the parade!"

Article: 62813
Subject: Capturing Video with RC200E board of Celoxica
From: gerardo_sr@yahoo.com (Gerardo Sosa)
Date: 7 Nov 2003 18:11:27 -0800
Links: << >> << T >> << A >>

Hi, I'm trying to capture a video frame with the camera included in
RC200E board of Celoxica.

I'm founding my design in the PAL example VideoIn. 

1. I capture a block of 640x9 pixels and I store in a RAM. (I capture
the pixels with PalVideoInRead)
2. I copy this block to a PalFrameBuffer.
3. I return to step 1 to capture the next block in the video IN frame.

In simulation it works, but when I download the bit stream to RC200E
board, the result it's wrong. I lose a few lines, because the image
displayed in the monitor is of less height and therefore it's
displayed 3 or 4 frames in the monitor displacing up.

The idea is capture a block of rows for make some processing and then
move the results pixels to the PalFrameBuffer. I have read the PAL API
Reference manual a lot of times and I've watched the examples and the
only thing that I believe that makes the error is that for use
PalVideoInRead function, it should be called repeatedly whitout delay
in order to be sure of not missing pixels.

Any Ideas or suggestions? I'll appreciate whatever comment.

Thanks

Gerardo Sosa

P.D. If somebody needs to see my code e-mail me.

Article: 62814
Subject: Re: Virtex II DCM & ZBT SRAM
From: Marc Randolph <mrand@my-deja.com>
Date: Sat, 08 Nov 2003 04:58:30 GMT
Links: << >> << T >> << A >>

David Gesswein wrote:

>    I tried the Xilinx support line Case # 503586 and haven't gotten a good 
> answer so though I would try here.
> 
> I am trying to interface to a ZBT SRAM from a Virtex II and was trying to
> do like in xapp 136 which used 2 DCM's to generate a internal FPGA and an
> external board clock using external DCM feedback that are aligned.  That 
> configuration in simulation (5.2i sp3) shows the external clock is .5ns 
> delayed from the internal clock.  We actually need to use a third DCM FX 
> output to generate the clock for the SRAM.  When we do that the external 
> clock is now leading the internal clock by 1 ns.  I didn't understand why 
> what clock feeds both DCM's would change the timing and since our timing 
> is tight I need them to be closely aligned.
> 
> The external clock is output using a DDR FF and I used the DCM wizard which
> should of put in all the problem bufg/ibufg etc which the V2 users guide
> says are needed if it is going to compensate for the pad to DCM delay.
> 
> I also think only 2 DCM's are needed, 1 to generate the internal clock using
> FX and a second to generate the deskewed external clock. That configuration
> seems to generate the same timing as the 3 DCM version.
> 
> Anybody know the correct solution?

Howdy David,

    I'm not quite clear on what you are using the FX output before, but 
I've used ZBT SRAM's on a number of designs over the past couple years.

Here is how I do it: feed the input/reference clock into two DCM's.  The 
CLKFB of one is just the output of the global buffer.  The output of the 
other DCM goes off chip (using a DDR FF doesn't get you anythig... the 
deskew function takes the delay out).  Put two resistors on the output 
pin (keeping the resistors as close as possible to the FPGA), and route 
from one resistor to the ZBT.  The other resistor routes to a GCLK input 
pin and feeds to the CLKFB pin.  As long as you keep the two "long" 
traces (the outputs of the two resistors) close to the same length, you 
won't get reflections, and the DCM will remove the skew so that the 
rising edge of the clock arrives at the ZBT around the same time as the 
rising edge occurs inside the FPGA.

As another poster mentioned, if you use external feedback, Xilinx 
recommends (or used to) that you hold the second DCM in reset for a long 
while so that the clock has time to propagate off chip and back into the 
feedback pin.

As for how you can compare the output clocks of the two, do you really 
need to?  What you care about is alignment of the ZBT clock and the 
address or data bus transitions coming from the FPGA (which are a 
function of the internal clock).

Good luck,

    Marc

Article: 62815
Subject: 0.13u device with 5V I/O
From: Jim Granville <jim.granville@designtools.co.nz>
Date: Sat, 08 Nov 2003 18:06:53 +1300
Links: << >> << T >> << A >>

News info below.
 Automotive customers tend to be tough on reliability, and on
standard supply voltages. 

Of interest in this release are

 + 0.13u/150MHz core, but they manage to deliver 5V I/O, ADCs etc
 [ FPGA vendors could learn from this ]

 + Comment on error correcting FLASH

 Not mentioned here, but also noted, is the trend to require 
a Vpp or PGM enable pin, on Automotive FLASH parts.
 Seems to be a concern about shipping a part that MIGHT be able to
re-program its own flash ?

 - jg

 Motorola news item :
"Based on 0.13-micron design rules, the MPC5554 chip operates at speeds
of 50 to 150MHz. Though the design rules are advanced, Motorola made the
part so that its I/O and ADC will run at 5V, which automakers often
prefer. 

The company also said it designed the flash memory to be more reliable
by adding error correcting code. The flash is built to retain data for
20 years and withstand 100,000 read/erase cycles. The first MPC5554 will
include 2 Mbytes of flash, and the company is planning to come out with
a 4Mbyte version next year, Cornyn said. "

Article: 62816
Subject: Re: Programmer's unpaid overtime.
From: H. Peter Anvin <hpa@zytor.com>
Date: 7 Nov 2003 21:23:08 -0800
Links: << >> << T >> << A >>

Followup to:  <3FAC46F0.31F9B374@myrealbox.com>
By author:    The Real Bev <bashley@myrealbox.com>
In newsgroup: comp.arch.fpga
>
> Kevin Neilson wrote:
> > 
> > I've always been amazed that at a big company there can be two coders
> > sitting next to each other with outputs that vary by a factor of ten, and
> > their pay varies by a factor of 5%.  Companies seem to be very good at
> > laying off large swaths of workers, but not at firing really useless ones.
> > -Kevin
> 
> And some companies are very good at promoting and throwing great
> fistfuls of cash at coders with outputs of 100x the average who can also
> solve other technical problems.
> 
> It's really hard to fire a useless person without being able to prove in
> court that they guy really IS useless, was given the appropriate number
> of chances to remedy his uselessness, and that the company bent over
> backwards to keep him gainfully employed in spite of his limitations,
> especially if said useless person is a member of some EEO "protected"
> class.  You have problems even if you give such a person a charity
> layoff and a few months of severance pay.  
> 

What's much worse than deadwood are people who are active
obstructionists.  They can also be really hard to get rid of,
unfortunately.

	-hpa
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

Article: 62817
Subject: Re: Arithmetics with carry
From: H. Peter Anvin <hpa@zytor.com>
Date: 7 Nov 2003 21:29:26 -0800
Links: << >> << T >> << A >>

Followup to:  <bbc55e92.0311061129.28af9d44@posting.google.com>
By author:    starbugs@gmx.net (Kevin Becker)
In newsgroup: comp.arch.fpga
>
> I'm designing a processor for one specific application and in my
> software I have need a counter. I have a problem figuring out how to
> make Add-with-carry work for this.
> 
> I want to do v := v + i.
> v and i are both 32 bit values, my ALU is 16 bits wide.
> Everything is 2-complement.
> 
> I would add the lower 16 bits, then add the higher 16 bits with carry.
> My problem: "i" may be positive or negative, so there are 3 things
> that can occur:
> - overflow
> - underflow
> - none of those
> 
> If I have only one carry bit, those 3 possibilities cannot be
> represented. Am I right that in such an architecture it is impossible
> to achieve what I want? How do I have to change my ALU in order to do
> that? And how do I handle the sign bits in the "middle" of the 32 bit
> values? If possible, I would like to avoid an additional comparison
> and use only flags.
> 

No, you're not correct.  What you're doing wrong is simply failing to
recognize the fundamental reason why 2's complement is so ubiquitous:

ADDITION AND SUBTRACTION OF 2'S COMPLEMENT NUMBERS IS IDENTICAL TO
THE SAME OPERATIONS ON UNSIGNED NUMBERS

Therefore, you don't care if you got overflow or underflow -- they are
both represented by carry out.

In other words, build your ALU just as if "v" and "i" were unsigned
numbers, and everything is good.

	-hpa

	
-- 
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64

Article: 62818
Subject: Re: FPGAs and DRAM bandwidth
From: johnjakson@yahoo.com (john jakson)
Date: 7 Nov 2003 21:39:13 -0800
Links: << >> << T >> << A >>

fortiz80@tutopia.com (Fernando) wrote in message news:<2658f0d3.0311071117.3bf6eaea@posting.google.com>...
> How fast can you really get data in and out of an FPGA?  
> With current pin layouts it is possible to hook four (or maybe even
> five) DDR memory DIMM modules to a single chip.
> 
> Let's say you can create memory controllers that run at 200MHz (as
> claimed in an Xcell article), for a total bandwidth of
> 5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * (2words/cycle) *
> (1byte/8bits)=
> 5*3.2GB/s=16GB/s 
> 
> Assuming an application that needs more BW than this, does anyone know
> a way around this bottleneck? Is this a physical limit with current
> memory technology?
> 
> Fernando

OTOH

If you want more bandwidth than DDR DRAM, you could go for RamBus,
RLDRAM or the other NetRam or whatever its called. The RLDRAM devices
separate the I/Os for pure bandwidth, no turning the bus or clock
around nonsense and reduce latency from 60-80ns range down to 20ns or
so, that is true RAS cycle.

Micron & Infineon do the RLDRAM, another group does the NetRam (Hynix,
Samsung maybe).

The RLDRAM can run the bus upto 400MHz, double pumped to 800MHz and
can use most every cycle to move data 2x and receive control 1x. It is
8 ways banked so every 2.5ns another true random access can start to
each bank once every 20ns. The architecture supports 8,16,32-36 bit
width IOs IIRC. Sizes are 256M now. I was quoted price about $20
something, cheap for the speed, but far steeper than PC ram. Data can
come out in 1,2 or 4 words per address. Think I got all that right.
Details on Micron.com. I was told there are Xilinx interfaces for
them, I got docs at Xilinx but haven't eaten them yet. They also have
interfaces for the RamBus & NetRam. AVNET (??) also has a dev board
with couple of RLDRAM parts on them connected to a Virtex2 part, but I
think these are the 1st gen RLDRAM parts which are 250MHz 25ns cycle
so the interface must work.

Anyway, I only wish my PC could use them, I'd willingly pay mucho $
for a mobo that would use them but that will never happen. I quite
fancy using one for FPGA cpu, only I could probably keep 8 nested cpus
busy 1 bank each since cpus will be far closer to 20ns cycle than
2.5ns. The interface would then be a mux-demux box on my side. The
total BW would far surpass any old P4, but the latency is the most
important thing for me.

Hope that helps

johnjakson_usa_com

Article: 62819
Subject: Re: Arithmetics with carry
From: "Glen Herrmannsfeldt" <gah@ugcs.caltech.edu>
Date: Sat, 08 Nov 2003 06:23:15 GMT
Links: << >> << T >> << A >>


"H. Peter Anvin" <hpa@zytor.com> wrote in message
news:bohuvm$4v4$1@cesium.transmeta.com...
> Followup to:  <bbc55e92.0311061129.28af9d44@posting.google.com>
> By author:    starbugs@gmx.net (Kevin Becker)
> In newsgroup: comp.arch.fpga
> >
> > I'm designing a processor for one specific application and in my
> > software I have need a counter. I have a problem figuring out how to
> > make Add-with-carry work for this.

(snip)

> No, you're not correct.  What you're doing wrong is simply failing to
> recognize the fundamental reason why 2's complement is so ubiquitous:
>
> ADDITION AND SUBTRACTION OF 2'S COMPLEMENT NUMBERS IS IDENTICAL TO
> THE SAME OPERATIONS ON UNSIGNED NUMBERS
>
> Therefore, you don't care if you got overflow or underflow -- they are
> both represented by carry out.
>
> In other words, build your ALU just as if "v" and "i" were unsigned
> numbers, and everything is good.

This is true, except for generating the flags on the final add.  Well, you
can either generate all the flags, or only the signed or unsigned flags.
For the intermediate adds only the carry, or lack of carry, from the high
bit is important.  To detect signed overflow or underflow (more negative
than can be represented) requires comparing the carry into and out of the
sign bit.

-- glen

Article: 62820
Subject: Re: ASIC speed
From: johnjakson@yahoo.com (john jakson)
Date: 7 Nov 2003 23:21:02 -0800
Links: << >> << T >> << A >>

Austin Lesea <Austin.Lesea@xilinx.com> wrote in message news:<3FABC4A2.E78A6D86@xilinx.com>...
> Yu Jun,
> 
> Knock off 20% for .13u from schematic to RC extracted.
> 
> Also depends on what the foundry actually supports:  is this based on lo-k
> dieletric?
> 
> If not, that will take you down another 5%.
> 
> The Virtex II Pro IBM405PPC runs at 450 MHz, so I would expect any well
> designed and semi-custom layout uP to be at least that fast in .13u.
> 
> Austin
> 
> Yu Jun wrote:
> 
> > I'm working on a cpu core and intend to embed it into ASIC circuits,
> > with the aim to do some network processing. Now the FPGA prototype is
> > running and a 66M speed is achieved( xilinx virtexII-4 ). Wondering
> > how fast it can run in ASIC, we had our ASIC guys to synthesize the
> > codes and the result was shocking, it reached 400M! Far beyond our
> > expectation of 150M. The library we used was of 0.13u, from TI, fairly
> > fast, in which a NAND gate is around 0.03ns.
> >
> > Now my question is: Is the ASIC speed result reliable? Since we didn't
> > do P&R( we don't have tools and experiences ), I really doubt the
> > timing report may be over optimistically estimated and not reliable. I
> > was told something about "wire load model" and ours is automatically
> > selected by the compiler.
> >
> > Anybody can give me some hints or direct me to some documents will be
> > very appreciated! Thank you very much.
> >
> > yu jun
> >
> > yujun@huawei.com

Your surprise really reflects that your design is not Blockram limited
but gate/logic level limited where ASICs will stay about 5x faster or
more. If you were not going to ASIC, your design might be considered
slow since you could push any Blockrams to 200MHz or so, but then it
is very difficult to do much cpu logic with only a few LUT levels per
cycle. MicroBlaze (at 120MHz)is probably limited to multiplier delay
as well as cpu logic levels long before hitting BlockRam limit, and I
am sure its hand placed where needed to boot.

For those designs that are truly Blockram limited, an ASIC memory
won't be much faster than BlockRams for the same architecture spec &
process, they are also likely made by same foundry on similar process.
Ofcourse ASICs can offer custom compiled SRAMs to get a bit more speed
and they do allow 5x more logic layers in that cycle.

The note of 30ps nand gates, that compares to 3GHz P4 cycle of 330ps
or about 10 gate delays. Although I am sure Intel doesn't use many
gates as we know them but various high speed pass logic schemes so
they are using much shorter transit times. Also SRAMs have for decades
had access times of about 10 gate delays too. And the old
supercomputer designers used to clock cpus in 10 ECL layers of dotted
logic, so I figure 10 Lut levels is fair enough cycle target. Luckily
the carry chains we need are not done by Lut level logic or we would
be really ____ed, but then we deal with switched wires instead.

johnjakson_usa_com

Article: 62821
Subject: Re: Building the 'uber processor'
From: johnjakson@yahoo.com (john jakson)
Date: 8 Nov 2003 01:50:59 -0800
Links: << >> << T >> << A >>

Hi Goran

> 
> The new instruction in MicroBlaze for handling these locallinks are 
> simple but there is no HW scheduler in MicroBlaze. I have done processor 
> before with complete Ada RTOS in HW but it would be an overkill in a FPGA:
>

.. now that sounds like something we could chat about for some time.
An Ada RTOS in HW certainly would be heavy, but the Occam model is
very light. The Burns book on Occam compares them, the jist being that
ADA has something for everybody, and Occam is maybe too light. Anyway
they both rendezvous. At the beginning of my Inmos days we were
following ADA and the iAPX32 very closely to see where concurrency on
other cpus might go (or not as the case turned out). Inmos went for
simplicity, ADA went for complexity.

Thanks for all the gory details.

> The Locallinks for MicroBlaze is 32-bit wide so they are not serial.
> They can handle a new word every clock cycle.
> 
> You could also connect up a massive array of MicroBlaze over FSL ala 
> transputer but I think that the usage of the FPGA logic as SW 
> accelarators will be a more popular way since FPGA logic can be many 
> magnitudes faster than any processor and with the ease of interconnect 
> as the FSL provides it will be the most used case.
> 

I am curious what the typ power useage of MicroBlaze is per node, and
has anybody actually tried to hook any no of them up. If I wanted
large no of cpus to work on some project that weren't Transputers, I
might also look at PicoTurbo, Clearspeed or some other BOPSy cpu
array, but they would all be hard to program and I wouldn't be able to
customize them. Having lots of cpus in FPGA brings up the issue of how
to organize memory hierarchy. Most US architects seem to favor the
complexity of shared memory and complicated coherent caches, Europeans
seem to favor strict message passing (as I do).

We agree that if SW can be turned into HW engines quickly and
obviously, for the kernals, sure they should be mapped right onto FPGA
fabric for whatever speed up. That brings up some points, 1st P4
outruns typ FPGA app maybe 50x on clockspeed. 2nd converting C code to
FPGA is likely to be a few x less efficient than an EE designed
engine, I guess 5x. IO bandwidth to FPGA engine from PC is a killer.
It means FPGAs best suited to continuous streaming engines like real
time DSP. When hooked to PC, FPGA would need to be doing between
50-250x more work in parallel just to be even. But then I thinks most
PCs run far slower than Intel/AMD would have us believe because they
too have been turned into streaming engines that stall on cache misses
all too often.

But SW tends to follow 80/20 (or whatever xx/yy) rule, some little
piece of code takes up most of the time. What about the rest of it, it
will still be sequential code that interacts with the engine(s). We
would still be forced to rewrite the code and cut it with an axe and
keep one side in C and one part in HDL. If C is used as a HDL, we know
thats already very inefficient compared to EE HDL code.

The Transputer & mixed language approach allows a middle road between
the PC cluster and raw FPGA accelerator. It uses less resources than
cluster but more than the dedicated accelerator. Being more general
means that code can run on an array of cpus can leave decision to
commit to HW for later or never. The less efficient approach also
sells more FPGAs or Transputer nodes than one committed engine. In the
Bioinformatics case, a whole family of algorithms need to be
implemented, all in C, some need FP. An accelerator board that suits
one problem may not suit others, so does Bio guy get another board,
probably not. TimeLogic is an interesting case study, the only
commercial FPGA solution left for Bio.

My favourite candidate for acceleration is in our own backyard, EDA,
esp P/R, I used to spend days waiting for it to finish on much smaller
ASICs and FPGAs. I don't see how it can get better as designs are
getting bigger much faster than pentium can fake up its speed. One
thing EDA SW must do is to use ever increasingly complex algorithms to
make up the short fall, but that then becomes a roadblock to turning
it to HW so it protects itself in clutter. Not as important as the Bio
problem (growing at 3x Moores law), but its in my backyard.

rant_mode_off

Regards

johnjakson_usa_com

Article: 62822
Subject: Re: Building the 'uber processor'
From: johnjakson@yahoo.com (john jakson)
Date: 8 Nov 2003 01:52:59 -0800
Links: << >> << T >> << A >>

Mario Trams <Mario.Trams@informatik.tu-chemnitz.de> wrote in message > 
> Hi John,
> 
> do you know about this nice stuff developed by Cradle
> (http://www.cradle.com) ?
> 
> They have developed something like an FPGA. But the PFUs 
> do not consist of generic logic blocks but small processors.
> That's perhaps something you would like :-)
> 
> Regards,
> Mario

Thanks for pointer, I hadn't seen it yet, will take a peek.

Article: 62823
Subject: Re: Arithmetics with carry -- got it :-)
From: starbugs@gmx.net (Kevin Becker)
Date: 8 Nov 2003 05:35:24 -0800
Links: << >> << T >> << A >>

Aaaah, now I got it!

The problem was that I read somewhere that a SUB instruction generates
a carry flag when subtracting a negative number and the result becomes
too big. So I automatically assumed somehow that an ADD instruction
also generates a carry flag when adding a negative number (which is
wrong). I also forgot that when I have a negative number, also the
high halfword will be FFFF. I thought it would be zero because the
abs(i) is small enough to fit into the low halfword, but due to the
sign it is NOT zero.

Peter: I am not using the counter macro because this operation is one
of many operations in an algorithm and the value needs to be in the
RAM which is only connected to the processor.

Thanks to everybody who helped me out with this.

Article: 62824
Subject: Re: FPGAs and DRAM bandwidth
From: fortiz80@tutopia.com (Fernando)
Date: 8 Nov 2003 10:16:01 -0800
Links: << >> << T >> << A >>

Lots of good points in your reply, here is why I think these
technologies don't apply to problem that requires large and fast
memory.

RLDRAM: very promising, but the densities do not seem to increase
significantly over time (500Mbits now ~ 64MB).  To the best of my
knowledge, nobody is making DIMMS with these chips, so they're stuck
as cache or network memory.

RDRAM (RAMBUS): as you said, only the slowest parts can be used with
FPGAs because of the very high frequency of the serial protocol.  The
current slowest RDRAMs  run at 800 MHz, a forbidden range for FPGAs
(Xilinx guys, please jump in and correct me if I'm wrong)

Am I missing something?  Are there any ASICs out there that interface
memory DIMMS and FPGAs? Is there any way to use the rocket I/Os to
communicate with memory chips?  or maybe a completely different
solution to the memory bottleneck not mentioned here?


johnjakson@yahoo.com (john jakson) wrote in message news:<adb3971c.0311072139.6dab6951@posting.google.com>...
> fortiz80@tutopia.com (Fernando) wrote in message news:<2658f0d3.0311071117.3bf6eaea@posting.google.com>...
> > How fast can you really get data in and out of an FPGA?  
> > With current pin layouts it is possible to hook four (or maybe even
> > five) DDR memory DIMM modules to a single chip.
> > 
> > Let's say you can create memory controllers that run at 200MHz (as
> > claimed in an Xcell article), for a total bandwidth of
> > 5(modules/FPGA) * 64(bits/word) * 200e6(cycles/sec) * (2words/cycle) *
> > (1byte/8bits)=
> > 5*3.2GB/s=16GB/s 
> > 
> > Assuming an application that needs more BW than this, does anyone know
> > a way around this bottleneck? Is this a physical limit with current
> > memory technology?
> > 
> > Fernando
> 
> OTOH
> 
> If you want more bandwidth than DDR DRAM, you could go for RamBus,
> RLDRAM or the other NetRam or whatever its called. The RLDRAM devices
> separate the I/Os for pure bandwidth, no turning the bus or clock
> around nonsense and reduce latency from 60-80ns range down to 20ns or
> so, that is true RAS cycle.
> 
> Micron & Infineon do the RLDRAM, another group does the NetRam (Hynix,
> Samsung maybe).
> 
> The RLDRAM can run the bus upto 400MHz, double pumped to 800MHz and
> can use most every cycle to move data 2x and receive control 1x. It is
> 8 ways banked so every 2.5ns another true random access can start to
> each bank once every 20ns. The architecture supports 8,16,32-36 bit
> width IOs IIRC. Sizes are 256M now. I was quoted price about $20
> something, cheap for the speed, but far steeper than PC ram. Data can
> come out in 1,2 or 4 words per address. Think I got all that right.
> Details on Micron.com. I was told there are Xilinx interfaces for
> them, I got docs at Xilinx but haven't eaten them yet. They also have
> interfaces for the RamBus & NetRam. AVNET (??) also has a dev board
> with couple of RLDRAM parts on them connected to a Virtex2 part, but I
> think these are the 1st gen RLDRAM parts which are 250MHz 25ns cycle
> so the interface must work.
> 
> Anyway, I only wish my PC could use them, I'd willingly pay mucho $
> for a mobo that would use them but that will never happen. I quite
> fancy using one for FPGA cpu, only I could probably keep 8 nested cpus
> busy 1 bank each since cpus will be far closer to 20ns cycle than
> 2.5ns. The interface would then be a mux-demux box on my side. The
> total BW would far surpass any old P4, but the latency is the most
> important thing for me.
> 
> Hope that helps
> 
> johnjakson_usa_com

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search