Messages from 102025

Article: 102025
Subject: Re: Superscalar Out-of-Order Processor on an FPGA
From: "Luke" <lvalenty@gmail.com>
Date: 9 May 2006 14:22:47 -0700
Links: << >> << T >> << A >>

I'm pretty sure I'll be time-multiplexing the register file.  I at
least have to multiplex the write port.  Depending on timing and space
constraints, I may multiplex the read ports, clone the register file,
or some of both.

This shouldn't be _too_ much of a problem.  The Distributed RAM in the
Xilinx FPGAs is pretty fast, and the register file shouldn't be on a
critical path.  The register file write data will come straight from
the Re-Order Buffer with little or no logic in-between.  The register
file read data however will likely have more logic between it and a
flop.  If multiplexing the read ports puts it on the critical path,
then I can always clone the register file for each read port.  It's
most likely that the bypass logic will be on the critical path, so this
shouldn't be an issue. (on second thought: I may need to register the
output of the register file as the output will essentially be clocked
4x the system clock...)

I think that's the only data structure that I will need to
time-multiplex.  I've been analyzing the various data structures for
instruction scheduling and it looks like I will be able to neatly
partition their write ports into segments based on instruction issue
slots.  This will take up marginally more logic than regular
Distributed RAM (just a couple LUTs if it's placed well and uses built
in MUXes), and will be about the same speed.

On the other hand, the write-back bus will need to properly route
results to the correct partitions, potentially limiting throughput.

Article: 102026
Subject: Re: Putting the Ring into Ring oscillators
From: Kolja Sulimma <news@sulimma.de>
Date: Tue, 09 May 2006 23:29:03 +0200
Links: << >> << T >> << A >>

Jim Granville schrieb:
> Bringing this back into the FPGA domain:
> 
>  The idea is to build the closest thing a FPGA fabric allows. Use the
> routing path-lengths to dominate the delays, and place the (series)
> buffers only sparingly.
>  The result should be a Physical Ring Osc, where the Physical
> ring dominates, and thus gives better precision.
>  With each FPGA generation, the buffer effects will decrease.
> 65nm FPGAs are in the labs now ?

Hmm. Probably currently the RC-Delay of the wires will dominate.
At least R is not well controlled.
But for large chips you start seeing inductive effects inside large
ICs, and then you might be right.

Kolja

Article: 102027
Subject: Re: Superscalar Out-of-Order Processor on an FPGA
From: "Luke" <lvalenty@gmail.com>
Date: 9 May 2006 14:36:47 -0700
Links: << >> << T >> << A >>

I actually did build a CPU for pure MHz speed.  It was a super fast
dual-issue cpu, but in order to get the high-clock rates, I had to make
some serious trade-offs.

Number one tradeoff, the execution stage is two stages and there is at
least one delay slot after every instruction before the result can be
used.  This version runs at 150MHz.  I have another version with not
much bypass hardware that makes it up to 180MHz.  But with three delay
slots and only 8 registers per issue slot, scheduling becomes a major
issue.

Number two: 16-bit architecture.  Addition actually takes a long time
using ripple-carry in an FPGA, and there's reall no way around it.
16-bit is pretty easy to add, so that's what it gets.  It's also 16-bit
to cut down on utilization.

Number three: Some arithmetic instructions are split in two.  For
example, shift instructions and 32-bit addition is split into two
instructions.  I simply could not afford the logic and delay of doing
these with one instruction.

Number four: 16 bit addressing.  Same deal with addition, it takes too
long, and i don't want to extend the delay slots any further, so I have
16 bit addressing only.  Also, instruction sizes were 16 bits to cut
down on logic and keep things "simple".

So besides being a total pain in the butt to schedule and program, it
really is rocket-fast.  It is at it's very worst 2 times faster than a
32-bit pipleined processor i designed, and at it's best, it is 10 times
faster.  With decent scheduling and a superb compiler or hand coding,
it should be able to sustain 5-8 times faster.

The other advantage is that I could put 12 of them on a spartan 3 1000.
 Theoretically, I could get the performance of a really crappy modern
computer with these things.

And now I come back to reality.  It's such a specialized system, and
the memory architecture, ISA and all around whole system is a mess.
Yes, it's super fast, but so what?  I would be so much better off just
designing custom pipelined logic to do something rather than this gimp
of a cpu.

So that's why I'm designing a "modern" processor.  It's a general
purpose CPU that could run real software such as Linux.  It's that much
more useful ;)

Also, anyone can create a fast, simple processor.  What's important is
proper balance.  I do agree that OOO and all that is not suitable for
an FPGA.  But it sure is fun!

Article: 102028
Subject: Re: Funky experiment on a Spartan II FPGA
From: "John_H" <johnhandwork@mail.com>
Date: Tue, 09 May 2006 21:42:50 GMT
Links: << >> << T >> << A >>

The assumption may have been made (but not thoroughly communicated) that 
shifting would be through SRLs to deliver the max power.  Transition of 
every *register* from 1s to 0s would be one heck of a strain but perhaps not 
as much of a strain as half that many SRLs (SLICEM vs SLICEL these days) 
feeding an all-1 to all-0 transition on the registers fed by those SRLs.

"Austin Lesea" <austin@xilinx.com> wrote in message 
news:e3qvsb$rel14@xco-news.xilinx.com...
> lenz,
>
> You really have to draw yourself a picture.
>
> I don't think anyone has really thought this through, unless they are 
> doing it in reality.
>
> For example, if I place a 1,1,1,1,... in a shift register, and clock it, I 
> get a transition from a 1 to a 1, and no charging or discharging, so no 
> current!
>
> If I place 1,0,1,0,1,0 ... in the shift register, then I maximize my 
> average dynamic current, as on every clock, I make a node change from 0 to 
> 1, or 1 to 0.
>
> If I place an isolated 0 to 1 transition I can see the effective impulse 
> response for a single transition of 0 to 1.  This would have to be done 
> with all the DFF tied to the same D input, and not a giant shift register, 
> however.
>
> One has to examine how the skew across the global clock will affect the 
> outcome (nothing is really synchronous in reality - never all the exact 
> same phase).
>
> So, there are many experiments one can perform, and as Peter points out, 
> many of them are degenerate cases (unlikely to exist in reality).
>
> These are exacly the kinds of patterns we use in verification and 
> charaterization.  And, we have been doing this for many years now.
>
> Austin

Article: 102029
Subject: Re: ml-403 and USB
From: John Williams <jwilliams@itee.uq.edu.au>
Date: Wed, 10 May 2006 07:54:54 +1000
Links: << >> << T >> << A >>

Hi Clark,

Anonymous wrote:

> Has anyone used USB from linux on the ml-403 board? I'd like to get some
> peripherals like usb memory or bluetooth adapter to work on it but usb is
> not in the kernel they provide. The hardware does appear to be on the board
> though.

We (PetaLogix) recently ported the Cypress Reference drivers for the
Cypress-EZUSB devices to the ML40x boards.   We targeted MicroBlaze but they'll
work on PPC (PetaLogix auto-config, not MontaVista) without difficulty.

The drivers are in the uClinux public CVS tree.

Regards,

John

Article: 102030
Subject: Re: Funky experiment on a Spartan II FPGA
From: Austin Lesea <austin@xilinx.com>
Date: Tue, 09 May 2006 14:57:59 -0700
Links: << >> << T >> << A >>

John_H,

I don't doubt that one can construct a ultra worst case scenario.  We 
have done that in the past.

Part configures, DONE goes high, and then after a few clocks, the part 
configures, DONE goes high....

You get the picture.  The thump from the switching resets the entire 
power on reset circuit, and the part starts all over again.

A very very expensive relaxation oscillator, whose output is the DONE pin.

What is far more important is for the customer to know how much system 
jitter there will be for their pcb, their bypass network, and their 
bitstream.  THAT is a real problem!

Austin

John_H wrote:

> The assumption may have been made (but not thoroughly communicated) that 
> shifting would be through SRLs to deliver the max power.  Transition of 
> every *register* from 1s to 0s would be one heck of a strain but perhaps not 
> as much of a strain as half that many SRLs (SLICEM vs SLICEL these days) 
> feeding an all-1 to all-0 transition on the registers fed by those SRLs.
> 
> "Austin Lesea" <austin@xilinx.com> wrote in message 
> news:e3qvsb$rel14@xco-news.xilinx.com...
> 
>>lenz,
>>
>>You really have to draw yourself a picture.
>>
>>I don't think anyone has really thought this through, unless they are 
>>doing it in reality.
>>
>>For example, if I place a 1,1,1,1,... in a shift register, and clock it, I 
>>get a transition from a 1 to a 1, and no charging or discharging, so no 
>>current!
>>
>>If I place 1,0,1,0,1,0 ... in the shift register, then I maximize my 
>>average dynamic current, as on every clock, I make a node change from 0 to 
>>1, or 1 to 0.
>>
>>If I place an isolated 0 to 1 transition I can see the effective impulse 
>>response for a single transition of 0 to 1.  This would have to be done 
>>with all the DFF tied to the same D input, and not a giant shift register, 
>>however.
>>
>>One has to examine how the skew across the global clock will affect the 
>>outcome (nothing is really synchronous in reality - never all the exact 
>>same phase).
>>
>>So, there are many experiments one can perform, and as Peter points out, 
>>many of them are degenerate cases (unlikely to exist in reality).
>>
>>These are exacly the kinds of patterns we use in verification and 
>>charaterization.  And, we have been doing this for many years now.
>>
>>Austin 
> 
> 
>

Article: 102031
Subject: constraints for DDR bus with 133MHz write and 66Mhz read clocks
From: "DC" <dc@dcarr.org>
Date: 9 May 2006 15:42:28 -0700
Links: << >> << T >> << A >>

I'm working on a design that needs to be able to write to a DDR ram at
133MHz but only needs to read back the data at a slower rate.  I
thought I could greatly ease the design by slowing down the clock on
reads to say 66MHz.  This really opens up my read timing budget.  Doing
fast writes is easier because I can use a 90 degree shifted clock to
drive the DQS lines.

The problem is I'm not sure how to create a constraints file that
enforces the timing required for different read and write clock speeds.
 Anybody have any ideas?  I'm using a Spartan-3[E] and ISE 8.1.

Thanks,
David Carr

Article: 102032
Subject: Re: constraints for DDR bus with 133MHz write and 66Mhz read clocks
From: "John_H" <johnhandwork@mail.com>
Date: Tue, 09 May 2006 23:15:29 GMT
Links: << >> << T >> << A >>

Doesn't DDR have a system clock that runs a DLL?  The physical interface 
will have to maintain the same clock.  Your reads can come into DDR IOB 
registers in bursts of 2 rather than 4 or 8 allowing the input registers to 
be read at your slower clock speed.

Look at the system level clocking that includes the DDR and the FPGA clocks.


"DC" <dc@dcarr.org> wrote in message 
news:1147214548.342501.9200@v46g2000cwv.googlegroups.com...
> I'm working on a design that needs to be able to write to a DDR ram at
> 133MHz but only needs to read back the data at a slower rate.  I
> thought I could greatly ease the design by slowing down the clock on
> reads to say 66MHz.  This really opens up my read timing budget.  Doing
> fast writes is easier because I can use a 90 degree shifted clock to
> drive the DQS lines.
>
> The problem is I'm not sure how to create a constraints file that
> enforces the timing required for different read and write clock speeds.
> Anybody have any ideas?  I'm using a Spartan-3[E] and ISE 8.1.
>
> Thanks,
> David Carr

Article: 102033
Subject: Re: Superscalar Out-of-Order Processor on an FPGA
From: "JJ" <johnjakson@gmail.com>
Date: 9 May 2006 16:30:56 -0700
Links: << >> << T >> << A >>

I did some of the same things as your 16b design but with some
differences.

I used 2 clock per microcycle at 300Mhz or so on V2Pro and <<200MHz on
SP3 so that the BlockRam or a 16bit add were the critical paths or 3
levels of Lut logic in the control.

The 32b ISA therefore gets 4 ports out of a BlockRam with a large
register set upto 64 regs but encodes the ops in variable length 16b
opcodes using prefixes to stretch address widths 3b at a time for 3 Reg
fields or to set up a literal in 8b chunks. I was always partial to the
old 9900 and 68000.

The pipeline is quite long since it runs 4 way threads and therefore
has no hazard or forwarding logic. The MTA logic allows the instruction
decode to occur over many pipelines since only every 4th pair is per
thread. The BlockRam also holds the instruction queue reminiscent of
the 8086 per thread and the IFetch unit also has another tiny queue to
reassemble 16b words into 1..4 x 16b final opcode.

Since the PEs only run RRR codes in 2 clocks and average Bcc or Ld,St
in 4 or sometimes 6 clocks, the branch latency covers the cc decode so
no prediction needed. Bcc inside the queue are also 2 clocks. The Ld,
St codes go out to a MMU that hashes 32b address with a 32b object id
into RLDRAM on 8 word line blocks. The idea there is that the RLDRAM
interleaved throughput can be shared over 10 or so of these PEs which
leaves me with 40 odd threads. The RLDRAM 20ns latency is easily
covered by the MTA pipeline so Ld,St have a slight Memory Wall effect
over effectively the whole DRAM address space so no SRAM Dcache needed.

So instead of a Memory Wall, I have a Thread Wall, use em or lose em
~40*~25mips. The MMU also implements the remainder of the Transputer
process scheduler and DMA, message passing and links so thats where
most of the real fun is, and much remains to be done.

Ofcourse it only really makes sense if you want to run lots of
processes as any Transputer user would. The basic version was done
25yrs ago so I guess its not a modern processor yet.

Post your results when you have some, you might be the 1st to tackle
OoO SS I haven't heard of anything else.

John Jakson
transputer guy

Article: 102034
Subject: Re: Xilinx 3s8000?
From: Ron <News5@spamex.com>
Date: Tue, 09 May 2006 16:37:26 -0700
Links: << >> << T >> << A >>

Isaac Bosompem wrote:
> Are you attempting a 100% hardware solution or are you doing a mix of
> both hardware and PC software?

Hi Isaac.

Yes, as I mentioned to Tom elsewhere in this thread, I have written a 
100% Verilog implementation of ECM (as well as Fermat's method and 
Pollard-Rho). It's runs completely standalone and is not connected to 
anything except power. It would display the answer (if found) on the 
development board's LCD display.

I have no objection to using an FPGA as an accelerator for a program 
running on a conventional computer, but was hoping to fit the entire 
thing (ECM) on one or more FPGA development boards because that keeps 
things fast and simple. For me, it's at least as easy to code in Verilog 
as it is in a typical assembly language, so I see no reason to clutter 
things up by going off-board or adding a micro-CPU core.

My reasoning is that eventually (hopefully within my lifetime, ha!) 
FPGA's will become huge and cheap, and I'm hoping that the LUT count of 
FPGAs increase faster than the performance of traditional computers. 
That hope may not be justified of course, because some of the same type 
of technology is used in both CPUs and FPGAs; however, there is no 
chance that I'll ever be able to afford a cluster of Opterons, but if I 
can find a way around the exorbitant prices FPGA vendors charge for 
proprietary software design tools (that can only be used with their own 
products no less), I could probably afford a fairly high performance FPGA.

In any case, writing ECM in Verilog has been fun and I've learned a lot 
about Verilog. I now have a working Verilog ECM design, and I'll spend 
the time until development s/w gets cheap enough for me to afford by 
tweaking and improving the design. I've been forced to slow my design 
down to a crawl in order to try to get it to fit into something I can 
afford, but whenever FPGA's and the requisite design s/w become 
plentiful and cheap, I hope to be able to take full advantage of all the 
opportunities for parallelization and pipelining that ECM offers.

Regards,

Ron

Article: 102035
Subject: Re: Superscalar Out-of-Order Processor on an FPGA
From: "Luke" <lvalenty@gmail.com>
Date: 9 May 2006 16:47:18 -0700
Links: << >> << T >> << A >>

That's pretty slick!  I like it.

I've got a blog that I write random ideas and notes about my projects
online: http://bitstuff.blogspot.com

You'll either find it incredibly dull, or it'll pique your interest.
Most people I know could care less about this stuff, so I appreciate
the dialouge and comments.

Article: 102036
Subject: Re: Funky experiment on a Spartan II FPGA
From: "Peter Alfke" <peter@xilinx.com>
Date: 9 May 2006 16:54:12 -0700
Links: << >> << T >> << A >>

When you shift a 1010101 pattern, there is a lot of power consumption,
but on each transition half the nodes go Low-to-High, and the adjacent
other half goes High-to-Low. I call that benign.
If you switch on every even clock cycle from 1111111 to 0000000 and on
every odd clock cycle back to 11111111, you do not get the compensation
effect, although the total average power consumption is the same.
I hope this is clearer.
Peter Alfke

Article: 102037
Subject: Re: Superscalar Out-of-Order Processor on an FPGA
From: Eric Smith <eric@brouhaha.com>
Date: 09 May 2006 17:01:56 -0700
Links: << >> << T >> << A >>

"Luke" <lvalenty@gmail.com> writes:
> Number one tradeoff, the execution stage is two stages and there is at
> least one delay slot after every instruction before the result can be
> used.  This version runs at 150MHz.  I have another version with not
> much bypass hardware that makes it up to 180MHz.

I must be confused.  Since your two-stage 150 MHz processor needed a
delay slot for all access to results, it must not have used any bypassing.
So are you telling us that adding some ("not much") bypass hardware sped
it up by 20%?  That seems counterintuitive.

> Number two: 16-bit architecture.  Addition actually takes a long time
> using ripple-carry in an FPGA, and there's reall no way around it.

Actually, I've implemented wide carry-select adders using carry
lookahead in the Spartan 3.  So there is a "way around it", but it
probably won't help for a 16-bit adder.

Eric

Article: 102038
Subject: Re: Anyone use Xilinx ppc405 profiling tools?
From: "Alan Nishioka" <alan@nishioka.com>
Date: 9 May 2006 17:02:12 -0700
Links: << >> << T >> << A >>

Joel wrote:
> I recently finished my Masters Thesis on Algorithm Acceleration in
> FPGA.  Part of my research and experiementation was running some
> algorithms on the PPC405 core in V2PRO.  I used both ISE/EDK 7.1 and
> EDK 8.1 (and then developing IP in the FPGA and attaching to PLB).  I
> got software profiling to work using PIT and I used PLM BRAM memory for
> storing the profiling information.  Initially I ran into a lot of
> problems, but eventually got it to work on 7.1.  I was using latest ISE
> SP and EDK SP for 7.1.  Most of my research was infact done on ISE/EDK
> 7.1, and towards the end I repeated same experiements using EDK 8.1.
> Profiling worked on EDK 8.1 also for PPC406 using -pg gcc and setting
> PIT for software profiling in software platform settings.

Thank you for your response.

I finally got profiling working with edk 8.1.  But since you said edk
7.1 worked, I went back and tried it again, and it worked too!

I think I mixed up the compilers since I have edk 6.3 and edk 7.1
installed on the same computer.

Alan Nishioka

Article: 102039
Subject: Re: Superscalar Out-of-Order Processor on an FPGA
From: "Luke" <lvalenty@gmail.com>
Date: 9 May 2006 17:06:37 -0700
Links: << >> << T >> << A >>

I must not have been very clear.  The 180MHz version had no bypassing
logic whatsoever.  It had three delay slots.  The 150MHz version did
have bypassing logic, it had one delay slot.

I read up on carry lookahead for the spartan 3, and you're correct, it
wouldn't help for 16-bits.  In fact, it's slower than just using the
dedicated carry logic.

Article: 102040
Subject: Re: constraints for DDR bus with 133MHz write and 66Mhz read clocks
From: "DC" <dc@dcarr.org>
Date: 9 May 2006 17:27:46 -0700
Links: << >> << T >> << A >>

Good point about the DLLs in RAM itself.  In this application (a
digital scope) I do all writes in essentially one long burst and then
go back and read the aquired waveforms.  I could potentially pause
while the DLLs relock at the new clock rate for the reads.  The
physical interface to the DDR RAM itself is presenting the problem.  At
133MHz you really need to use the DQS strobes to latch the data.
Unfortunately in the Spartan 3, its difficult to use the DQS as a latch
signal and as a result most Spartan 3 DDR designs only use the system
clock for reads.

-DC

Article: 102041
Subject: Re: PCI Express and DMA
From: Mark McDougall <markm@vl.com.au>
Date: Wed, 10 May 2006 10:34:49 +1000
Links: << >> << T >> << A >>

SongDragon wrote:

> 1) device driver (let's say for linux 2.6.x) requests some
	(snip snip)
> writes a zero to a register ("serviced descriptor"), telling the PCIe
> device the interrupt has been fielded.

> I have a number of questions regarding this. First and foremost, is
> this view of the transaction correct? Is this actually "bus
> mastering"? It seems like for PCIe, since there is no "bus", there is
> no additional requirements to handle other devices "requesting" the
> bus. So I shouldn't have to perform any bus arbitration (listen in to
> see if any of the other INT pins are being triggered, etc). Is this
> assumption correct?

Your description of events is pretty much correct. The exact registers 
and sequencing will of course depend on your implementation of a DMA 
controller.

You'll need a source register too unless the data is being supplied by a 
FIFO or I/O "pipe" on the device.

"Bus mastering" is a PCI term and refers to the ability to initiate a 
PCI transfer - which also implies the capability to request the bus.

In PCIe nomenclature, an entity that can initiate a transfer is referred 
to as a "requestor" and you're right, there's no arbitration involved as 
such. But this is the equivalent of a PCI bus master I suppose. The 
target of the request is called the "completer".

This is where my knowledge of PCIe becomes thinner, as I'm currently in 
the process of ramping up for a PCIe project myself. But I have worked 
on several PCI projects so I think my foundations are valid.

For example, using a (bus-mastering) PCI core you wouldn't have to 
'worry about' requesting the bus etc - initiating a request via the 
back-end of the core would trigger that functionality in the core 
transparently for you. As far as your device is concerned, you have 
"exclusive" use of the bus - you may just have to wait a bit to get to 
use it (and you may get interrupted occasionally). Arbitration etc is 
not your problem.

> In PCI Express, you have to specify a bunch of things in the TLP
> header, including bus #, device #, function #, and tag. I'm not sure
> what these values should be. If the CPU were requesting a MEMREAD32,
> the values for these fields in the MEMREAD32_COMPLETION response
> would would be set to the same values as were included in the
> MEMREAD32. However, since the PCIe device is actually sending out a
> MEMWRITE32 command, the values for these fields are not clear to me.

This is where I'll have to defer to others...

Regards,

-- 
Mark McDougall, Engineer
Virtual Logic Pty Ltd, <http://www.vl.com.au>
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266

Article: 102042
Subject: Re: PCI Express and DMA
From: Mark McDougall <markm@vl.com.au>
Date: Wed, 10 May 2006 10:37:34 +1000
Links: << >> << T >> << A >>

Mark McDougall wrote:

> SongDragon wrote:
> 
>> 1) device driver (let's say for linux 2.6.x) requests some

BTW if you're writing Linux device drivers as opposed to Windows 
drivers, you're in for a *much* easier ride! :)

Regards,

-- 
Mark McDougall, Engineer
Virtual Logic Pty Ltd, <http://www.vl.com.au>
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266

Article: 102043
Subject: Re: constraints for DDR bus with 133MHz write and 66Mhz read clocks
From: John_H <johnhandwork@mail.com>
Date: Wed, 10 May 2006 00:57:11 GMT
Links: << >> << T >> << A >>

DC wrote:
> Good point about the DLLs in RAM itself.  In this application (a
> digital scope) I do all writes in essentially one long burst and then
> go back and read the aquired waveforms.  I could potentially pause
> while the DLLs relock at the new clock rate for the reads.  The
> physical interface to the DDR RAM itself is presenting the problem.  At
> 133MHz you really need to use the DQS strobes to latch the data.
> Unfortunately in the Spartan 3, its difficult to use the DQS as a latch
> signal and as a result most Spartan 3 DDR designs only use the system
> clock for reads.
> 
> -DC

By matching the DDR clock to the data I'm convinced that the read 
interface is doable without using the DQS though lane-to-lane skew 
matching has to tighten up to achieve that goal.  Generating the DDR 
clock with the FPGA and routing a copy of that clock out to your 
memories and back will give you a matched round trip such that the clock 
copy can return to the FPGA at the same time as the read data would 
return.  I started the design process and got great timing results but 
never got to where working silicon showed all the timings were precise. 
  So I'm convinced it's straightforward, I'm just not certain of the 
numbers I was achieving.

Article: 102044
Subject: Re: Spartan 3e starter kit & Multimedia
From: "BoroToro" <beiermann@gmail.com>
Date: 9 May 2006 18:57:48 -0700
Links: << >> << T >> << A >>

> > The huge DRAM and the fast FPGA seem to make this board ideal for video
> > and sound processing. But.. why on earth does the board come with a
> > 3bit VGA output? We do not live in the 80ies anymore. Adding a couple
> > of resistors to get 8 or 12bit color resolution would hardly have
> > changed the BOM. Even a video DAC is not expensive.
>
> It would, however, have used up more IO pins.  I don't know if that was
> a consideration, but they do seem to share pins for the devices on the
> SPI bus.
>
> Does anyone know if doing PWM on the VGA output pins would cause Bad
> Things to happen to a typical monitor?

In some ways this is very similar to dithering. I already used ordered
dithering to improve the color resolution from 8 to 14bit on another
board. Unfortunately 3bit color resolution is extremely low to start
with.

One way out could be to use a PAL/NTSC encoder. Is there any free core
out there?

Article: 102045
Subject: Re: Superscalar Out-of-Order Processor on an FPGA
From: "JJ" <johnjakson@gmail.com>
Date: 9 May 2006 19:31:19 -0700
Links: << >> << T >> << A >>

Luke wrote:
> I must not have been very clear.  The 180MHz version had no bypassing
> logic whatsoever.  It had three delay slots.  The 150MHz version did
> have bypassing logic, it had one delay slot.
>
> I read up on carry lookahead for the spartan 3, and you're correct, it
> wouldn't help for 16-bits.  In fact, it's slower than just using the
> dedicated carry logic.

I also used a 32b CSA design in the design prior to one described
above. It worked but was pretty expensive, IIRC it gave me 32b adds in
same cycle time as 16b ripple add but it used up 7 of 8b ripple add
blocks and needed 2 extra pipe stages to put csa select results back
together and combined that with the CC status logic. The 4 way MTA
though just about took it without too much trouble but it also needed
hazard and forwarding logic. Funny thing I saw was the doubling of
adders added more C load to the pipeline FFs and those had to be
duplicated as well and so the final cost of that 32 datapath was
probably 2 or 3 x bigger than a plain ripple datapath and much harder
to floorplan the P/R.

What really killed that design was an interlock mechanism designed to
prevent the I Fetch and I Exec blocks from ever running code from same
thread at same time, that 1 little path turned out to be 3 or 4 x
longer than the 32b add time and no amount of redesign could make it go
away, all that trouble for nout. The lesson learned was that complex
architecture with any interlocks usually gets hammered on these paths
that don't show up till the major blocks are done. The final state of
that design was around 65MHz when I thought I would hit 300MHz on the
datapath, and the total logic was about 3x the current design. Not much
was wasted though, much of the conceptual design got rescued in simpler
form. In an ASIC this is much less of a problem since transistor logic
is relatively much faster per clock freq than FPGAs, it would have been
more like 20 gates and ofcourse the major cpu designers can throw
bodies at such problems.

I wonder how you will get the performance you want without finding an
achilles heal till the later part is done. You have to finish the
overall logic design before commiting to design specific blocks and it
ends up taking multiple iterations. When I said 25MHz in the 1st post I
meant that to reflect these sorts of critical paths that can't be
forseen till your done rather than the datapaths. Thats why I went to
the extreme MTA solution to abolish or severely limit almost all
variables, make it look like a DSP engine and you can't fail.

Curiously how do you prototype the architecture, in cycle C, or go
straight to HDL simulation?

Anyway have fun

John Jakson
transputer guy

the details are at wotug.org if interested

Article: 102046
Subject: simulation works fine but the actual chip doesnt work
From: sandeepbabel@gmail.com
Date: 9 May 2006 19:36:22 -0700
Links: << >> << T >> << A >>

Hi all,

I am working on a chip design, where the frontend is interfaced to PCI
bus and the backend has asynchronous FIFOs and a UART. I load the data
into the FIFO at 33 MHz and then read it at 40 mhz and pass it to the
UART module to transmit the data out. The design works great in
simulation, but it is giving me entirely different results when i check
it through a logic analyzer on the actual chip. I don't know what I am
doing wrong. Please advise. 

Thank you

Sandeep

Article: 102047
Subject: Re: Superscalar Out-of-Order Processor on an FPGA
From: Jim Granville <no.spam@designtools.co.nz>
Date: Wed, 10 May 2006 15:36:12 +1200
Links: << >> << T >> << A >>

Luke wrote:
> I actually did build a CPU for pure MHz speed.  It was a super fast
> dual-issue cpu, but in order to get the high-clock rates, I had to make
> some serious trade-offs.
> 
> Number one tradeoff, the execution stage is two stages and there is at
> least one delay slot after every instruction before the result can be
> used.  This version runs at 150MHz.  I have another version with not
> much bypass hardware that makes it up to 180MHz.  But with three delay
> slots and only 8 registers per issue slot, scheduling becomes a major
> issue.
> 
> Number two: 16-bit architecture.  Addition actually takes a long time
> using ripple-carry in an FPGA, and there's reall no way around it.
> 16-bit is pretty easy to add, so that's what it gets.  It's also 16-bit
> to cut down on utilization.
> 
> Number three: Some arithmetic instructions are split in two.  For
> example, shift instructions and 32-bit addition is split into two
> instructions.  I simply could not afford the logic and delay of doing
> these with one instruction.
> 
> Number four: 16 bit addressing.  Same deal with addition, it takes too
> long, and i don't want to extend the delay slots any further, so I have
> 16 bit addressing only.  Also, instruction sizes were 16 bits to cut
> down on logic and keep things "simple".
> 
> So besides being a total pain in the butt to schedule and program, it
> really is rocket-fast.  It is at it's very worst 2 times faster than a
> 32-bit pipleined processor i designed, and at it's best, it is 10 times
> faster.  With decent scheduling and a superb compiler or hand coding,
> it should be able to sustain 5-8 times faster.
> 
> The other advantage is that I could put 12 of them on a spartan 3 1000.
>  Theoretically, I could get the performance of a really crappy modern
> computer with these things.
> 
> And now I come back to reality.  It's such a specialized system, and
> the memory architecture, ISA and all around whole system is a mess.
> Yes, it's super fast, but so what?  

Well, to some users, that is important.

> I would be so much better off just
> designing custom pipelined logic to do something rather than this gimp
> of a cpu.
> 
> So that's why I'm designing a "modern" processor.  It's a general
> purpose CPU that could run real software such as Linux.  It's that much
> more useful ;)

Sounds more like a microprocessor, whereas the first one is more like
a microcontroller.
There is room for both, so don't throw the first one away!

With a small/nimble core, you have the option to deploy more than one,
and in an FPGA, that's where soft-cpu's can run rings around other 
solutions.

How much Ram/Registers could the 16 bit one access ?

-jg

Article: 102048
Subject: Re: FPGA-based hardware accelerator for PC
From: ptkwt@aracnet.com (Phil Tomson)
Date: 10 May 2006 03:44:19 GMT
Links: << >> << T >> << A >>

In article <1147153450.028603.66700@u72g2000cwu.googlegroups.com>,
JJ <johnjakson@gmail.com> wrote:
>
>Phil Tomson wrote:
>> In article <1146981253.226901.102660@i39g2000cwa.googlegroups.com>,
>> JJ <johnjakson@gmail.com> wrote:
>> >I always hated that the PCI cores were so heavily priced compared to
>> >the FPGA they might go into. The pricing seemed to reflect the value
>> >they once added to ASICs some 10 or 15 years ago and not the potential
>> >of really low cost low volume applications. A $100 FPGA in small vol
>> >applications doesn't support $20K IP for a few $ worth of fabric it
>> >uses. It might be a bargain compared to the cost of rolling your own
>> >though, just as buying an FPGA is a real bargain compared to rolling my
>> >own FPGA/ASIC too.
>>
>> That's why OpenCores is so important.  (http://opencores.org)  As FPGAs
>> become cheaper we're going to need an open source ecosystem of cores.
>> They've got a PCI bridge design at Open cores, for example.
>>
>> BTW: it would also be nice to have an open source ecosystem of FPGA
>> design tools... but that's a bit tougher at this point.
>>
>> Phil
>
>Yes but open source and closed source are also like oil and water esp
>together in a commercial environment. If I were doing commercial work I
>doubt I'd ever use opencores but I might peek at it for an
>understanding of how it might be done or ask someone else to. On a
>hobbyist level, 

What's the hesitation?

> well I have mixed feelings about gpl too. 

There are many more open source licenses besides gpl, though gpl is pretty 
commonly used.

> I suspect the
>software world does far better with it since enough people support the
>gpl movement and there is a large user base for it. Hardware ultimately
>can't be made for free so it can't be the same model.
>

Hardware itself  cannot be made for free, however various cores (such as a 
PCI bridge that sparked this) can be created for free as it's pretty much 
the same process as software development: code it up in synthesizable HDL, 
simulate it to make sure it does what you want, synthesize it and try it 
out in an FPGA.  Computers aren't free either, but there is plenty of open 
source software being created to run on them.

Phil

Article: 102049
Subject: Altera Max Plus II to Quartus migration tool
From: "Keith Williams" <e_s_p_i_a_n_@insightbb.com>
Date: Tue, 9 May 2006 23:54:21 -0400
Links: << >> << T >> << A >>

I have a fairly large Altera-based design that will soon be updated to
Cyclone II and Quartus (from Flex10K and Max+II).

Has anyone else been through this migration that would be willing to share
any gotchas?  Is the migration tool in Quartus worthwhile?

Thanks,

Keith

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search