Messages from 37575

Article: 37575
Subject: multi-cycle constraint
From: ronhui@ctimail3.com (ron)
Date: 15 Dec 2001 20:08:40 -0800
Links: << >> << T >> << A >>

I apply a set of multi-cycle constraints to a module and it works
fine, both timing analyzer and timing simulation. Then I incorporate
this module and a larger design and apply the the same constraints
again. This time timing analyzer reports is OK but the timing sim is
wrong. Any idea to solve the problem?

Thanks.

Ron

Article: 37576
Subject: FPGA-Conversion. IP Cores
From: arlington_sade@yahoo.com (arlington)
Date: 16 Dec 2001 00:47:31 -0800
Links: << >> << T >> << A >>

Hello all, 

If you were to use IP cores, such as Logicore/Alliance cores from
Xilinx, Megafunction cores from Altera, Inventra cores from Mentor,
etc. how can you get the RTL verilog/VHDL when you want to convert to
ASIC ?

Thanks.

Article: 37577
Subject: Xilinx Modular Design tool
From: rotemg@mysticom.com (Rotem Gazit)
Date: 16 Dec 2001 03:56:31 -0800
Links: << >> << T >> << A >>

Anyone tried Xilinx "Modular Design tool" ?
Any good / bad experience to share ?

Thanks,
Rotem Gazit
Design Engineer
High-speed board & FPGA design
MystiCom LTD
mailto:rotemg@mysticom.com
http://www.mysticom.com/

Article: 37578
Subject: Leonardo Spectrum Editor Configuration
From: "C.Schlehaus" <carlhermann.schlehaus@t-online.de>
Date: Sun, 16 Dec 2001 13:38:38 +0100
Links: << >> << T >> << A >>

Hi,
I'm currently trying to adjust the Leonardo Spectrum OEM
Edition from ALTERA's website to my needs. As I have a
"smaller" screen I'd like to size the font in the HDL
Editor (opened with click on the VHDL file) down to 8pt.
Courier New. I made good experience with this size with
my Max+PlusII designs. Unfortunately I didn't manage to
find any configuration for this editor (just for the
information window) except adding the line numbers...

Any help appreciated, Carlhermann Schlehaus

Article: 37579
Subject: Efficient multiplication using block SRAM...
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Sun, 16 Dec 2001 13:07:26 +0000 (UTC)
Links: << >> << T >> << A >>

I'm not sure if people are doing this already, but I couldn't find a reference
on the Xilinx web site.

Block RAMs make more efficient squaring circuits than they do multipliers.  And
you can get multipliers out of squarers.

An explanation for the arithmetic.  Let A and B be the numbers to be
multiplied.

Compute C = (A+B) * (A+B) = A**2 + 2AB + B**2
Compute D = (A-B) * (A-B) = A**2 - 2AB + B**2
Then C - D = 4AB.

This is particularly efficient in Xilinx Spartan2, Virtex, and Virtex2
architectures because the block RAM is dual port.  That means you can use one
side for the (A-B)**2 calculation and the other side for the (A+B)**2
calculation.

With the Xilinx Spartan2, Virtex or VirtexE, use the RAMB4_16_16.  It has 8
inputs and 16 outputs in two sections.  Each section can conveniently compute
the square of an 8-bit number.  Note that the lowest two bits of the two
squares are going to have to be equal (i.e. C-D = 4AB, so C and D have to match
two bits), so you don't have to subtract bits 1 and 0 of the two squares.

If "A" and "B" are both 7-bit, their sum will be no worse than 8-bit, so you
can compute a 7x7 multiply using only the 8 LUTs for each of "A+B" and "A-B",
and another 14 LUTs for the result, a total of 30 LUTs (i.e. 15 slices) and one
block RAM.  Maybe there's a way to get the bit back, and let A and B be 8-bit
numbers; I haven't looked at it long enough to conclude there isn't.

The circuit uses about half the LUTs required by the standard algorithm, at an
expense of one block RAM.

Similarly, you should be able to get a 15x15 multiplier with around 62 LUTs (31
slices) and four RAMB4_8_8 block RAMs.  In addition, these multipliers are
naturally pipelined with no need to register low results.

To put the LUT utilization in perspective, the Xilinx 8x8 multiply takes 39
slices, while the 16x16 takes 143:
http://www.xilinx.com/ipcenter/reference_designs/vmult/vmult_v1_4.pdf

Using RAMB4s alone to implement even a 7x7 multiply would require 28 of them,
though you could reduce that somewhat by being properly sneaky...

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37580
Subject: Multiplying by squaring using Block RAM.
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Sun, 16 Dec 2001 13:29:17 +0000 (UTC)
Links: << >> << T >> << A >>

I'm not sure if people are doing this already, but I couldn't find a reference
on the Xilinx web site.

Block RAMs make more efficient squaring circuits than they do multipliers.  And
you can get multipliers out of squarers.

An explanation for the arithmetic.  Let A and B be the numbers to be
multiplied.

Compute C = (A+B) * (A+B) = A**2 + 2AB + B**2
Compute D = (A-B) * (A-B) = A**2 - 2AB + B**2
Then C - D = 4AB.

This is particularly efficient in Xilinx Spartan2, Virtex, and Virtex2
architectures because the block RAM is dual port.  That means you can use one
side for the (A-B)**2 calculation and the other side for the (A+B)**2
calculation.

With the Xilinx Spartan2, Virtex or VirtexE, use the RAMB4_16_16.  It has 8
inputs and 16 outputs in two sections.  Each section can conveniently compute
the square of an 8-bit number.  Note that the lowest two bits of the two
squares are going to have to be equal (i.e. C-D = 4AB, so C and D have to match
two bits), so you don't have to subtract bits 1 and 0 of the two squares.

If "A" and "B" are both 7-bit, their sum will be no worse than 8-bit, so you
can compute a 7x7 multiply using only the 8 LUTs for each of "A+B" and "A-B",
and another 14 LUTs for the result, a total of 30 LUTs (i.e. 15 slices) and one
block RAM.  Maybe there's a way to get the bit back, and let A and B be 8-bit
numbers; I haven't looked at it long enough to conclude there isn't.

The circuit uses about half the LUTs required by the standard algorithm, at an
expense of one block RAM.

To put the LUT utilization in perspective, the Xilinx 8x8 multiply takes 39
slices:
http://www.xilinx.com/ipcenter/reference_designs/vmult/vmult_v1_4.pdf

Using RAMB4s alone to implement even a 7x7 multiply would require a huge number
of them, as multiplies require twice as many address inputs as squares.

You can iterate on the calculation of the square.  That is, if A is too big to
square in a single operation, then break A into two parts.  With A broken into
two parts, say A = AH + AL, you can compute AH**2, AL**2 with block RAM, and
compute 2*AH*AL by computing the difference between (AH+AL)**2 and (AH-AL)**2.

Breaking A and B into more than 3 parts may be worth exploring, for certain bit
sizes.

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37581
Subject: Re: About special promotion of Synplicity's Synplify? FPGA synthesis solution
From: "MH" <blahblah@blahblah.blah>
Date: Sun, 16 Dec 2001 17:37:58 -0000
Links: << >> << T >> << A >>


"S. Ramirez" <sramirez@cfl.rr.com> wrote in message
news:AXvS7.84755$oj3.14571085@typhoon.tampabay.rr.com...
>      Did you check into this and is the special still going on?  I'd like to
> know, because things are always different in various parts of the
> country/world.  I'm in Florida, where are you?


In the UK.

I checked with our local Synplicity distributor (with whom
I have a good working relationship), and the reply was:

"...the promotion wasn't run outside the US,
and expired at the end of November anyway".

However, they went on to say:

"If you are serious about another seat we're always happy
to consider a deal!"

So it sounds like some dickering / haggling is in order....


MH.

Article: 37582
Subject: Certicom challenge and FPGA based modular math
From: "Jay Berg" <admin@eCompute.org>
Date: Sun, 16 Dec 2001 12:11:37 -0800
Links: << >> << T >> << A >>

After making the mistake of getting involved in the current ECCp109
distributed computing project (see URL below), I'm now casting around to
determine if there's a possibility of finding a PCI board with an FPGA
co-processor capable of handling a small set of modular math functions.

http://www.nd.edu/~cmonico/eccp109/

The main points to be aware of are:
    1. The client requires 128-bit integer math.
    2. The client uses modular math for nearly all of the
        math functions.
    3. The majority of compute time is spent in a single
        function that performs 128-bit modulo multiplication.
    4. This project will move to the next challenge following
        completion. The next project will be a 131-bit challenge,
        requiring a word size larger than 128-bits).

If there existed an FPGA based PCI card capable of doing 128-bit modulo
multiplication, I would be very interested. But after investing a week of
searching, I'm unable to find an off the shelf solution, or the IP core to
provide this capability.

Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit
integer modulo multiplication?

Q2: Does anyone know of existing FPGA IP that supports 128-bit integer
modulo multiplication?

Q3: If Q1 and Q2 are both no, would there be anyone interested in creating
such a core, or possibly even a PCI board to accomplish 128-bit integer
modulo multiplication? I would be willing to sponsor development costs. But
I'm not flush enough to pay for the labor (time) involved.

Note that I've multiple decades of SW development background (C and
assembler) with very minimal HW background. And this includes zero FPGA
development experience. But given my SW experience, I can easily build any
driver(s) and do the client porting.

BTW, the current client executes approximately 190,000 iterations per second
on a 450p2. This includes several (1-3) modulo multiplications per
iteration.

            Jay Berg
            jberg@eCompute.org

Article: 37583
Subject: Re: annoying problem
From: John Larkin <jjlarkin@highlandSNIPTHIStechnology.com>
Date: Sun, 16 Dec 2001 13:07:59 -0800
Links: << >> << T >> << A >>

On Sun, 16 Dec 2001 01:48:45 GMT, Peter Alfke <palfke@earthlink.net>
wrote:

>
>
>Mohap wrote:
>
>> Why is the xilinx 4005xl 3.3v device? what if i want to output signals to an
>> external 5v device? is there something i can do to get around this problem?
>
>XC4005XL uses 3.3 V supply voltage because that cuts dynamic power consumption
>in half, compared to 5 V. Also, modern high-performance processes do not
>tolerate high voltages like 5 V.
>Today, 3.3 V is already obsolete, the most modern FPGAs use 1.5 V for the core
>logic, but retain 2.5 V tolerance on all outputs, 3.3 V on some.
>
>For new designs, 5 V is definitely out. It served us well from 1965 to 1995,
>for one third of the previous century, but its days are over. Rest In Peace !
>
>Now to your problem:
>The XC4005 inputs tolerate 5 V, if you select this ( and thus disable the clamp
>diode to Vcc that is there because PCI demands it ).
>Now you can drive the inputs up to 5.5 V.
>The outputs can obviously not drive higher than their own Vcc, and 3.3 V may be
>high enough for driving 5 V logic with so-called TTL input thresholds of ~ 1.5
>V.
>(Forget the 2.4 V spec for Voh, that's a 30-year old left-over from the days of
>bipolar TTL.
>
>What if you have to drive 5-V that has a CMOS input threshold of up to 3.5 V.
>Then you need a pull-up resistor to the 5 V, and you should configure the
>XC4005 output as "open collector" ( really: open drian ). That costs you speed,
>since you now have a 1 kilohm pull-up and a, say, 100 pF lad, which creates a
>100 ns delay time constant.
>There is a simple and clever way around that, and I can send you the circuit
>description on Monday when I am back at work.
>
>Peter Alfke, Xilinx Applications
>

Peter,

5 volts was actually a temporary abberation. We actually started with
RTL at 3.6.

We do 3.3V FPGA to 5V PECL with a pullup to +4.2; FPGA hi-Z becomes a
PECL '1', and tristate ON of a hard logic HIGH becomes 3.3V = PECL
'0'.

Oh, please send me the clever circuit too. Despam my email address or
fax to 4-1-5/7-5-3-3-3-0-1. Thanks!

John

Article: 37584
Subject: Re: Dual-port ram templates
From: Mike Treseler <tres@tc.fluke.com>
Date: Sun, 16 Dec 2001 13:51:07 -0800
Links: << >> << T >> << A >>

Russell Shaw wrote:

> -- Pre Optimizing Design .work.sync_dpram_8_8.synth
> -- Boundary optimization.
> "E:/AAProjs/Bugs/Leonardo/main.vhd", line 34:Info, Inferred ram instance 'ix26409' of type
> 'ram_dq_da_inclock_outclock_8_8_256'

So it found the right module but .  .  .

> -- optimize -target acex1 -effort quick -chip -area -hierarchy=auto
> Using default wire table: STD-1
> Warning, Dual read ports not supported for FLEX/APEX/MERCURY RAMs; using default implementation.
> Warning, using default ram implementation for ram_dq_da_inclock_outclock_8_8_256, run time can get large.

. . . it refused to use it.

Since acex1k is not in the unsupported list above,
it's either a bug or a deliberate dumbing down
of the oem version.

Note that this ram is inferred properly with
acex1k technology on the mentor version of leo.

             -- Mike Treseler

Article: 37585
Subject: Re: Leonardo Spectrum Editor Configuration
From: Russell Shaw <rjshaw@iprimus.com.au>
Date: Mon, 17 Dec 2001 10:50:12 +1100
Links: << >> << T >> << A >>



"C.Schlehaus" wrote:
> 
> Hi,
> I'm currently trying to adjust the Leonardo Spectrum OEM
> Edition from ALTERA's website to my needs. As I have a
> "smaller" screen I'd like to size the font in the HDL
> Editor (opened with click on the VHDL file) down to 8pt.
> Courier New. I made good experience with this size with
> my Max+PlusII designs. Unfortunately I didn't manage to
> find any configuration for this editor (just for the
> information window) except adding the line numbers...
> 
> Any help appreciated, Carlhermann Schlehaus

The leonardo editor is hopeless. I use an external
editor (ultraedit), and an external compiler (vhdl-simili),
before feeding anything into leonardo.

Article: 37586
Subject: Re: Dual-port ram templates
From: Russell Shaw <rjshaw@iprimus.com.au>
Date: Mon, 17 Dec 2001 10:55:43 +1100
Links: << >> << T >> << A >>



Mike Treseler wrote:
> 
> Russell Shaw wrote:
> 
> > -- Pre Optimizing Design .work.sync_dpram_8_8.synth
> > -- Boundary optimization.
> > "E:/AAProjs/Bugs/Leonardo/main.vhd", line 34:Info, Inferred ram instance 'ix26409' of type
> > 'ram_dq_da_inclock_outclock_8_8_256'
> 
> So it found the right module but .  .  .
> 
> > -- optimize -target acex1 -effort quick -chip -area -hierarchy=auto
> > Using default wire table: STD-1
> > Warning, Dual read ports not supported for FLEX/APEX/MERCURY RAMs; using default implementation.
> > Warning, using default ram implementation for ram_dq_da_inclock_outclock_8_8_256, run time can get large.
> 
> . . . it refused to use it.
> 
> Since acex1k is not in the unsupported list above,
> it's either a bug or a deliberate dumbing down
> of the oem version.
> 
> Note that this ram is inferred properly with
> acex1k technology on the mentor version of leo.

I've reported it as a bug/feature-request. I've worked out
a 2x clk method to get simultaneous read and write address
ports from a lpm_ram_dq.

Article: 37587
Subject: SPI interface in VHDL
From: "Jason Berringer" <jberringer@trace-logic.com>
Date: Sun, 16 Dec 2001 19:35:52 -0500
Links: << >> << T >> << A >>

Hello again

I'm curious to know if anyone out there knows where there are some examples
of an SPI interface coded in VHDL. Just curious as I have to code one in the
near future and I always like to compare the various approaches taken by
others.

Thanks

Jason

Article: 37588
Subject: How can I reduce Spartan-II routing delays to meet 33MHz PCI's Tsu < 7 ns?
From: kevinbraceusenet@hotmail.com (Kevin Brace)
Date: 16 Dec 2001 17:55:00 -0800
Links: << >> << T >> << A >>

Hi, I will like to know if someone knows the strategies on how to
reduce routing (net) delays for Spartan-II.
So far, I treated synthesis tool(XST)/Map/Par as a blackbox, but
because my design (a PCI IP core) was not meeting Tsu (Tsu < 7ns), I
started to take a close look of how LUTs are placed on the FPGA.
Using Floorplanner, I saw the LUTs being placed all over the FPGA, so
I decided to hand place the LUTs using UCF flow.
That was the most effective thing I did to reduce interconnect delay
(reduced the worst interconnect delay by about 2.7 ns (11 ns down to
8.3 ns)), but unfortunately, I still have to reduce the interconnect
delay by 1.3 ns (worst Tsu currently at 8.3 ns).
Basically, I have two input signals, FRAME# and IRDY# that are not
meeting timings.
Here are the two of the worst violators for FRAME# and IRDY#,
respectively.



________________________________________________________________________________

================================================================================
Timing constraint: COMP "frame_n" OFFSET = IN 7 nS  BEFORE COMP "clk"
;

 503 items analyzed, 61 timing errors detected.
 Minimum allowable offset is   8.115ns.
--------------------------------------------------------------------------------
Slack:                  -1.115ns (requirement - (data path - clock
path - clock arrival))
  Source:               frame_n
  Destination:          PCI_IP_Core_Instance_ad_Port_2
  Destination Clock:    clk_BUFGP rising at 0.000ns
  Requirement:          7.000ns
  Data Path Delay:      10.556ns (Levels of Logic = 6)
  Clock Path Delay:     2.441ns (Levels of Logic = 2)
  Timing Improvement Wizard
  Data Path: frame_n to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tiopi                 1.224   frame_n
                                  frame_n_IBUF
    net (fanout=45)       0.591   frame_n_IBUF
    Tilo                  0.653   PCI_IP_Core_Instance_I_25_LUT_7
    net (fanout=3)        0.683   N21918
    Tbxx                  0.981   PCI_IP_Core_Instance_I_XXL_1357_1
    net (fanout=15)       2.352   PCI_IP_Core_Instance_I_XXL_1357_1
    Tilo                  0.653   PCI_IP_Core_Instance_I_125_LUT_17
    net (fanout=1)        0.749   PCI_IP_Core_Instance_N3059
    Tilo                  0.653   PCI_IP_Core_Instance_I__n0055
    net (fanout=1)        0.809   PCI_IP_Core_Instance_N3069
    Tioock                1.208   PCI_IP_Core_Instance_ad_Port_2
    ----------------------------  ------------------------------
    Total                10.556ns (5.372ns logic, 5.184ns route)
                                  (50.9% logic, 49.1% route)

  Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tgpio                 1.082   clk
                                  clk_BUFGP/IBUFG
    net (fanout=1)        0.007   clk_BUFGP/IBUFG
    Tgio                  0.773   clk_BUFGP/BUFG
    net (fanout=423)      0.579   clk_BUFGP
    ----------------------------  ------------------------------
    Total                 2.441ns (1.855ns logic, 0.586ns route)
                                  (76.0% logic, 24.0% route)

--------------------------------------------------------------------------------


================================================================================
Timing constraint: COMP "irdy_n" OFFSET = IN 7 nS  BEFORE COMP "clk" ;

 698 items analyzed, 74 timing errors detected.
 Minimum allowable offset is   8.290ns.
--------------------------------------------------------------------------------
Slack:                  -1.290ns (requirement - (data path - clock
path - clock arrival))
  Source:               irdy_n
  Destination:          PCI_IP_Core_Instance_ad_Port_2
  Destination Clock:    clk_BUFGP rising at 0.000ns
  Requirement:          7.000ns
  Data Path Delay:      10.731ns (Levels of Logic = 6)
  Clock Path Delay:     2.441ns (Levels of Logic = 2)
  Timing Improvement Wizard
  Data Path: irdy_n to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tiopi                 1.224   irdy_n
                                  irdy_n_IBUF
    net (fanout=138)      0.766   irdy_n_IBUF
    Tilo                  0.653   PCI_IP_Core_Instance_I_25_LUT_7
    net (fanout=3)        0.683   N21918
    Tbxx                  0.981   PCI_IP_Core_Instance_I_XXL_1357_1
    net (fanout=15)       2.352   PCI_IP_Core_Instance_I_XXL_1357_1
    Tilo                  0.653   PCI_IP_Core_Instance_I_125_LUT_17
    net (fanout=1)        0.749   PCI_IP_Core_Instance_N3059
    Tilo                  0.653   PCI_IP_Core_Instance_I__n0055
    net (fanout=1)        0.809   PCI_IP_Core_Instance_N3069
    Tioock                1.208   PCI_IP_Core_Instance_ad_Port_2
    ----------------------------  ------------------------------
    Total                10.731ns (5.372ns logic, 5.359ns route)
                                  (50.1% logic, 49.9% route)

  Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2
    Delay type         Delay(ns)  Logical Resource(s)
    ----------------------------  -------------------
    Tgpio                 1.082   clk
                                  clk_BUFGP/IBUFG
    net (fanout=1)        0.007   clk_BUFGP/IBUFG
    Tgio                  0.773   clk_BUFGP/BUFG
    net (fanout=423)      0.579   clk_BUFGP
    ----------------------------  ------------------------------
    Total                 2.441ns (1.855ns logic, 0.586ns route)
                                  (76.0% logic, 24.0% route)

--------------------------------------------------------------------------------


Timing summary:
---------------

Timing errors: 135  Score: 55289

Constraints cover 27511 paths, 0 nets, and 4835 connections (92.1%
coverage)
________________________________________________________________________________


Locations of various resources:

FRAME#: pin 23
IRDY#:  pin 24
AD[2]:  pin 62
PCI_IP_Core_Instance_I_25_LUT_7: CLB_R12C1.s1
PCI_IP_Core_Instance_I_XXL_1357_1: CLB_R12C2
PCI_IP_Core_Instance_I_125_LUT_17: CLB_R23C9.s0
PCI_IP_Core_Instance_I__n0055: CLB_R24C9.s0



Input signals other than FRAME# and IRDY# are all meeting Tsu < 7 ns
requirement, and because I now figured out how to use IOB FFs, I can
easily meet Tval < 11 ns (Tco) for all output signals.
I am using Xilinx ISE WebPack 4.1 (which doesn't come with FPGA
Editor), and the PCI IP core is written in Verilog.
The device I am targeting is Xilinx Spartan-II 150K system gate speed
grade -5 part (XC2S150-5CPQ208), and I did meet all 33MHz PCI timings
with Spartan-II 150K system gate speed grade -6 part (XC2S150-6CPQ208)
when I resynthesized the PCI IP core for speed grade -6 part, and
basically reused the same UCF file with the floorplan (I had to make
small modifications to the UCF file because some of the LUT names
changed).
The reason I really care about Xilinx Spartan-II 150K system gate
speed grade -5 part is because that is the chip that is on the PCI
prototype board of Insight Electronics Spartan-II Development Kit.
Yes, I wish the PCI prototype board came with speed grade -6 . . .
Because I want the PCI IP core to be portable across different
platforms (most notably Xilinx and Altera FPGAs), I am not really
interested in making any vendor specific modification to my Verilog
RTL code, but I won't mind using various tricks in the .UCF file (for
Xilinx) or .ACF file (I believe that is the Altera equivalent of
Xilinx .UCF file).
Here are some solutions I came up with.


1) Reduce the signal fanout (Currently at 35 globally, but FRAME# and
IRDY#'s fanout are 200. What number should I reduce the global fanout
to?)

2) Use USELOWSKEWLINES in a UCF file (already tried on some long
routings, but didn't seem to help. I will try to play around with this
option a little more with different signals.).

3) Floorplan all the LUTs and FFs on the FPGA (currently, I only
floorplanned the LUTs that violated Tsu, and most of them take inputs
from FRAME# and IRDY#.).

4) Use Guide file Leverage mode in Map and Par.

5) Try routing my design 2000 times (That will take several days . . .
I once routed my design about 20 times. After routing my design 20
times, Par seems to get stuck in certain Timing Score range beyond 20
iterations.).

6) Pay for ISE Foundation 4.1 (I don't want to pay for tools because I
am poor), and use FPGA Editor (I wish ISE WebPack came with FPGA
Editor.). At least from FPGA Editor, I can see how the signals are
actually getting routed.

7) Use a different synthesis tool other than XST (I am poor, so I
doubt that I can afford.).


I will like to hear from anyone who can comment on the solutions I
just wrote, or has other suggestions on what I can do to reduce the
delays to meet 33MHz PCI's Tsu < 7 ns requirement.




Thanks,



Kevin Brace (don't respond to me directly, respond within the
newsgroup)




P.S.  Considering that I am struggling to meet 33MHz PCI timings with
Spartan-II speed grade -5, how come Xilinx meet 66MHz PCI timings on
Virtex/Spartan-II speed grade -6? (I can only barely meet 33MHz PCI
timings with Spartan-II speed grade -6 using floorplanner)
Is it possible to move a signal through a input pin like FRAME# and
IRDY# (pin 23 and pin 24 respectively for Spartan-II PQ208), go
through a few levels of LUTs, and reach far away IOB output FF and
tri-state control FF like pin 67 (AD[0]) or pin 203 (AD[31]) in 5 ns?
(3 ns + 1.9 to 2 ns natural clock skew = 4.9 ns to 5.0 ns realistic
Tsu)
Can a signal move that fast on Virtex/Spartan-II speed grade -6? (I
sort of doubt from my experience.)
I know that Xilinx uses the special IRDY and TRDY pin in LogiCORE PCI,
but that won't seem to help FRAME#, since FRAME# has to be sampled
unregistered to determine an end of burst transfer.
What kind of tricks is Xilinx using in their LogiCORE PCI other than
the special IRDY and TRDY pin?
Does anyone know?

Article: 37589
Subject: Re: FPGA-Conversion. IP Cores
From: kevinbraceusenet@hotmail.com (Kevin Brace)
Date: 16 Dec 2001 18:37:35 -0800
Links: << >> << T >> << A >>

I don't claim to be an expert at all, but according to this EE Times
article, if the IP core came from an FPGA vendor like Xilinx or
Altera, you are pretty much stuck with their devices, unless the FPGA
vendor offers a conversion service (like Altera's HardCopy (started
recently) or Xilinx's HardWire (which if I am correct was discontinued
in 1999)).

http://www.eetimes.com/story/OEG20010907S0103


Another bad news of the conversion service is that Clear Logic
recently lost a key ruling against Altera.

http://www.altera.com/corporate/press_box/releases/corporate/pr-wins_clear_logic.html


I sort of find the ruling troubling because assuming that an
Altera-made IP is not included in the customer's design, should anyone
have any control of the bit stream file you generated from Altera's
software?
I suppose that what Altera wants to say is that because the customer
had to agree prior to using an Altera software (like MAX+PLUS II or
Quartus), the customer has to use the generated bit stream file in a
way agreed in the software licensing agreement.
However, recently Clear Logic won a patent on their business model of
converting a bit stream file directly to an ASIC, and that business
model seems to be very similar to Altera's HardCopy, so I expect Clear
Logic to sue Altera soon.

http://www.ebnews.com/story/OEG20011108S0031



So seeing that IP cores from FPGA vendors have strings attached to
them, I think it will be safer to use a third party (non-device
vendor) IP core if FPGA-ASIC conversion is part of the requirement of
your application.




Kevin Brace (don't respond to me directly, respond within the
newsgroup)



arlington_sade@yahoo.com (arlington) wrote in message news:<63d93f75.0112160047.77f9982e@posting.google.com>...
> Hello all, 
> 
> If you were to use IP cores, such as Logicore/Alliance cores from
> Xilinx, Megafunction cores from Altera, Inventra cores from Mentor,
> etc. how can you get the RTL verilog/VHDL when you want to convert to
> ASIC ?
> 
> Thanks.

Article: 37590
Subject: Re: Certicom challenge and FPGA based modular math
From: "Eric Pearson" <ecp@mgl.ca>
Date: Sun, 16 Dec 2001 23:57:16 -0500
Links: << >> << T >> << A >>

Hi Jay...

Excuse the question: what is n-bit modulo multiplication? I'm a resonably
well experience at fpga's and logic, and have never knowingly used n-bit
modular mults so
I don't appreciate the difficulty of working with n=128 or more. When I do,
I may be able to
answer your questions.

Eric Pearson


"Jay Berg" <admin@eCompute.org> wrote in message
news:3c1cfff8$0$34821$9a6e19ea@news.newshosting.com...
> After making the mistake of getting involved in the current ECCp109
> distributed computing project (see URL below), I'm now casting around to
> determine if there's a possibility of finding a PCI board with an FPGA
> co-processor capable of handling a small set of modular math functions.
>
> http://www.nd.edu/~cmonico/eccp109/
>
> The main points to be aware of are:
>     1. The client requires 128-bit integer math.
>     2. The client uses modular math for nearly all of the
>         math functions.
>     3. The majority of compute time is spent in a single
>         function that performs 128-bit modulo multiplication.
>     4. This project will move to the next challenge following
>         completion. The next project will be a 131-bit challenge,
>         requiring a word size larger than 128-bits).
>
> If there existed an FPGA based PCI card capable of doing 128-bit modulo
> multiplication, I would be very interested. But after investing a week of
> searching, I'm unable to find an off the shelf solution, or the IP core to
> provide this capability.
>
> Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit
> integer modulo multiplication?
>
> Q2: Does anyone know of existing FPGA IP that supports 128-bit integer
> modulo multiplication?
>
> Q3: If Q1 and Q2 are both no, would there be anyone interested in creating
> such a core, or possibly even a PCI board to accomplish 128-bit integer
> modulo multiplication? I would be willing to sponsor development costs.
But
> I'm not flush enough to pay for the labor (time) involved.
>
> Note that I've multiple decades of SW development background (C and
> assembler) with very minimal HW background. And this includes zero FPGA
> development experience. But given my SW experience, I can easily build any
> driver(s) and do the client porting.
>
> BTW, the current client executes approximately 190,000 iterations per
second
> on a 450p2. This includes several (1-3) modulo multiplications per
> iteration.
>
>             Jay Berg
>             jberg@eCompute.org
>
>

Article: 37591
Subject: Re: Certicom challenge and FPGA based modular math
From: "Jay Berg" <admin@eCompute.org>
Date: Sun, 16 Dec 2001 21:09:17 -0800
Links: << >> << T >> << A >>

Let me see if I can explain this. But given that I'm not a math expert, bear
with me.

Modulo math (also known as "clock arithmetic") can be thought of as using
remainders. Imagine the following numbers.

    17
    32
    65

Now if we convert these numbers modulo-10, the numbers are converted to the
following.

    7
    2
    5

The idea is that following the multiplication, the remainder is calculated
modulus N. To do this, you can divide the result of the multiplication by N.
The remainder is the modulus.

    7 * 5 = 35
    (7 * 5) mod 10 = 5

Since the need is for 128-bit multiplication (128x128=256), the result of
the multiplication can be 256-bits in size. Following the multiplication,
the 256-bit result is reduced by the modulus value N. This translates the
result into a number between 0 and (N-1). With the assumption that N is
128-bits (or less), the final result of the modulo multiplication will be
128-bits (or smaller).

There are numerous methods to compute modulo remainders. But the simplest is
to envision a division with remainder, where the remainder is the desired
result - with the quotient being discarded. Remember also that all numbers
are in integer form.

   result =  (X*Y) mod N

That's what I'm trying to achieve. The value of 'result'.

            Jay Berg



"Eric Pearson" <ecp@mgl.ca> wrote in message
news:u1quq4p7nq0l86@corp.supernews.com...
> Hi Jay...
>
> Excuse the question: what is n-bit modulo multiplication? I'm a resonably
> well experience at fpga's and logic, and have never knowingly used n-bit
> modular mults so
> I don't appreciate the difficulty of working with n=128 or more. When I
do,
> I may be able to
> answer your questions.
>
> Eric Pearson
>
>
> "Jay Berg" <admin@eCompute.org> wrote in message
> news:3c1cfff8$0$34821$9a6e19ea@news.newshosting.com...
> > After making the mistake of getting involved in the current ECCp109
> > distributed computing project (see URL below), I'm now casting around to
> > determine if there's a possibility of finding a PCI board with an FPGA
> > co-processor capable of handling a small set of modular math functions.
> >
> > http://www.nd.edu/~cmonico/eccp109/
> >
> > The main points to be aware of are:
> >     1. The client requires 128-bit integer math.
> >     2. The client uses modular math for nearly all of the
> >         math functions.
> >     3. The majority of compute time is spent in a single
> >         function that performs 128-bit modulo multiplication.
> >     4. This project will move to the next challenge following
> >         completion. The next project will be a 131-bit challenge,
> >         requiring a word size larger than 128-bits).
> >
> > If there existed an FPGA based PCI card capable of doing 128-bit modulo
> > multiplication, I would be very interested. But after investing a week
of
> > searching, I'm unable to find an off the shelf solution, or the IP core
to
> > provide this capability.
> >
> > Q1: Does anyone know of an existing FPGA based PCI card capable of
128-bit
> > integer modulo multiplication?
> >
> > Q2: Does anyone know of existing FPGA IP that supports 128-bit integer
> > modulo multiplication?
> >
> > Q3: If Q1 and Q2 are both no, would there be anyone interested in
creating
> > such a core, or possibly even a PCI board to accomplish 128-bit integer
> > modulo multiplication? I would be willing to sponsor development costs.
> But
> > I'm not flush enough to pay for the labor (time) involved.
> >
> > Note that I've multiple decades of SW development background (C and
> > assembler) with very minimal HW background. And this includes zero FPGA
> > development experience. But given my SW experience, I can easily build
any
> > driver(s) and do the client porting.
> >
> > BTW, the current client executes approximately 190,000 iterations per
> second
> > on a 450p2. This includes several (1-3) modulo multiplications per
> > iteration.
> >
> >             Jay Berg
> >             jberg@eCompute.org
> >
> >
>
>

Article: 37592
Subject: Re: Certicom challenge and FPGA based modular math
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Mon, 17 Dec 2001 05:17:13 +0000 (UTC)
Links: << >> << T >> << A >>

That sounds like fun.  Tell us more about the number you're taking all this
modulo to.  Is it a constant, a rarely changing parameter, or a variable? It
would be very handy if it were a power of two, but is there any other form it
has to have?

For the general problem, the worst part is the division.  But there are some
cute tricks for division by constants, particularly when you want only the
remainder.  You could improve the algorithm in that area.

Also, 128-bit arithmetic is into the region where FPGAs' ripple carries are
slower than more complicated carry schemes.

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37593
Subject: Re: Certicom challenge and FPGA based modular math
From: "Jay Berg" <admin@eCompute.org>
Date: Sun, 16 Dec 2001 21:27:17 -0800
Links: << >> << T >> << A >>

For each project, the modulus N value remains constant. But following the
completion of the current project, another project will be starting with a
new N. To complicate the matter, the next project will probably require word
widths in excess of 128-bits.

But to answer your question, once initialized the N value remains constant
through all iterations. Only A and B must be reloaded with only the modulus
result needing to be read.

Remember that the result of the A*B will result in 256-bits, since A and B
are both 128-bits in size. Only after reduction by the modulus N value, will
the result diminish to 128-bits. So think of it as:

            (A * B) mod N = result

Where A, B, N, and result are all 128-bits. But with the intermediate value
of the multiplication being 256-bits. Thus:

            (128 bits x 128 bits) = 256 bits
            (256 bits MOD 128 bits) = 128 bits

And as you point out, the MOD value can be computed as the integer result of
the division of (A*B) by N. So the result value will range from 0 to N-1.

            Jay Berg
            jberg@eCompute.org


"Carl Brannen" <carl.brannen@terabeam.com> wrote in message
news:891312b39c3743b0261e9a27121dc2e9.51709@mygate.mailgate.org...
> That sounds like fun.  Tell us more about the number you're taking all
this
> modulo to.  Is it a constant, a rarely changing parameter, or a variable?
It
> would be very handy if it were a power of two, but is there any other form
it
> has to have?
>
> For the general problem, the worst part is the division.  But there are
some
> cute tricks for division by constants, particularly when you want only the
> remainder.  You could improve the algorithm in that area.
>
> Also, 128-bit arithmetic is into the region where FPGAs' ripple carries
are
> slower than more complicated carry schemes.
>
> Carl
>
>
> --
> Posted from firewall.terabeam.com [216.137.15.2]
> via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37594
Subject: Re: Certicom challenge and FPGA based modular math
From: "Jay Berg" <admin@eCompute.org>
Date: Sun, 16 Dec 2001 22:11:22 -0800
Links: << >> << T >> << A >>

I had someone point out to me that I made no mention as to the required
speed.

Each iteration of the SW requires 3-4 modulo multiplications. On a 450p2
system using SW only, the current SW achieves approximately 190,000
iterations per second. This equates to approximately 632,700 modulo
multiplications per second. On an (overclocked) 975p3, it equates to
approximately 1,332,000 iterations per second.

Therefore I would hope to achieve at least 600,000 iterations per second (2
million modulo multiplications per second). Note that this assumes some
degree of SW time with the modulo multiplications occurring upon demand.

Sorry, I'm a coding pig and not a HW type. So bear with me on these
approximations of performance.

        Jay Berg


"Jay Berg" <admin@eCompute.org> wrote in message
news:3c1cfff8$0$34821$9a6e19ea@news.newshosting.com...
> After making the mistake of getting involved in the current ECCp109
> distributed computing project (see URL below), I'm now casting around to
> determine if there's a possibility of finding a PCI board with an FPGA
> co-processor capable of handling a small set of modular math functions.
>
> http://www.nd.edu/~cmonico/eccp109/
>
> The main points to be aware of are:
>     1. The client requires 128-bit integer math.
>     2. The client uses modular math for nearly all of the
>         math functions.
>     3. The majority of compute time is spent in a single
>         function that performs 128-bit modulo multiplication.
>     4. This project will move to the next challenge following
>         completion. The next project will be a 131-bit challenge,
>         requiring a word size larger than 128-bits).
>
> If there existed an FPGA based PCI card capable of doing 128-bit modulo
> multiplication, I would be very interested. But after investing a week of
> searching, I'm unable to find an off the shelf solution, or the IP core to
> provide this capability.
>
> Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit
> integer modulo multiplication?
>
> Q2: Does anyone know of existing FPGA IP that supports 128-bit integer
> modulo multiplication?
>
> Q3: If Q1 and Q2 are both no, would there be anyone interested in creating
> such a core, or possibly even a PCI board to accomplish 128-bit integer
> modulo multiplication? I would be willing to sponsor development costs.
But
> I'm not flush enough to pay for the labor (time) involved.
>
> Note that I've multiple decades of SW development background (C and
> assembler) with very minimal HW background. And this includes zero FPGA
> development experience. But given my SW experience, I can easily build any
> driver(s) and do the client porting.
>
> BTW, the current client executes approximately 190,000 iterations per
second
> on a 450p2. This includes several (1-3) modulo multiplications per
> iteration.
>
>             Jay Berg
>             jberg@eCompute.org
>
>

Article: 37595
Subject: Re: Certicom challenge and FPGA based modular math
From: Steven Derrien <sderrien@irisa.fr>
Date: Mon, 17 Dec 2001 10:39:37 +0100
Links: << >> << T >> << A >>


Hello,

On a typical PCI FGPA bord, it is likely that your performance is limited by the
PCI bandwidth rather than by the FPGA processing power. Assuming N fixed, you
need 3*128 bits (96 bytes, 2Wr 1Rd) I/O  per iteration.

If you use PCI MMAP IOs, you will hardly get more than 15MBytes/sec between the
host and the board. This poses a bound on your achievable peformance (15/96*10^6
Mul operation per second  ~ 150 000 Mul/sec), which will be less than what you
get by software.

If your alogorithm has no data dependecies between the different multiplication
results (which I doubt), you could use blocked I/O (or DMA) operations, and
maybe reach 60-80Mbytes/sec, but even then, you would not get more than 1
million multiplication per second.

The only solution would be to implement a larger part of the algorithm (like a
whole loop nest) on the FGPA board, which is much more difficult (unelss your
algorithm is very regular, and requires little control) but this generally
reduces the amount of I/O operations on the PCI bus.

Steven


Jay Berg wrote:

> I had someone point out to me that I made no mention as to the required
> speed.
>
> Each iteration of the SW requires 3-4 modulo multiplications. On a 450p2
> system using SW only, the current SW achieves approximately 190,000
> iterations per second. This equates to approximately 632,700 modulo
> multiplications per second. On an (overclocked) 975p3, it equates to
> approximately 1,332,000 iterations per second.
>
> Therefore I would hope to achieve at least 600,000 iterations per second (2
> million modulo multiplications per second). Note that this assumes some
> degree of SW time with the modulo multiplications occurring upon demand.
>
> Sorry, I'm a coding pig and not a HW type. So bear with me on these
> approximations of performance.
>
>         Jay Berg
>
> "Jay Berg" <admin@eCompute.org> wrote in message
> news:3c1cfff8$0$34821$9a6e19ea@news.newshosting.com...
> > After making the mistake of getting involved in the current ECCp109
> > distributed computing project (see URL below), I'm now casting around to
> > determine if there's a possibility of finding a PCI board with an FPGA
> > co-processor capable of handling a small set of modular math functions.
> >
> > http://www.nd.edu/~cmonico/eccp109/
> >
> > The main points to be aware of are:
> >     1. The client requires 128-bit integer math.
> >     2. The client uses modular math for nearly all of the
> >         math functions.
> >     3. The majority of compute time is spent in a single
> >         function that performs 128-bit modulo multiplication.
> >     4. This project will move to the next challenge following
> >         completion. The next project will be a 131-bit challenge,
> >         requiring a word size larger than 128-bits).
> >
> > If there existed an FPGA based PCI card capable of doing 128-bit modulo
> > multiplication, I would be very interested. But after investing a week of
> > searching, I'm unable to find an off the shelf solution, or the IP core to
> > provide this capability.
> >
> > Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit
> > integer modulo multiplication?
> >
> > Q2: Does anyone know of existing FPGA IP that supports 128-bit integer
> > modulo multiplication?
> >
> > Q3: If Q1 and Q2 are both no, would there be anyone interested in creating
> > such a core, or possibly even a PCI board to accomplish 128-bit integer
> > modulo multiplication? I would be willing to sponsor development costs.
> But
> > I'm not flush enough to pay for the labor (time) involved.
> >
> > Note that I've multiple decades of SW development background (C and
> > assembler) with very minimal HW background. And this includes zero FPGA
> > development experience. But given my SW experience, I can easily build any
> > driver(s) and do the client porting.
> >
> > BTW, the current client executes approximately 190,000 iterations per
> second
> > on a 450p2. This includes several (1-3) modulo multiplications per
> > iteration.
> >
> >             Jay Berg
> >             jberg@eCompute.org
> >
> >

Article: 37596
Subject: Re: Certicom challenge and FPGA based modular math
From: Steven Derrien <sderrien@irisa.fr>
Date: Mon, 17 Dec 2001 10:54:53 +0100
Links: << >> << T >> << A >>

I have just noticed that there exists some work on ECC implemenation on FPGAs.

http://citeseer.nj.nec.com/leung00fpga.html

Good luck

Steven

Jay Berg wrote:

> After making the mistake of getting involved in the current ECCp109
> distributed computing project (see URL below), I'm now casting around to
> determine if there's a possibility of finding a PCI board with an FPGA
> co-processor capable of handling a small set of modular math functions.
>
> http://www.nd.edu/~cmonico/eccp109/
>
> The main points to be aware of are:
>     1. The client requires 128-bit integer math.
>     2. The client uses modular math for nearly all of the
>         math functions.
>     3. The majority of compute time is spent in a single
>         function that performs 128-bit modulo multiplication.
>     4. This project will move to the next challenge following
>         completion. The next project will be a 131-bit challenge,
>         requiring a word size larger than 128-bits).
>
> If there existed an FPGA based PCI card capable of doing 128-bit modulo
> multiplication, I would be very interested. But after investing a week of
> searching, I'm unable to find an off the shelf solution, or the IP core to
> provide this capability.
>
> Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit
> integer modulo multiplication?
>
> Q2: Does anyone know of existing FPGA IP that supports 128-bit integer
> modulo multiplication?
>
> Q3: If Q1 and Q2 are both no, would there be anyone interested in creating
> such a core, or possibly even a PCI board to accomplish 128-bit integer
> modulo multiplication? I would be willing to sponsor development costs. But
> I'm not flush enough to pay for the labor (time) involved.
>
> Note that I've multiple decades of SW development background (C and
> assembler) with very minimal HW background. And this includes zero FPGA
> development experience. But given my SW experience, I can easily build any
> driver(s) and do the client porting.
>
> BTW, the current client executes approximately 190,000 iterations per second
> on a 450p2. This includes several (1-3) modulo multiplications per
> iteration.
>
>             Jay Berg
>             jberg@eCompute.org

Article: 37597
Subject: Re: Certicom challenge and FPGA based modular math
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Mon, 17 Dec 2001 09:55:21 +0000 (UTC)
Links: << >> << T >> << A >>

Jay, I'd be stunned if a typical million gate FPGA like an XCV1000 didn't beat
any home computer by a factor of 1000 in performing these calculations.  I'm
going to try and figure out what's actually going on in this link:
http://www.certicom.com/research/ch32.html

In the unlikely event that I do figure it out, I'll post a performance estimate
here.

Carl


-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37598
Subject: Re: Certicom challenge and FPGA based modular math
From: "Carl Brannen" <carl.brannen@terabeam.com>
Date: Mon, 17 Dec 2001 10:36:44 +0000 (UTC)
Links: << >> << T >> << A >>

There's another set of ECC challenges, but these ones use F_2^m/P instead of
F_p/P.  The arithmetic, in that field, would be a lot easier on an FPGA than
arithmetic on F_p/P because carries are eliminated.

Carl

P.S. I said that an XCV1000 should beat a home computer by a factor of 1000.
That would assume some very good FPGA design work.  A factor of 100 should be
fairly easy to achieve.



-- 
Posted from firewall.terabeam.com [216.137.15.2] 
via Mailgate.ORG Server - http://www.Mailgate.ORG

Article: 37599
Subject: Configuring Xilinx FPGA through parallel port
From: Steven Derrien <sderrien@irisa.fr>
Date: Mon, 17 Dec 2001 12:27:23 +0100
Links: << >> << T >> << A >>

Hello,

This might be slighlty off-topic, but I guess several people in this NG
had to face this kind of problem.
We are using a BurchEd Board, with a parallel port download cable,
however, because we need to communicate with the board once it is
configured, we use two parallel ports, the one on the motherboard (for
communication in EPP mode) and another one connected on a PCI parallel
port extension board (using netmos 9705 chip) for configuration*.

The PCI // port, does not work properly when it comes to configure the
FPGA board (I managed to make it work for a week or so, but now for a
mysterious reasons, the FPGA DONE signal does not behave correctly). BTW
configuration with the motherboard // port works fine.

The general PCI // port behavior is correct (checked by feeding-back
CTRL signal on STATUS), so I really don't understand where this problem
is coming from. Has anybody faced the same kind of problems ?

* We have no choice since the PCI board does not seem to allow anything
else than SPP

Thank you for your help,

Steven

Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources

Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Custom Search