Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
I apply a set of multi-cycle constraints to a module and it works fine, both timing analyzer and timing simulation. Then I incorporate this module and a larger design and apply the the same constraints again. This time timing analyzer reports is OK but the timing sim is wrong. Any idea to solve the problem? Thanks. RonArticle: 37576
Hello all, If you were to use IP cores, such as Logicore/Alliance cores from Xilinx, Megafunction cores from Altera, Inventra cores from Mentor, etc. how can you get the RTL verilog/VHDL when you want to convert to ASIC ? Thanks.Article: 37577
Anyone tried Xilinx "Modular Design tool" ? Any good / bad experience to share ? Thanks, Rotem Gazit Design Engineer High-speed board & FPGA design MystiCom LTD mailto:rotemg@mysticom.com http://www.mysticom.com/Article: 37578
Hi, I'm currently trying to adjust the Leonardo Spectrum OEM Edition from ALTERA's website to my needs. As I have a "smaller" screen I'd like to size the font in the HDL Editor (opened with click on the VHDL file) down to 8pt. Courier New. I made good experience with this size with my Max+PlusII designs. Unfortunately I didn't manage to find any configuration for this editor (just for the information window) except adding the line numbers... Any help appreciated, Carlhermann SchlehausArticle: 37579
I'm not sure if people are doing this already, but I couldn't find a reference on the Xilinx web site. Block RAMs make more efficient squaring circuits than they do multipliers. And you can get multipliers out of squarers. An explanation for the arithmetic. Let A and B be the numbers to be multiplied. Compute C = (A+B) * (A+B) = A**2 + 2AB + B**2 Compute D = (A-B) * (A-B) = A**2 - 2AB + B**2 Then C - D = 4AB. This is particularly efficient in Xilinx Spartan2, Virtex, and Virtex2 architectures because the block RAM is dual port. That means you can use one side for the (A-B)**2 calculation and the other side for the (A+B)**2 calculation. With the Xilinx Spartan2, Virtex or VirtexE, use the RAMB4_16_16. It has 8 inputs and 16 outputs in two sections. Each section can conveniently compute the square of an 8-bit number. Note that the lowest two bits of the two squares are going to have to be equal (i.e. C-D = 4AB, so C and D have to match two bits), so you don't have to subtract bits 1 and 0 of the two squares. If "A" and "B" are both 7-bit, their sum will be no worse than 8-bit, so you can compute a 7x7 multiply using only the 8 LUTs for each of "A+B" and "A-B", and another 14 LUTs for the result, a total of 30 LUTs (i.e. 15 slices) and one block RAM. Maybe there's a way to get the bit back, and let A and B be 8-bit numbers; I haven't looked at it long enough to conclude there isn't. The circuit uses about half the LUTs required by the standard algorithm, at an expense of one block RAM. Similarly, you should be able to get a 15x15 multiplier with around 62 LUTs (31 slices) and four RAMB4_8_8 block RAMs. In addition, these multipliers are naturally pipelined with no need to register low results. To put the LUT utilization in perspective, the Xilinx 8x8 multiply takes 39 slices, while the 16x16 takes 143: http://www.xilinx.com/ipcenter/reference_designs/vmult/vmult_v1_4.pdf Using RAMB4s alone to implement even a 7x7 multiply would require 28 of them, though you could reduce that somewhat by being properly sneaky... Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37580
I'm not sure if people are doing this already, but I couldn't find a reference on the Xilinx web site. Block RAMs make more efficient squaring circuits than they do multipliers. And you can get multipliers out of squarers. An explanation for the arithmetic. Let A and B be the numbers to be multiplied. Compute C = (A+B) * (A+B) = A**2 + 2AB + B**2 Compute D = (A-B) * (A-B) = A**2 - 2AB + B**2 Then C - D = 4AB. This is particularly efficient in Xilinx Spartan2, Virtex, and Virtex2 architectures because the block RAM is dual port. That means you can use one side for the (A-B)**2 calculation and the other side for the (A+B)**2 calculation. With the Xilinx Spartan2, Virtex or VirtexE, use the RAMB4_16_16. It has 8 inputs and 16 outputs in two sections. Each section can conveniently compute the square of an 8-bit number. Note that the lowest two bits of the two squares are going to have to be equal (i.e. C-D = 4AB, so C and D have to match two bits), so you don't have to subtract bits 1 and 0 of the two squares. If "A" and "B" are both 7-bit, their sum will be no worse than 8-bit, so you can compute a 7x7 multiply using only the 8 LUTs for each of "A+B" and "A-B", and another 14 LUTs for the result, a total of 30 LUTs (i.e. 15 slices) and one block RAM. Maybe there's a way to get the bit back, and let A and B be 8-bit numbers; I haven't looked at it long enough to conclude there isn't. The circuit uses about half the LUTs required by the standard algorithm, at an expense of one block RAM. To put the LUT utilization in perspective, the Xilinx 8x8 multiply takes 39 slices: http://www.xilinx.com/ipcenter/reference_designs/vmult/vmult_v1_4.pdf Using RAMB4s alone to implement even a 7x7 multiply would require a huge number of them, as multiplies require twice as many address inputs as squares. You can iterate on the calculation of the square. That is, if A is too big to square in a single operation, then break A into two parts. With A broken into two parts, say A = AH + AL, you can compute AH**2, AL**2 with block RAM, and compute 2*AH*AL by computing the difference between (AH+AL)**2 and (AH-AL)**2. Breaking A and B into more than 3 parts may be worth exploring, for certain bit sizes. Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37581
"S. Ramirez" <sramirez@cfl.rr.com> wrote in message news:AXvS7.84755$oj3.14571085@typhoon.tampabay.rr.com... > Did you check into this and is the special still going on? I'd like to > know, because things are always different in various parts of the > country/world. I'm in Florida, where are you? In the UK. I checked with our local Synplicity distributor (with whom I have a good working relationship), and the reply was: "...the promotion wasn't run outside the US, and expired at the end of November anyway". However, they went on to say: "If you are serious about another seat we're always happy to consider a deal!" So it sounds like some dickering / haggling is in order.... MH.Article: 37582
After making the mistake of getting involved in the current ECCp109 distributed computing project (see URL below), I'm now casting around to determine if there's a possibility of finding a PCI board with an FPGA co-processor capable of handling a small set of modular math functions. http://www.nd.edu/~cmonico/eccp109/ The main points to be aware of are: 1. The client requires 128-bit integer math. 2. The client uses modular math for nearly all of the math functions. 3. The majority of compute time is spent in a single function that performs 128-bit modulo multiplication. 4. This project will move to the next challenge following completion. The next project will be a 131-bit challenge, requiring a word size larger than 128-bits). If there existed an FPGA based PCI card capable of doing 128-bit modulo multiplication, I would be very interested. But after investing a week of searching, I'm unable to find an off the shelf solution, or the IP core to provide this capability. Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit integer modulo multiplication? Q2: Does anyone know of existing FPGA IP that supports 128-bit integer modulo multiplication? Q3: If Q1 and Q2 are both no, would there be anyone interested in creating such a core, or possibly even a PCI board to accomplish 128-bit integer modulo multiplication? I would be willing to sponsor development costs. But I'm not flush enough to pay for the labor (time) involved. Note that I've multiple decades of SW development background (C and assembler) with very minimal HW background. And this includes zero FPGA development experience. But given my SW experience, I can easily build any driver(s) and do the client porting. BTW, the current client executes approximately 190,000 iterations per second on a 450p2. This includes several (1-3) modulo multiplications per iteration. Jay Berg jberg@eCompute.orgArticle: 37583
On Sun, 16 Dec 2001 01:48:45 GMT, Peter Alfke <palfke@earthlink.net> wrote: > > >Mohap wrote: > >> Why is the xilinx 4005xl 3.3v device? what if i want to output signals to an >> external 5v device? is there something i can do to get around this problem? > >XC4005XL uses 3.3 V supply voltage because that cuts dynamic power consumption >in half, compared to 5 V. Also, modern high-performance processes do not >tolerate high voltages like 5 V. >Today, 3.3 V is already obsolete, the most modern FPGAs use 1.5 V for the core >logic, but retain 2.5 V tolerance on all outputs, 3.3 V on some. > >For new designs, 5 V is definitely out. It served us well from 1965 to 1995, >for one third of the previous century, but its days are over. Rest In Peace ! > >Now to your problem: >The XC4005 inputs tolerate 5 V, if you select this ( and thus disable the clamp >diode to Vcc that is there because PCI demands it ). >Now you can drive the inputs up to 5.5 V. >The outputs can obviously not drive higher than their own Vcc, and 3.3 V may be >high enough for driving 5 V logic with so-called TTL input thresholds of ~ 1.5 >V. >(Forget the 2.4 V spec for Voh, that's a 30-year old left-over from the days of >bipolar TTL. > >What if you have to drive 5-V that has a CMOS input threshold of up to 3.5 V. >Then you need a pull-up resistor to the 5 V, and you should configure the >XC4005 output as "open collector" ( really: open drian ). That costs you speed, >since you now have a 1 kilohm pull-up and a, say, 100 pF lad, which creates a >100 ns delay time constant. >There is a simple and clever way around that, and I can send you the circuit >description on Monday when I am back at work. > >Peter Alfke, Xilinx Applications > Peter, 5 volts was actually a temporary abberation. We actually started with RTL at 3.6. We do 3.3V FPGA to 5V PECL with a pullup to +4.2; FPGA hi-Z becomes a PECL '1', and tristate ON of a hard logic HIGH becomes 3.3V = PECL '0'. Oh, please send me the clever circuit too. Despam my email address or fax to 4-1-5/7-5-3-3-3-0-1. Thanks! JohnArticle: 37584
Russell Shaw wrote: > -- Pre Optimizing Design .work.sync_dpram_8_8.synth > -- Boundary optimization. > "E:/AAProjs/Bugs/Leonardo/main.vhd", line 34:Info, Inferred ram instance 'ix26409' of type > 'ram_dq_da_inclock_outclock_8_8_256' So it found the right module but . . . > -- optimize -target acex1 -effort quick -chip -area -hierarchy=auto > Using default wire table: STD-1 > Warning, Dual read ports not supported for FLEX/APEX/MERCURY RAMs; using default implementation. > Warning, using default ram implementation for ram_dq_da_inclock_outclock_8_8_256, run time can get large. . . . it refused to use it. Since acex1k is not in the unsupported list above, it's either a bug or a deliberate dumbing down of the oem version. Note that this ram is inferred properly with acex1k technology on the mentor version of leo. -- Mike TreselerArticle: 37585
"C.Schlehaus" wrote: > > Hi, > I'm currently trying to adjust the Leonardo Spectrum OEM > Edition from ALTERA's website to my needs. As I have a > "smaller" screen I'd like to size the font in the HDL > Editor (opened with click on the VHDL file) down to 8pt. > Courier New. I made good experience with this size with > my Max+PlusII designs. Unfortunately I didn't manage to > find any configuration for this editor (just for the > information window) except adding the line numbers... > > Any help appreciated, Carlhermann Schlehaus The leonardo editor is hopeless. I use an external editor (ultraedit), and an external compiler (vhdl-simili), before feeding anything into leonardo.Article: 37586
Mike Treseler wrote: > > Russell Shaw wrote: > > > -- Pre Optimizing Design .work.sync_dpram_8_8.synth > > -- Boundary optimization. > > "E:/AAProjs/Bugs/Leonardo/main.vhd", line 34:Info, Inferred ram instance 'ix26409' of type > > 'ram_dq_da_inclock_outclock_8_8_256' > > So it found the right module but . . . > > > -- optimize -target acex1 -effort quick -chip -area -hierarchy=auto > > Using default wire table: STD-1 > > Warning, Dual read ports not supported for FLEX/APEX/MERCURY RAMs; using default implementation. > > Warning, using default ram implementation for ram_dq_da_inclock_outclock_8_8_256, run time can get large. > > . . . it refused to use it. > > Since acex1k is not in the unsupported list above, > it's either a bug or a deliberate dumbing down > of the oem version. > > Note that this ram is inferred properly with > acex1k technology on the mentor version of leo. I've reported it as a bug/feature-request. I've worked out a 2x clk method to get simultaneous read and write address ports from a lpm_ram_dq.Article: 37587
Hello again I'm curious to know if anyone out there knows where there are some examples of an SPI interface coded in VHDL. Just curious as I have to code one in the near future and I always like to compare the various approaches taken by others. Thanks JasonArticle: 37588
Hi, I will like to know if someone knows the strategies on how to reduce routing (net) delays for Spartan-II. So far, I treated synthesis tool(XST)/Map/Par as a blackbox, but because my design (a PCI IP core) was not meeting Tsu (Tsu < 7ns), I started to take a close look of how LUTs are placed on the FPGA. Using Floorplanner, I saw the LUTs being placed all over the FPGA, so I decided to hand place the LUTs using UCF flow. That was the most effective thing I did to reduce interconnect delay (reduced the worst interconnect delay by about 2.7 ns (11 ns down to 8.3 ns)), but unfortunately, I still have to reduce the interconnect delay by 1.3 ns (worst Tsu currently at 8.3 ns). Basically, I have two input signals, FRAME# and IRDY# that are not meeting timings. Here are the two of the worst violators for FRAME# and IRDY#, respectively. ________________________________________________________________________________ ================================================================================ Timing constraint: COMP "frame_n" OFFSET = IN 7 nS BEFORE COMP "clk" ; 503 items analyzed, 61 timing errors detected. Minimum allowable offset is 8.115ns. -------------------------------------------------------------------------------- Slack: -1.115ns (requirement - (data path - clock path - clock arrival)) Source: frame_n Destination: PCI_IP_Core_Instance_ad_Port_2 Destination Clock: clk_BUFGP rising at 0.000ns Requirement: 7.000ns Data Path Delay: 10.556ns (Levels of Logic = 6) Clock Path Delay: 2.441ns (Levels of Logic = 2) Timing Improvement Wizard Data Path: frame_n to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tiopi 1.224 frame_n frame_n_IBUF net (fanout=45) 0.591 frame_n_IBUF Tilo 0.653 PCI_IP_Core_Instance_I_25_LUT_7 net (fanout=3) 0.683 N21918 Tbxx 0.981 PCI_IP_Core_Instance_I_XXL_1357_1 net (fanout=15) 2.352 PCI_IP_Core_Instance_I_XXL_1357_1 Tilo 0.653 PCI_IP_Core_Instance_I_125_LUT_17 net (fanout=1) 0.749 PCI_IP_Core_Instance_N3059 Tilo 0.653 PCI_IP_Core_Instance_I__n0055 net (fanout=1) 0.809 PCI_IP_Core_Instance_N3069 Tioock 1.208 PCI_IP_Core_Instance_ad_Port_2 ---------------------------- ------------------------------ Total 10.556ns (5.372ns logic, 5.184ns route) (50.9% logic, 49.1% route) Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tgpio 1.082 clk clk_BUFGP/IBUFG net (fanout=1) 0.007 clk_BUFGP/IBUFG Tgio 0.773 clk_BUFGP/BUFG net (fanout=423) 0.579 clk_BUFGP ---------------------------- ------------------------------ Total 2.441ns (1.855ns logic, 0.586ns route) (76.0% logic, 24.0% route) -------------------------------------------------------------------------------- ================================================================================ Timing constraint: COMP "irdy_n" OFFSET = IN 7 nS BEFORE COMP "clk" ; 698 items analyzed, 74 timing errors detected. Minimum allowable offset is 8.290ns. -------------------------------------------------------------------------------- Slack: -1.290ns (requirement - (data path - clock path - clock arrival)) Source: irdy_n Destination: PCI_IP_Core_Instance_ad_Port_2 Destination Clock: clk_BUFGP rising at 0.000ns Requirement: 7.000ns Data Path Delay: 10.731ns (Levels of Logic = 6) Clock Path Delay: 2.441ns (Levels of Logic = 2) Timing Improvement Wizard Data Path: irdy_n to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tiopi 1.224 irdy_n irdy_n_IBUF net (fanout=138) 0.766 irdy_n_IBUF Tilo 0.653 PCI_IP_Core_Instance_I_25_LUT_7 net (fanout=3) 0.683 N21918 Tbxx 0.981 PCI_IP_Core_Instance_I_XXL_1357_1 net (fanout=15) 2.352 PCI_IP_Core_Instance_I_XXL_1357_1 Tilo 0.653 PCI_IP_Core_Instance_I_125_LUT_17 net (fanout=1) 0.749 PCI_IP_Core_Instance_N3059 Tilo 0.653 PCI_IP_Core_Instance_I__n0055 net (fanout=1) 0.809 PCI_IP_Core_Instance_N3069 Tioock 1.208 PCI_IP_Core_Instance_ad_Port_2 ---------------------------- ------------------------------ Total 10.731ns (5.372ns logic, 5.359ns route) (50.1% logic, 49.9% route) Clock Path: clk to PCI_IP_Core_Instance_ad_Port_2 Delay type Delay(ns) Logical Resource(s) ---------------------------- ------------------- Tgpio 1.082 clk clk_BUFGP/IBUFG net (fanout=1) 0.007 clk_BUFGP/IBUFG Tgio 0.773 clk_BUFGP/BUFG net (fanout=423) 0.579 clk_BUFGP ---------------------------- ------------------------------ Total 2.441ns (1.855ns logic, 0.586ns route) (76.0% logic, 24.0% route) -------------------------------------------------------------------------------- Timing summary: --------------- Timing errors: 135 Score: 55289 Constraints cover 27511 paths, 0 nets, and 4835 connections (92.1% coverage) ________________________________________________________________________________ Locations of various resources: FRAME#: pin 23 IRDY#: pin 24 AD[2]: pin 62 PCI_IP_Core_Instance_I_25_LUT_7: CLB_R12C1.s1 PCI_IP_Core_Instance_I_XXL_1357_1: CLB_R12C2 PCI_IP_Core_Instance_I_125_LUT_17: CLB_R23C9.s0 PCI_IP_Core_Instance_I__n0055: CLB_R24C9.s0 Input signals other than FRAME# and IRDY# are all meeting Tsu < 7 ns requirement, and because I now figured out how to use IOB FFs, I can easily meet Tval < 11 ns (Tco) for all output signals. I am using Xilinx ISE WebPack 4.1 (which doesn't come with FPGA Editor), and the PCI IP core is written in Verilog. The device I am targeting is Xilinx Spartan-II 150K system gate speed grade -5 part (XC2S150-5CPQ208), and I did meet all 33MHz PCI timings with Spartan-II 150K system gate speed grade -6 part (XC2S150-6CPQ208) when I resynthesized the PCI IP core for speed grade -6 part, and basically reused the same UCF file with the floorplan (I had to make small modifications to the UCF file because some of the LUT names changed). The reason I really care about Xilinx Spartan-II 150K system gate speed grade -5 part is because that is the chip that is on the PCI prototype board of Insight Electronics Spartan-II Development Kit. Yes, I wish the PCI prototype board came with speed grade -6 . . . Because I want the PCI IP core to be portable across different platforms (most notably Xilinx and Altera FPGAs), I am not really interested in making any vendor specific modification to my Verilog RTL code, but I won't mind using various tricks in the .UCF file (for Xilinx) or .ACF file (I believe that is the Altera equivalent of Xilinx .UCF file). Here are some solutions I came up with. 1) Reduce the signal fanout (Currently at 35 globally, but FRAME# and IRDY#'s fanout are 200. What number should I reduce the global fanout to?) 2) Use USELOWSKEWLINES in a UCF file (already tried on some long routings, but didn't seem to help. I will try to play around with this option a little more with different signals.). 3) Floorplan all the LUTs and FFs on the FPGA (currently, I only floorplanned the LUTs that violated Tsu, and most of them take inputs from FRAME# and IRDY#.). 4) Use Guide file Leverage mode in Map and Par. 5) Try routing my design 2000 times (That will take several days . . . I once routed my design about 20 times. After routing my design 20 times, Par seems to get stuck in certain Timing Score range beyond 20 iterations.). 6) Pay for ISE Foundation 4.1 (I don't want to pay for tools because I am poor), and use FPGA Editor (I wish ISE WebPack came with FPGA Editor.). At least from FPGA Editor, I can see how the signals are actually getting routed. 7) Use a different synthesis tool other than XST (I am poor, so I doubt that I can afford.). I will like to hear from anyone who can comment on the solutions I just wrote, or has other suggestions on what I can do to reduce the delays to meet 33MHz PCI's Tsu < 7 ns requirement. Thanks, Kevin Brace (don't respond to me directly, respond within the newsgroup) P.S. Considering that I am struggling to meet 33MHz PCI timings with Spartan-II speed grade -5, how come Xilinx meet 66MHz PCI timings on Virtex/Spartan-II speed grade -6? (I can only barely meet 33MHz PCI timings with Spartan-II speed grade -6 using floorplanner) Is it possible to move a signal through a input pin like FRAME# and IRDY# (pin 23 and pin 24 respectively for Spartan-II PQ208), go through a few levels of LUTs, and reach far away IOB output FF and tri-state control FF like pin 67 (AD[0]) or pin 203 (AD[31]) in 5 ns? (3 ns + 1.9 to 2 ns natural clock skew = 4.9 ns to 5.0 ns realistic Tsu) Can a signal move that fast on Virtex/Spartan-II speed grade -6? (I sort of doubt from my experience.) I know that Xilinx uses the special IRDY and TRDY pin in LogiCORE PCI, but that won't seem to help FRAME#, since FRAME# has to be sampled unregistered to determine an end of burst transfer. What kind of tricks is Xilinx using in their LogiCORE PCI other than the special IRDY and TRDY pin? Does anyone know?Article: 37589
I don't claim to be an expert at all, but according to this EE Times article, if the IP core came from an FPGA vendor like Xilinx or Altera, you are pretty much stuck with their devices, unless the FPGA vendor offers a conversion service (like Altera's HardCopy (started recently) or Xilinx's HardWire (which if I am correct was discontinued in 1999)). http://www.eetimes.com/story/OEG20010907S0103 Another bad news of the conversion service is that Clear Logic recently lost a key ruling against Altera. http://www.altera.com/corporate/press_box/releases/corporate/pr-wins_clear_logic.html I sort of find the ruling troubling because assuming that an Altera-made IP is not included in the customer's design, should anyone have any control of the bit stream file you generated from Altera's software? I suppose that what Altera wants to say is that because the customer had to agree prior to using an Altera software (like MAX+PLUS II or Quartus), the customer has to use the generated bit stream file in a way agreed in the software licensing agreement. However, recently Clear Logic won a patent on their business model of converting a bit stream file directly to an ASIC, and that business model seems to be very similar to Altera's HardCopy, so I expect Clear Logic to sue Altera soon. http://www.ebnews.com/story/OEG20011108S0031 So seeing that IP cores from FPGA vendors have strings attached to them, I think it will be safer to use a third party (non-device vendor) IP core if FPGA-ASIC conversion is part of the requirement of your application. Kevin Brace (don't respond to me directly, respond within the newsgroup) arlington_sade@yahoo.com (arlington) wrote in message news:<63d93f75.0112160047.77f9982e@posting.google.com>... > Hello all, > > If you were to use IP cores, such as Logicore/Alliance cores from > Xilinx, Megafunction cores from Altera, Inventra cores from Mentor, > etc. how can you get the RTL verilog/VHDL when you want to convert to > ASIC ? > > Thanks.Article: 37590
Hi Jay... Excuse the question: what is n-bit modulo multiplication? I'm a resonably well experience at fpga's and logic, and have never knowingly used n-bit modular mults so I don't appreciate the difficulty of working with n=128 or more. When I do, I may be able to answer your questions. Eric Pearson "Jay Berg" <admin@eCompute.org> wrote in message news:3c1cfff8$0$34821$9a6e19ea@news.newshosting.com... > After making the mistake of getting involved in the current ECCp109 > distributed computing project (see URL below), I'm now casting around to > determine if there's a possibility of finding a PCI board with an FPGA > co-processor capable of handling a small set of modular math functions. > > http://www.nd.edu/~cmonico/eccp109/ > > The main points to be aware of are: > 1. The client requires 128-bit integer math. > 2. The client uses modular math for nearly all of the > math functions. > 3. The majority of compute time is spent in a single > function that performs 128-bit modulo multiplication. > 4. This project will move to the next challenge following > completion. The next project will be a 131-bit challenge, > requiring a word size larger than 128-bits). > > If there existed an FPGA based PCI card capable of doing 128-bit modulo > multiplication, I would be very interested. But after investing a week of > searching, I'm unable to find an off the shelf solution, or the IP core to > provide this capability. > > Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit > integer modulo multiplication? > > Q2: Does anyone know of existing FPGA IP that supports 128-bit integer > modulo multiplication? > > Q3: If Q1 and Q2 are both no, would there be anyone interested in creating > such a core, or possibly even a PCI board to accomplish 128-bit integer > modulo multiplication? I would be willing to sponsor development costs. But > I'm not flush enough to pay for the labor (time) involved. > > Note that I've multiple decades of SW development background (C and > assembler) with very minimal HW background. And this includes zero FPGA > development experience. But given my SW experience, I can easily build any > driver(s) and do the client porting. > > BTW, the current client executes approximately 190,000 iterations per second > on a 450p2. This includes several (1-3) modulo multiplications per > iteration. > > Jay Berg > jberg@eCompute.org > >Article: 37591
Let me see if I can explain this. But given that I'm not a math expert, bear with me. Modulo math (also known as "clock arithmetic") can be thought of as using remainders. Imagine the following numbers. 17 32 65 Now if we convert these numbers modulo-10, the numbers are converted to the following. 7 2 5 The idea is that following the multiplication, the remainder is calculated modulus N. To do this, you can divide the result of the multiplication by N. The remainder is the modulus. 7 * 5 = 35 (7 * 5) mod 10 = 5 Since the need is for 128-bit multiplication (128x128=256), the result of the multiplication can be 256-bits in size. Following the multiplication, the 256-bit result is reduced by the modulus value N. This translates the result into a number between 0 and (N-1). With the assumption that N is 128-bits (or less), the final result of the modulo multiplication will be 128-bits (or smaller). There are numerous methods to compute modulo remainders. But the simplest is to envision a division with remainder, where the remainder is the desired result - with the quotient being discarded. Remember also that all numbers are in integer form. result = (X*Y) mod N That's what I'm trying to achieve. The value of 'result'. Jay Berg "Eric Pearson" <ecp@mgl.ca> wrote in message news:u1quq4p7nq0l86@corp.supernews.com... > Hi Jay... > > Excuse the question: what is n-bit modulo multiplication? I'm a resonably > well experience at fpga's and logic, and have never knowingly used n-bit > modular mults so > I don't appreciate the difficulty of working with n=128 or more. When I do, > I may be able to > answer your questions. > > Eric Pearson > > > "Jay Berg" <admin@eCompute.org> wrote in message > news:3c1cfff8$0$34821$9a6e19ea@news.newshosting.com... > > After making the mistake of getting involved in the current ECCp109 > > distributed computing project (see URL below), I'm now casting around to > > determine if there's a possibility of finding a PCI board with an FPGA > > co-processor capable of handling a small set of modular math functions. > > > > http://www.nd.edu/~cmonico/eccp109/ > > > > The main points to be aware of are: > > 1. The client requires 128-bit integer math. > > 2. The client uses modular math for nearly all of the > > math functions. > > 3. The majority of compute time is spent in a single > > function that performs 128-bit modulo multiplication. > > 4. This project will move to the next challenge following > > completion. The next project will be a 131-bit challenge, > > requiring a word size larger than 128-bits). > > > > If there existed an FPGA based PCI card capable of doing 128-bit modulo > > multiplication, I would be very interested. But after investing a week of > > searching, I'm unable to find an off the shelf solution, or the IP core to > > provide this capability. > > > > Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit > > integer modulo multiplication? > > > > Q2: Does anyone know of existing FPGA IP that supports 128-bit integer > > modulo multiplication? > > > > Q3: If Q1 and Q2 are both no, would there be anyone interested in creating > > such a core, or possibly even a PCI board to accomplish 128-bit integer > > modulo multiplication? I would be willing to sponsor development costs. > But > > I'm not flush enough to pay for the labor (time) involved. > > > > Note that I've multiple decades of SW development background (C and > > assembler) with very minimal HW background. And this includes zero FPGA > > development experience. But given my SW experience, I can easily build any > > driver(s) and do the client porting. > > > > BTW, the current client executes approximately 190,000 iterations per > second > > on a 450p2. This includes several (1-3) modulo multiplications per > > iteration. > > > > Jay Berg > > jberg@eCompute.org > > > > > >Article: 37592
That sounds like fun. Tell us more about the number you're taking all this modulo to. Is it a constant, a rarely changing parameter, or a variable? It would be very handy if it were a power of two, but is there any other form it has to have? For the general problem, the worst part is the division. But there are some cute tricks for division by constants, particularly when you want only the remainder. You could improve the algorithm in that area. Also, 128-bit arithmetic is into the region where FPGAs' ripple carries are slower than more complicated carry schemes. Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37593
For each project, the modulus N value remains constant. But following the completion of the current project, another project will be starting with a new N. To complicate the matter, the next project will probably require word widths in excess of 128-bits. But to answer your question, once initialized the N value remains constant through all iterations. Only A and B must be reloaded with only the modulus result needing to be read. Remember that the result of the A*B will result in 256-bits, since A and B are both 128-bits in size. Only after reduction by the modulus N value, will the result diminish to 128-bits. So think of it as: (A * B) mod N = result Where A, B, N, and result are all 128-bits. But with the intermediate value of the multiplication being 256-bits. Thus: (128 bits x 128 bits) = 256 bits (256 bits MOD 128 bits) = 128 bits And as you point out, the MOD value can be computed as the integer result of the division of (A*B) by N. So the result value will range from 0 to N-1. Jay Berg jberg@eCompute.org "Carl Brannen" <carl.brannen@terabeam.com> wrote in message news:891312b39c3743b0261e9a27121dc2e9.51709@mygate.mailgate.org... > That sounds like fun. Tell us more about the number you're taking all this > modulo to. Is it a constant, a rarely changing parameter, or a variable? It > would be very handy if it were a power of two, but is there any other form it > has to have? > > For the general problem, the worst part is the division. But there are some > cute tricks for division by constants, particularly when you want only the > remainder. You could improve the algorithm in that area. > > Also, 128-bit arithmetic is into the region where FPGAs' ripple carries are > slower than more complicated carry schemes. > > Carl > > > -- > Posted from firewall.terabeam.com [216.137.15.2] > via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37594
I had someone point out to me that I made no mention as to the required speed. Each iteration of the SW requires 3-4 modulo multiplications. On a 450p2 system using SW only, the current SW achieves approximately 190,000 iterations per second. This equates to approximately 632,700 modulo multiplications per second. On an (overclocked) 975p3, it equates to approximately 1,332,000 iterations per second. Therefore I would hope to achieve at least 600,000 iterations per second (2 million modulo multiplications per second). Note that this assumes some degree of SW time with the modulo multiplications occurring upon demand. Sorry, I'm a coding pig and not a HW type. So bear with me on these approximations of performance. Jay Berg "Jay Berg" <admin@eCompute.org> wrote in message news:3c1cfff8$0$34821$9a6e19ea@news.newshosting.com... > After making the mistake of getting involved in the current ECCp109 > distributed computing project (see URL below), I'm now casting around to > determine if there's a possibility of finding a PCI board with an FPGA > co-processor capable of handling a small set of modular math functions. > > http://www.nd.edu/~cmonico/eccp109/ > > The main points to be aware of are: > 1. The client requires 128-bit integer math. > 2. The client uses modular math for nearly all of the > math functions. > 3. The majority of compute time is spent in a single > function that performs 128-bit modulo multiplication. > 4. This project will move to the next challenge following > completion. The next project will be a 131-bit challenge, > requiring a word size larger than 128-bits). > > If there existed an FPGA based PCI card capable of doing 128-bit modulo > multiplication, I would be very interested. But after investing a week of > searching, I'm unable to find an off the shelf solution, or the IP core to > provide this capability. > > Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit > integer modulo multiplication? > > Q2: Does anyone know of existing FPGA IP that supports 128-bit integer > modulo multiplication? > > Q3: If Q1 and Q2 are both no, would there be anyone interested in creating > such a core, or possibly even a PCI board to accomplish 128-bit integer > modulo multiplication? I would be willing to sponsor development costs. But > I'm not flush enough to pay for the labor (time) involved. > > Note that I've multiple decades of SW development background (C and > assembler) with very minimal HW background. And this includes zero FPGA > development experience. But given my SW experience, I can easily build any > driver(s) and do the client porting. > > BTW, the current client executes approximately 190,000 iterations per second > on a 450p2. This includes several (1-3) modulo multiplications per > iteration. > > Jay Berg > jberg@eCompute.org > >Article: 37595
Hello, On a typical PCI FGPA bord, it is likely that your performance is limited by the PCI bandwidth rather than by the FPGA processing power. Assuming N fixed, you need 3*128 bits (96 bytes, 2Wr 1Rd) I/O per iteration. If you use PCI MMAP IOs, you will hardly get more than 15MBytes/sec between the host and the board. This poses a bound on your achievable peformance (15/96*10^6 Mul operation per second ~ 150 000 Mul/sec), which will be less than what you get by software. If your alogorithm has no data dependecies between the different multiplication results (which I doubt), you could use blocked I/O (or DMA) operations, and maybe reach 60-80Mbytes/sec, but even then, you would not get more than 1 million multiplication per second. The only solution would be to implement a larger part of the algorithm (like a whole loop nest) on the FGPA board, which is much more difficult (unelss your algorithm is very regular, and requires little control) but this generally reduces the amount of I/O operations on the PCI bus. Steven Jay Berg wrote: > I had someone point out to me that I made no mention as to the required > speed. > > Each iteration of the SW requires 3-4 modulo multiplications. On a 450p2 > system using SW only, the current SW achieves approximately 190,000 > iterations per second. This equates to approximately 632,700 modulo > multiplications per second. On an (overclocked) 975p3, it equates to > approximately 1,332,000 iterations per second. > > Therefore I would hope to achieve at least 600,000 iterations per second (2 > million modulo multiplications per second). Note that this assumes some > degree of SW time with the modulo multiplications occurring upon demand. > > Sorry, I'm a coding pig and not a HW type. So bear with me on these > approximations of performance. > > Jay Berg > > "Jay Berg" <admin@eCompute.org> wrote in message > news:3c1cfff8$0$34821$9a6e19ea@news.newshosting.com... > > After making the mistake of getting involved in the current ECCp109 > > distributed computing project (see URL below), I'm now casting around to > > determine if there's a possibility of finding a PCI board with an FPGA > > co-processor capable of handling a small set of modular math functions. > > > > http://www.nd.edu/~cmonico/eccp109/ > > > > The main points to be aware of are: > > 1. The client requires 128-bit integer math. > > 2. The client uses modular math for nearly all of the > > math functions. > > 3. The majority of compute time is spent in a single > > function that performs 128-bit modulo multiplication. > > 4. This project will move to the next challenge following > > completion. The next project will be a 131-bit challenge, > > requiring a word size larger than 128-bits). > > > > If there existed an FPGA based PCI card capable of doing 128-bit modulo > > multiplication, I would be very interested. But after investing a week of > > searching, I'm unable to find an off the shelf solution, or the IP core to > > provide this capability. > > > > Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit > > integer modulo multiplication? > > > > Q2: Does anyone know of existing FPGA IP that supports 128-bit integer > > modulo multiplication? > > > > Q3: If Q1 and Q2 are both no, would there be anyone interested in creating > > such a core, or possibly even a PCI board to accomplish 128-bit integer > > modulo multiplication? I would be willing to sponsor development costs. > But > > I'm not flush enough to pay for the labor (time) involved. > > > > Note that I've multiple decades of SW development background (C and > > assembler) with very minimal HW background. And this includes zero FPGA > > development experience. But given my SW experience, I can easily build any > > driver(s) and do the client porting. > > > > BTW, the current client executes approximately 190,000 iterations per > second > > on a 450p2. This includes several (1-3) modulo multiplications per > > iteration. > > > > Jay Berg > > jberg@eCompute.org > > > >Article: 37596
I have just noticed that there exists some work on ECC implemenation on FPGAs. http://citeseer.nj.nec.com/leung00fpga.html Good luck Steven Jay Berg wrote: > After making the mistake of getting involved in the current ECCp109 > distributed computing project (see URL below), I'm now casting around to > determine if there's a possibility of finding a PCI board with an FPGA > co-processor capable of handling a small set of modular math functions. > > http://www.nd.edu/~cmonico/eccp109/ > > The main points to be aware of are: > 1. The client requires 128-bit integer math. > 2. The client uses modular math for nearly all of the > math functions. > 3. The majority of compute time is spent in a single > function that performs 128-bit modulo multiplication. > 4. This project will move to the next challenge following > completion. The next project will be a 131-bit challenge, > requiring a word size larger than 128-bits). > > If there existed an FPGA based PCI card capable of doing 128-bit modulo > multiplication, I would be very interested. But after investing a week of > searching, I'm unable to find an off the shelf solution, or the IP core to > provide this capability. > > Q1: Does anyone know of an existing FPGA based PCI card capable of 128-bit > integer modulo multiplication? > > Q2: Does anyone know of existing FPGA IP that supports 128-bit integer > modulo multiplication? > > Q3: If Q1 and Q2 are both no, would there be anyone interested in creating > such a core, or possibly even a PCI board to accomplish 128-bit integer > modulo multiplication? I would be willing to sponsor development costs. But > I'm not flush enough to pay for the labor (time) involved. > > Note that I've multiple decades of SW development background (C and > assembler) with very minimal HW background. And this includes zero FPGA > development experience. But given my SW experience, I can easily build any > driver(s) and do the client porting. > > BTW, the current client executes approximately 190,000 iterations per second > on a 450p2. This includes several (1-3) modulo multiplications per > iteration. > > Jay Berg > jberg@eCompute.orgArticle: 37597
Jay, I'd be stunned if a typical million gate FPGA like an XCV1000 didn't beat any home computer by a factor of 1000 in performing these calculations. I'm going to try and figure out what's actually going on in this link: http://www.certicom.com/research/ch32.html In the unlikely event that I do figure it out, I'll post a performance estimate here. Carl -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37598
There's another set of ECC challenges, but these ones use F_2^m/P instead of F_p/P. The arithmetic, in that field, would be a lot easier on an FPGA than arithmetic on F_p/P because carries are eliminated. Carl P.S. I said that an XCV1000 should beat a home computer by a factor of 1000. That would assume some very good FPGA design work. A factor of 100 should be fairly easy to achieve. -- Posted from firewall.terabeam.com [216.137.15.2] via Mailgate.ORG Server - http://www.Mailgate.ORGArticle: 37599
Hello, This might be slighlty off-topic, but I guess several people in this NG had to face this kind of problem. We are using a BurchEd Board, with a parallel port download cable, however, because we need to communicate with the board once it is configured, we use two parallel ports, the one on the motherboard (for communication in EPP mode) and another one connected on a PCI parallel port extension board (using netmos 9705 chip) for configuration*. The PCI // port, does not work properly when it comes to configure the FPGA board (I managed to make it work for a week or so, but now for a mysterious reasons, the FPGA DONE signal does not behave correctly). BTW configuration with the motherboard // port works fine. The general PCI // port behavior is correct (checked by feeding-back CTRL signal on STATUS), so I really don't understand where this problem is coming from. Has anybody faced the same kind of problems ? * We have no choice since the PCI board does not seem to allow anything else than SPP Thank you for your help, Steven
Site Home Archive Home FAQ Home How to search the Archive How to Navigate the Archive
Compare FPGA features and resources
Threads starting:
Authors:A B C D E F G H I J K L M N O P Q R S T U V W X Y Z