To put things into practical perspective my company sells an FPGA based solution that applies our video enhancement technology in real-time to any video streams up to 1080p60 (our consumer product handles HDMI in and out). It's a world class algorithm with complex calculations, generating 3D information and saliency maps on the fly. I crammed that beast into a Cyclone 4 with 40K LEs.
It's hard to translate the "System Logic Cells" metric that Xilinx uses to measure these FPGAs, but a pessimistic calculation puts it at about 1.1 million LEs. That's over 27 times the logic my real-time video enhancement algorithm uses. With just one of these FPGAs we could run our algorithm on 6 4K60 4:4:4 streams at once. That's insane.
For another estimation, my rough calculations show that each FPGA would be able to do about 7 GH/s mining Bitcoin. Not an impressive figure by today's standards, but back when FPGA mining was a thing the best I ever got out of an FPGA was 500 MH/s per chip (on commercially viable devices).
I'm very curious what Amazon is going to charge for these instances. FPGAs of that size are incredibly expensive (5 figures each). Xilinx no doubt gave them a special deal, in exchange for the opportunity to participate in what could be a very large market. AWS has the potential to push a lot of volume for FPGAs that traditionally had very poor volume. IntelFPGA will no doubt fight exceptionally hard to win business from Azure or Google Cloud.
* Take all these estimates with a grain of salt. Most recent "advancements" in FPGA density are the result of using tricky architectures. FPGAs today are still homogeneous logic, but don't tend to be as fine grained as they were. In other words, they're basically moving from RISC to CISC. So it's always up in the air how well all the logic cells can be utilized for a given algorithm.
My guess is that Amazon will have to be very careful not to price themselves out of the market, for mid-range Deep Learning based cloud apps.
Wild guestimate but I think it'll cost more than $20/hr for each instance.
Amazon is betting on the fact that they can get better pricing than anyone else. They probably can. No one else will be buying these FPGAs in quantities Amazon will if these instances become popular (within their niche). So for the medium sized players it'll be cheaper to rent the FPGAs from Amazon, even with the AWS markup, than to buy the boards themselves. Especially for dynamic workloads where you're saving money by renting instead of owning (which is generally the advantage of cloud resources).
That's my guess anyway.
Once upon a time I thought seriously about going in to hardware design. I took a couple different courses in college (over 10 years ago now... sigh) dealing with VHDL and/or verilog and entirely loved it. If not for a chance encounter with web programming during my co-op my career would have been entirely different. With AWS offering this in the cloud if it is not prohibitively expensive I'll be looking in to toying with it and hopefully discovering uses for it in my work.
How many NOT operations can this do per cycle (and per second)? I realise FPGAs aren't the most suited for this, but the raw number is useful when thinking about how much better the FPGA is compared to a GPU for simple ops.
Anyway, the FPGAs being used here are, I believe, based on a 6-LUT (6 input, 2 output). So you'd get about 1.25 million 6-LUTs to work with, and some combination of MUXes, flip-flops, distributed RAM, block RAM, DSP blocks, etc.
Supposing Xilinx isn't doing any trickery and you really can use all those LUTs freely, then you'd be able to cram ~2.5 million binary NOTs into the thing (2 NOTs per LUT, since they're two output LUTs). So 2.5 million NOTs per cycle. I don't know what speed it'd run at for such a simple operation. Their mid-range 7 series FPGAs were able to do 32-bit additions plus a little extra logic, at ~450 MHz and consume 16 LUTs for each adder.
HDK here: https://github.com/aws/aws-fpga
(I work for AWS)
I hope the growing popularity of FPGAs for general-purpose computing will help push the vendors to open up bitstreams and invest in open-source design tools.
[background: many years of writing VHDL specifically for FPGAs, using various dev boards and custom boards]
To the best of our knowledge, state-of-the-art performance for forward propagation of CNNs on FPGAs was achieved by a team at Microsoft. Ovtcharov et al. have reported a throughput of 134 images/second on the ImageNet 1K dataset [28], which amounts to roughly 3x the throughput of the next closest competitor, while operating at 25 W on a Stratix V D5 [30]. This performance is projected to increase by using top-of-the-line FPGAs, with an estimated through- put of roughly 233 images/second while consuming roughly the same power on an Arria 10 GX1150. This is com- pared to high-performing GPU implementations (Caffe + cuDNN), which achieve 500-824 images/second, while con- suming 235 W. Interestingly, this was achieved using Micros oft- designed FPGA boards and servers, an experimental project which integrates FPGAs into datacenter applications.
It doesn't look like there's much AWS proprietary stuff here, though we'd have to wait for the SDK to be opened properly to be sure. I imagine it's mostly just making all of the stuff prepackaged and easily consumable for usage, and maybe some extra IP Cores or something for common stuff, and lots of examples. If you're already using Vivado I imagine using the F1/Cloud won't introduce any kind of major changes to what you expect.
"This AMI includes a set of developer tools that you can use in the AWS Cloud at no charge. You write your FPGA code using VHDL or Verilog and then compile, simulate, and verify it using tools from the Xilinx Vivado Design Suite (you can also use third-party simulators, higher-level language compilers, graphical programming tools, and FPGA IP libraries)."
So basically, buying a copy of Vivado is the minimum. There aren't any open source tools that directly output Xilinx FPGA bitstreams that I know of.
Wow. An app store for FPGA IPs and the infrastructure to enable anyone to use it. That's really cool.
I see people making video transcoder instances on day 1, and MPEGLA bankrupting Amazoners with lawsuits on day 2
Only if they distribute it through amazon. Just put the code up in a torrent; anyone can run it without MPEGLA knowing.
But I also think this is FPGAs for the Rest of Us. Suddenly, FPGAs are available without having to buy some development board from Xilinx, install a toolchain, use said (shitty) toolchain ...
Me, I was thinking of FPGAs as being something I'd use down the road a few years, eventually, etc. But instead, I'm looking at this right now. This morning. Waiting for the damn 404 to go away on:
https://github.com/aws/aws-fpga
This reduces the barrier to entry. It also reduces the transaction cost (h/t Ronald Coase).
The thing is, it seems like people always invest heavily into dedicated hardware when using FPGAs. I'll be interested to see what people actually end up using this service for.
via http://www.bittware.com/xilinx/product/xupp3r/
Thanks OP
Pretty interesting read. Also, kudos to AWS !
It's more an issue of being able to reproduce an existing build later on. You can't delegate ownership of the toolchain to the "cloud" (read: somebody else's computer) if you think you'll ever need to maintain the design in the future.
I do think the issue with cloud is the concern over IP. There are not a lot of EDA vendors, so the chances that your competitor is also using that same EDA vendor is pretty high. I think companies are pretty wary of using a cloud hosted service where you could literally be running simulations on the same machines as your competitors. Can you imagine some cloud/hosting snafu resulting in your codebase being accessible by your competitors?
EDA companies also sell ASIC/FPGA IP, and VIP (verification IP), so there's also a pretty clear conflict of interest if they have access to your IP. So, if you're really paranoid, imagine the EDA vendors themselves picking through your IP and repackaging/reselling it as IP to other customers (encrypted of course so you can't readily identify the source code)?
You do however, potentially expose your source code to Amazon. But possibly not, if you do your design/testing on EDA tools under your control, then deploy FPGA build packages to the F1 instances for hardware testing.
If anyone suggests links or books, please do
Thank You
To get some basic ideas I always recommend the book code by charles petzold: https://www.amazon.com/Code-Language-Computer-Hardware-Softw...
It walks you through everything from the transistor to the operating system.
(Apparently I need to add that I work for AWS on every message so yes I work for AWS)
I'm surprised where you got the idea of using C to program FPGAs, are you thinking of SystemC or OpenCL (they're both vastly different from each other)
I'm really surprised a sibling comment recommended the code book. It really meant to be a layman's reading about tech. It's a great book but it won't teach you programming FPGAs.
[1]: https://www.amazon.com/Digital-Design-Introduction-Verilog-H...
One key difference to keep in mind for digital programming is that everything happens in parallel, unless explicitly serialized, which is the opposite of the usual software development most people know about.
You can start with the EDA Playground tutorial, practice with HDLBits, while going through a book alongside (e.g., Harris & Harris) for examples, exercises, and best practices.
Similarly to a sibling thread, I'd also go with a free and open source flow, IceStorm (for the cheaply available iCE40 FPGAs): http://www.clifford.at/icestorm/
You can follow-up from the aforementioned tutorial and continue testing the designs on an iCE40 board -- starting here: http://hackaday.com/2015/08/19/learning-verilog-on-a-25-fpga...
Here are some really great presentations about it (slides & videos) by the creator (which can also serve in part as a general introduction):
- http://www.clifford.at/papers/2015/icestorm-flow/
- http://www.clifford.at/papers/2015/yosys-icestorm-etc/
Have fun!
A C programmer will just spend a lot of time learning why the things they already know how to do are not useful.
It was very pleasant surprise! The JVM world usually does not a have a great interface to the heterogenous world. I think it would yield tremendous benefits. FPGA-accelerated matrix multiplication, sorting, graph operations sound very appealing.
And then, as you mentioned, is the possibility of JITting things. http headers parsing ends up on the FPGA, and routes things to a message queue an actor can read. Or FPGA based actors; Does that make sense?
----
I have been unable to follow this development at all, however. Do you have any news about this project? I've been looking for a blog, a github or a mailing list, but can't find any.
These Xilinx 16nm Virtex FPGA's are beasts, but Altera has some compelling choices as well. Perhaps some of the hardened IP in the Xilinx tipped the scales, such as the H.265 encode/decode, 100G EMAC, PCI-E Gen 4?
OK, here's a concrete question: I have a vector of 64 floats. I want to multiply it with a matrix of size 64xN, where N is on the order of 1 billion. How fast can I do this multiplication, and find the top K elements of the resulting N-dimensional array?
Basically, you can create a custom "CPU" for your particular workflow. Imagine the GPU didn't exist and you couldn't multiply vectors of floats in parallel on your CPU. You could use a FPGA to write something to multiply a vector of floats in parallel without developing a GPU. It would probably not be as fast as a GPU or the equivalent CPU, but it would be faster than doing it serially.
Another way to put it: you can create a GPU with a FPGA, but not vice versa.
Hopefully, by browsing that list, you can see how FPGAs aren't really directly comparable to something like a GPU.
It's not as viable as it resulting in a large scale FPGA movement anytime soon since the the industry and academia is heavy experienced with using GPUs. The software and libraries on GPUs, like CUDA, TensorFlow and other open source libraries are very mature and are optimized for GPUs. There will have to be libraries in Verilog (I for one I'm hoping to be a part of this movement for some time now, so I'd love it if anyone can guide me to anything going on)
There are some major to minor hurdles. Although some of them might not seem like much[0], here they are:
1. Till now deep learning/machine learning researchers have been okay with learning the software stack related to GPUs and there are widespread tutorials on how to get started, etc. Verilog/VHDL is a whole different ball game and a very different thought process. (I will address using OpenCL later)
2. The toolchain being used is not open source and it's not really hackable. Although that is not that important in this case, since you're starting off writing gates from scratch, there will be problems with licensing, bugs that will be fixed at snail's pace (if ever) till there will be a performant open source toolchain (if ever, but I have hope in the community). You'll have to learn to give up at a customer service rep if you try to get help, unlike open source libraries where to head to github's issue page and get help quickly with the main devs.
3. Although this move will make getting into the game a lot easier, it will still not change the fact that people want to have control over their devices and it will take time for people to realize they have to start buying FPGAs for their data centers and use them in production, which has to happen sometime soon. Using AWS's services won't be cost effective for long term usage, just like GPUs instances(I don't know how the spot instance siutation is going to look with the FPGA instances).
This comes with it's own slew of SW problems and good luck trying to understand what's breaking what with the much slower compilation times and terribly unhelpful debugging messages.
4. OpenCL to FPGA is a mess. Only a handful of FPGAs supported using OpenCL. So this has lead to there being little to no open source development surrounding OpenCL with FPGAs in mind. And no the OpenCL libraries for GPUs cannot be used for FPGAs. More likely as from scrach rewrite. There should be a LOT more tweaking done to get them to work. OpenCL to FPGA is not as seamless as one might think and is ridden with problems. This will again, take time and energy by people familiar with FPGAs who have been largely out of the OSS movement.
Although I might come of as pessimistic, I'm largely hopeful for the future in the FPGA space. This move isn't great news just because it lowers the barrier, but introduces a chip that will be much more popular and now we have a chip for which libraries can focus their support on, compared to before, when each dev had a different board. So you'll have to get familiar with this -- Virtex Ultrascale+ XCVU9P [1]
And also, what might be interesting to you is that, Microsoft is doing a LOT on research on this.
I think all of the articles on MS's use of FPGAs can explain better than I can in this comment.
Some links to get you started: MS's blog post: http://blogs.microsoft.com/next/2016/10/17/the_moonshot_that...
Papers: https://www.microsoft.com/en-us/research/publication/acceler...
Media outlet links: https://www.top500.org/news/microsoft-goes-all-in-for-fpgas-... https://www.wired.com/2016/09/microsoft-bets-future-chip-rep...
I'd suggest started with the wired article or MS's blog post. Exciting stuff.
[0]: Remember that academia moves at a much slower pace in getting adjusted to the latest and greatest software than your average developer. The reason CUDA is still so popular although it is closed source and you can only use nvidia's GPUs is that it got in the game first and wooed them with performance. Although OpenCL is comparably performant(although there are some rare cases where this isn't true), I still see CUDA regarded as the defacto language to learn in the GPGPU space.
[1]: https://www.xilinx.com/support/documentation/selection-guide...
Maybe someone will finally find the triple-des password used at adobe for password hashing.
The possibilities are endless :)
So much easier than buying hardware. Also deep learning works sometimes similarly. It's easier to play with on AWS with their hourly billing than buying hardware for many use cases.
> 64 GiB of ECC-protected memory on a 288-bit wide bus (four DDR4 channels).
> Dedicated PCIe x16 interface to the CPU.
Does anyone know whether this is likely to be a plug-in card? and can I buy one to plug in to a local machine for testing?
Even better, maybe Amazon (and others getting into this space like Intel and Microsoft) will put their weight behind an open source VHDL/Verilog simulator. A few exist but they are pretty slow and way behind the curve in language support. Heck, maybe they can drive adoption of one of the up-and-coming HDL's like chisel, or create one even better. A guy can dream...
What's the use of a simulator when you can spin up an AWS instance and run your program on a real FPGA?
That being said, you are far from alone as an FPGA developer in skipping sim and going straight to hardware. Tools like Xilinx's chipscope help with the visibility problem in real hardware too.
It's now called QuestaSim I believe. But are you sure it can't handle simulating large designs? If yes, what is the full-featured software from Mentor that can?
> Heck, maybe they can drive adoption of one of the up-and-coming HDL's like chisel
Chisel isn't a full-blown HDL from what I understand; it's only a DSL that compiles to Verilog. In other words, you'd still need a Verilog simulator to actually run your design.
What could YOU use this for professionally?
(I certainly always wanted to play around with an FPGA for fun...)
Could also use a fast JPG encoder/decoder as well.
But seriously, I'm open to ideas for technologies that you or anyone else needs implemented for these instances. Would make an interesting side business for me.
EDIT: I should point out that I'm an experienced "full-stack" engineer when it comes to FPGAs. I've implemented the FPGA code and the software to drive them. None of this software developed by "hardware guys" garbage.
Been planning a NIC card that directly serves web apps via HDL for a while now...
For JPEG the GPU instances might be better.
Let this day be known as the beginning of the end general-compute infrastructure for internet scale services.
For the higher level object file or assembly language, that would be a netlist - essentially a digital representation of a schematic. The HDL is transformed into a netlist, then the netlist is optimized and the components converted from generics to device-specifc components, then the placement and routing is determined, and finally a 'bit' file is generated for actually configuring the FPGA. This process can take several hours for a large design.
I'm not sure I agree with them that this is the right path forward (but they're smart and know their stuff, so I'm probably wrong), but it's absolutely for real.