Asynchronous CPU, MIPS

kelnos · on Jan 24, 2010

This stuff is really cool. One of my college professors, Rajit Manohar, was one of the guys who did a lot of the major research into async VLSI in the 90s. He and a few of his colleagues founded a company in San Jose (Achronix) that designs and builds async FPGAs.

It's an amazingly different way of thinking about computing. Generally async designs take up more physical die space (because of things like dual-rail logic), but the power savings can be pretty significant since you're not running a clock... basically the logic runs as fast or as slow as it needs to, and doesn't do much of anything when idle.

I took a class where we designed and did a little layout for some CPU-subset blocks. It was fun, but often very challenging. At least at the time, there weren't many (any?) good test/validation tools. A friend of mine has been working on one for the past 8 years or so (http://www.csl.cornell.edu/~fang/hackt/), first as a part of his PhD thesis, later as a part of his job.

It's been years since I've touched this stuff... if I'd continued down the EE path (like many others, I switched to software post-college) I'd (ideally) be doing async stuff now.

jacquesm · on Jan 24, 2010

You may find that talent much in demand in the years to come.

sketerpot · on Jan 24, 2010

This feels like a lot of my class projects: fascinating idea, some good work on it, and then "we ran out of time, here's a pretty web page and a report". Ah, memories. :-)

I've really enjoyed asynchronous design. Did you know that it's possible to make an async n-bit adder that works in O(lg lg n) average time? Or that you can make a low-power CPU that does all its operations one bit at a time, and just stops sending bits when it comes to the leading zeros? It's fun stuff.

(My only contribution to the world of async design was a smaller dual-rail completion detector. It was pretty sweet, and I'm eternally proud of it.)

jacquesm · on Jan 24, 2010

I ran in to all this because of some talks with other HN'ers about computing fabrics, one of the big questions is 'clocked' or 'free running'.

That brought me to post my ideas on using 'life' as a very basic computing fabric, it's already proven to be Turing complete, so now to make it run without a global clock.

That would prove that a clock-free computing fabric of arbitrary size is a real life possibility.

I think it can be done, I'm not 100% sure yet.

That adder sounds like a bit of magic, do you have a link for that?

I can picture the cpu that 'sleeps on 0', basically the transitions are what clocks it forward so as long as you keep sending it 0's it simply never changes state right? (no state changes -> minimal power consumption).

sketerpot · on Jan 24, 2010

Someone's already figured out a way to run Conway's Game of Life asynchronously. Check it out:

http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=10298...

I didn't read the paper carefully enough to be sure how it works, but it looks like each cell (or region of cells) maintains a count of how many generations have passed, and it blocks until adjacent cells have caught up to it. So time propagates across the board like waves on a pond.

> That adder sounds like a bit of magic, do you have a link for that?

Sure! Here's the paper:

http://www1.cs.columbia.edu/async/publications/cheng-unger-t...

I'll summarize it briefly. You know about ripple carry adders, where the carry bits propagate across the adder and it takes O(n) time? Well, the adder for each bit can sometimes precompute its carry output. If both inputs are 1, it knows that its carry output will be 1. There are some other tricks like that. Anyway, the upshot of all this is that an asynchronous ripple carry adder can operate in O(lg n) average time, and an asynchronous carry-lookahead adder can work in O(lg lg n) average time. Or about twice as fast as a synchronous carry lookahead adder, according to transistor-level simulations.

> I can picture the cpu that 'sleeps on 0', basically the transitions are what clocks it forward so as long as you keep sending it 0's it simply never changes state right? (no state changes -> minimal power consumption).

Asynchronous design is like ideal clock gating: when part of a processor doesn't need to do anything, it just stays idle and doesn't dissipate dynamic power. In the example of a bit-serial length-adaptive datapath that I mentioned, the leading zero bits are simply ignored, and no processor work needs to be done to deal with them. Here's a paper on it:

http://vlsi.cornell.edu/~rajit/ps/bitsnap.pdf

jacquesm · on Jan 24, 2010

> it looks like each cell (or region of cells) maintains a count of how many generations have passed,

Hah!

That's exactly what I came up with :)

I finished it at 5 am this morning and it works.

The code is here: http://ww.com/asynclife.php.txt

Thanks for the other papers, I'll be reading.

edit:

I've written a blog post about all this:

http://jacquesmattheij.com/asynchronous+life

I can't read the IEEE paper without paying for it, so I think I'll forego the honour of doing so, pity because I would like to see what they came up with exactly.

Especially whether or not they clued in to the fact that if you have a 'generation' counter you only need the lower two bits of it to keep going forever.

sketerpot · on Jan 24, 2010

Oh man, that's awesome! Have you considered modeling a more "real-world" computing fabric? Your cells would have randomly-distributed computation and communication delays, and you could make some sort of visualization of the time waves propagating, and it would just be really cool.

I initially didn't think this would be possible to implement in simple physical hardware, but your insight that you only need a two-bit time counter changes things.

Of course, ultimately, I think what you'd want to look at is amorphous computing. It's a fascinating new branch of computer science that looks at organizing a large number of unreliable small processors, sensors, and actuators, with only local communication, into a working system. Here's the big intro paper (Abelson and Sussman are among the authors, BTW):

http://groups.csail.mit.edu/mac/projects/amorphous/papers/ai...

I think you'll really enjoy amorphous computing. It's already getting some use in wireless sensor networks, especially the stuff having to do with establishing reliable communication channels, or figuring out the spatial distribution of the sensors based on ping times, or keeping clocks more-or-less synchronized.

jacquesm · on Jan 24, 2010

> Oh man, that's awesome!

Thanks :) It was a lot of fun.

> Have you considered modeling a more "real-world" computing fabric?

That's what this is leading in to, the simplest unit is just a way for me to know that it is doable. If I can't do 'life' then there is no point in going any further with it.

One of the HN members here is very far in coming up with a higher level concept but I feel that it needs to be done from the 'gate level' up.

> Your cells would have randomly-distributed computation and communication delays, and you could make some sort of visualization of the time waves propagating, and it would just be really cool.

I did that for a bit, it is easy enough to show the generations as colours.

> I initially didn't think this would be possible to implement in simple physical hardware, but your insight that you only need a two-bit time counter changes things.

Yes, that's what made me really happy :) I realized that with a 'full' generation count you could make it work but that would always limit the maximum generation count to the number of bits in the counters. So why not turn it around and see how few bits you can use and handle the 'wrapping' properly. The number is 2, and that's low enough to use a simple 4-to-1 demultiplexer to select the right state line (current or previous generation).

> Of course, ultimately, I think what you'd want to look at is amorphous computing.

I had not heard the term before, more reading to do!

> I think you'll really enjoy amorphous computing.

Yes, that sounds like it is exactly the way to go.

I will write a fair sized article at the end of this month about where I feel computing is headed, this as a follow up on the new years day posting about what we think the future will bring.

I thought the question was an important one and I'm hitting two birds with one stone here, first the interest in cell computing and second the information I'm gathering to answer that question about the future.

Fascinating stuff.

boots · on Jan 24, 2010

I worked at Achronix Semiconductor (San Jose startup) for over a year and a half designing Asynchronous Circuits. I can answer any questions about the specifics of designing these circuits.

Anyone interested in this topic should definitely look into the Cornell University research ( http://vlsi.cornell.edu/ ). Unlike most researchers in asynchronous research, they still tape out chips.

sketerpot · on Jan 24, 2010

Cool, there are a few things I've been wondering about for a while now.

Is it possible to make practical async logic on ordinary FPGAs? I've seen some very conservatively-designed async stuff that ran on an Altera FPGA, but I remember hearing that it was kind of flaky. (It was a Java-to-hardware synthesis program -- there's a remarkably direct mapping between the usual program structures like loops, if statements, etc., and hardware if you're willing to use a naive design.)

What are the prospects for putting asynchronous IP cores into a system-on-chip with mostly clocked parts?

What kind of async logic style do you usually use in practice? Dual-rail quasi-delay-insensitive, or bundled data with matched delays, or something so exotic that I've yet to hear of it?

I've heard some people speculating about how async circuits can be used to make fault-tolerant logic. For example, if you're using dual-rail coding, then a single bit flip is going to be detectable as a fault, and you can trigger some self-checking-and-repair logic without breaking your timing assumptions. How important/plausible would you say this is?

blasdel · on Jan 24, 2010

Years ago I tried to implement a simple CPU architecture in Xilinx Verilog on a Spartan3 where the aim was for it to be pedagogically simple inside and out, with extremely concise and expository HDL source.

I failed pretty miserably: It took me several rewrites before I got one to even synthesize, and dozens before I got it to place+route onto the physical chip (fucking traveling salesman problem!). I never got my CPU's JMP instruction to work correctly within my self-imposed ideological constraints, making it rather useless.

It took me a while to realize that the fundamental problem with my design was hidden asynchrony, despite having a clock.

boots · on Jan 24, 2010

If you wanted to map asynchronous designs to an ordinary FPGA, I think there would be two major hurdles: 1) Choosing an asynchronous methodology (There are more then one), and mapping that to the LUT. This shouldn't be too difficult, but you are likely to have a high area overhead for this. 2) Getting the place and router to behave as you would expect it to. It depends if the router uses the clock to propagate it's information, although I expect it would NOT.

Getting the asynchronous IP core into a SoC system is not that difficult. You'll have to convert synchronous signals to asynchronous signals and vise versa, but once you figure that out (it's actually very simple) you can plug in the asynchronous core anywhere.

I used dual rail QDI in my circuit design, with a mix of four phase handshakes and two phase handshakes, depending on the application.

Fault-tolerant logic is definitely feasible - I worked on this for one project! The technique is patented by Achronix, but it definitely is almost bullet proof and only requires an area overhead similar to a voting schema.

jacquesm · on Jan 24, 2010

This is a very interesting area. You're really lucky to have worked for a company busy in this field, that's golden knowledge and experience.

I think that there are a few factors that will eventually drive asynchronous designs to become the way forward.

briansmith · on Jan 24, 2010

Is there difficulty implementing constant-time algorithms for async processors? For example, it is very desirable that every crypto operation takes exactly the same amount of time, to avoid timing attacks.

sketerpot · on Jan 24, 2010

You can always use a clock and a counter to delay for a period of time that you know is greater than the worst-case time that the operation could take. Or randomize it up a bit. Or design the logic to take similar times for all crypto operations.

Here's another cool thing: by using a dual-rail encoding (one wire being high means a bit is 0, the other wire being high means the bit is 1) you can design logic that consumes roughly constant power no matter what the operands are. This makes power-analysis attacks a lot more difficult.

Here's a paper on all this that you may enjoy:

http://www.gemplus.com/smart/rd/publications/pdf/MAM03as.pdf

boots · on Jan 24, 2010

There is definitely a very bound on the time that it could take, but one of the advantages of asynchronous is that you can design for the average case, not the worst case. Each circuit forms part of a data pipeline, and with enough data in this pipeline you can guarantee a maximum time limit.

Unfortunately, to match the minimum to the maximum, you would likely have to design your implementation to be very slow. In the constant time case you remove all the pipelining advantages, as you would be required to 'fill' the pipeline.

achronixinsider · on Jan 25, 2010

Do you think that your former employer (Achronix) would be pleased to hear that you are looking to disclose "specifics of designing these circuits" on a public message board?

petewarden · on Jan 23, 2010

Great stuff. I've been fascinated by the potential of async CPUs ever since I studied under Steve Furber, who was working on the clockless version of the ARM processor, Amulet: http://en.wikipedia.org/wiki/AMULET_microprocessor

jacquesm · on Jan 23, 2010

yes, I posted that one earlier today.

Quite the inspiration all these designs doing the 'impossible' (according to some).

It's always nice to see people pursue presumed dead ends then turn up finding gold.

kd0amg · on Jan 23, 2010

The professor I TA'ed for last semester said he's had a student turn in a clockless CPU for the final project in the introductory architecture class.

david927 · on Jan 23, 2010

very cool

jacquesm · on Jan 23, 2010

You're the inspiration for the current crop of 'clock-free' articles, I really got interested in it that particular aspect of our conversation after M's visit here.

I think I can see at least one way to crack that 'nut' of doing without a global clock, which would scale to arbitrary large systems.

The downside is that it is a purely theoretical solution for the moment, I'll try to do some simulations to see if it is actually viable.

The tricky bit seems to be to not fall in to the 'pit' of thinking that you have it working without synchronization when actually the test system has some underwater property that makes the likelihood of a failure vanishingly small, but which when duplicated 'in real life' would lead to an error almost immediately.

The classical example is two circuits that are 'independently' clocked from some master clock that gets divided by different factors.

You'd need to test this with local clocks that are varying wildly comparing the output of the system with 'known good' output over many cycles to gain some confidence.

It's amazing how the devil really is in the details with this stuff.

rikthevik · on Jan 24, 2010

Agreed. Especially when timings errors happen between different clocks in a deployed chip at different temperatures.