Solving Spectre and Meltdown may ultimately require a new type of processor

furi · on Aug 27, 2018

Solving Meltdown absolutely does not require a new type of processor. Performing out-of-order executions of privileged instructions is something you can just not do, nobody except Intel is doing it right now in fact. Lots of Spectre variants likewise don't require entirely new processor varieties, such as the lazy floating point speculation vulnerability or the new L1 terminal fault vulnerabilities. The only one that possibly requires a new type of processor is the original Spectre, which is quite hard to completely stamp out all possibility of while retaining the concept of speculative execution.

xenadu02 · on Aug 27, 2018

Indeed, AMD CPUs don’t speculate loads until the page protection is checked. That alone eliminates whole classes of vulnerabilities.

Cache updates will need to be staged and not “retired” (aka committed and made visible to other cores) until the speculated instructions are retired. That will have some perf impact and cost some die space, but it is hardly fatal.

Whether AVX survives in its current form is an open question. Those 512-bit vector units cost a lot of power. Intel gets away with powering them down when unused, leading to a large delay when the first vector instruction is encountered after an idle period. Powering them up continuously blows their TurboBoost strategy. If they can eliminate the penalty or make the units more power efficient it might not require many changes, otherwise AVX may need to shrink back down to narrower units, or maybe force them through a shutdown cycle on every context switch? Not sure what the solution is here.

voidmain · on Aug 27, 2018

On the AVX thing, I think maybe you can eliminate it as a channel for speculative execution attacks by just not speculatively executing AVX-512 instructions when the units are powered down, which also sounds more efficient (it doesn't sound good that, apparently, if (expression_that_is_false) { do_avx_instruction() } can drop your clock speed for several milliseconds if branch prediction guesses wrong! The cost of powering up the AVX units is, I think, much greater than the cost of failing to speculate once.)

In general, I think the problem is that we probably don't know about all possible side channels, and might not for many years. So the approach you suggest - eliminating side channels one by one so that you can't extract information from speculative execution that way - is inherently risky.

titzer · on Aug 27, 2018

> ...by just not speculatively executing AVX-512 instructions...

See this suggestion a lot. At any given time, a CPU is trying to execute 5-8 u-ops in different execution units every cycle. These 5-8 u-ops must come from instructions somewhere. To have something to do, CPUs need to load dozens, even hundreds of instructions from "the future" using branch prediction. As such, there literally could be hundreds of instructions in its reorder buffer at once. Of these, 5-10 of them might represent branches which cannot been executed due to dependencies but instead have been simply guessed at using branch prediction.

TLDR; that pretty much means that the CPU is always speculating--perhaps 95% of all cycles.

Typical CPU designs do not explicitly track branch dependencies in the reorder buffer. Instead, they either rely on anulling at commit time or clearing the reorder buffer when a mispredicted branch commits. That means the CPU literally has no way of knowing whether it is currently "speculating". To fix this, one would have to add control dependencies to AVX instructions so that they could never execute with branch instructions on which they depend (i.e. earlier on their proper architected path) in-flight. That would almost certainly annihilate performance, which, after all, is the point of AVX instructions.

voidmain · on Aug 27, 2018

Powering up the full AVX-512 datapath supposedly takes something like 500 microseconds, and then slows the clock speed for something like 2000 microseconds. If so, you could literally just make every AVX-512 instruction fault when the units are powered down, let the operating system (after an appropriate fence) explicitly authorize powering up the AVX-512 datapath, and have no noticeable performance degradation relative to these enormous costs. And once the units are powered up, there is no side channel problem.

phire · on Aug 27, 2018

I don't know if Intel have done this, but you can design AVX-512 to run on AVX ALUs at half the IPC. You could feasabley design a CPU which has full AVX-512 units powered down and runs instructions at half speed until the full ALUs are powered up.

One you have a CPU of that design, you can eliminate that AVX-512 sidechannel by not sending the signal to power up the full AVX-512 ALUs until a AVX-512 instruction is fully executed.

mjevans · on Aug 27, 2018

I'm not sure you could do that in a completely secure processor though; at least not the way you're describing. More ideally the delay would at least be /simulated/ even IF the units were already powered up because of prior instructions. It would be /per thread/ tracking of slow/fast/powered before paths.

Edit: About 10 min after posting the above, I realize that this MIGHT be what your second paragraph is describing, though it isn't as unambiguous.

phire · on Aug 29, 2018

No. Simply providing two ways to executate with 256bit or 512bit ALUs is not enough to prevent side channels, you can still time the execution time to read the side channel.

The key to closing the side channel is not allowing speculated instructions to trigger the activation of the 512bit ALUs. Hold off for a few cycles until execution of those instructions is confirmed.

sitkack · on Aug 27, 2018

Would there still be a thermal side channel, if one had access to high enough resolution power monitoring?

Symmetry · on Aug 27, 2018

Never speculatively executing AVX instructions will, for bad but not uncommon cases, drop your performance by something like a factor of 10 in cases such as when you perform some AVX operation until a flag is set. I don't think there's any chance of that happening.

voidmain · on Aug 27, 2018

I said "not speculatively executing AVX-512 instructions _when the units are powered down_".

nickpsecurity · on Aug 27, 2018

Since gruez mentioned it, I'll note Intel also pushed Itanium which had security benefits almost nobody talks about in x86 vs Itanium discussions. Secure64, co-founded by Itanium designer, uses them in SourceT OS which they claimed got positive analysis by Matasano Security.

https://www.intel.com/content/dam/www/public/us/en/documents...

They also claimed the Itanium-based solutions were immune to Spectre and Meltdown. I'm not a CPU expert. I'll let others review that. A lot of attacks are showing up.

https://secure64.com/not-vulnerable-intel-itanium-secure64-s...

So, the processor that was about stronger reliability and security that the market and mainstream security ignored seems to mitigate some of the risks both are now griping about. Maybe folks who don't depend specifically on x86 might buy some Itaniums to signal they'll pay for security-enhanced processors. Be sure to tell the sales rep why so they can pass it up the chain. :)

For embedded stuff, there's also Microsemi's CodeSEAL and Dover's CoreGuard which have advanced protections. The first has encrypted/authenticated RAM plus control-flow integrity at CPU level. The second has a flexible, metadata unit that enforces many types of security policies at CPU level on per-instruction basis.

close04 · on Aug 27, 2018

Itanium is dead, after 15 years on the death bed. To my knowledge there is no plan to launch any new Itanium CPU. The only reason the 9700 was launched in 2017 is to serve as a drop-in replacement for old CPUs, most likely for contractual obligations.

There's simply no "killer feature" for Itanium. It's not coming back.

nickpsecurity · on Aug 27, 2018

That serves my point that introducing a security-enhanced processor didn't work for Intel. i432 APX and i960 also failed at a loss of over a billion dollars so far when Intel tries to make better, safer CPU's. The customers wanted backward compatible with x86's warts, highest performance per dollar, and decreased energy. That's what Intel gave them at billions in profit. They're making patches for the security weaknesses and/or fixing them in newer CPU's. Vendors of insecure processors still dominate while the few processors with better security sell almost nothing.

So, Intel should continue to make deliberately insecure processors and patch them since that's what people pay for. It's the market's fault for rarely buying anything better. Plus, voters not doing something about patent reform which would've led to more x86 competitors. The Chinese supplier is an interesting development where we might get a secure-ish x86 that way.

"There's simply no "killer feature" for Itanium. It's not coming back."

I just named its killer feature: security enhancements that make it hard to inject code into it. Most security audits of OS's find problems. The one of SourceT on Itanium didn't find a way to inject code into it since it used hardware to mitigate and contain most of that. Whereas, the protections on x86 are a source of vulnerabilities at the moment.

Unfortunately, that was the only company that I'm aware of which saw that potential. The market's direction means they're porting the product to insecure hardware right now. Who knows what they'll come up with since it's impossible to secure software on insecure hardware. I'm so grateful to the markets for buying such things so little that they basically don't exist in the mainstream space.

Note: Another example was Cell processor. Green Hills, who make INTEGRITY-178B OS for high security, did an assessment of the Cell processor for baking security into systems. It had some nice capabilities for that. Market wasn't going to buy it even to protect their systems, though. So, they're not pushing that.

close04 · on Aug 27, 2018

> Intel should continue to make deliberately insecure processors and patch them since that's what people pay for.

Unfortunately... What's more disappointing is that while you expect the average Joe to go for shiny-new-fast, enterprises should have been a little more restrained.

But hey, it's only an issue when it's found and with the power of hindsight all decisions are easy.

nickpsecurity · on Aug 27, 2018

It wasn't hindsight: we've been publishing and describing these risks along with techniques to mitigate them for decades.

https://news.ycombinator.com/item?id=16092183

Mainstream security ignores it for political reasons. They're like a cliche that pretend all prior work that's not like theirs or are outside their social groups didn't happen. Enterprise ignores it since management gets rewarded for either high-growth or low-cost deployments. Those tend to favor insecure deployments. The consumers don't care enough for I guess psychological reasons. They don't even adopt usable, free stuff like Signal most of the time. Much less paid options.

The high-assurance, security field has steadily done their part producing secure versions of hardware and software to address these risks. They usually knock out entire classes of attack. Almost nobody uses them outside of aerospace and defense. For hardware, few companies will build the secure prototypes since prior examples went bankrupt. The Spectre/Meltdown situation gives potential to use it as a buzzword pushing new hardware. That hardware needs to be totally compatible with existing OS's/RTOS's and C language, though, with almost no performance or cost hit. I think secure ARM's or MIPS's are best start. CoreGuard is example targeting them plus RISC-V. We'll see what happens.

close04 · on Aug 27, 2018

Because the regular timing attacks were always relatively easy to mitigate in software with low performance penalties?

Unless I'm misreading your comment, you're saying Meltdown and Spectre are just the "run of the mill" cache timing attacks about which tens or hundreds of papers were written in the past years. In which case the same mitigations should still apply since they have been used in cryptography for decades.

If this is the case and there's nothing more to them, and since you appear to be working in the field, I have one question: what is it that triggered this kind of response now that wasn't already triggered by the swath of already existing papers?

nickpsecurity · on Aug 27, 2018

Now you're putting words in my mouth. I'm saying almost all of the advice and techniques were ignored by security community for years whether it incurred a performance penalty or not. That's covert channels in general plus cache-based timing channels. I mean, when is the last time you saw someone writing FOSS with crypto or for security say they did a covert, channel analysis of their software? Pretty much never happens even though it was a mandated step in older and current evaluations with every certified product at EAL5 or above getting one. Heck, we couldn't even get a lot of developers to stop using C when safer languages like Ada or Modula-3 were available that fit their use case. Companies that claimed to take security seriously and had money for secure products wouldn't buy them. People who had money or whose uses could fit safer processors often didn't buy them.

Hell, one company whose products got lots of analysis were selling a secure RTOS, file system, and network stack for $50,000 license with no royalties. A router company could've greatly reduced attack surface that way on top of reliability/predictability improvements. The big players had huge piles of money, too. IIRC, only one company, not a big vendor, put a separation kernel in a network switch. And withdrew it later since nobody bought it. Sirrix is still selling stuff with similar design (Google Mikro-SINA architecture) but not same assurance or amount of evaluation. GENU is using OpenBSD variant. Only two I can think of that probably sell a lot of product and will be around a while.

It's pretty systematic that strong methods and products are ignored. Far as security folks, most of them don't know anything about this stuff despite it being in colleges, conferences, here on HN, and so on a long time. Those that hear about it mostly dismiss it often not even knowing what's in it. A cultural thing given they're dismissing stuff that resists attack well in favor of stuff that often fails. If they didn't, they'd have known about cache-based, timing channels in 1992. A paper somewhere in 1992-1995 range referencing that work analyzing either VAX or x86 noted potential leaks in many components and modes. It called for every component to be made leak proof to maintain the security policy. As usual, both private sector and security folks acted like that stuff didn't exist. They patted themselves on the back when rediscovering the same shit in 2005. Others digging into it more found Meltdown/Spectre many years later. NICTA doing a systematic analysis more like 1990's work found all kinds of attack surface in one pass:

https://ts.data61.csiro.au/publications/nicta_full_text/9074...

Meanwhile, those in CompSci and industry that paid attention were working on hardware architectures that minimized problems. Partitioning caches, asynchronous execution with randomization (I independently invented this), and masking are examples. An product example is Rockwell-Collins' AAMP7G CPU that was mathematically verified for correctness, embedded a separation kernel that does time/space partitioning between processes, did a proof it was correct, and triplicated the registers with voting to increase reliability in face of cosmic rays or individual failures. So, the CPU itself mitigates timing channels plus separates processes with maybe stronger assurance than seL4 given a microcode-level proof on verified CPU. They also have tools to prove your algorithm and assembly code match to prevent abstraction attacks. They sell those CPU's commercially in aerospace and defense. They use the tools to build stuff for their customers, esp crypto for NSA.

Cambridge is doing CHERI which runs FreeBSD; Microsemi's CodeSEAL has control-flow integrity with encrypted, authenticated RAM for FreeRTOS; Softbound/Hardbound/Watchdog team has theirs for safety; several teams have done non-interference for hardware with one proven down to the gates; Edmison did encrypted/authenticated RAM in a way that preserved legacy processor architecture and blocked some software attacks; Dover is doing SAFE architecture (crash-safe.org) for ARM/MISP/RISC-V called CoreGuard; SCOMP, first system certified by NSA, had security kernel, leak analysis, and an IO/MMU in 1985; AS/400's predecessor had capability-based security at hardware level; Burroughs B5000, the first mainframe in 1961, was immune at CPU level to most forms of code injection at CPU level.

These methods were published enough that all kinds of people were building on them. Most security professionals ignore them, even claiming nothing was ever achieved (just "red tape"). That's straight-up slander. The industry and consumers mostly didn't buy them since they wanted to either minimize cost (security is externality) or wanted specific benefits of insecure stuff (often got hacked). If they want secure CPU's, they can just buy some I mentioned that still exist and/or (supply side) build them based on highly-detailed descriptions of other stuff. CHERI is even open-source running C language and FreeBSD on FPGA board. You can bet a MIPS licensee hasn't built a CHERI SoC since they believe market won't pay for it.

" I have one question: what is it that triggered this kind of response now that wasn't already triggered by the swath of already existing papers?"

You'll find that most things in IT, including INFOSEC, are driven by mass psychology more than technical reasons. I'll demonstrate instead of explain using Heartbleed. It drove a lot of security discussion and fixes. Yet, it would've been prevented by pre-existing advice on using safe languages like Ada or a combo of tools plus development practices on C code. David A. Wheeler illustrates all the stuff that was available which developers were ignoring here:

https://www.dwheeler.com/essays/heartbleed.html

So, someone finds the attack, they publish it with a catchy name, and suddenly people are all over this issue. It drove even more activity in that part of the field. They can either adopt the advice of the people who managed to dodge all that or they can't focus on the narrow problem affecting them right now. Most response to Heartbleed did the latter with them still using unsafe practices and underusing available tooling. Baseline did increase in some places, though. That was good.

I think Spectre/Meltdown is just another example of that effect in action. It probably has a name already like herd behavior, getting on bandwagons, etc. Like before, they have at least two choices: listen to the people who used specific techniques to discover everything from cache-based channels to operating system leaks, using those same techniques that worked again; ignore all that to focus on every way you can squeeze a side channel out of the same or other components of x86. The correct answer is the first since it will do the second as a side effect plus show you all the other channels elsewhere. Kemmerer's and Wray's methods were universally applicable. Most work following Meltdown/Spectre is ignoring that to react to the specifics of the attack with narrower focus. Most, hopefully not all, of the solutions being designed will fail since they clean up a problem here while ignoring those over there. And we'll get another fix and another. Avoidable damage will keep happening.

I am hoping that a better scenario plays out with things like CoreGuard coming to market right as hardware is getting attacked. The better scenario is it sells profitably due to heightened awareness which fuels further development of secure hardware/software. We'll see.

close04 · on Aug 27, 2018

I wasn't trying to put words in your mouth, it's just what I could conclude from this put in context:

> It wasn't hindsight: we've been publishing and describing these risks along with techniques to mitigate them for decades.

>https://news.ycombinator.com/item?id=16092183

It worked for decades and the issues were "ignored" (not given much of a second thought) because everybody assumed that's all there is to the exploits. Corner cases that can be easily mitigated in software. It's just when everybody realized that the whole implementation in hardware was broken that the point was driven home.

Performance is what sold those CPUs when security just couldn't. Without the critical mass of revelations recently I doubt security would have been considered too much today. Just like privacy became a concern only with a similarly shocking volume of revelations.

So looking back it's hard to say if Intel would have chosen another way. Yeah, in hindsight you could say "if they just fixed it back then..." but in reality nobody would have cared.

nickpsecurity · on Aug 31, 2018

"Performance is what sold those CPUs when security just couldn't. Without the critical mass of revelations recently I doubt security would have been considered too much today. "

I meant to respond to this earlier. Yeah, it seems the mainstream acts when it's in their face, there's working exploits, they're dressed up somehow, and there's an immediate action to take. I guess the route to get secure hardware/software architectures is a series of highly-publicized breaks in every component with a Right Thing solution and alternative suppliers that used it. They also move fastest when the damage is high.

So there's a recipe for change. Of course, there's possibly some ethical and legal issues in there. ;)

saati · on Aug 27, 2018

Correct me if I am wrong, but I think later Itaniums got OoO and speculative execution too.

zaarn · on Aug 27, 2018

Of course, simply solving Meltdown can be done on an existing CPU.

However, VLIW-like architectures could give us back the performance of out-of-order execution without the drawbacks since the compiler has much more direct control over the caches and registers used in the CPU as well as branch prediction and lots of other internals.

Of course, you also need much smarter compilers, since the Itanium failure, we've come a long way and I'm convinced that Rust and friends would be able to get the maximum out of a VLIW-like with some additional work.

wolfgke · on Aug 27, 2018

> However, VLIW-like architectures could give us back the performance of out-of-order execution without the drawbacks since the compiler has much more direct control over the caches and registers used in the CPU as well as branch prediction and lots of other internals.

> Of course, you also need much smarter compilers, since the Itanium failure, we've come a long way and I'm convinced that Rust and friends would be able to get the maximum out of a VLIW-like with some additional work.

- What kind of compiler magic do you have upon your sleeves that will enable that kind of magic, but circumvents the problem of non-existence of a "sufficiently smart compiler" that plagued Itanium? In particular: What kind of magic does Rust offer for this?

- How do you intend to solve the problem (that also plagued Itanium) that mostly scientific code has the kind of parallelity that is very suitable to VLIW which lead to the problem that "lots of ordinary, existing code" did not benefit so much from the potential speed that Itanium offered?

- How do you intend to solve the problem that putting deep microarchitecture details into the instruction set is usually a bad idea, because it "cements" these details (I just say: MIPS' delay slots), while having a good instruction set makes deep changes in the underlying microarchitecture easy.

zaarn · on Aug 27, 2018

Compilers have advanced a lot since Itanium as have languages, Rust has lot of information to properly track shortly lived objects and optimize that.

Lots of code runs fairly well in CPU pipelines with multiple work units, if the work pipeline is too wide that sucks of course and is a bit of a waste but, say, having a Intel-sized pipeline would run most of the current code with the same efficiency.

This isn't purely about parallelizing threads, this is about microcode instructions which can be parallelized very easily; pull a instruction from the list of the program and put it in the first free execution unit on no conflict, if a conflict occurs resolve it (WaW; drop it or assign new register, RaR; free reordering, WaR; assign new register, RaW, strictly no ordering before), repeat. A compiler should be easily able to do that.

Of course having the microarchitecture and instruction set tightly married is a problem but considering we have a fairly alive ARM marketplace of varying instruction sets from ARM, I think it's not unsolvable from the compilation direction and you can abstract out some details so porting is easier or you can atleast emulate the other CPU with high efficiency.

gpderetta · on Aug 27, 2018

VLIW cpus are being built today. While they do quite well for specialized workloads, they make pretty poor general purpose CPUs. If compiler magic was available it would be already in use.

Transmeta style VLIW+JIT frontend seem to be viable, but I think they are as vulnerable to spectre as OoO CPUs.

wolfgke · on Aug 27, 2018

> Transmeta style VLIW+JIT frontend seem to be viable, but I think they are as vulnerable to spectre as OoO CPUs.

Transmeta's CPUs were in-order VLIW. The in-order property at least makes them at least less prone to Spectre-like vulnerabilities, just like the CPU used in the Raspberry Pi which is also in-order:

> https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulne...

gpderetta · on Aug 28, 2018

Nvidia Denver, which is basically an evolution of transmeta technology, appears to be vulnerable to spectre. It doesn't matter that the underlying cpu is in order if the cpu+runtime pair implement a deep out of order speculative engine.

wolfgke · on Aug 27, 2018

> Of course having the microarchitecture and instruction set tightly married is a problem but considering we have a fairly alive ARM marketplace of varying instruction sets from ARM, I think it's not unsolvable from the compilation direction and you can abstract out some details so porting is easier or you can atleast emulate the other CPU with high efficiency.

At least the first Itanium also had hardware emulation for x86 code. It was really slow (which gave AMD a strong competitive advantage at that time and enabled them to establish x86-64).

snaky · on Aug 28, 2018

> What kind of magic does Rust offer for this?

Not much, comparing to C, they both are too low-level for that. Just like almost all the programming languages widely in use. Except maybe Haskell.

That's just impossible to optimize C code beyond some (not very impressive) level. And every CPU hoping for wide adoption is pretty much "CPU for C compiler" - and that's a problem.

lvs · on Aug 27, 2018

That's just precisely what the article says. The first version of Spectre is the problem.

rurban · on Aug 27, 2018

But it didn't tell you that such processors do exist, you just need a 2nd c3 register. It's not rocket math.

ohiovr · on Aug 27, 2018

Why can't Intel sell drop in replacements for the existing cpus out there? Can of worms?

wolfgke · on Aug 27, 2018

> Why can't Intel sell drop in replacements for the existing cpus out there? Can of worms?

Even if this were financially suitable for Intel (it is not), a necessity for this is that Intel has a microarchitecture available that "really fixes these problems" - but there is none. I consider it as plausible that the old in-order Atoms are less prone to these problems - but do you really want such a much slower CPU?

vmchale · on Aug 27, 2018

> which is quite hard to completely stamp out all possibility of while retaining the concept of speculative execution

So... why not do exactly that? Let compilers access cache and stop this nonsense where we pretend computer memory is flat in order to let C developers believe they're "close to the metal [sic]"

gpderetta · on Aug 27, 2018

That's because compilers by are terrible at predicting the dynamic behaviour of a program, be it cache access patterns, branche prediction etc.

vmchale · on Aug 27, 2018

They don't have to be. If cache access has finer controls, the compiler can use them.

gpderetta · on Aug 27, 2018

That worked well for CELL. It is not of a matter of having control, the issue is that the working set of a program is data dependent and not statically predictable.

noobermin · on Aug 26, 2018

Something very relevant to this discussion was this article[0] from a few weeks back. A bit of a clickbait title, but the points are very relevant: C, based on the PDP-11, has a computation model that no longer really tracks with how a computer actually works today, and in those gaps is where people insert in stuff like speculative execution in the first place to be faster while maintaining the guise to app developers of a "fast PDP-11".

[0] https://queue.acm.org/detail.cfm?id=3212479

Old discussion (skip the first reply for meat!)

https://news.ycombinator.com/item?id=16967675

flukus · on Aug 27, 2018

Correct me if I'm wrong but assembly doesn't offer many more entries into this computation model either does it, especially when considering some non-standard extensions to C? So they've added a lot of processor features with no API.

Is there a reason speculative execution had to be added to the processor and not compilers? Would we have been better off if the true memory model was exposed to compilers?

xenadu02 · on Aug 27, 2018

The ground is littered with the dead bodies of CPU designers who said “we’ll just make a smart compiler”.

The CPU knows what the software is actually doing and has done in the past. The compiler does not.

The CPU can change implementations; the compiler must bake any assumptions in at build time.

Speculative execution is required so long as memory latencies are so large. That trend isn’t reversing itself anytime soon.

waterhouse · on Aug 27, 2018

Not that people making the smart-compiler assumption in the days before JITs were justified... but that sounds like an argument for using JITs.

gruez · on Aug 27, 2018

>The CPU knows what the software is actually doing and has done in the past. The compiler does not.

profile guided optimizations?

gmueckl · on Aug 27, 2018

Profiles capture only program brhabiour under test Data and sssumed that that pattern is prevalent. If the actual runtime behaviour is bimodal or worse, then only one mode will be optimized at most. A CPU can keep statiatics at runtime (the branch predictor does) and thus adapt ever so slightly to the actual current program behaviour.

flukus · on Aug 27, 2018

Runtime optimizations have there own problems. Sometimes the rare case will also be the most sensitive, sometimes the statistics will take too long to turn around, etc.

At least profile guided optimization gives you a degree of control.

mokus · on Aug 27, 2018

It seems like the main problem is that the compiler doesn’t know the specific target it will run on. We actually do have our smart-enough compilers. They just come in the form of microcode that we can’t audit or even opt out of because the true instruction set of the machine is a trade secret and we’ve been locked out anyway by code signing.

angry_octet · on Aug 27, 2018

Well you can target a specific CPU model and its specific cache sizes and functional units, even the microcode revision it uses. Some cryptographic code certainly does that. But it is a major overhead to comprehend all that detail for even a small section of code. There are undocumented micro ops that are present, but they are not really relevant to this, they are not the cause of speculative execution attacks.

The compiler is poor at anticipating the many types of dynamic program behaviour. But the CPU is also bad at dealing with it. The very simple statistical model for branch prediction is used because space on the die is precious. Cycle accurate simulation is done routinely, if anyone had come up with a much better (and equally general) predictor that could fit on die it would already be in use. People have tried many variants of dynamically recompiling machine code. For example, early MIPS processors (R3000) recompiled everything to be slightly closer to the specific CPU, and exposed pipeline stages (like delayed branch, delayed load), software page fault handling etc.

In some ISAs you can even encode the branch prediction hints the compiler makes (from analysis or profiling) into the opcode. But the gains are limited.

adrianratnapala · on Aug 27, 2018

> Is there a reason speculative execution had to be added to the processor and not compilers?

Wasn't this supposed to be the lesson of Itanium and VLIW?

I have read (without understanding for myself) that VLIW it had two problems: (a) no one can actually write an appropriate compiler, and (b) that's not surprising because the non-literalness of assembly is a useful hinge point allowing CPU designers to evolve their implementations while retaining a stable interface.

But this thread is talking about going even further and exposing new CPU features to programmers, and not just compilers. Does anyone have thoughts about what abstractions they would like to see?

flukus · on Aug 27, 2018

> But this thread is talking about going even further and exposing new CPU features to programmers, and not just compilers.

You could argue they already do, __builtin_prefetch and it's associated assembly instructions offer a way to manipulate the cache, but at the same time they seem to hedge their bets and pretend the cache isn't there. On the other end we've got programmers that know it's there and jump through hoops to try and keep things in contiguous memory, tricking the CPU into doing the correct thing.

That is truly an elephant in the room situation, everyone knows there is a magic black box there but do their best to ignore it because there is no other option.

> Wasn't this supposed to be the lesson of Itanium and VLIW?

It was the lesson, but that doesn't mean it was the correct one. In an alternate universe where there was no x86_64 we might have learned the opposite lesson.

adrianratnapala · on Aug 28, 2018

I think I'm with you about caches.

I sometimes wish we wrote modern software for swarms of little 16-bit CPUs with 64kB of RAM and dealt with larger data structures by passing messages to other CPUs and to a DRAM-backed database.

And yes, a lot of software architecture is actually about emulating the above picture on a machine that pretends to have a flat address space, even though it actually only has 64kB of truly random-access memory.

st26 · on Aug 27, 2018

the non-literalness of assembly is a useful hinge point allowing CPU designers to evolve their implementations while retaining a stable interface

Case in point, we have already seen a few times how unanticipated NUMA or thread scheduling weirdness can totally tank performance- and these are finer details!

wtallis · on Aug 27, 2018

> Is there a reason speculative execution had to be added to the processor and not compilers? Would we have been better off if the true memory model was exposed to compilers?

Binary compatibility is the reason. You can't break the memory model or the instruction set unless you actually have the ability to re-compile everything that your users want to run. This almost never happens in the personal computing world, and emulation is only a partial solution. Intel managed to transition from the 8080 to the 8086 without backwards compatibility by preserving assembly language compatibility, but every architecture shift since then has required either emulation or a backwards-compatible processor mode.

mitchty · on Aug 27, 2018

> Binary compatibility is the reason. You can't break the memory model or the instruction set unless you actually have the ability to re-compile everything that your users want to run.

I fail to see how having shared speculation is incompatible with existing binaries. AMD processors for example check security before speculating, instead of just running things in parallel and hoping they don't need to reset registers later.

Can you explain how speculative execution is systemic to x86? I can't see it to be honest.

Bucephalus355 · on Aug 27, 2018

During World War I, in order to get a decent working radio, the US government suspended lots of IP restrictions and made companies pool their patents. Perhaps in the future we could see something like this.

Considering that radio stocks were the “hot item” in the 20’s, to a point where it became something of a bubble, it doesn’t look like businesses made out too bad.

_chris_ · on Aug 27, 2018

I gave a talk on this topic at a recent RISC-V conference (slides here: [1]).

I think this can entirely be hidden below the ISA abstraction, but it will be a lot of development work to do this and unfortunately getting new hardware into the field is a slow process. One part of this is buffering and killing speculative updates (or flushing on "time domain" switches). The other, and harder part, is to manage/avoid bandwidth contentions caused by speculation.

I think the better, long-term solutions will involve address IDs/capabilities to better denote security boundaries to the hardware. But that requires a lot of buy-in across many layers of the compute stack(OS/platform change for starters).

[1] https://content.riscv.org/wp-content/uploads/2018/05/13.00-1...

zlynx · on Aug 26, 2018

I heard rumors that Intel set up a research group to fix this in hardware as soon as the Spectre exploit showed up.

Mill may be cool and all but x86 isn't going anywhere.

Although I think it'd be hilarious if Intel decided to bring back Itanium with explicit per task cache indexing and no speculative execution. They already have the technology while Mill has never had a silicon implementation.

mindslight · on Aug 27, 2018

Yeah - AMD!

(I kid)

Although seriously with the "invisible hand" pushing platform lockdowns across the industry, it's time to stockpile some G34's next to the canned goods.

lysp · on Aug 27, 2018

The Alternate Mitigation Device cpu?

gwbas1c · on Aug 27, 2018

Two of the most important things about security is understanding how valuable information is and how likely an attack is. Specter is very hard to pull off, thus it will probably only be used for highly valuable information.

TwoBit · on Aug 27, 2018

I keep thinking, surely Intel and AMD were internally aware of spectre. How could they not foresee it?

acdha · on Aug 27, 2018

The same way most security bugs happen: people are focused on making the expected results happen. The skill of thinking about how something could creatively be abused is hard to cultivate – theoretically this class of bugs has been around for decades but it wasn’t publicly exploited until now.

alexeiz · on Aug 27, 2018

Am I the only one with the impression that Spectre and Meltdown vulnerabilities are way overblown? At the company I work for, nobody cries bloody murder about them. I don't see other organizations scrambling to update their infrastructure either. For the amount of press these vulnerabilities get there is very little action taken.

acct1771 · on Aug 27, 2018

RISC is good.

mhkool · on Aug 26, 2018

The "new processor" is already being developed. Mill computing is designing the Mill CPU and made 13 videos so far on Youtube explaining most characteristics of the new CPU. The CPU is on par with the latest Xeon, uses 10x less power, and has many great features that enhance performance and security a lot. For example, portals calls are function calls from protection domains to another protection domain and automagically only allow the parameters to be read by the callee and the results be read by the caller. The rest of the data of the other protection domain cannot be accessed. The stack is also secure since there are 2 stacks: one holds all data that one normally find on a stack and is modifiable by the program and the second stack is a hardware-managed stack with the return addresses which is hidden from software. Also the protection on the stack changes with each call and each return such that a stack overflow results in an immediate fault. The CPU has many more features, too many to name here.

nikanj · on Aug 26, 2018

Extraordinary claims require extraordinary proof. Being on par with the latest Xeon, while using 10x less power, is an absurd claim.

olefoo · on Aug 26, 2018

Moores law is well and truly dead.

That's one design cycle == 10x improvement.

So not unreasonable within the context of the history of the industry.

The proof will be in the product when it ships.

eropple · on Aug 27, 2018

So there's no proof and the persistent pro-Mill posts are wishcasting?

Glad we cleared that up.

bsder · on Aug 27, 2018

> The CPU is on par with the latest Xeon, uses 10x less power, and has many great features that enhance performance and security a lot.

Here's the last two companies to spout this line of garbage: Transmeta and SiByte. Remember them?

Your statement implies that the Intel engineers have left a lot of design on the table. I assure you, this hasn't been true for quite a long time (circa 1996, arguably, but certainly not since 2000--that's why both Transmeta and SiByte failed).

The simple answer to this is: graph the active area vs power consumed for all the chips of a particular technology node. Unless Intel is significantly below that line, you're not going to beat them by much.

wtallis · on Aug 27, 2018

> Your statement implies that the Intel engineers have left a lot of design on the table.

The Mill folks would rather argue that Intel's problem is they aren't leaving things on the table—instead, they're shoving tons of extra machinery into their core designs while the Mill tries to avoid extremely power-hungry performance enhancing techniques like speculative out of order execution.

> graph the active area vs power consumed for all the chips of a particular technology node

Setting aside the difficulties of accounting for differences between different manufacturer's processes at the same node, this still isn't a very informative measure. Active areas of the chip don't all make the same contribution to performance. Speculative execution can light up a lot of silicon executing instructions that don't actually need to be executed.

I'm quite skeptical that the Mill guys will ever produce a core that beats Intel on raw performance, but they have plenty of convincing arguments for why their power consumption will be much lower, and I think it's plausible that they will be able to beat Intel on performance per watt for some workloads.

bsder · on Aug 27, 2018

> Mill tries to avoid extremely power-hungry performance enhancing techniques like speculative out of order execution.

Go look at Alpha 21164 vs x86 in the timeframe. You don't save as much as you think (and Intel's designs were much shoddier back then).

> Setting aside the difficulties of accounting for differences between different manufacturer's processes at the same node, this still isn't a very informative measure.

You would think so, but the graph is remarkably consistent. Just because a measure has noise does not make it a useless measure.

Intel could be significantly below the curve. But beating the curve is ferociously difficult.

gpderetta · on Aug 27, 2018

Although the mythical improvements never materialized, at the very least Transmeta manged to ship a commercially viable CPU using some trully innovative technology, so I think it is unfair to bucket them together with the rest.

comboy · on Aug 26, 2018

> The CPU is on par with the latest Xeon, uses 10x less power (...)

When you say the CPU do you mean a working ASIC? Some simulation? Or a plan to make one?

And how do you compare them if they have completely different instructions set?

gruez · on Aug 27, 2018

A quick skim on wikipedia says it's yet another VLIW uarch. What's preventing it from being another Itanium?