Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Linux RNG flaws (chromium.org)
182 points by tbodt on May 1, 2018 | hide | past | favorite | 65 comments


The flaws only applied for keys generated during early boot. Jann and I looked, but we didn't find a way that this could be turned into a practical exploit, which is why we considered, but decided against, doing any kind of coordinated disclosure. The potential weakness applies before you see crng_init=2 message, and in practice on many machines this happens well before a minute --- on my laptop, in under 10 seconds.

The main problem with the fix is that there are some userspace applications which assumed they could get cryptographic randomness super-early during system startup, and with a patched kernel, those userspace applications would block --- and in some cases, block the boot altogether. With no activity, there is no entropy to harness, and the boot scripts essentially deadlock waiting for the random pool to be initialized.

So for some hardware, and some distributions, we're getting some boot hangs that we now need to try to workaround or fix somehow. In general, the best thing to do is not to rely on cryptographic strength random number generation during early boot. Key generation should be done lazily, and deferred for as long as possible.


What reasons are there not to implement the OpenBSD model where the bootloader passes the kernel a seed that the kernel generated previously?


The reason why this is hard is because Linux is supported on a great number of architectures, and some architectures have more than one boot loader that is used. The approach of letting the bootloader pass seed entropy is the right answer in general, I agree, but unlike OpenBSD we can't assume that it will always be present. (OpenBSD only supports a limited number of architectures, and a single bootloader.) Hence, no matter what, we have to have fallback mechanisms for dealing with the case where we have a bootloader which doesn't pass a seed to the kernel.

The other issue is that I don't get paid to work on the random driver in Linux. I've been looking for volunteers to work with the grub, syslinux, efistub, not to mention all of the various signed bootloaders used by different Android devices, but because we had a fallback mechanism there is less motivation for people to want to work on this.

This would actually be a great intern project or GSOC project, but this have been so hectic this year, personally and professionally, I didn't have the time to commit to hosting an intern or GSOC student this summer. :-(


There is no reason not to try it, but you would still want to cater for the initial boot when you have no previous entropy to read.


Couldn't the installer write the initial seed?


There's not always an installer, I make heavily use of live images. On other machines of mine, first boot is made from a copied image.


You may know this but for others who may not: there are tools like virt-sysprep that can (among other things) inject a random seed into disk images.

If you clone/generate VMs from one "master" image, running virt-sysprep (or similar) is one of the steps you should do right before launching a new instance (inject a new random seed, wipe out any SSH host keys, etc.)

Even DigitalOcean missed doing this step a while back.


I was about to mention the same thing. A pass to make an image unique before bootstrapping it is, or should be, a well-known thing. Remember the hoo-hah over unique machine SIDs in Windows NT. And there is of course the Freedesktop world's systemd-firstboot for making unique machine IDs in images in D-Bus and systemd.


If you have network access early enough in the install and/or boot processes could you not pluck some random bits from a web service to pump more entropy into the RNG's pool?

For paranoia you could host your own fairly easily instead of using a public one: have a machine that is publicly visible respond with random digits from its own entropy pool, have it use one or more of various methods to keep that pool topped up (local interrupt timings, a hardware RNG, ...), use request hashing with a "secret" key if you are worried about the general public draining your entropy pool or available bandwidth.

And if it can't see the entropy service for any reason (it is booting in an environment where the rest of the network is completely cut off, or perhaps the entropy service is down) then fall back to just doing what-ever is done now.



Exactly that. I'd not spotted that a nice convenient option already existed rather than rolling your own. If the location you host pollen on to provide entropy for elsewhere via pollenate, try running haveged.


It's seeds all the way down.


Some distros dump the entropy pool to disk during shutdown and feed it back in during early boot. Would this mitigate the issue?


Part of this flaw was that process wasn't working correctly.


Ah, I admit I didn't read far enough into the write-up.


In cloud during new machine install ssh host keys are generated during boot time well under a minute. Say that the virtual host machine that hosts the virtual machines writes the disk image to local disk, the disk image is then in cache memory on the virtual host, the virtual machine boots from what is essentially a file in RAM cache or a fast SSD.


Cloud hosting is actually the easiest case because you have to trust the host environment anyway. (If you can't, you're sunk.) So we just can trust the use of virtio-rng. Setting this up for qemu is pretty simple.


I think that problem could be handled by having the RNG return an error until crng_init=2 and introduce a way to check the state of the kernel RNG.

Applications can then check if the kernel is ready to do RNG and if not, handle the delay themselves.

But I guess that would break some applications so there might be another way (extra device for early boot randomness? ioctl?)


When the device blocks, the application can always do a non-blocking read to determine if data is available.


Why doesn't the device simply block until it has collected sufficient randomness?


Chicken and egg; it won't properly boot until it has enough entropy, and you don't get enough entropy until it is booted.

The proper fix is to change the boot scripts to not assume that the CRNG will be available directly after boot.


It's better to deadlock than to revert to a weak RNG. Deadlocking guarantees the application gets fixed. Reminds me of setting an init system to restart crashed daemons.


> It's better to deadlock than to revert to a weak RNG

..for you. Other people have a different threat model where availability is ranked above RNG strength in the first minute of booting. Making security decisions not based on any sort of threat model considerations is zealotry/cargo-culting, IMO.


The kernel doesn't do that.

Above all else, the rule is that the kernel shouldn't cause userspace regressions. Something that works now must continue working later.


Or how about it blocks, and then if the user starts rambling on the keyboard, it collects entropy?


Right, that would be fine. I think. The point is it's gotta break in a way that requires attention. In the IOT setting often no user input easily available. The OS should not lie to the apps when it hands them numbers that are supposed to be crypto random.


/dev/random kinda does but a lot of applications also rely on /dev/urandom which doesn't, which is why it's a problem to some extend.

Atleast from what I know.


I don't understand why it's not possible to make /dev/random block until there is enough entropy in the pool for the seed, and never block afterwards when it can securely do key-stretching.

After seeding, there is no point in blocking on a lack of entropy, which is why it is recommended to use /dev/urandom in the first place.


This is what getrandom(2) does, and using getrandom is the recommended path. However, if you use getrandom(2), and you are in early boot, such as by systemd's journald, and you are using it for some pointless HMAC mechanism with a randomly generated key for no specified security goal and no clearly articulated threat model, then you can end up hanging the boot, leading to the entropy deadlock situation I described.

So getrandom(2) will block until the pool is seeded, and all newly written application should use it, and existing applications should switch to it. But you still need to try to lazily generated random keys, and think very hard about whether you really need to generate cryptographic grade random numbers before the user logs in.


Is there a way to review the different paths and come up with answers for all of them?

E.g., if on a platform with a hardware RNG then it should never be necessary to block: just read 128 bits out of the hardware RNG & generate a stream of random numbers.

On a platform without a hardware RNG, could one require a previous seed? An installer could install the seed in some persistent storage somewhere, so there's no need to do this even on boot. A VM system needs someway to atomically read the seed & then write a new seed.

On a platform without a hardware RNG and without a previous seed (i.e., the very first boot after a hand-install or something), could the system require the user to type keys until it has collected 128 bits of entropy?

I'm not certain if there's a good answer on systems with no hardware RNG, no previous seed and no input capability. Maybe CPU timing loops or somesuch?

It'd be also be nice were there a filesystem interface to getrandom(2) …


How do you decide whether or not you trust the hardware RNG? Do you trust RDRAND? Some people do, other people are convinced it may be backdoored by Intel at the request of the NSA. Worst of all, there is no way to tell which belief is true. So there are potential real problems with hardware RNG's if you are worried about state sponsored attackers who are willing to intercept hardware shipments.

Requiring a previous seed requires a way to get access to the seed, early enough in the boot that it is available to kernel users who are trying to use randomness for address space randomization and for stack canaries. But in early boot the kernel may not be sufficiently initialized to read from persistent storage, and there are many, many bootloaders.

The reason why there is no file system interface to getrandom(2) is that a file system interface is subject to file descriptor exhaustion attacks. It was OpenBSD which designed the getentropy(2) system call, and getrandom(2) was modelled after it. Basically, getrandom(2) is getentropy(2) with an extra flags parameter added.


> I'm not certain if there's a good answer on systems with no hardware RNG, no previous seed and no input capability.

Typically, the crypto is needed for network communication. It can provide some randomness too...



I think a perverse sense of ABI stability, in this case. Don't want to change expectations for any userspace program, ever. getrandom(2) theoretically provides the actually useful behavior.


What makes you think that that's impossible? It's what the BSDs do.


> So for some hardware, and some distributions, we're getting some boot hangs that we now need to try to workaround or fix somehow.

I recently updated to a 4.x series kernel and had this issue during boot. I had to interact with the VM to get it to unblock and boot.


What kind of applications need cryptographic randomness during early boot?


The classic application is generating the ssh server keys.

I don’t know if that solely uses the system RNG though.


Pretty silly. If you can network, you have access to at least 3 decent sources of entropy. (Network traffic, hardware clock randomness and network device clock. Potentially power supply monitoring and bus clocks. And finally thermal sources.) And then you can always bake in the key into flash.

The problem here is that you would have to probe RNG stats manually as it returned readiness too early.


You don’t need network to start the SSH server, it just listens on all interfaces. Also, although it doesn’t matter that much in this case, trusting the network for entropy is not such a great idea because it isn’t trusted.


That is only relevant if your entropy estimator and mixer is broken and/or you have too few entropy sources. Linux entropy handling is quite well tested. (As opposed to initialization of the RNG.)


I wonder what sshd could turn to in order to get entropy if the system doesn't think it has enough?


If it can tell if the system doesn’t have enough, it can just wait for a while longer.


On RHEL 7 (and derivatives), by default, SSH keys are generated on first boot and with entropy acquired from /dev/urandom.

It is possible, however, to force it sshd-keygen, sshd, et al., to use /dev/random instead (by setting "SSH_USE_STRONG_RNG" in /etc/sysconfig/sshd to a value >= 14). In this case, would keys still potentially be at risk?

---

For a while now, I've been generating my SSH host keys at the end of the kickstart installation process (for physical hosts) with SSH_USE_STRONG_RNG=32 (in a "%post" script).

Virtual guests (KVM) still generate theirs on first boot, but they have the benefit of access to the host's /dev/random and also get a random seed "injected" into their disk image just before first boot -- although I'm not sure if that helps with this issue.


Just to be sure I understand fully, if I take a small embedded router running OpenWRT, that maybe take ~1min for the urandom pool to be ready. https://git.openwrt.org/?p=openwrt/openwrt.git;a=commitdiff;... On the first boot we save 512 bytes from getrandom(). On the second boot we restore the seed (cat seed > /dev/urandom) before almost all the user space starts. Many daemons (dropbear/openvpn) will read from /dev/urandom before urandom pool is considered ready, but after the seed is restored.

With the new patches, are the daemons benefiting from the seed ?


Note that Dropbear will read urandom upon each incoming connection so that should be mitigated in many circumstances. (Hopefully some systems are using the delayed hostkey generation option too)


Seeing that `i % len` in https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... makes me uncomfortable. This is performing a non-constant modulus on every loop iteration, which seems quite excessive even for a slow path. Performing wraparounds using a separate branch would be more efficient.


You should submit a patch.


The amount of data this is used for is vanishingly small. Of course it could be eliminated entirely by copying shorter inputs to a local CHACHA20_KEY_SIZE stack buffer first, but why bother? Typical input size is on the order of 5.

Additionally, CHACHA20_KEY_SIZE is a power of 2 and optimizing compilers like GCC will convert the modulus into a binary-AND.


It's worth noting that FreeBSD also had some RNG troubles quite recently (2017)[0] (talk: [1]). Personally, I don't think they were as severe as this. (But I am biased, I do FreeBSD development.) What I'm trying to say is, getting random right is hard. It's easy to produce something random-looking that isn't good enough.

JMG's FreeBSD 2017 RNG talk Tl;dr:

1. Boot-time entropy might leak on either UFS or ZFS; might be re-used in a byzantine ZFS environment. (Not addressed, as far as I know.)

2. Input entropy data (pre-whitening) had only about 0.18 bits of entropy per byte of input. NIST likes to see 4-6 bits per byte. (Not really an issue due to sufficient input volume.) Partially due to:

3. All the fast, high-quality sources of random entropy were accidentally disabled(!). (All the PURE_* ones, including x86 RDRAND.) Fixed in r324394.

4. Other low entropy structures were getting mixed in. Probably harmless, but reduces entropy-per-byte measure that NIST cares about. Fixed in r324372.

The earlier 2015 FreeBSD RNG issues were probably as severe as this Linux issue[2]:

> URGENT: RNG broken for last 4 months

> If you are running a current kernel r273872 or later, please upgrade your kernel to r278907 or later immediately and regenerate keys.

> I discovered an issue where the new framework code was not calling randomdev_init_reader, which means that read_random(9) was not returning good random data. read_random(9) is used by arc4random(9) which is the primary method that arc4random(3) is seeded from.

> This means most/all keys generated may be predictable and must be regenerated. This includes, but not limited to, ssh keys and keys generated by openssl. This is purely a kernel issue, and a simple kernel upgrade w/ the patch is sufficient to fix the issue.

[0]: https://www.funkthat.com/~jmg/vbsdcon_2017_ddfreebsdrng_slid...

[1]: https://www.youtube.com/watch?v=A41cDCE6pTc

[2]: https://lists.freebsd.org/pipermail/freebsd-current/2015-Feb...


4795 b2242bd 30230fc 9fb6ae d7 7d 69 69 73 69 69 69 69 69 69 69 73 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 6e 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 69 5140 f a a a 5 a a a a 5 a

Nice.


Tldr?


Early system, post-boot randomness is hard. Linux wasn't handling it ideally. Consequently, things which relied on a cryptographically safe RNG were likely executing before that could be guaranteed.

Sequencing was modified to better align expectations with actual system operation.


From TFA:

    > == Discarded early randomness, including device randomness ==
    > == RNG is treated as cryptographically safe too early ==
    > == Interaction between kernel and entropy-persisting userspace is broken ==
    > == No entropy is fed into NUMA CRNGs between rand_initialize() initcall and crng_init==2 ==
    > == initcall can propagate entropy into primary and NUMA CRNGs while crng_init==1 ==
and

    > == Impact ==
    > I have spent a few days attempting to figure out how bad these issues are.
    > I believe that on an Intel Grass Canyon system, with RDRAND disabled,
    > ASLR disabled, fast boot enabled, no connected devices, with boot on power,
    > some frequency scaling options disabled, and the fan set to maximum,
    > it should be possible to express the entropy in the used RDTSC samples in around
    > 105 bits or less. (I'm not sure which parts of this configuration actually
    > influence the amount of entropy; but ASLR certainly does influence it, since the
    > one interrupt sample that is fed into the RNG before the RNG initialization
    > contains an instruction pointer.)


Due to some unfortunate complications of somewhat recent changes:

> The worst part of this (one device entropy sample being enough to move to crng_init==1) was AFAICS introduced in commit ee7998c50c26 ("random: do not ignore early device randomness"), first in v4.14.

confirmed by the fix for that bit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


"105 bits" isn't too shabby.

I'd only start to worry when it gets to say 70 bits or less. That's the kind of size a state actor could precompute ssh host keys for each of the initial random number states in a common distro for example.

If a state actor could compute ssh keys in 50 milliseconds on custom hardware, and they had 100,000 machines generating those keys, they could generate enough keys to have a 1 in 1 million chance for 70 bits of entropy after 20 years.


Emphasis on "or less." Anyway, far far below expected entropy.


Read each of the headings and see the bottom part for impact. The final update (added today) mentions the resolution.


man urandom

> When read during early boot time, /dev/urandom may return data prior to the entropy pool being initialized.

I see no bugs here. Just repeating what is written in the man page.


If you continue to the next sentence in that man page:

> If this is of concern in your application, use getrandom(2) or /dev/random instead.

These bugs affect getrandom too.


Where is it written? Could you pin point it? I see urandom everywhere.


I got it from this part, unless I'm misreading:

> Multiple callers, including sys_getrandom(..., flags=0), attempt to wait for the

> RNG to become cryptographically safe before reading from it by checking for

> crng_ready() and waiting if necessary. However, crng_ready() only checks for

> `crng_init > 0`, and `crng_init==1` does not imply that the RNG is

> cryptographically safe.


Please don't post shallow dismissals.

None of the reported bugs are about /dev/urandom returning data too early.


Keep in mind that a lot of high profile security people on HN and elsewhere have spent many years now telling everybody that the linux urandom man page is wrong, so don't be surprised when people ignore it.

https://hn.algolia.com/?query=urandom%20manpage&sort=byPopul...


Wasn't there a popular article with the author vehemently supporting the use of /dev/urandom because of it's nonblocking characteristics? I vividly remember it.

Also I think it's common knowledge that there's little entropy available during startup but developers have no control over it regardless.

I didn't know it was that bad though. That's really bad.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: