Another fun thing with SMP: The x86 multiprocessor spec says that to start an AP you need to send an IPI, wait 10 ms, then send another IPI (IIRC it's a "reset" followed by an "init"). On large systems, this adds up!
Except that you don't need to wait 10 ms for each AP -- you can start up the APs in parallel. There's just one small problem: All of the APs start up in the same state -- executing from the same CS:IP, and also the same stack pointer. Good luck having hundreds of CPUs stomping over each other's stacks.
Except that if you're careful, it doesn't matter -- you can even make a function call if you want because all of the CPUs will push the same return address onto the stack.
Implementing this in FreeBSD is on my "speeding up the boot" to-do list. I know it's possible though, because someone told me that they had already done exactly this in a different (non open source) system.
Lexicon for the non-x86 people: SMP = Symmetric MultiProcessing, aka more than one "virtual CPU". AP = Auxiliary Processor, any CPU other than the one which the BIOS starts up for you. IPI = InterProcessor Interrupt, how CPUs wake each other up. CS:IP = Code Segment + Instruction Pointer, where the CPU is reading instructions from.
Also, on any modern system, you really don't need the second SIPI. The CPU will come up with the first SIPI, and then ignore the second SIPI. So you can just send a pile of INITs and then a pile of SIPIs (or in theory one broadcast SIPI), and expect the CPUs to come up.
For the startup code, you shouldn't need to make a function call. A few lines of memory-less stack-less assembly could get the CPU number and then change the stack, assuming you have a global value that gives the base of a preallocated array of stacks.
Core ID in TSC_AUX is essentially an concession to userspace. As an OS and firmware you are supposed to identify CPU cores by means of their LAPIC ID (as read from APIC configurations MSR or from CPUID). Small issue there is that APIC IDs are structured according to HT/NUMA topology and thus not necessarily consecutive.
On the other hand, as an OS on PC-like platform you know how many cores there are supposed to be and what are their APIC IDs before-hand because they were already enumerated by firmware (which is the reason why you can do the one by one AP startup sequence in the first place).
Exactly: either read the APIC ID and use that to look up the CPU number in a table you already have, or arrange a location in memory to use xadd to assign a sequential CPU number, whichever your OS prefers.
> For newer CPUs (P6, Pentium 4) one SIPI is enough, but I'm not sure if older Intel CPUs (Pentium) or CPUs from other manufacturers need a second SIPI or not.
When (and if) that became officially sanctioned behaviour is another question.
> All of the APs start up in the same state -- executing from the same CS:IP, and also the same stack pointer. Good luck having hundreds of CPUs stomping over each other's stacks.
I'm away from my hobby OS to double check, but isn't it the cae that the Start IPI includes a page number which drives CS? If you send those out one by one, you can give each AP its own code page and set the stack page based on that (either using the CS value to index into something, or as an immediate value in the code, that you modify as you copy to the page). Of course, if you do a broadcast SIPI, then all of those are going to have the same CS. Depending on how much early boot code you fancy writing in assembler, you could maybe jump into protected/long mode, find the current cpu id, and lookup the proper stack pointer without using the stack at all, and only then jump into C code? Of course, one probably has nice C functions for some of those things, so it doesn't seem nice to also have it in assembly.
> I'm away from my hobby OS to double check, but isn't it the cae that the Start IPI includes a page number which drives CS?
I double checked, and as I understand it, with traditional APIC start IPI, you get to pick the CS address to be (0-255) * 0x1000; although how much of the first 1MB of physical address space is available within that depends on the system memory map. I just start one AP at a time, and use the top of the code page as stack space until it switches to the intended kernel stack, and then that AP starts the next one. That's not time efficient though; you could pretty easily start as many APs as you've got low pages available; although the option where everybody starts from the same place and they figure it out among themselves is probably simpler; because there's never a need to wait for an AP to finish starting before starting more APs; just saying, you've got options, they don't all have to start at the same CS:IP.
If you know how many CPUs you are bringing up, then you can allocate a bunch of stacks contiguously and have the CPUs race to pick up the next one, say
mov rsp, STACKSIZE
lock xadd [currstack], rsp
Of course, the contention on that xadd is going to cost you (if not 10ms per CPU... probably?), and this presumes you aren’t using the kernel stack pointer for anything (like a stable CPU number). To fix that, you probably will need to traverse a CPU -> startup data map in assembly. But it’s a start (no pun intended), and is not as horrendous a hack as having multiple CPUs push the same return address onto the same stack.
As an order of magnitude point, my experience has been that a bunch of CPUs trying to xadd has a throughput bottleneck on the scale of once per 50 to 100 nanoseconds.
But even if you allow an entire extra order of magnitude, at one per microsecond, that's still 10000 over the course of 10 milliseconds which is plenty for this usecase, at least for now.
As far as the startup process is concerned, SMT is two CPUs. I don't actually know how SMT works when one "CPU" has been started and the other hasn't... I guess it just pretends that it hit a hlt instruction on the unstarted "CPU"?
Except that you don't need to wait 10 ms for each AP -- you can start up the APs in parallel. There's just one small problem: All of the APs start up in the same state -- executing from the same CS:IP, and also the same stack pointer. Good luck having hundreds of CPUs stomping over each other's stacks.
Except that if you're careful, it doesn't matter -- you can even make a function call if you want because all of the CPUs will push the same return address onto the stack.
Implementing this in FreeBSD is on my "speeding up the boot" to-do list. I know it's possible though, because someone told me that they had already done exactly this in a different (non open source) system.
Lexicon for the non-x86 people: SMP = Symmetric MultiProcessing, aka more than one "virtual CPU". AP = Auxiliary Processor, any CPU other than the one which the BIOS starts up for you. IPI = InterProcessor Interrupt, how CPUs wake each other up. CS:IP = Code Segment + Instruction Pointer, where the CPU is reading instructions from.