NVMe confusion

Ethin · Post by **Ethin** » Thu Dec 17, 2020 12:06 am

So, I'm working on my NVMe driver again (now that I have more time since the college semester is over, yay!). I'm trying to figure out how to properly send commands and acknowledge responses. I've been reading the NVMe specification -- section 7.2 -- and it describes the general process. For reference:

This section describes command submission and completion processing. Figure 432 shows the steps that are followed to submit and complete a command. The steps are:
1. The host places one or more commands for execution in the next free Submission Queue slot(s) in memory;
2. The host updates the Submission Queue Tail Doorbell register with the new value of the Submission Queue Tail entry pointer. This indicates to the controller that a new command(s) is submitted for processing;
3. The controller transfers the command(s) from in the Submission Queue slot(s) into the controller for future execution. Arbitration is the method used to determine the Submission Queue from which the controller starts processing the next candidate command(s), refer to section 4.13;
4. The controller then proceeds with execution of the next command(s). Commands may complete out of order (the order submitted or started execution);
5. After a command has completed execution, the controller places a completion queue entry in the next free slot in the associated Completion Queue. As part of the completion queue entry, the controller indicates the most recent Submission Queue entry that has been consumed by advancing the Submission Queue Head pointer in the completion entry. Each new completion queue entry has a Phase Tag inverted from the previous entry to indicate to the host that this completion queue entry is a new entry;
6. The controller optionally generates an interrupt to the host to indicate that there is a new completion queue entry to consume and process. In the figure, this is shown as an MSI-X interrupt, however, it could also be a pin-based or MSI interrupt. Note that based on interrupt coalescing settings, an interrupt may or may not be generated for each new completion queue entry;
7. The host consumes and then processes the new completion queue entries in the Completion Queue. This includes taking any actions based on error conditions indicated. The host continues consuming and processing completion queue entries until a previously consumed entry with a Phase Tag inverted from the value of the current completion queue entries is encountered; and
8. The host writes the Completion Queue Head Doorbell register to indicate that the completion queue entry has been consumed. The host may consume many entries before updating the associated Completion Queue Head Doorbell register.

I'm confused though on the doorbells. The spec notes (in sections 3.1.24 and 3.1.25) that "The difference between the last SQT write and the current SQT write indicates the number of commands added to the Submission Queue" and "The difference between the last CQH write and the current CQH entry pointer write indicates the number of entries that are now available for re-use by the controller in the Completion Queue". Does this mean that I need to calculate the difference between the old SQT and the new SQT and same for the CQHs? I.e.: say I write 6 commands in the admin queue. Would I write 6 to the queue tail doorbell, read all six responses, then write 6 to the queue completion head doorbell? And then if I write another command in, I write 1 to the submission doorbell and 1 to the completion doorbell to acknowledge the response?

Octocontrabass · Post by **Octocontrabass** » Thu Dec 17, 2020 10:28 am

Ethin wrote:Does this mean that I need to calculate the difference between the old SQT and the new SQT and same for the CQHs?

No, it means the hardware will calculate the difference. To use your example, you'd write a 7 to indicate one additional command after the initial 6.

Ethin · Post by **Ethin** » Thu Dec 17, 2020 11:44 am

Aha, thanks. As for the queue head code, do I just update that whenever I read a new queue entry, or do I update that at every entry I read? Also, I note that the spec indicates that rollover has to be accounted for. For now, when I increment the tail or head pointers, I use wrapping addition modulo the number of entries in the queue (self.qtail = self.qtail.wrapping_add(1) % self.entries and self.qhead = self.qhead.wrapping_add(1) % self.entries). Is there anything special I need to do along with this to account for rollover?

Octocontrabass · Post by **Octocontrabass** » Thu Dec 17, 2020 12:17 pm

Ethin wrote:Aha, thanks. As for the queue head code, do I just update that whenever I read a new queue entry, or do I update that at every entry I read?

You can update it for every new entry you consume, but it may be faster to consume all of the available new entries and then update it when you reach a stale entry.

Ethin wrote:Is there anything special I need to do along with this to account for rollover?

Nope, that's all there is to it.

Ethin · Post by **Ethin** » Thu Dec 17, 2020 12:25 pm

Octocontrabass wrote:
Ethin wrote:Aha, thanks. As for the queue head code, do I just update that whenever I read a new queue entry, or do I update that at every entry I read?
You can update it for every new entry you consume, but it may be faster to consume all of the available new entries and then update it when you reach a stale entry.

Don't you mean a new entry? Stail would imply an old entry, which would mean I'd need to check to see if the phase bit is inverted from what my code believes it to be.

Octocontrabass wrote:
Ethin wrote:Is there anything special I need to do along with this to account for rollover?
Nope, that's all there is to it.

Thanks! This makes things really easy (compared to other disk storage mechanisms anyway...)!

Octocontrabass · Post by **Octocontrabass** » Thu Dec 17, 2020 1:00 pm

Ethin wrote:Don't you mean a new entry? Stail would imply an old entry, which would mean I'd need to check to see if the phase bit is inverted from what my code believes it to be.

No, I mean a stale entry. One interrupt per command completion is pretty expensive for something as fast as NVMe, so you may want to process all available entries each time you receive an interrupt (and set up interrupt coalescing to reduce the number of interrupts you receive). If you choose to do that, you'll need to check the phase bit before consuming each entry.

Ethin · Post by **Ethin** » Thu Dec 17, 2020 1:26 pm

Octocontrabass wrote:
Ethin wrote:Don't you mean a new entry? Stail would imply an old entry, which would mean I'd need to check to see if the phase bit is inverted from what my code believes it to be.
No, I mean a stale entry. One interrupt per command completion is pretty expensive for something as fast as NVMe, so you may want to process all available entries each time you receive an interrupt (and set up interrupt coalescing to reduce the number of interrupts you receive). If you choose to do that, you'll need to check the phase bit before consuming each entry.

So your saying then that when reading entries, I should update the head pointer if the phase bit is *not* inverted (e.g.: its not a new entry)? I get that one interrupt per completion entry is ridiculously expensive considering how fast NvMe is, and I'll definitely enable interrupt coalescing. But I check the phase bit already to determine if the entry I'm reading is "new" and if it is, I store it for future processing. For non-"new" entries I don't consume them though (just to avoid me consuming the same entry repeatedly by accident).

Octocontrabass · Post by **Octocontrabass** » Thu Dec 17, 2020 2:36 pm

Ethin wrote:So your saying then that when reading entries, I should update the head pointer if the phase bit is *not* inverted (e.g.: its not a new entry)?

When you consume entries, you have to update the head pointer to tell the controller that you're done reading them. You can update the head pointer each time you consume one entry, but that involves writing to the head pointer many times. It's faster to read all of the new entries in one go, and then update the head pointer after you've read all of them. The phase bit is only tangentially related: you can tell that you've run out of new entries to read when you find a stale entry, and you can tell that an entry is stale when its phase bit hasn't been inverted.

Ethin · Post by **Ethin** » Thu Dec 17, 2020 2:54 pm

Octocontrabass wrote:
Ethin wrote:So your saying then that when reading entries, I should update the head pointer if the phase bit is *not* inverted (e.g.: its not a new entry)?
When you consume entries, you have to update the head pointer to tell the controller that you're done reading them. You can update the head pointer each time you consume one entry, but that involves writing to the head pointer many times. It's faster to read all of the new entries in one go, and then update the head pointer after you've read all of them. The phase bit is only tangentially related: you can tell that you've run out of new entries to read when you find a stale entry, and you can tell that an entry is stale when its phase bit hasn't been inverted.

Oh. What I do now is update the head pointer in memory and then once I've consumed all the entries I then write the new pointer to the doorbell. I don't ring the doorbell every time I consume an entry.

Ethin · Post by **Ethin** » Fri Dec 18, 2020 1:56 pm

Okay, so now I'm getting this weird page fault when my driver tries reading and writing to NVMe submission queues. When my driver initializes it uses the RDRAND instruction to generate a random memory address for the queues. For instance, on this latest boot it generated 2DE354779000h for the ASQ and 3ACF7E2D8000h for the ACQ (masking bits 47-63, of course). My driver then goes to try sending the identify command. The queue handler updates the internal queue tail counter (from zero to one), then tries writing byte zero of the command to 2DE354779004h. This generates a page not present page fault, which doesn't seem right because my memory manager notes the allocation (addresses 2DE354779000h-2DE3547797FFh and 3ACF7E2D8000h-3ACF7E2D87FFh, respectively), and didn't fail the allocation, so clearly it succeeded. Yet the processor now is telling me otherwise, and I'm quite confused. The memory isn't being freed anywhere, so it (should) still be mapped.... Right?

Octocontrabass · Post by **Octocontrabass** » Fri Dec 18, 2020 3:36 pm

Are those virtual or physical addresses?

How are you mapping half of a page?

Ethin · Post by **Ethin** » Fri Dec 18, 2020 4:38 pm

These are physical allocations, not virtual ones. I'm not mapping half a page (to my knowledge that's not possible). When I allocate a memory range, I give it a start an an end address, and then I ensure my code never accesses memory beyond that range -- even if its less than a full page. My code maps the whole page but I ensure that it never reads/writes beyond that end address; I could probably do it, but I'd rather not test it to find out.

Octocontrabass · Post by **Octocontrabass** » Fri Dec 18, 2020 6:57 pm

I'm still confused: are those virtual or physical addresses? (Or both, if you're identity-mapping everything.)

A page fault indicating page not present means you've written something incorrectly to your page tables. For other types of page faults, it could indicate a stale TLB entry, but as far as I know Intel and AMD require pages to be present to be stored in the TLB.

Ethin · Post by **Ethin** » Fri Dec 18, 2020 7:10 pm

They're physical addresses. I'm identity mapping everything -- I don't like other mapping strategies (like recursive mapping, I find that incredibly confusing). I always map physical addresses with the present, writable, and no cache bits set. My memory manager panics if a frame allocation failure occurs, and that's (not) happening, so I really am lost here.
My rust toolchain is currently fucked at the moment; Cargo is failing to read vendor paths for the rust source code, and I've submitted an issue about that here, so hopefully that gets fixed soon -- though I just might want to reinstall my entire rust toolchain.

Octocontrabass · Post by **Octocontrabass** » Fri Dec 18, 2020 7:31 pm

Does memory exist at those addresses? The NVMe controller doesn't provide memory to store the queues.

It's not really on topic, but I'm curious what your other plans are without paging.

OSDev.org

NVMe confusion

NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion

Re: NVMe confusion