Qemu: NVMe controller gets stuck during controller config

Ethin · Post by **Ethin** » Mon Sep 28, 2020 9:36 am

Thanks, Octocontrabass -- I forgot about that.

Ethin · Post by **Ethin** » Mon Sep 28, 2020 1:37 pm

So I think qemu is broken. My log2 function is like this:

(8 * size_of::<u64>() - (n.leading_zeros() as usize) - 1) as u64

Which is equivalent to, in C:

Code: Select all

#define LOG2(X) ((unsigned) (8*sizeof (unsigned long long) - __builtin_clzll((X)) - 1))

This gets me 10 (MQES is 2047). Using math.log2 in Python yields 10.99 (so my implementation is correct). This doesn't handle zero, but the likelihood of MQES being zero is nil anyway. However, I still get this same error, no matter what binary logarithm algorithm I try:

[email protected]:pci_nvme_err_startfail_cqent_too_large nvme_start_ctrl failed because the completion queue entry size is too large: log2size=10, max=15
[email protected]:pci_nvme_err_startfail setting controller enable bit failed

Thoughts?

BenLunt · Post by **BenLunt** » Mon Sep 28, 2020 5:34 pm

Ethin wrote:This gets me 10 (MQES is 2047). Using math.log2 in Python yields 10.99 (so my implementation is correct).

Remember that the value is zero based. You need to add 1 before you calculate the value. The CAP.MQES values is the index of the highest allowed, not the count. Add 1 to get a count.

Ethin wrote:
[email protected]:pci_nvme_err_startfail_cqent_too_large nvme_start_ctrl failed because the completion queue entry size is too large: log2size=10, max=15
[email protected]:pci_nvme_err_startfail setting controller enable bit failed
Thoughts?

I am getting the same results.

Here are my thoughts, though don't take them as fact. I haven't researched it enough yet.

Starting with line 1707 is where QEMU makes the checks you are showing above.

However, after initialization and before you get the IDENTIFY block, you have no clue what the value is to compare to. Yet, QEMU is already comparing these values.
It is the chicken before the egg.

You need to send the IDENTIFY command to get the minimum and maximum values before you can place these values, yet QEMU is comparing these values before you have a chance to send the IDENTIFY command.

However, since QEMU displays the values it compared with, even after placing these values in my initialization, QEMU still complains at this point.

If I figure something out, I will post here. I ask that you do the same.

Thanks,
Ben
- http://www.fysnet.net/osdesign_book_series.htm

Octocontrabass · Post by **Octocontrabass** » Mon Sep 28, 2020 6:06 pm

Ethin wrote:Thoughts?

QEMU is broken. The error message is printing garbage instead of the actual maximum allowed value.

As for the bug in your code, MQES indicates the maximum number of queue entries, not the maximum size of each entry. For that, you need to look at SQES and CQES in the Identify Controller data structure, which you can retrieve using the Identify command.

In order to send the Identify command, you still need to set the entry size; lucky for you, the minimum entry sizes are fixed values, so you can just use those to start out.

I'm not sure you actually need to know the maximum entry size. The spec seems to imply that the minimum is good enough for typical use.

BenLunt · Post by **BenLunt** » Mon Sep 28, 2020 9:41 pm

Octocontrabass wrote:
Ethin wrote:Thoughts?
QEMU is broken.

Yep, after a little more looking (I don't know why I didn't catch this before) QEMU is broken.

Look at lines 2357 and 2358.

Code: Select all

    id->sqes = (0x6 << 4) | 0x6;
    id->cqes = (0x4 << 4) | 0x4;

This is where QEMU sets its internal values for the I/O Queues. It sets the SQES (max and min) to 6 (64) and the CQES (max and min) to 4 (16).

Now look at the code starting at line 1707: (I have removed the inside part of the block for clarity here)

Code: Select all

    if (unlikely(NVME_CC_IOCQES(n->bar.cc) < NVME_CTRL_CQES_MIN(n->id_ctrl.cqes))) {
        // return error      
    }
    if (unlikely(NVME_CC_IOCQES(n->bar.cc) > NVME_CTRL_CQES_MAX(n->id_ctrl.cqes))) {
        // return error      
    }
    if (unlikely(NVME_CC_IOSQES(n->bar.cc) <  NVME_CTRL_SQES_MIN(n->id_ctrl.sqes))) {
        // return error      
    }
    if (unlikely(NVME_CC_IOSQES(n->bar.cc) >  NVME_CTRL_SQES_MAX(n->id_ctrl.sqes))) {
        // return error      
    }

Since the values are hard coded to be the same values for the MAX and MIN, the above might as well have been:

Code: Select all

    if (unlikely(NVME_CC_IOCQES(n->bar.cc) != NVME_CTRL_CQES_MIN(n->id_ctrl.cqes))) {
        // return error      
    }
    if (unlikely(NVME_CC_IOSQES(n->bar.cc) !=  NVME_CTRL_SQES_MIN(n->id_ctrl.sqes))) {
        // return error      
    }

Since the first block of code, shown again here--

Code: Select all

    id->sqes = (0x6 << 4) | 0x6;
    id->cqes = (0x4 << 4) | 0x4;

--is hard-coded to have the same MAX and MIN values for each, there is no range. Second, since we have not received the IDENTIFY block yet, we don't know these values yet. The current code assumes we have researched and found that it requires these values to be 6 and 4 respectively.

IMHO, QEMU needs to do one of two things:
1) skip the check above if the IDENTIFY command has not been called, ignoring the current settings of CC.IOSQES and CC.IOCQES.
or
2) set the code to be

Code: Select all

    id->sqes = (0xF << 4) | 0x0;
    id->cqes = (0xF << 4) | 0x0;

by default and once the IDENTIFY command has been called, then set the internal values to valid values (the same values found in the IDENTIFY block).

As a work-around, if you initialize your driver to use 6 and 4 (respectively), the emulation will enable the controller and continue on.

Ben

Octocontrabass · Post by **Octocontrabass** » Mon Sep 28, 2020 10:13 pm

BenLunt wrote:QEMU needs to do one of two things:
1) skip the check above if the IDENTIFY command has not been called.
2) set the code to be
Code: Select all
    id->sqes = (0xF << 4) | 0x0;
    id->cqes = (0xF << 4) | 0x0;
by default and once the IDENTIFY command has been called, then set the valid values.

Both of these suggestions violate the NVMe spec and would cause QEMU to behave differently from real hardware. Only the error messages are broken.

BenLunt · Post by **BenLunt** » Mon Sep 28, 2020 10:23 pm

Octocontrabass wrote:
BenLunt wrote:QEMU needs to do one of two things:
1) skip the check above if the IDENTIFY command has not been called.
2) set the code to be
Code: Select all
    id->sqes = (0xF << 4) | 0x0;
    id->cqes = (0xF << 4) | 0x0;
by default and once the IDENTIFY command has been called, then set the valid values.
Both of these suggestions violate the NVMe spec and would cause QEMU to behave differently from real hardware. Only the error messages are broken.

Hi Octocontrabass,

Please explain, because the way I see it, the current code requires you to set the CC.IOSQES and CC.IOCQES values to 6 and 4 (respectively) just to enable the controller (set the CC.EN bit).

Where are these numbers coming from? Does the specification state that these are to be used as defaults? I don't see where it does.

One cannot request and receive the IDENTIFY block, which contains these MAX and MIN values, without first enabling the controller.

Therefore, using the current QEMU code, you are required to use 6 and 4 simply to enable the controller to then send the IDENTIFY command to get these values. Chicken and egg.

Am I wrong here? If so, please correct me.

Ben

P.S. I agree that the error messages need to be fixed. It is comparing one set of values and then displaying a second set of values. However, fixing the error messages doesn't fix the problem I describe here.

Octocontrabass · Post by **Octocontrabass** » Mon Sep 28, 2020 10:57 pm

BenLunt wrote:Where are these numbers coming from? Does the specification state that these are to be used as defaults? I don't see where it does.

Those numbers come from page 186 of the current NVMe spec (1.4a), where it's mandatory for all devices to use those values as their minimum supported queue entry sizes. Since those values are mandatory, you don't need to read the Identify Controller block to know what they are.

You do need to read the Identify Controller block to find the maximum queue entry sizes, but I'm not sure why you would want to use queue entries bigger than the minimum size.

Ethin · Post by **Ethin** » Tue Sep 29, 2020 12:14 am

Octocontrabass is right; I had completely forgotten to re-read that part of the specification. It indeed does define the absolute minimums as 6 and 4, respectively, in bits 3:0 of bytes 512 and 513 of Fig. 249 (it says figure 247, but its actually 249).
Edit: setting bits 23:20 and 19:16 to 6 and 4 do not actually fix this issue. QEMU still fails to initialize the controller, and the trace events are no help. It appears that Linux also defines IOSQES and IOCQES to 6 and 4 so I'm not sure how my code differs from theirs (although its significantly less evolved).

BenLunt · Post by **BenLunt** » Tue Sep 29, 2020 5:52 pm

Octocontrabass wrote:
BenLunt wrote:Where are these numbers coming from? Does the specification state that these are to be used as defaults? I don't see where it does.
Those numbers come from page 186 of the current NVMe spec (1.4a), where it's mandatory for all devices to use those values as their minimum supported queue entry sizes. Since those values are mandatory, you don't need to read the Identify Controller block to know what they are.

You do need to read the Identify Controller block to find the maximum queue entry sizes, but I'm not sure why you would want to use queue entries bigger than the minimum size.

Indeed. I guess it is all how you interpret the specification.

After reading all the comments here, I now understand the interpretation that was meant. The minimum values are, by requirement, set at 6 and 4 respectively, but the maximum values can be higher.

Therefore, when initializing (enabling) the controller for the first time, since we have not received the IDENTIFY command yet, we can set the values to 6 and 4. Then after receiving the IDENTIFY command's response, we can evaluate to a new set of values if the values we choose still are within the new range found. My interpretation of the specification is that these values can be changed while CC.EN is 1 as long as they are within the range given. i.e.: CC.IOCQES and CC.IOSQES are writable while CC.EN is 1.

Ethin wrote:Edit: setting bits 23:20 and 19:16 to 6 and 4 do not actually fix this issue. QEMU still fails to initialize the controller, and the trace events are no help. It appears that Linux also defines IOSQES and IOCQES to 6 and 4 so I'm not sure how my code differs from theirs (although its significantly less evolved).

After setting my values to 6 and 4, I get the controller to enable. Here is my set up:

Code: Select all

mem_write_io_regs(addr, SSS_HC_CC, 
    SSS_HC_CC_SET_IOCQES(4) |                // (1 << 4) = 16  Defined minimum
    SSS_HC_CC_SET_IOSQES(6) |                // (1 << 6) = 64  Defined minimum
    SSS_HC_CC_SET_SHN(0) |                   // no shutdown notification
    SSS_HC_CC_SET_AMS(0) |                   // round robin arbitration
    SSS_HC_CC_SET_MPS(0) |                   // 0 = 4096 page size
    SSS_HC_CC_SET_CSS(SSS_HC_CC_CSS_NVM)|    // NVM command set
    SSS_HC_CC_EN);                           // enable the controller

Thanks everyone for your comments.

Similar to Ethin's comment, I had completely missed the two statements where it defined a required minimum of 6 and 4 so my interpretation was off a bit.

Thanks,
Ben

Ethin · Post by **Ethin** » Wed Sep 30, 2020 9:23 am

Idk what I'm doing wrong... Its definitely not working for me. Pushed my latest code just now. Lines 314-326 now (I restructured my NVMe code).
Edit: So I got it to work. Using Ben's, I found out taht I had reversed IOCQES and IOSQES.

BenLunt · Post by **BenLunt** » Wed Oct 07, 2020 8:06 pm

(See update at end of post)

Just out of curiosity, have you got your driver to the point where you can read sectors from the disk?

I have a working driver and can read up to 16 sectors just fine. However, when I try to read more than 16 sectors at a time, it fails.

It fails with the error: (Specs: v1.2, page 65, Figure 30)

02: Invalid Field in Command: An invalid or unsupported field specified in the command parameters.

Therefore I thought it might be the maximum allowed transfer size per transfer. (Specs: v1.2, page 100, Figure 90, Byte 77)

If a command is submitted that exceeds the transfer size, then the command is aborted with a status of Invalid Field in Command.

However, the value at Byte 77 returns 0. (Specs: v1.2, page 100, Figure 90, Byte 77)

A value of 0h indicates no restrictions on transfer size.

I am using a Scatter/Gather list (not PRPs) with a single Data Block Segment Entry since the buffer used is physically continuous.

Here are my concerns.
1) Sixteen 512-byte sectors is 8192 bytes, exactly two (2) pages of data. Doesn't mean anything at the moment, that I can tell.

2) Does QEMU's emulation actually have a transfer limit, they just forgot to update Byte 77 in the Indentify block?
The check is at: https://github.com/qemu/qemu/blob/maste ... vme.c#L575
I believe that the mdts member is 7 (https://github.com/qemu/qemu/blob/maste ... me.c#L2453) which using the test at Line 575 is well above the 8192+ bytes I am trying to transfer.
Line 18 shows that I can add a parameter to set this value, though my version of QEMU barks at the parameter stating it is not a member of nvme.

(A note if you haven't thought of it already. The value at Byte 77 is:

The value is in units of the minimum memory page size (CAP.MPSMIN) and is reported as a power of two (2^n).

Therefore, the limit is calculated with the minimum page size, NOT the current page size. Therefore, if you use a page size other than the Minimum, remember that this limit is calculated on the Minimum page size, not the current used page size you specify in CC.MPS)

I don't find where QEMU actually sets Byte 77 of the (Controller) Identify block, but I am reading a value of zero from that byte where as I believe Line 2453 is setting it to 7 (4096 << 7 = 524,288).

Just wondering if anyone has any thoughts about this. I am sure I am missing something simple. Just can't pin-point it at the moment.

Ben

P.S. I guess one thing I need to mention and it has a bit to do with it (though I didn't think of it until just now), I am using the Windows version of QEMU which has a different source listing still reporting version 1.2. I will have to study this code instead.

Update: (At a glance) it looks like version 1.2 (of the QEMU code) doesn't support Scatter Gather, so it was taking my SGL address as the PRP1 address and PRP2 as the length of the data as an actual address. Since PRP1 points to the first page, and PRP2 points to the second page, this is the 8192 bytes it will transfer. Again, (at a glance) it looks like version 1.2 doesn't support Scatter Gather.
.
.
.
Proof: Patch for version 1.3 states

- adds support for scatter gather lists (SGLs)

Ethin · Post by **Ethin** » Wed Oct 07, 2020 11:21 pm

I haven't gotten there yet. In general I'm going to strive to use PRPs as much as possible; I'm not very good with SGLs, and I'm not exactly sure how to construct one (and more things seem to work with PRPs than SGLs). I'm getting stuck just sending identify. For some reason, my memory allocator goes rogue when my NVMe driver starts.At first I thought that my math was wrong, so I switched it to just allocate a 16KiB ringbuffer that I can just reuse over and over (is that a bad idea, by the way?). I'm using 4KiB pages, so that should equal four memory frames of size 4096, right? Because the last time I ran my code my memory allocator allocated more than a thousand frames (actually it was more like 5 thousand and rising). The addresses of those higher frames exceeded the 16 KiB I'd requested too. And I've no idea exactly how to debug it either -- because there's no error, there's no condition... there's not much for me to go on. I've pushed my commit -- would appreciate some help because I'm at a complete and utter loss.

BenLunt · Post by **BenLunt** » Thu Oct 08, 2020 9:03 pm

Ethin wrote:I haven't gotten there yet. In general I'm going to strive to use PRPs as much as possible; I'm not very good with SGLs, and I'm not exactly sure how to construct one (and more things seem to work with PRPs than SGLs). I'm getting stuck just sending identify. For some reason, my memory allocator goes rogue when my NVMe driver starts.At first I thought that my math was wrong, so I switched it to just allocate a 16KiB ringbuffer that I can just reuse over and over (is that a bad idea, by the way?). I'm using 4KiB pages, so that should equal four memory frames of size 4096, right? Because the last time I ran my code my memory allocator allocated more than a thousand frames (actually it was more like 5 thousand and rising). The addresses of those higher frames exceeded the 16 KiB I'd requested too. And I've no idea exactly how to debug it either -- because there's no error, there's no condition... there's not much for me to go on. I've pushed my commit -- would appreciate some help because I'm at a complete and utter loss.

I guess I don't understand what you issue is.

You need to allocate physical continuous memory for your Submission and Completion rings. The CAP.MQES will give you a limit of how many entries per ring, though I only use 64 each.

Therefore, since the Submission Ring uses 64-byte entries, 64 of them would occupy a single 4k page. The Completion Ring uses 16-byte entries, 64 of them would occupy less than a single 4k page.
This is the same for the I/O ring(s) as well.

The IDENTIFY blocks (CNS values 0, 1, and 2), all require a single 4k block, no matter the page size you use.

So to keep it simple, you need the following:
1) One 4k block for the Admin Submission Ring
2) One 4k block for the Admin Completion Ring
3) One 4k block for returning IDENTIFY data
4) One 4k block for each I/O Submission Ring
5) One 4k block for each I/O Completion Ring

Since you haven't gotten past the IDENTIFY command yet, you don't need to worry about the I/O rings yet.

From previous posts, you have been able to enable the controller. Did you create your Admin rings before or after enabling the controller? You should have done this before enabling it.

At this point, just after enabling the controller, you should have an empty Admin Submission Ring and Completion Ring.
You can now send the IDENTIFY command.

CDW0 = CID, USE_PRP's, FUSE_NORMAL, OPCODE_IDENTIFY;
NSID = NSID_NONE;
MPTR = NULL;
PRP1 = 4k page aligned pointer to the 4k page of physical memory to store the data
PRP2 = 0
CDW10 = CNS (0, 1, or 2)
CDW11 = 0;
etc = 0;
Insert the Submission into the Admin Submission Queue (Ring) and ring the Admin Doorbell.
Wait for the interrupt
Process the Admin Completion Ring (Using the Phase Bit)
Return

You should now have the 4k data you are looking for.

Does this help?
Ben

Ethin · Post by **Ethin** » Thu Oct 08, 2020 11:20 pm

BenLunt wrote:
Ethin wrote:I haven't gotten there yet. In general I'm going to strive to use PRPs as much as possible; I'm not very good with SGLs, and I'm not exactly sure how to construct one (and more things seem to work with PRPs than SGLs). I'm getting stuck just sending identify. For some reason, my memory allocator goes rogue when my NVMe driver starts.At first I thought that my math was wrong, so I switched it to just allocate a 16KiB ringbuffer that I can just reuse over and over (is that a bad idea, by the way?). I'm using 4KiB pages, so that should equal four memory frames of size 4096, right? Because the last time I ran my code my memory allocator allocated more than a thousand frames (actually it was more like 5 thousand and rising). The addresses of those higher frames exceeded the 16 KiB I'd requested too. And I've no idea exactly how to debug it either -- because there's no error, there's no condition... there's not much for me to go on. I've pushed my commit -- would appreciate some help because I'm at a complete and utter loss.
I guess I don't understand what you issue is.

You need to allocate physical continuous memory for your Submission and Completion rings. The CAP.MQES will give you a limit of how many entries per ring, though I only use 64 each.

Therefore, since the Submission Ring uses 64-byte entries, 64 of them would occupy a single 4k page. The Completion Ring uses 16-byte entries, 64 of them would occupy less than a single 4k page.
This is the same for the I/O ring(s) as well.

The IDENTIFY blocks (CNS values 0, 1, and 2), all require a single 4k block, no matter the page size you use.

So to keep it simple, you need the following:
1) One 4k block for the Admin Submission Ring
2) One 4k block for the Admin Completion Ring
3) One 4k block for returning IDENTIFY data
4) One 4k block for each I/O Submission Ring
5) One 4k block for each I/O Completion Ring

Since you haven't gotten past the IDENTIFY command yet, you don't need to worry about the I/O rings yet.

From previous posts, you have been able to enable the controller. Did you create your Admin rings before or after enabling the controller? You should have done this before enabling it.

At this point, just after enabling the controller, you should have an empty Admin Submission Ring and Completion Ring.
You can now send the IDENTIFY command.

CDW0 = CID, USE_PRP's, FUSE_NORMAL, OPCODE_IDENTIFY;
NSID = NSID_NONE;
MPTR = NULL;
PRP1 = 4k page aligned pointer to the 4k page of physical memory to store the data
PRP2 = 0
CDW10 = CNS (0, 1, or 2)
CDW11 = 0;
etc = 0;
Insert the Submission into the Admin Submission Queue (Ring) and ring the Admin Doorbell.
Wait for the interrupt
Process the Admin Completion Ring (Using the Phase Bit)
Return

You should now have the 4k data you are looking for.

Does this help?
Ben

Yes, and that shows me what I need to do. However, I can't even queue the command. As I said, my memory allocation routine goes rogue when I ask it to allocate the buffer for the PRP. And yes, I enable the controller after allocating queues.

OSDev.org

Qemu: NVMe controller gets stuck during controller config

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi

Re: Qemu: NVMe controller gets stuck during controller confi