Page 2 of 2

Re: Bug while creating the first i/o completion queue (NVMe over PCIe)

Posted: Fri Aug 09, 2024 9:14 am
by osdev199
Hello Octocontrabass, I do not want to try to make it run on qemu as I am creating my Core (think of kernel) for my real machine and I think fixing the bug on qemu might not make it run on the real machine as it had happened to me the earlier. I don't want to waste time on qemu instead I want to keep the fix as simple as possible (obviously to handle such future errors directly I guess :) ). My intuition says that the fix is very very simple - not to waste my time in making it running on qemu instead just to fix a silly mistake that is hidden in front of my eyes. Plus, I think making it run on the real machine will help me in identifying the issues without using any debugger (I really don't know if using gdb and other debuggers is not a good thing. I am just following Linus Torvalds advice and I'm trying to keep everything to the real machine in order to be a real Core developer).

From my past error experience where the error was regarding the admin completion queue not getting any response from the controller, you had suggested me to write a real memory allocator on reddit. My intuition was that the error was very very simple and I should not waste the time in writing the memory allocator. It may sound silly but I waited to disprove my intuition for 2 months and also tried to politely deny all of your suggestions in order to get to that point where the error was. But as I got fed with these, I decided to rewrite the driver from "scratch" by following the specs, keeping track of every minute detail and guess what, it worked. This approach then helped me in learning about the nvme driver of which I had only ideas from the assembly code that I was translating to C plus it sufficed my desire to know what I was doing in those translations somewhere along the line in the journey of writing the Core (please think of kernel). I would have left it (the desire) if it was not that important to me (that was the intuition with which I began writing (copying and translating) the nvme driver from BareMetal OS's github repo) but as this admin completion queue error occured, I intuitively settled on this "scratch" thing - knowing by hand and by myself (reading specs, searching for information) what each and every instruction "was doing in my own new code". To be honest, it was an hour of work and the error was gone. The approach worked! It was that simple... :)

Hence, I'm still lurking around this intuition of mine to solve the above bug. I'm just thinking about how can I minimize the "thinking" time (please read: thinking intuitively) as far as I can :)

Thank you.

Re: Bug while creating the first i/o completion queue (NVMe over PCIe)

Posted: Fri Aug 09, 2024 10:03 am
by osdev199
Thanks Octocontrabass, your suggestion worked on my real machine :)

I checked CQES value in the identify controller structure and it was 16 bytes (both required and the maximum). Then I set CC.IOCQES to 16 entries, and finally cdw10 to 0xf0001 (queue size = 16 commands, qid = 1) and it worked!

Re: Bug while creating the first i/o completion queue (NVMe over PCIe)

Posted: Wed Aug 14, 2024 12:51 am
by mleks
Octocontrabass wrote: Thu Aug 08, 2024 8:21 pm
mleks wrote: Thu Aug 08, 2024 5:27 amsc=0x08
Invalid Interrupt Vector. You're using pin-based interrutps, so the interrupt vector needs to be 0. If you switch to MSI or MSI-X, the interrupt vector is the index in the list of MSI(-X) interrupts you've configured on the NVMe controller. The interrupt vector you need to use here is not the same as a CPU interrupt vector.
Yes, the VEC_NO was due to my misunderstanding of the mechanism at that time. I've started working on MSI-X and am almost at the end of that process. However, now I am getting sc=0x1, sct=0x1 for the first CQ (cqid=1), and I truly do not know why.
Regarding your comment, are you suggesting that we should use the Message Table Index for MSI-X? If so, it could also be 0, 1, 2... up to 2047. Right?

EDITED: Regarding my incorrect qid (sc=0x1, sct=0x1): If the Get Features (opc=0xa, fid=0x7) command returns ncqr=63, nsqr=63 for sel=0b000 (current), it means that 'something' has already created 63 queues (I use the QEMU device:

Code: Select all

"-device pcie-root-port,id=pcie_port0,multifunction=on,bus=pcie.0,addr=0x10 -device nvme,drive=drv0,serial=1,bus=pcie_port0,use-intel-id=on"
So, let me start by deleting them and recreating them again.

EDITED2: OK, I can't remove submission queues, and all completion queues (sc=0, sct=1). I am unable to set Features (fid=0x7) to configure my ncqr/nsqa because the result always shows 63/63 (ncqa/ncsa). The command succeeded. I'm lost. qemu max_iopairs can be set to 1 but then the Get features fid 0x7 yields 0/0 and it's possible to create CQ but not SQ.

FINAŁ SCENE: Everything is fine. There's no need to delete anything or adjust settings in QEMU. I was trying to create queues with the same ID for each Command Set Vector :/ Yep, it took a while ;) BTW, the questiom on the vector number for MSI-X is still valid.

Re: Bug while creating the first i/o completion queue (NVMe over PCIe)

Posted: Wed Aug 14, 2024 8:38 pm
by Octocontrabass
mleks wrote: Wed Aug 14, 2024 12:51 am Regarding your comment, are you suggesting that we should use the Message Table Index for MSI-X? If so, it could also be 0, 1, 2... up to 2047. Right?
Right.
mleks wrote: Wed Aug 14, 2024 12:51 amI am unable to set Features (fid=0x7) to configure my ncqr/nsqa because the result always shows 63/63 (ncqa/ncsa). The command succeeded.
It's a performance hint. As long as the command is valid, it will always succeed, but it might not change anything.