Bare-metal ARM + qemu - issue with floating point division

ajxs · Post by **ajxs** » Sun Jan 07, 2018 6:10 am

My apologies in advance that this is a cross-post from Stackoverflow, but my problem isn't getting much attention there so I'll see if anyone here has any insight!
If anyone would rather earn points on SO, the original question is here.

I'm currently teaching myself bare-metal ARM kernel development, I've settled on using the Raspberry Pi 2 as a target platform on the basis of being well documented. I'm currently emulating the device using qemu.
In a function called by my kernel I'm required to divide a numerical constant by a function argument and store the result as a floating point number for future calculations. Calling this function causes qemu to go off the rails. Here's the function itself ( setting PL011 baud rate ):

Code: Select all

void pl011_set_baud_rate(pl011_uart_t *uart, uint32_t baud_rate) {
    float divider = PL011_UART_CLOCK / (16.0f * baud_rate);
    uint16_t integer_divider = (uint16_t)divider;
    uint8_t fractional_divider = ((divider - integer_divider) * 64) + 0.5;
    mmio_write(uart->IBRD, integer_divider);        // Integer baud rate divider
    mmio_write(uart->FBRD, fractional_divider);     // Fractional baud rate divider
};

( Forgive for saying "go off the rails", but it's actually a little hard to see what qemu is doing, it lands in a non-functional state. Checking where 'pc' currently is in gdb shows that it's at adress 0x8, which is my temporary 'halt' routine for when the kernel 'main' returns. If this is accurate it means that it's broken out of the function, gone down the whole stack frame and returned to the bootstrap code. )

I'd post a minimal verifiable example, but just about anything will trigger the issue. If you even use:

Code: Select all

void test(uint32_t test_var) {
    float test_div = test_var / 16;
    (void)test_div;    // squash [-Wunused-variable] warnings
    // goes off the rails here
};

You'll get the same result.

Stepping through the function in gdb, stepping past "float divider..." will cause qemu to jump out of the function and head straight to the halt loop in my bootloader code ( for when the kernel main returns )
Checking "info args" in gdb shows the correct arguments. Checking "info locals" will show the correct value for float divider. Checking "info stack" shows the correct stack trace and arguments. Initially I suspected 'sp' might be in the wrong place, but that doesn't check out since the stack trace looks normal enough.

Code: Select all

(gdb) info stack
#0  pl011_set_baud_rate (uart=0x3f201000, baud_rate=115200) at kernel/uart/pl011.c:23
#1  0x0000837c in pl011_init (uart=0x3f201000) at kernel/uart/pl011.c:49
#2  0x0000806c in uart_init () at kernel/uart/uart.c:12
#3  0x00008030 in kernel_init (r0=0, r1=0, atags=0) at kernel/boot/start.c:10
#4  0x00008008 in _start () at kernel/boot/boot.S:6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

Here's the register dump from right before the line that causes the unpredictable behavior:

Code: Select all

r0             0x3f201000       1059065856
r1             0x1c200  115200
r2             0x7ff    2047
r3             0x0      0
r4             0x0      0
r5             0x0      0
r6             0x0      0
r7             0x0      0
r8             0x0      0
r9             0x0      0
r10            0x0      0
r11            0x7fcc   32716
r12            0x0      0
sp             0x7fb0   0x7fb0
lr             0x837c   33660
pc             0x8248   0x8248 <pl011_set_baud_rate+20>
cpsr           0x600001d3       1610613203

My Makefile is:

Code: Select all

INCLUDES=include
INCLUDE_PARAMS=$(foreach d, $(INCLUDES), -I$d)

CC=arm-none-eabi-gcc

C_SOURCES:=kernel/boot/start.c kernel/uart/uart.c kernel/uart/pl011.c
AS_SOURCES:=kernel/boot/boot.S

SOURCES=$(C_SOURCES)
SOURCES+=$(AS_SOURCES)

OBJECTS=
OBJECTS+=$(C_SOURCES:.c=.o)
OBJECTS+=$(AS_SOURCES:.S=.o)


CFLAGS=-std=gnu99 -Wall -Wextra -fpic -ffreestanding -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard
LDFLAGS=-ffreestanding -nostdlib

LIBS=-lgcc

DEBUG_FLAGS=

BINARY=kernel.bin

.PHONY: all clean debug

all: $(BINARY)

debug: DEBUG_FLAGS += -ggdb
debug: $(BINARY)

$(BINARY): $(OBJECTS)
    $(CC) -T linker.ld $(LDFLAGS) $(LIBS) $(OBJECTS) -o $(BINARY)

%.o: %.c
    $(CC) $(INCLUDE_PARAMS) $(CFLAGS) $(DEBUG_FLAGS) -c $< -o $@

%.o: %.S
    $(CC) $(INCLUDE_PARAMS) $(CFLAGS) $(DEBUG_FLAGS) -c $< -o $@

clean:
    rm $(BINARY) $(OBJECTS)

As you can see I'm linking against lgcc, and using -mfpu=neon-vfpv4 -mfloat-abi=hard, so at very least gcc should supply it's own floating point division functions from lgcc.
Can anyone point me in the right direction for debugging this issue? I suspect I'm either using the incorrect compiler arguments and not loading the correct function for floating-point division, or there's some issue with the stack.

Can anyone shed any insight here?
Even if someone could point me in the right direction for how to debug the issue, or where to look I'd really appreciate it! Thank you.

bluemoon · Post by **bluemoon** » Sun Jan 07, 2018 6:19 am

Post an assembly dump on the offending function and see what is actually called? something from objdump is good enough.

ajxs · Post by **ajxs** » Sun Jan 07, 2018 6:25 am

Here's the full assembly dump.
I have to admit, I'm new to ARM assembly, and the asm that gcc has generated is a little beyond me right now. I can't get much insight from this just yet. I've disabled compiler optimization right now so I can get more predictable results. I found that using '-O2' caused different behavior entirely, which I suspect is due to the compiler "inlining" the function, for the lack of a better word, or substituting constants for args.

Code: Select all

.global	pl011_set_baud_rate
	.syntax unified
	.arm
	.fpu neon-vfpv4
	.type	pl011_set_baud_rate, %function
pl011_set_baud_rate:
	@ args = 0, pretend = 0, frame = 24
	@ frame_needed = 1, uses_anonymous_args = 0
	push	{fp, lr}
	add	fp, sp, #4
	sub	sp, sp, #24
	str	r0, [fp, #-16]
	str	r1, [fp, #-20]
	ldr	r3, [fp, #-20]
	vmov	s15, r3	@ int
	vcvt.f32.u32	s15, s15
	vmov.f32	s14, #1.6e+1
	vmul.f32	s14, s15, s14
	vldr.32	s13, .L11
	vdiv.f32	s15, s13, s14
	vstr.32	s15, [fp, #-8]
	vldr.32	s15, [fp, #-8]
	vcvt.u32.f32	s15, s15
	vmov	r3, s15	@ int
	strh	r3, [fp, #-10]	@ movhi
	ldrh	r3, [fp, #-10]
	vmov	s15, r3	@ int
	vcvt.f32.s32	s15, s15
	vldr.32	s14, [fp, #-8]
	vsub.f32	s15, s14, s15
	vldr.32	s14, .L11+4
	vmul.f32	s15, s15, s14
	vcvt.f64.f32	d16, s15
	vmov.f64	d17, #5.0e-1
	vadd.f64	d7, d16, d17
	vcvt.u32.f64	s15, d7
	vstr.32	s15, [fp, #-24]	@ int
	ldrb	r3, [fp, #-24]
	strb	r3, [fp, #-11]
	ldr	r3, [fp, #-16]
	ldr	r3, [r3, #36]
	ldrh	r2, [fp, #-10]
	mov	r1, r2
	mov	r0, r3
	bl	mmio_write(PLT)
	ldr	r3, [fp, #-16]
	ldr	r3, [r3, #40]
	ldrb	r2, [fp, #-11]	@ zero_extendqisi2
	mov	r1, r2
	mov	r0, r3
	bl	mmio_write(PLT)
	nop
	sub	sp, fp, #4
	@ sp needed
	pop	{fp, pc}
.L12:
	.align	2
.L11:
	.word	1245125376
	.word	1115684864
	.size	pl011_set_baud_rate, .-pl011_set_baud_rate

I know that posting this asm dump is pretty brutal, but I'm hoping there's something obvious that will just jump out at someone a little more knowledgeable in this area. Something that I'm missing due to my lack of experience with ARM.

xenos · Post by **xenos** » Sun Jan 07, 2018 1:08 pm

I have not used the FPU on a Raspberry Pi / QEMU so far (and it's not really needed to configure the baud rate), but I would assume that the FPU must be initialized before it can be used, and that the CPU simply throws an exception when you use FPU instructions without proper initialization.

zaval · Post by **zaval** » Sun Jan 07, 2018 1:29 pm

The first question that arises - why are you using FPU in kernel? for baud rate calculations?

originally but better stop this for goodness. it's wrong. the second:

From reset, the Cortex-A7 FPU is disabled. Any attempt to execute a VFP instruction results in
an Undefined Instruction exception being taken. To enable software to access VFP features
ensure that:
<RTFM>

PS. I was late. won't remove it though, for the higher persuasiveness.

zaval · Post by **zaval** » Sun Jan 07, 2018 1:49 pm

0x8 is normally the interrupt vector "prefetch abort", this usually means that the CPU cannot fetch the instruction from the memory. "prefetch aborts" are used extensively when you use demand paging in order to page in code.

well, offset 8 is "hypervisor call", not prefetch abort, but this is hardly relevant, since these are offsets from the base, not the full address, obviously.

who knows what qemu does when undefined instruction exception occurs? I don't know. On real machines, like for example Beagle Bone Black, ROM code when in charge of handling exceptions, falls into so called "dead loop" until Watchdog does reset.

But for sure, the author is trying what's not initialized.

OSwhatever · Post by **OSwhatever** » Sun Jan 07, 2018 1:53 pm

zaval wrote:
0x8 is normally the interrupt vector "prefetch abort", this usually means that the CPU cannot fetch the instruction from the memory. "prefetch aborts" are used extensively when you use demand paging in order to page in code.
well, offset 8 is "hypervisor call", not prefetch abort, but this is hardly relevant, since these are offsets from the base, not the full address, obviously.

who knows what qemu does when undefined instruction exception occurs? I don't know. On real machines, like for example Beagle Bone Black, ROM code when in charge of handling exceptions, falls into so called "dead loop" until Watchdog does reset.

But for sure, the author is trying what's not initialized.

No I was wrong and calculated from the wrong offset, 0x8 is really "undefined instruction abort", prefech is 0xc. Everything adds up and even the interrupt was right.

zaval · Post by **zaval** » Sun Jan 07, 2018 1:58 pm

^ just for clarity, again - 8 is hyper/super/puper

visor call, system service/call in short. Undefined instruction is 4. But this doesn't matter here!

ajxs · Post by **ajxs** » Sun Jan 07, 2018 2:24 pm

XenOS wrote:I have not used the FPU on a Raspberry Pi / QEMU so far (and it's not really needed to configure the baud rate), but I would assume that the FPU must be initialized before it can be used, and that the CPU simply throws an exception when you use FPU instructions without proper initialization.

Thanks for pointing this out! I wasn't aware that you needed to initialize the FPU prior to use. RTFM-ing aside, I guess you learn not to take basic things for granted with bare metal development.
I also learned something else regarding the undefined mode there as well. This was a misunderstanding on my part regarding ARM.
Regarding XenOS's comment about not needing to use floating point calculations here, this is true. for the sake of being methodical I was just following the PL011 TRM's formula for calculating the baud rate dividers to a letter until everything else was in place. I just got stuck there and never progressed any further.
Initialising the FPU has solved the issue!
Thanks everyone for taking the time to help!

OSwhatever · Post by **OSwhatever** » Sun Jan 07, 2018 3:51 pm

zaval wrote:^ just for clarity, again - 8 is hyper/super/puper visor call, system service/call in short. Undefined instruction is 4. But this doesn't matter here!

Yes, it is 4, I messed up again with numbers. That's twice this day.

OSDev.org

Bare-metal ARM + qemu - issue with floating point division

Bare-metal ARM + qemu - issue with floating point division

Re: Bare-metal ARM + qemu - issue with floating point divisi

Re: Bare-metal ARM + qemu - issue with floating point divisi

Re: Bare-metal ARM + qemu - issue with floating point divisi

Re: Bare-metal ARM + qemu - issue with floating point divisi

Re: Bare-metal ARM + qemu - issue with floating point divisi

Re: Bare-metal ARM + qemu - issue with floating point divisi

Re: Bare-metal ARM + qemu - issue with floating point divisi

Re: Bare-metal ARM + qemu - issue with floating point divisi

Re: Bare-metal ARM + qemu - issue with floating point divisi