Funny (NOT!) piece of C strangeness.
Posted: Tue Sep 15, 2009 8:34 am
Have some patience, this will get a bit longer while I prepare the grounds, but I am sure the results will be as surprising for you as they were for me.
The definition of the not-too-exotic function unsigned long int strtoul( const char * restrict nptr, char ** restrict endptr, int base);, as by the standard (skip the quote if you like, I'll paraphrase later):
Nice.
Now, the definition of fscanf(), more precisely the %x conversion specifier, emphasis mine:
One of my references for working on PDCLib is P.J. Plaugher's book "The C standard library". In there, Plaugher makes a reference that, because ungetc() is only guaranteed to push one read character back into the input stream, scanf() (and brethren) might not give exactly the same results as strtol() (and brethren).
That got me curious, and I wanted to find out what exactly those differences might be, so that PDCLib could handle them graciously.
After some testing, I came up with this, quite simple I think, test program:
Now think for a minute what kind of output you would expect. After all, the specs are clear, no?
Here is what I got.
Cygwin:
Gentoo:
MSVC 2005:
Isn't that fun?
Right now I feel like hopping around and blowing soap bubbles. Obviously, it is very easy to get numerical parsing into undefined country, without actually doing anything wrong. The only halfway sane result is from MSVC (both parsings failing completely), which is something to mark up red in the calendar in any case...
The definition of the not-too-exotic function unsigned long int strtoul( const char * restrict nptr, char ** restrict endptr, int base);, as by the standard (skip the quote if you like, I'll paraphrase later):
In layman terms, given a call strtoul( string, &endptr, 16 ), string is parsed as a hexadecimal number (optionally prefixed "0x"), until the first non-hexadecimal character is encountered. Return code is the value parsed, and endptr will point to the first unparsed character.[...] If the value of base is between 2 and 36 (inclusive), the
expected form of the subject sequence is a sequence of letters and digits representing an
integer with the radix specified by base, optionally preceded by a plus or minus sign,
but not including an integer suffix. [...] If the value of base is 16, the characters 0x or 0X may
optionally precede the sequence of letters and digits, following the sign if present.
The subject sequence is defined as the longest initial subsequence of the input string,
starting with the first non-white-space character, that is of the expected form. The subject
sequence contains no characters if the input string is empty or consists entirely of white
space, or if the first non-white-space character is other than a sign or a permissible letter
or digit.
If the subject sequence has the expected form and the value of base
is between 2 and 36, it is used as the base for conversion, ascribing to each letter its value
as given above. [...] A pointer to the final string is stored in the
object pointed to by endptr, provided that endptr is not a null pointer.
[...]
If the subject sequence is empty or does not have the expected form, no conversion is
performed; the value of nptr is stored in the object pointed to by endptr, provided
that endptr is not a null pointer.
The strtol, strtoll, strtoul, and strtoull functions return the converted
value, if any. If no conversion could be performed, zero is returned.
Nice.
Now, the definition of fscanf(), more precisely the %x conversion specifier, emphasis mine:
Easy enough.Matches an optionally signed hexadecimal integer, whose format is the same
as expected for the subject sequence of the strtoul function with the value
16 for the base argument. The corresponding argument shall be a pointer to
unsigned integer.
One of my references for working on PDCLib is P.J. Plaugher's book "The C standard library". In there, Plaugher makes a reference that, because ungetc() is only guaranteed to push one read character back into the input stream, scanf() (and brethren) might not give exactly the same results as strtol() (and brethren).
That got me curious, and I wanted to find out what exactly those differences might be, so that PDCLib could handle them graciously.
After some testing, I came up with this, quite simple I think, test program:
Code: Select all
#include <stdio.h>
#include <stdlib.h>
int main()
{
char * string = "0xz"; // valid hex prefix, followed by invalid digit
int i = -1; // result value, initialized to impossible value
int count = -1; // count of scanf()-parsed characters, likewise initialized
char c; // holds first scanf()-unparsed character
char * endptr = NULL; // points to first strtoul()-unparsed character
// scan string with scanf(), putting result in i, characters parsed in 'count',
// and the next character to be parsed in 'c'.
sscanf( string, "%x%n%c", &i, &count, &c );
// print results
printf( "sscanf(): Value %d - Consumed %d - Next char %c\n", i, count, c );
// scan string with strtoul(), putting result in i, and pointer to next
// character to be parsed in 'c'.
i = strtoul( string, &endptr, 16 );
// print results
printf( "strtoul(): Value %d - Consumed %d - Next char %c\n", i, ( endptr - string ), *endptr );
return 0;
}
Here is what I got.
Cygwin:
Code: Select all
sscanf(): Value 0 - Consumed 1 - Next char x
strtoul(): Value 0 - Consumed 0 - Next char 0
Code: Select all
sscanf(): Value 0 - Consumed 2 - Next char z
strtoul(): Value 0 - Consumed 1 - Next char x
Code: Select all
sscanf(): Value -1 - Consumed -1 - Next char ╠
strtoul(): Value 0 - Consumed 0 - Next char 0
Right now I feel like hopping around and blowing soap bubbles. Obviously, it is very easy to get numerical parsing into undefined country, without actually doing anything wrong. The only halfway sane result is from MSVC (both parsings failing completely), which is something to mark up red in the calendar in any case...