Page 1 of 3

Funny (NOT!) piece of C strangeness.

Posted: Tue Sep 15, 2009 8:34 am
by Solar
Have some patience, this will get a bit longer while I prepare the grounds, but I am sure the results will be as surprising for you as they were for me.

The definition of the not-too-exotic function unsigned long int strtoul( const char * restrict nptr, char ** restrict endptr, int base);, as by the standard (skip the quote if you like, I'll paraphrase later):
[...] If the value of base is between 2 and 36 (inclusive), the
expected form of the subject sequence is a sequence of letters and digits representing an
integer with the radix specified by base, optionally preceded by a plus or minus sign,
but not including an integer suffix. [...] If the value of base is 16, the characters 0x or 0X may
optionally precede the sequence of letters and digits, following the sign if present.

The subject sequence is defined as the longest initial subsequence of the input string,
starting with the first non-white-space character, that is of the expected form. The subject
sequence contains no characters if the input string is empty or consists entirely of white
space, or if the first non-white-space character is other than a sign or a permissible letter
or digit.

If the subject sequence has the expected form and the value of base
is between 2 and 36, it is used as the base for conversion, ascribing to each letter its value
as given above. [...] A pointer to the final string is stored in the
object pointed to by endptr, provided that endptr is not a null pointer.

[...]

If the subject sequence is empty or does not have the expected form, no conversion is
performed; the value of nptr is stored in the object pointed to by endptr, provided
that endptr is not a null pointer.

The strtol, strtoll, strtoul, and strtoull functions return the converted
value, if any. If no conversion could be performed, zero is returned.
In layman terms, given a call strtoul( string, &endptr, 16 ), string is parsed as a hexadecimal number (optionally prefixed "0x"), until the first non-hexadecimal character is encountered. Return code is the value parsed, and endptr will point to the first unparsed character.

Nice.

Now, the definition of fscanf(), more precisely the %x conversion specifier, emphasis mine:
Matches an optionally signed hexadecimal integer, whose format is the same
as expected for the subject sequence of the strtoul function with the value
16 for the base argument. The corresponding argument shall be a pointer to
unsigned integer.
Easy enough.

One of my references for working on PDCLib is P.J. Plaugher's book "The C standard library". In there, Plaugher makes a reference that, because ungetc() is only guaranteed to push one read character back into the input stream, scanf() (and brethren) might not give exactly the same results as strtol() (and brethren).

That got me curious, and I wanted to find out what exactly those differences might be, so that PDCLib could handle them graciously.

After some testing, I came up with this, quite simple I think, test program:

Code: Select all

#include <stdio.h>
#include <stdlib.h>

int main()
{
    char * string = "0xz";  // valid hex prefix, followed by invalid digit
    int i = -1;             // result value, initialized to impossible value
    int count = -1;         // count of scanf()-parsed characters, likewise initialized
    char c;                 // holds first scanf()-unparsed character
    char * endptr = NULL;   // points to first strtoul()-unparsed character

    // scan string with scanf(), putting result in i, characters parsed in 'count',
    // and the next character to be parsed in 'c'.
    sscanf( string, "%x%n%c", &i, &count, &c );
    // print results
    printf( "sscanf():  Value %d - Consumed %d - Next char %c\n", i, count, c );

    // scan string with strtoul(), putting result in i, and pointer to next
    // character to be parsed in 'c'.
    i = strtoul( string, &endptr, 16 );
    // print results
    printf( "strtoul(): Value %d - Consumed %d - Next char %c\n", i, ( endptr - string ), *endptr );
    return 0;
}
Now think for a minute what kind of output you would expect. After all, the specs are clear, no?

Here is what I got.

Cygwin:

Code: Select all

sscanf():  Value 0 - Consumed 1 - Next char x
strtoul(): Value 0 - Consumed 0 - Next char 0
Gentoo:

Code: Select all

sscanf():  Value 0 - Consumed 2 - Next char z
strtoul(): Value 0 - Consumed 1 - Next char x
MSVC 2005:

Code: Select all

sscanf():  Value -1 - Consumed -1 - Next char ╠
strtoul(): Value 0 - Consumed 0 - Next char 0
Isn't that fun?

Right now I feel like hopping around and blowing soap bubbles. Obviously, it is very easy to get numerical parsing into undefined country, without actually doing anything wrong. The only halfway sane result is from MSVC (both parsings failing completely), which is something to mark up red in the calendar in any case...

Re: Funny (NOT!) piece of C strangeness.

Posted: Tue Sep 15, 2009 12:36 pm
by NickJohnson
It seems like the only thing needed to make the standard consistent is to have both functions return -1 on error, so the caller can disregard the other information, which would be undefined. The only issue is that the properly read number could actually be -1... perhaps a good place for ssize_t :lol: ?

Re: Funny (NOT!) piece of C strangeness.

Posted: Tue Sep 15, 2009 1:20 pm
by Kevin
My winner is...
Solar wrote:Gentoo:

Code: Select all

sscanf():  Value 0 - Consumed 2 - Next char z
strtoul(): Value 0 - Consumed 1 - Next char x
The strtoul() case is obvious: 0 is a valid hex number, so it reads one character and returns 0, just as expected. The sscanf() behaviour took me a while to understand (I had expected that it should read exactly one character as well), but I think it's right too:
An input item shall be defined as the longest sequence of input bytes [...] which is an initial subsequence of a matching sequence.
0x is an initial subsequence of a matching sequence, even though it cannot completely be parsed into a hex number. It may be debatable if this definition makes a lot of sense, but reading two characters seems to be the right thing.

And now I'll better check my strtoul implementation... ;)

Re: Funny (NOT!) piece of C strangeness.

Posted: Tue Sep 15, 2009 2:14 pm
by Solar
NickJohnson wrote:It seems like the only thing needed to make the standard consistent is to have both functions return -1 on error...
Behaviour in case of error is defined by the standard, changing it is not an option.

I just looked it up: standard chapter 6.4.4.1 defines a "hexadecimal number" to be either a sequence of hex digits, or "0x" / "0X" followed by hexadecimal digits. Which means "0x" is not a hex number at all, and correct behaviour should be value 0, consumed 1. For both functions.

Cygwin and Gentoo get 1 out of 2, Microsoft - while being the only lib giving consistent results across both functions - gets 0 out of 2, which somehow sets my concept of the computing world back on its feet. ;-)

Re: Funny (NOT!) piece of C strangeness.

Posted: Tue Sep 15, 2009 2:19 pm
by Kevin
Solar wrote:Which means "0x" is not a hex number at all, and correct behaviour should be value 0, consumed 1. For both functions.
So you disagree with my interpretation of the standard for sscanf? How would you interpret the "initial subsequence of a matching sequence" thing then?

Re: Funny (NOT!) piece of C strangeness.

Posted: Tue Sep 15, 2009 2:42 pm
by Velko
More fun!

FreeBSD:

Code: Select all

sscanf():  Value 0 - Consumed 1 - Next char x
strtoul(): Value 0 - Consumed 1 - Next char x
Winner? :P

Re: Funny (NOT!) piece of C strangeness.

Posted: Tue Sep 15, 2009 4:15 pm
by tarrox
Kevin wrote:
Solar wrote:Which means "0x" is not a hex number at all, and correct behaviour should be value 0, consumed 1. For both functions.
So you disagree with my interpretation of the standard for sscanf? How would you interpret the "initial subsequence of a matching sequence" thing then?
After a half an hour of reading i come to the same conclusion like Solar. The critical part of it all is:
The subject sequence is defined as the longest initial subsequence of the input string,
starting with the first non-white-space character, that is of the expected form.
And:
If the value of base is zero, the expected form of the subject sequence is that of an
integer constant as described in 6.4.4.1
, optionally preceded by a plus or minus sign, but
not including an integer suffix. [...]
Part of 6.4.4.1:

Code: Select all

hexadecimal-constant:
    hexadecimal-prefix hexadecimal-digit
    hexadecimal-constant hexadecimal-digit
hexadecimal-prefix: one of
    0x 0X
hexadecimal-digit: one of
    0 1 2 3 4 5 6 7 8 9
    a b c d e f
    A B C D E F
And "0x" isn't an expected form of a hexadecimal integer like Solar said.
Also the part you mention isn't in the std I got from [url=open-std.org]open-std.org[/url] (which is of 2005, newest is 2007 afaik), so I don't see any problem. Or are there new/special parts i don't know of or are you maybe just wrong with it, thinking of it as a part of the standard?
Velko wrote:More fun!

FreeBSD:

Code: Select all

sscanf():  Value 0 - Consumed 1 - Next char x
strtoul(): Value 0 - Consumed 1 - Next char x
Winner? :P
Yep, this implementation is right.

Re: Funny (NOT!) piece of C strangeness.

Posted: Tue Sep 15, 2009 11:08 pm
by Solar
Seems like BSD is the only library to get this right. Nota bene: This is a bug in glibc, newlib, IBM's AIX library, and MSVC (which are the libs I tested so far).

Fun... :twisted:

Re: Funny (NOT!) piece of C strangeness.

Posted: Wed Sep 16, 2009 1:46 am
by Kevin
tarrox wrote:
Kevin wrote:
Solar wrote:Which means "0x" is not a hex number at all, and correct behaviour should be value 0, consumed 1. For both functions.
So you disagree with my interpretation of the standard for sscanf? How would you interpret the "initial subsequence of a matching sequence" thing then?
After a half an hour of reading i come to the same conclusion like Solar. The critical part of it all is:
The subject sequence is defined as the longest initial subsequence of the input string,
starting with the first non-white-space character, that is of the expected form.
And:
If the value of base is zero, the expected form of the subject sequence is that of an
integer constant as described in 6.4.4.1
, optionally preceded by a plus or minus sign, but
not including an integer suffix. [...]
You are quoting the wrong thing, this is the description of strtoul. We are all of the same opinion that the right result for strtoul is one consumed character. No problems here, though it's sad that so many libs have bugs here (including my own, by the way ;))

When discussing the sscanf behaviour however, you should look at the text for sscanf. And this text says, as I mentioned above: "An input item shall be defined as the longest sequence of input bytes (up to any specified maximum field width, which may be measured in characters or bytes dependent on the conversion specifier) which is an initial subsequence of a matching sequence. The first byte, if any, after the input item shall remain unread." This is different from strtoul in that it doesn't demand that the read characters constitute a valid sequence. They only need to be a prefix of a valid sequence, and 0x definitely is such a prefix.
Velko wrote:FreeBSD:

Code: Select all

sscanf():  Value 0 - Consumed 1 - Next char x
strtoul(): Value 0 - Consumed 1 - Next char x
Winner? :P
Yep, this implementation is right.
Nope, against common sense, this is yet another broken implementation. ;)

Re: Funny (NOT!) piece of C strangeness.

Posted: Wed Sep 16, 2009 1:54 am
by Solar
Kevin wrote:...an initial subsequence of a matching sequence.
"0x" is not a matching sequence, because "0x" is not a valid hexadecimal number. "0", on the other hand, is.

Re: Funny (NOT!) piece of C strangeness.

Posted: Wed Sep 16, 2009 2:08 am
by Kevin
Solar wrote:
Kevin wrote:...an initial subsequence of a matching sequence.
"0x" is not a matching sequence, because "0x" is not a valid hexadecimal number. "0", on the other hand, is.
Right, this is why strtoul consumes one character. But "0x123" is a matching sequence and "0x" is a prefix of "0x123", so sscanf consumes both characters of "0x".

Edit: I shouldn't change terminology during the discussion... "prefix" and "initial subsequence" are the same. The former is used in the C99 standard (which I was reading while posting this), the latter in the POSIX manpage (which I used yesterday).

Re: Funny (NOT!) piece of C strangeness.

Posted: Wed Sep 16, 2009 2:35 am
by Solar
You can read it like that.

But you can also read it as: The sequence is "0xz", the sequence does not match, and thus the "0x" is not the prefix of a matching sequence.

Consider it from the other side: There is no way for printf() to print a hexadecimal number "0x".

printf( "%x", 0 ) -> "0"
printf( "%#x", 0 ) -> "0" (because "0x" is only prefixed for non-zero values)

So I see logic to be on my side: Why should I parse "0x" as being a hex number, when I cannot print it like that?

Re: Funny (NOT!) piece of C strangeness.

Posted: Wed Sep 16, 2009 2:52 am
by Kevin
Hm, I see. What you're reading from it isn't completely unreasonable either. However, I wouldn't see the point in writing it this way if they really meant what you read. I mean, if the whole thing matches, I'm reading the complete matching sequence anyway - no need to mention prefixes or matching subsequences. And it says "a matching sequence", not "the matching sequence".

What really makes me sure that my interpretation is the right one, is a footnote in the standard that says "fscanf pushes back at most one input character onto the input stream." So it reads "0" and then "0x" which could still be the prefix of a hex number. It reads on to "0xz" which is invalid, but can only push the "z" back to the input stream.

However, reading the text over and over again, now I think this leads to a matching failure and sscanf must return immediately. It would return 0 then and the value of all your variables would be either undefined or not changed (can't be bothered to read up this detail). I think we need to extend your test program to print the return values. ;)

Re: Funny (NOT!) piece of C strangeness.

Posted: Wed Sep 16, 2009 3:11 am
by Solar
Matching failure in %x would leave the -1 in count untouched.

I've been pondering the one-char-pushback myself. If you are serious about that limitation, you cannot opt for matching failure, because you've already read the 0xz and can only push back the z...

Re: Funny (NOT!) piece of C strangeness.

Posted: Wed Sep 16, 2009 3:19 am
by Kevin
Solar wrote:Matching failure in %x would leave the -1 in count untouched.
It would. The expected output would be "sscanf(): Value 0 - Consumed -1 - Next char <uninitialized> - Return value 0". (Edit: Not sure with the value, -1 could be right as well)
I've been pondering the one-char-pushback myself. If you are serious about that limitation, you cannot opt for matching failure, because you've already read the 0xz and can only push back the z...
Why not? Consume two characters and then fail. Or does it say somewhere that a failing fscanf must leave the input stream completely unread?