2 C runtime libraries

nullplan · Post by **nullplan** » Sat Nov 14, 2020 2:29 pm

Solar wrote:It should be mentioned that the musl implementation the author is so impressed with is assuming ASCII-7, which I am very much not impressed with.

isalpha() is a function working on single bytes. In a UTF-8 environment, the only legal single-byte characters are the ASCII characters. All others are multibyte, and therefore no valid inputs into this function. If you want to ask the libc about Unicode, you have to convert from multibyte to wide character and then use iswalpha(). And, since you are the author of a libc, I am a bit perplexed that you do not know this.

bzt wrote:This made me wonder. Is this really work for non-English locales?

Yes, because isdigit() only works on single bytes. isidigit(L'一') is undefined behavior. The larger question would be how character classifications are locale-dependent at all. The only reason I can see is character sets. musl only supports ASCII and UTF-8 (I don't quite remember the reasoning but it was something to the effect of a single-byte character set being required in the C locale, and the only such character set Rich found supportable was ASCII, since that is a subset of UTF-8), therefore the implementation of isdigit() is sufficient. I just looked it up, and iswdigit() has essentially the same implementation. So apparently, Chinese number signs are not digits. iswalpha(), on the other hand, classifies tons of Unicode codepoints as alphabetic, depending on their classification in Unicode. The encoding for this is pretty horrible, tho:

Code: Select all

static const unsigned char table[] = {
#include "alpha.h"
};

int iswalpha(wint_t wc)
{
	if (wc<0x20000U)
		return (table[table[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1;
	if (wc<0x2fffeU)
		return 1;
	return 0;
}

And alpha.h is just a long list of seemingly random numbers.

PeterX wrote:But does that explain why the GNUs use such a code, full of preprocessor instructions?

That's another topic altogether. The glibc code is pretty horrible and full of code smells. However, the crash spoken about in the blog article you linked in the OP did not come about because of them, but because isalpha() was called with an invalid input.

PeterX · Post by **PeterX** » Sat Nov 14, 2020 3:33 pm

nullplan wrote:
PeterX wrote:But does that explain why the GNUs use such a code, full of preprocessor instructions?
That's another topic altogether. The glibc code is pretty horrible and full of code smells. However, the crash spoken about in the blog article you linked in the OP did not come about because of them, but because isalpha() was called with an invalid input.

I quote from the blog article:

Maybe, just maybe, the behavior of this function should not depend on five macros, whether or not you’re using a C++ compiler, the endianness of your machine, a look-up table, thread-local storage, and two pointer dereferences.

Greetings
Peter

bzt · Post by **bzt** » Sat Nov 14, 2020 6:50 pm

nullplan wrote:
bzt wrote:This made me wonder. Is this really work for non-English locales?
Yes, because isdigit() only works on single bytes.

Exactly my point. So I repeat my question, what is the purpose of

Code: Select all

return ( c >= _PDCLIB_lc_ctype->digits_low && c <= _PDCLIB_lc_ctype->digits_high );

when there's only one interval available for unsigned char (no matter the charset, ASCII, ISO-LATIN-x, KOI-8, UNICODE etc.)? All the other non-English digits require more than 8 bits, and all codepages have those digits at the same fixed position (0x30-0x39). So is there really a need for "_PDCLIB_lc_ctype->" here?

Cheers,
bzt

nullplan · Post by **nullplan** » Sat Nov 14, 2020 11:57 pm

PeterX wrote:I quote from the blog article:
Maybe, just maybe, the behavior of this function should not depend on five macros, whether or not you’re using a C++ compiler, the endianness of your machine, a look-up table, thread-local storage, and two pointer dereferences.

Yes. Drew is wrong here. It happens. Drew thinks that isalpha() should tell you whether the argument is element of a few predetermined intervals. No, that is not what it does, and as I explained, if you allow character sets other than ASCII and UTF-8, then the question is indeed dependent on runtime state, namely locale. And so the thing that broke his neck here, the two pointer dereferences, cannot be avoided under the assumptions glibc makes, and musl only avoids them by limiting their feature set. And TLS is required to support uselocale(). The rest of the stuff, tho, yes that is horrible, but not what caused his crash. You can write a pristine version of that code with the same feature set, and it would still crash with invalid inputs.

Solar · Post by **Solar** » Sun Nov 15, 2020 1:35 pm

bzt wrote:
Solar wrote:
Code: Select all
int isdigit( int c )
{
    return ( c >= _PDCLIB_lc_ctype->digits_low && c <= _PDCLIB_lc_ctype->digits_high );
}
This made me wonder. Is this really work for non-English locales? For Chinese locale for example, isdigit(L'一') should return true...

The C standard defines isdigit() to return "true" for 0 through 9 specifically (to the exclusion of any locale-dependent digits), and requires that they be consecutive values.

So even if you had a machine where CHAR_BITS would allow you to have a non-multibyte Chinese encoding (because isdigit() is not for multibyte encodings like UTF-8, nor for wide-character encodings like UTF-32), then isdigit( '一' ) would still return zero.

C++ has a std::isalpha() in <locale>, which is defined locale dependent.

nullplan wrote:
Solar wrote:It should be mentioned that the musl implementation the author is so impressed with is assuming ASCII-7, which I am very much not impressed with.
isalpha() is a function working on single bytes. In a UTF-8 environment...

I was not talking about UTF-8. I was talking about me expecting a production-level libc to be able to tell me the truth about isalpha() with Latin-9, CP-1252 and EDCDIC, IBM Codepage 437 and 850, Latin-2 and whatnot. (And no, PDCLib isn't there yet, but it has the plumbing in place to be there when it hits 1.0.)

Of course a libc that decides to ignore anything not ASCII-7 has it pretty easy when showing off a really simple isalpha() implementation...

nullplan wrote:If you want to ask the libc about Unicode, you have to convert from multibyte to wide character and then use iswalpha(). And, since you are the author of a libc, I am a bit perplexed that you do not know this.

I know that full well, it was just not relevant to this discussion, because isdigit() is not working on UTF-8. (Unicode is the character set. The encoding is UTF-8.) Note that neither the C nor the C++ standard are actually equipped to handle Unicode (or any "real" internationalization really) in the first place, since toupper() and tolower() (as well as their wide-char counterparts) work on single characters only. One, there are locales in which such a conversion should result in more than one character (toupper( 'ß' ) should give "SS"), and there are conversions where the result depends on the position of the character in the word (Greek Sigma for example), and the C/C++ standard functions are ignorant of that.

Then there are normalization levels (is a "1" in a circle a "1" or not?), combining characters etc... You want real Unicode support, you need to go to the ICU lib. It has a not-very-comfortable API, but it does cover all the issues mentioned above properly.

Can we stop showing off our relative encoding knowledge now?

eekee · Post by **eekee** » Wed Nov 18, 2020 6:12 pm

PeterX wrote:BTW They have made things like bison and GNUstep and mono and much more. And some folks like Emacs, which is part of GNU, too. (No editor wars, please!)
But that doesn't mean they have to write that kind of code.

My thoughts exactly! ... except I could never get the hang of GNUstep.

(I did like WindowMaker, but that wasn't a GNU project in itself.)

My favourite account about GNU is when they organized a little protest against software patents on the occasion of Rob Pike giving a talk. Pike talked with them, they seemed reasonable and nice enough. Afterward, he looked at the web page they'd used to organize their protest. When he blogged about years later, he asked, "What kind of people write detailed instructions on making a cardboard protest sign?" I love it!

Disclaimer: I may have misremembered exactly what he asked.

I'm not going to comment on character encodings; I'll end up ranting against all possible sides of the issues.

OSDev.org

2 C runtime libraries

Re: 2 C runtime libraries

Re: 2 C runtime libraries

Re: 2 C runtime libraries

Re: 2 C runtime libraries

Re: 2 C runtime libraries

Re: 2 C runtime libraries