Re: 2 C runtime libraries
Posted: Sat Nov 14, 2020 2:29 pm
isalpha() is a function working on single bytes. In a UTF-8 environment, the only legal single-byte characters are the ASCII characters. All others are multibyte, and therefore no valid inputs into this function. If you want to ask the libc about Unicode, you have to convert from multibyte to wide character and then use iswalpha(). And, since you are the author of a libc, I am a bit perplexed that you do not know this.Solar wrote:It should be mentioned that the musl implementation the author is so impressed with is assuming ASCII-7, which I am very much not impressed with.
Yes, because isdigit() only works on single bytes. isidigit(L'一') is undefined behavior. The larger question would be how character classifications are locale-dependent at all. The only reason I can see is character sets. musl only supports ASCII and UTF-8 (I don't quite remember the reasoning but it was something to the effect of a single-byte character set being required in the C locale, and the only such character set Rich found supportable was ASCII, since that is a subset of UTF-8), therefore the implementation of isdigit() is sufficient. I just looked it up, and iswdigit() has essentially the same implementation. So apparently, Chinese number signs are not digits. iswalpha(), on the other hand, classifies tons of Unicode codepoints as alphabetic, depending on their classification in Unicode. The encoding for this is pretty horrible, tho:bzt wrote:This made me wonder. Is this really work for non-English locales?
Code: Select all
static const unsigned char table[] = {
#include "alpha.h"
};
int iswalpha(wint_t wc)
{
if (wc<0x20000U)
return (table[table[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1;
if (wc<0x2fffeU)
return 1;
return 0;
}
That's another topic altogether. The glibc code is pretty horrible and full of code smells. However, the crash spoken about in the blog article you linked in the OP did not come about because of them, but because isalpha() was called with an invalid input.PeterX wrote:But does that explain why the GNUs use such a code, full of preprocessor instructions?