Using -1 with unsigned integers only as missing marker

~ · Post by ~ » Tue Mar 06, 2018 6:24 pm

I think that it would be much better to use -1 with unsigned int or unsigned long instead of using their signed variants.

The idea is to reserve the value -1 (all bits set to 1) solely to indicate a missing offset, but use unsigned integers for the rest of values as it should be, to avoid dividing the capacity of an offset to its half.

It just takes away 1 single unsigned number to be reserved similarly to NULL, so why isn't it normally implemented like this. I think that I will implement all of my offsets as unsigned integers and I will only reserve the unsigned value of -1 for missing offsets.

Nobody said to use unsigned offsets, all they said was to use -1 (all bits to 1) as the marker for a missing offset.

Brendan · Post by **Brendan** » Tue Mar 06, 2018 7:33 pm

Hi,

~ wrote:I think that it would be much better to use -1 with unsigned int or unsigned long instead of using their signed variants.

The idea is to reserve the value -1 (all bits set to 1) solely to indicate a missing offset, but use unsigned integers for the rest of values as it should be, to avoid dividing the capacity of an offset to its half.

It just takes away 1 single unsigned number to be reserved similarly to NULL, so why isn't it normally implemented like this. I think that I will implement all of my offsets as unsigned integers and I will only reserve the unsigned value of -1 for missing offsets.

Nobody said to use unsigned offsets, all they said was to use -1 (all bits to 1) as the marker for a missing offset.

C is designed to be portable to different computers; where some computers use "two's complement" for signed numbers (where -1 == 0xFFFFFFFF), some computers use "sign and magnitude" (where -1 == 0x80000001), and some computers use something else (where -1 == something else). More correctly, something like "unsigned int x = -1;" relies implementation defined behaviour, where implementation defined behaviour is only slightly better than undefined behaviour.

For this reason, a good C programmer won't use -1 for unsigned integers - they'd use (e.g.) UINT_MAX instead to ensure that it's portable (and ensure that everyone doesn't think their code is bad even when portability doesn't matter).

The next problem is that this value doesn't survive promotion. For example, if "bar()" returns an unsigned int and you do "unsigned long foo = bar(); if (foo == ULONG_MAX) {..." then you might or might not get bugs depending on the actual sizes (e.g. if "int" is 32-bit and "long" is 64-bit; then the 32-bit value 0xFFFFFFFF might be converted to the 64-bit value 0x00000000FFFFFFFF and won't match ULONG_MAX; but if "int" and "long" are both 32-bit the code will work). In other words; this can easily become another "implementation defined" portability disaster.

The next problem is that a good interface may return more than just one kind of error (note: admittedly the C standard library is pure trash, doesn't use good interfaces and ends up relying on the horrible "errno" mess for the same purpose). In this case, for extensibility, it'd make sense to use a range of values for "errors of any kind" and another range for "no error"; and the easiest way to do that is to use signed integer where negative values indicate errors. For example, something like -1 = bad parameters, -2 = not enough memory, -3 = permission denied, -4 = file not found, etc.

The final problem is that nobody has a clue what this is actually for. Something that's ideal for one thing might be completely inappropriate for something else. Sometimes you might want errors to cause immediate termination, or cause a signal, or throw an exception; so that the programmer can assume that errors are never returned from functions. Sometimes negative values may be valid and can't be used to indicate errors. Sometimes you might want to return "status" even when there was no error (e.g. "succeeded without problem" and "succeeded, but an error occurred and was corrected"). It's a little bit like saying "I think people should smile and laugh more" - it sounds nice if you don't spend any time thinking about the corner-cases (e.g. funerals).

Cheers,

Brendan

StudlyCaps · Post by **StudlyCaps** » Tue Mar 06, 2018 10:58 pm

~ wrote:I think that it would be much better to use -1 with unsigned int or unsigned long instead of using their signed variants.

The idea is to reserve the value -1 (all bits set to 1) solely to indicate a missing offset, but use unsigned integers for the rest of values as it should be, to avoid dividing the capacity of an offset to its half.

It just takes away 1 single unsigned number to be reserved similarly to NULL, so why isn't it normally implemented like this. I think that I will implement all of my offsets as unsigned integers and I will only reserve the unsigned value of -1 for missing offsets.

Nobody said to use unsigned offsets, all they said was to use -1 (all bits to 1) as the marker for a missing offset.

What if you had something at offset of MAX_INT? Cause valid ranges often go 0 to MAX_<something>

alexfru · Post by **alexfru** » Wed Mar 07, 2018 12:01 am

Brendan wrote:More correctly, something like "unsigned int x = -1;" relies implementation defined behaviour, where implementation defined behaviour is only slightly better than undefined behaviour.

For this reason, a good C programmer won't use -1 for unsigned integers - they'd use (e.g.) UINT_MAX instead to ensure that it's portable (and ensure that everyone doesn't think their code is bad even when portability doesn't matter).

(unsigned int)-1 equals UINT_MAX. See the "Representations of types" and "Conversions" sections of the C standard.

Solar · Post by **Solar** » Wed Mar 07, 2018 3:59 am

@alexfru is correct -- "unsigned int x = -1" is a portable way to get all bits set in x. "unsigned int x = UINT_MAX" is, of course, more expressive.

As for what ~ originally wrote...

C++ does that (e.g. "std::string::npos" is defined as "static_cast<size_t>(-1)").
If you're using "signed int" as an index, you're doing it wrong already -- as e.g. strlen() returns the unsigned type size_t, and you (hopefully) start getting signed / unsigned comparison warnings from your compiler, which you will (hopefully) not ignore.
An offset is a different thing from an index, and can be negative.

Mind your vocabulary. Say what you mean, mean what you say.

~ · Post by ~ » Wed Mar 07, 2018 6:06 am

Doesn't the compiler generate the right sequence for -1 and the other values whenever it's specified anyway, regardless of using it as an unsigned bit pattern or as an actual signed integer?

Solar · Post by **Solar** » Wed Mar 07, 2018 7:22 am

Errr... I beg your pardon?

Could you re-phrase that question? Because I have no idea what you just asked.

Schol-R-LEA · Post by **Schol-R-LEA** » Wed Mar 07, 2018 1:43 pm

~ wrote:Doesn't the compiler generate the right sequence for -1 and the other values whenever it's specified anyway, regardless of using it as an unsigned bit pattern or as an actual signed integer?

You don't actually understand how two's complement works, do you? Not that it really matters; as Brendan stated, the C standard is written in a way that tries to avoid making assumptions about the underlying representation of integers, signed or unsigned, other than that there is one.

In a two's complement system - which is nearly universal today for system-word addressable integers, though exceptions exist (and floating-point numbers are entirely different, as are any bignums which a language or library might support) - there is no 'unsigned bit pattern' that the hardware can differentiate from a 'signed bit pattern'. Indeed, this is one of the main advantages of two's complement over, say, sign and magnitude, or one's complement: the exact same addition operation suffices for both signed and unsigned addition and subtraction.

In 2's complement, you negate a signed value by taking the complement of the value (the bitwise negation, that is to say, toggling the bits - ones goes to zeros and vice versa; this can be accomplished simply by XORing the value to a mask of all one's, though many ISAs including x86 have a built-in NOT instruction) and then adding one. You can also get the negated integer value by subtracting one from it's current value, then taking the complement of the result - the effect is the same.

For example, in a 4-bit signed value, -2 = (~0010) + 1 = 1101 + 1 = 1110.

This means that you don't need a separate subtraction system in hardware; you simply perform an arithmetic negation on the value being subtracted, and then add the two values normally.

It also mean that addition is the same for signed or unsigned values; whether the value is signed or not is entirely in the programmer's (or compiler's) interpretation of the value. The hardware doesn't have to even know which it is.

It also means that all the NEG instruction in the x86 has to do is simply apply the NOT and INC hardware in sequence - the circuit (or more likely, microcode/firmware) is made just but directing the output of the first operation to the input of the second.

Two's complement also has the advantage over one's complement that zero has the same representation for both signed and unsigned values. In one's complement (in negation is just the complement), a 4-bit zero is either 0000 or 1111; in two's complement, negating 0000 becomes 1111 + 1 = 0000 (due to wrapping overflow).

Schol-R-LEA · Post by **Schol-R-LEA** » Wed Mar 07, 2018 7:52 pm

Sorry for the sequential posts; I had to leave for an appointment earlier, before I had a chance to finish what i mean to say, and since I am uncertain if ~ read the previous post or not (the timing is really close to when i last edited it, meaning he may have seen the first posted version but not the fixes) thought it would be better to bite the bullet and post again.

Anyway, the rest of the story here is that, as Brendan said earlier, it isn't entirely clear what you are trying to accomplish in the first place. What is the use case for this? What is the context of the question? What 'missing offsets' are you concerned about?

I mean, honestly, you didn't ask a question, or even make a coherent statement really, you just sort of started into a something as if you were halfway through a conversation. What do you expect us to make of that? (A paper hat, perhaps... I may be dating myself with that reference, but whatev'.)

In the absence of other information I would say, bundle the value into a struct with a Boolean flag to indicate a valid/invalid value,

Code: Select all

typedef struct GOFFSET {
    bool valid;
    unsigned int offset;          // note that the exact size may not matter here        
} guarded_offset;

but I have no idea if that's even an applicable solution. You might use a struct with a bit field instead, if you know how the compiler will lay the values out (or don't need to know), but that can lead to a number of unexpected compiler dependencies.

Code: Select all

typedef struct BF_GOFFSET {
    bool valid: 1;
    uint32_t offset: 31;              // ... but it definitely does here
} bitfield_guarded_offset;

~ · Post by ~ » Thu Mar 08, 2018 5:35 am

The use case is an offset as big as possible, on disk, memory, etc...

I wouldn't like to have, say, 31 bits available for a file offset just to reserve -1 (specifically unsigned 0xFFFFFFFF or 0xFFFFFFFFFFFFFFFF, depending on the register width) as an "offset not found". I could just use an unsigned integer, use the full 32 or 64-bit range and just reserve -1 as a simple bit pattern generated by the compiler to signal "offset not found". 0 is a valid offset, -1 would be like NULL unless we return more than a value at once from a function to check an error/success flag (probably cleaner than reserving special values, like the CPU or BIOS do on error (CF, etc...) and more portable down to the machine level?)

Solar · Post by **Solar** » Thu Mar 08, 2018 5:44 am

This is already being done, using unsigned as index (not offset!) and -1 ("all bits set") as "fault" indicator.

Have you read what was written by others?

No. As usual, you just keep harping on.

bluemoon · Post by **bluemoon** » Thu Mar 08, 2018 7:05 am

~ wrote:31 bits available for a file offset

Wait, this put a 2GB limit on file size. Consider use full 64-bit.

Pype.Clicker · Post by **Pype.Clicker** » Thu Mar 08, 2018 8:58 am

I'd be tempted to use

Code: Select all

INVALID_WHATEVER=~0U/*LL*/

for unsigned /* long long */ types.

Solar · Post by **Solar** » Thu Mar 08, 2018 10:29 am

Pype, consider the StackOverflow link I posted earlier. ~0 works for two's complement, but not for others...

Schol-R-LEA · Post by **Schol-R-LEA** » Thu Mar 08, 2018 11:20 am

~ wrote:The use case is an offset as big as possible, on disk, memory, etc...

OK, first off, as Solar said, it seems you probably mean an 'index' (that is to say, an absolute location within the storage unit, whether that unit is primary memory or a disk partition) rather than an offset (which is relative to some point in the storage, such as the start of a file, the 200th element of the array, what have you). While one can view an index as a special case of an offset¹, the use of arbitrary offset² has very different implications about how you can use them effectively.

Second, I gather you are trying to have a common 'offset type' for all kinds of storage units - that is to say, you would use a variable a type, let's call it offset_t, for both memory and disk structure.

Uhm... I can see this working if you were using a language which supported either inheritance or prototyping³, and had an abstract type/class which the specific cases could inherit a type interface from, but trying to use one index type to rule them all is going to bite you down the road, hard. If nothing else, you would be committing yourself to a specific size for that index on a given system (even if you defined the size itself in some standards document as being system-specific , a la the basic integer types in C), meaning that you will either waste memory where it is too large, or have to work around the offsets when they are too small (or more likely still, both).

~ wrote:I wouldn't like to have, say, 31 bits available for a file offset just to reserve -1 (specifically unsigned 0xFFFFFFFF or 0xFFFFFFFFFFFFFFFF, depending on the register width) as an "offset not found". I could just use an unsigned integer, use the full 32 or 64-bit range and just reserve -1 as a simple bit pattern generated by the compiler to signal "offset not found". 0 is a valid offset, -1 would be like NULL unless we return more than a value at once from a function to check an error/success flag

I know that this sort of in-band error checking and exception flagging is still common - all too common, if you ask me - but it is one of those things that only sounds like a good idea but never really is. It was, unfortunately, enshrined in systems such as Unix back in the 1970s⁴, and from there carried into the C standard, but IMAO that's just one more thing Ken and Dennis are to blame for. I strongly advise not using a bare index or offset at all, and use a separate flag bundled with it in a struct (or use some other out-of-band mechanism if you can) if you can - and since this is your own OS design, and don't intend to adhere to POSIX or some other external standard, there is no reason I can see that you can't.

~ wrote:portable down to the machine level?)

WTF is this supposed to even mean? It's a contradiction in terms!

Footnotes:
1. One that is 'relative' to the start of the whole storage unit.
2. By which I mean an offset that isn't an index, which doesn't necessarily imply 'any possible index', for you rules lawyers out there.
3. With or without polymorphism, though that could be a value too in this.
4. At the time, one could - and many did - argue that this was more efficient due to the limited memory available. However, even at the time this was (IMAO) false fire, as it required a lot more additional code to handle it correctly than you saved. this was a known issue even before Multics was designed, having cropped up in both CTSS and OS/360. In both Unix and MS-DOS (and indeed most contemporary OSes), this was resolved by simply ignoring the problems it caused, a decision we are still paying for today.

OSDev.org

Using -1 with unsigned integers only as missing marker

Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker

Re: Using -1 with unsigned integers only as missing marker