Similarity detection

DavidCooper · Post by **DavidCooper** » Tue Dec 23, 2014 3:29 pm

Is there any kind of processor available (graphics card, perhaps) with instructions that can compare two registers and return a count of all the bits that they have in common to avoid the need to rotate 32 times and count the surviving bits after ANDing them? I'm trying to find a way to speed up searches for misspelt words in order to find all the most similar ones quickly, so I'm thinking of having a bitmap of letters used in each word to make this easier, but often there will not be an exact match, so I'm looking for a quick way to identify similarity rather than matches. (It's further complicated by the need to work with morphemes [a word list is never complete as new words can be built by adding endings to them], which makes identifying similarity rather than exact matches even more important.) I will be using other methods to reduce the size of the search (not unlike locality-sensitive hashing), but this kind of similarity test will still be needed a lot when going through buckets of similar items. It occurs to me that an instruction of this kind would be useful for other purposes too, such as with machine vision and hearing, but does such an instruction exist?

Combuster · Post by **Combuster** » Tue Dec 23, 2014 5:16 pm

All the algorithms (and some SSE instructions)

DavidCooper · Post by **DavidCooper** » Wed Dec 24, 2014 1:14 pm

Combuster wrote:All the algorithms (and some SSE instructions)

Thanks Combuster - you've put me on the right track yet again. POPCNT (bit population count) is exactly what I need, and on processors that don't support it I'll experiment with using a lookup table to do eight bits at a time. (I've also found an interesting page which makes some helpful suggestions about how to get better performance out of POPCNT, so that may help anyone else who now realises they could make use of this instruction: http://danluu.com/assembly-intrinsics/.)

Arto · Post by **Arto** » Wed Dec 24, 2014 1:38 pm

See also Sean Eron Anderson's well-known Bit Twiddling Hacks page for an exposition of techniques to use when the instruction isn't available. A good resource to bookmark for bit-fiddling in general.

DavidCooper · Post by **DavidCooper** » Wed Dec 24, 2014 3:15 pm

Arto wrote:See also Sean Eron Anderson's well-known Bit Twiddling Hacks page for an exposition of techniques to use when the instruction isn't available. A good resource to bookmark for bit-fiddling in general.

Yes, that was linked on to from Combuster's link along with a few other pages like it, but thanks anyway as I might easily have missed it. There are clearly a lot of good tricks available for doing many things. It looks as if I should use Brian Kernighan's method (count repetitions of "v &= v - 1" until zero flag = 0, I think - haven't tried it yet) rather than a lookup table as the number of set bits will always be fairly low when identifying misspelt words. It might even be faster than POPCNT on average for this task, so I'll have to do some tests to find out.

SpyderTL · Post by **SpyderTL** » Thu Dec 25, 2014 12:06 am

I'm not quite sure how counting bits is going to help you find misspelled words...

What exactly is the plan once you have a super fast bit counting function?

embryo · Post by **embryo** » Thu Dec 25, 2014 7:32 am

SpyderTL wrote:What exactly is the plan once you have a super fast bit counting function?

As I see it there should be some word matching loop which marks a bit in a long machine word if two letters of two words at the same position are equal. Next step is to count match number. If match is more than a threshold then we can consider the words as similar (but misspelled or with different endings/suffixes or in another form as is the case for some verbs). But of course, such algorithm is useful for only one type of word differences - when there are just a few (or just one) letters replaced with some others.

Brendan · Post by **Brendan** » Thu Dec 25, 2014 11:37 am

Hi,

embryo wrote:
SpyderTL wrote:What exactly is the plan once you have a super fast bit counting function?
As I see it there should be some word matching loop which marks a bit in a long machine word if two letters of two words at the same position are equal. Next step is to count match number. If match is more than a threshold then we can consider the words as similar (but misspelled or with different endings/suffixes or in another form as is the case for some verbs). But of course, such algorithm is useful for only one type of word differences - when there are just a few (or just one) letters replaced with some others.

That sounds less than ideal to me. For example, if someone types in "succesful", then both "successful" and "sucked" have 4 letters that match.

I'd be tempted to use something like enumerated morphemes (and not letters). For example:

1 = "suc", "succ" and "suck"
2 = "ess" and "es"
3 = "ful" and "full"

Then I'd have some sort of "for each morphene in word { next = get_code_for_next_morphene; value = value * MAX + next};" calculation. In that case all of the permutations ("sucessful", "successful", "suckessful", "sucesful", "succesful", "suckesful", ... ) end up with the exact same value; and that value would be used to find a list of valid words for that value ("successful" and no others). Of course for some cases (e.g. maybe someone typed "baer" and the list of valid words is "beer", "bear" and "bare") you might want an additional step to determine the probability of each word being the desired word (but in that case I'd expect the list/s to be short and performance to matter less).

Cheers,

Brendan

DavidCooper · Post by **DavidCooper** » Thu Dec 25, 2014 3:05 pm

SpyderTL wrote:I'm not quite sure how counting bits is going to help you find misspelled words...

What exactly is the plan once you have a super fast bit counting function?

There are 26 letters in English plus apostrophe (') and hyphen (-), so each one has a bit dedicated to it in a 32-bit space. The word "alphabet" would have associated with it the bitmap 00000000,00001000,10001000,10010011. If someone was to write "laphabet" by mistake, the bitmaps would still match. If they wrote "alphabt", the bitmaps would be out by one, but there would be very few words that come this close to being correct: "phablet" would be one of them. For short words like "the", any word containing those letters (such as "alphabet" will look like a perfect match for it using this method, so there needs to be a letter count for each word as well to eliminate all the ones that are two or more letters away from the right length (a second scan can be made allowing for a bigger difference if nothing useful comes out of the first scan). [Another part of the process will be guessing at the meaning from the context and then looking up possible concepts to see if any of them map to a word that looks similar, and that should probably be done first.]

The simple way to apply this method would be to run through the entire lexicon comparing the bitmap generated from the misspelt word (or just part of the word if you can recognise and remove any prefixes/suffixes), but a hashing method would speed it up. I don't know how to go about that at the moment, but that can be bolted on later when there's much more to be gained from doing it that way (i.e. by the time the system's ready to be put to work reading through astronomical numbers of social media posts). For now, what matters is making lexicon scans as fast as they can be by minimising the processing done when looking at each entry, so efficient counting of matching bits in bitmaps is a good way to improve performance.

(It was rclentey dreveisocd taht txed lkie tihs is frilay esay for polepe to raed - most people can read that kind of thing amazingly well, even if they're normally no good at solving anagrams. That was what originally led me to the idea of using a bitmap, but it should extend well to dealing with spelling errors too. 7h3r3 4r3 07h3r w4y5 0f 7ry1n9 70 533 1f y0u'r3 741k1ng 70 4 p3r50n 0r 4 m45h1n3, and different approaches have to be used to get round these attempts. I'm more concerned at the moment though with handling typing errors, but there's no harm in thinking further ahead.)

DavidCooper · Post by **DavidCooper** » Thu Dec 25, 2014 3:15 pm

Brendan wrote:I'd be tempted to use something like enumerated morphemes (and not letters). For example:
1 = "suc", "succ" and "suck"
2 = "ess" and "es"
3 = "ful" and "full"
Then I'd have some sort of "for each morphene in word { next = get_code_for_next_morphene; value = value * MAX + next};" calculation. In that case all of the permutations ("sucessful", "successful", "suckessful", "sucesful", "succesful", "suckesful", ... ) end up with the exact same value; and that value would be used to find a list of valid words for that value ("successful" and no others). Of course for some cases (e.g. maybe someone typed "baer" and the list of valid words is "beer", "bear" and "bare") you might want an additional step to determine the probability of each word being the desired word (but in that case I'd expect the list/s to be short and performance to matter less).

Sum spelling iz sew far a way from the norm that an other app roach iz required all two gether. I'm going to tackle it by coming up with rules for converting writing to a phonetic stream using a more consistent phonetic spelling system (tied to a phonetic lexicon) and then interpret the stream in the same way you would if you were handling speech input (from a microphone), but that's some way down the track at the minute.

embryo · Post by **embryo** » Fri Dec 26, 2014 6:48 am

Brendan wrote:That sounds less than ideal to me.

There's no absolute ideal in our world

My guess was about possible use case, but even guessed case was wrong

OSDev.org

Similarity detection

Similarity detection

Re: Similarity detection

Re: Similarity detection

Re: Similarity detection

Re: Similarity detection

Re: Similarity detection

Re: Similarity detection

Re: Similarity detection

Re: Similarity detection

Re: Similarity detection

Re: Similarity detection