Similarity detection

Programming, for all ages and all languages.
Post Reply
User avatar
DavidCooper
Member
Member
Posts: 1150
Joined: Wed Oct 27, 2010 4:53 pm
Location: Scotland

Similarity detection

Post by DavidCooper »

Is there any kind of processor available (graphics card, perhaps) with instructions that can compare two registers and return a count of all the bits that they have in common to avoid the need to rotate 32 times and count the surviving bits after ANDing them? I'm trying to find a way to speed up searches for misspelt words in order to find all the most similar ones quickly, so I'm thinking of having a bitmap of letters used in each word to make this easier, but often there will not be an exact match, so I'm looking for a quick way to identify similarity rather than matches. (It's further complicated by the need to work with morphemes [a word list is never complete as new words can be built by adding endings to them], which makes identifying similarity rather than exact matches even more important.) I will be using other methods to reduce the size of the search (not unlike locality-sensitive hashing), but this kind of similarity test will still be needed a lot when going through buckets of similar items. It occurs to me that an instruction of this kind would be useful for other purposes too, such as with machine vision and hearing, but does such an instruction exist?
Help the people of Laos by liking - https://www.facebook.com/TheSBInitiative/?ref=py_c

MSB-OS: http://www.magicschoolbook.com/computing/os-project - direct machine code programming
User avatar
Combuster
Member
Member
Posts: 9301
Joined: Wed Oct 18, 2006 3:45 am
Libera.chat IRC: [com]buster
Location: On the balcony, where I can actually keep 1½m distance
Contact:

Re: Similarity detection

Post by Combuster »

"Certainly avoid yourself. He is a newbie and might not realize it. You'll hate his code deeply a few years down the road." - Sortie
[ My OS ] [ VDisk/SFS ]
User avatar
DavidCooper
Member
Member
Posts: 1150
Joined: Wed Oct 27, 2010 4:53 pm
Location: Scotland

Re: Similarity detection

Post by DavidCooper »

Thanks Combuster - you've put me on the right track yet again. POPCNT (bit population count) is exactly what I need, and on processors that don't support it I'll experiment with using a lookup table to do eight bits at a time. (I've also found an interesting page which makes some helpful suggestions about how to get better performance out of POPCNT, so that may help anyone else who now realises they could make use of this instruction: http://danluu.com/assembly-intrinsics/.)
Help the people of Laos by liking - https://www.facebook.com/TheSBInitiative/?ref=py_c

MSB-OS: http://www.magicschoolbook.com/computing/os-project - direct machine code programming
User avatar
Arto
Member
Member
Posts: 44
Joined: Wed May 15, 2013 5:49 pm
Location: Berlin
Contact:

Re: Similarity detection

Post by Arto »

See also Sean Eron Anderson's well-known Bit Twiddling Hacks page for an exposition of techniques to use when the instruction isn't available. A good resource to bookmark for bit-fiddling in general.
Developer of libc11
User avatar
DavidCooper
Member
Member
Posts: 1150
Joined: Wed Oct 27, 2010 4:53 pm
Location: Scotland

Re: Similarity detection

Post by DavidCooper »

Arto wrote:See also Sean Eron Anderson's well-known Bit Twiddling Hacks page for an exposition of techniques to use when the instruction isn't available. A good resource to bookmark for bit-fiddling in general.
Yes, that was linked on to from Combuster's link along with a few other pages like it, but thanks anyway as I might easily have missed it. There are clearly a lot of good tricks available for doing many things. It looks as if I should use Brian Kernighan's method (count repetitions of "v &= v - 1" until zero flag = 0, I think - haven't tried it yet) rather than a lookup table as the number of set bits will always be fairly low when identifying misspelt words. It might even be faster than POPCNT on average for this task, so I'll have to do some tests to find out.
Help the people of Laos by liking - https://www.facebook.com/TheSBInitiative/?ref=py_c

MSB-OS: http://www.magicschoolbook.com/computing/os-project - direct machine code programming
User avatar
SpyderTL
Member
Member
Posts: 1074
Joined: Sun Sep 19, 2010 10:05 pm

Re: Similarity detection

Post by SpyderTL »

I'm not quite sure how counting bits is going to help you find misspelled words...

What exactly is the plan once you have a super fast bit counting function?
Project: OZone
Source: GitHub
Current Task: LIB/OBJ file support
"The more they overthink the plumbing, the easier it is to stop up the drain." - Montgomery Scott
embryo

Re: Similarity detection

Post by embryo »

SpyderTL wrote:What exactly is the plan once you have a super fast bit counting function?
As I see it there should be some word matching loop which marks a bit in a long machine word if two letters of two words at the same position are equal. Next step is to count match number. If match is more than a threshold then we can consider the words as similar (but misspelled or with different endings/suffixes or in another form as is the case for some verbs). But of course, such algorithm is useful for only one type of word differences - when there are just a few (or just one) letters replaced with some others.
User avatar
Brendan
Member
Member
Posts: 8561
Joined: Sat Jan 15, 2005 12:00 am
Location: At his keyboard!
Contact:

Re: Similarity detection

Post by Brendan »

Hi,
embryo wrote:
SpyderTL wrote:What exactly is the plan once you have a super fast bit counting function?
As I see it there should be some word matching loop which marks a bit in a long machine word if two letters of two words at the same position are equal. Next step is to count match number. If match is more than a threshold then we can consider the words as similar (but misspelled or with different endings/suffixes or in another form as is the case for some verbs). But of course, such algorithm is useful for only one type of word differences - when there are just a few (or just one) letters replaced with some others.
That sounds less than ideal to me. For example, if someone types in "succesful", then both "successful" and "sucked" have 4 letters that match.

I'd be tempted to use something like enumerated morphemes (and not letters). For example:
  • 1 = "suc", "succ" and "suck"
    2 = "ess" and "es"
    3 = "ful" and "full"
Then I'd have some sort of "for each morphene in word { next = get_code_for_next_morphene; value = value * MAX + next};" calculation. In that case all of the permutations ("sucessful", "successful", "suckessful", "sucesful", "succesful", "suckesful", ... ) end up with the exact same value; and that value would be used to find a list of valid words for that value ("successful" and no others). Of course for some cases (e.g. maybe someone typed "baer" and the list of valid words is "beer", "bear" and "bare") you might want an additional step to determine the probability of each word being the desired word (but in that case I'd expect the list/s to be short and performance to matter less).


Cheers,

Brendan
For all things; perfection is, and will always remain, impossible to achieve in practice. However; by striving for perfection we create things that are as perfect as practically possible. Let the pursuit of perfection be our guide.
User avatar
DavidCooper
Member
Member
Posts: 1150
Joined: Wed Oct 27, 2010 4:53 pm
Location: Scotland

Re: Similarity detection

Post by DavidCooper »

SpyderTL wrote:I'm not quite sure how counting bits is going to help you find misspelled words...

What exactly is the plan once you have a super fast bit counting function?
There are 26 letters in English plus apostrophe (') and hyphen (-), so each one has a bit dedicated to it in a 32-bit space. The word "alphabet" would have associated with it the bitmap 00000000,00001000,10001000,10010011. If someone was to write "laphabet" by mistake, the bitmaps would still match. If they wrote "alphabt", the bitmaps would be out by one, but there would be very few words that come this close to being correct: "phablet" would be one of them. For short words like "the", any word containing those letters (such as "alphabet" will look like a perfect match for it using this method, so there needs to be a letter count for each word as well to eliminate all the ones that are two or more letters away from the right length (a second scan can be made allowing for a bigger difference if nothing useful comes out of the first scan). [Another part of the process will be guessing at the meaning from the context and then looking up possible concepts to see if any of them map to a word that looks similar, and that should probably be done first.]

The simple way to apply this method would be to run through the entire lexicon comparing the bitmap generated from the misspelt word (or just part of the word if you can recognise and remove any prefixes/suffixes), but a hashing method would speed it up. I don't know how to go about that at the moment, but that can be bolted on later when there's much more to be gained from doing it that way (i.e. by the time the system's ready to be put to work reading through astronomical numbers of social media posts). For now, what matters is making lexicon scans as fast as they can be by minimising the processing done when looking at each entry, so efficient counting of matching bits in bitmaps is a good way to improve performance.

(It was rclentey dreveisocd taht txed lkie tihs is frilay esay for polepe to raed - most people can read that kind of thing amazingly well, even if they're normally no good at solving anagrams. That was what originally led me to the idea of using a bitmap, but it should extend well to dealing with spelling errors too. 7h3r3 4r3 07h3r w4y5 0f 7ry1n9 70 533 1f y0u'r3 741k1ng 70 4 p3r50n 0r 4 m45h1n3, and different approaches have to be used to get round these attempts. I'm more concerned at the moment though with handling typing errors, but there's no harm in thinking further ahead.)
Help the people of Laos by liking - https://www.facebook.com/TheSBInitiative/?ref=py_c

MSB-OS: http://www.magicschoolbook.com/computing/os-project - direct machine code programming
User avatar
DavidCooper
Member
Member
Posts: 1150
Joined: Wed Oct 27, 2010 4:53 pm
Location: Scotland

Re: Similarity detection

Post by DavidCooper »

Brendan wrote:I'd be tempted to use something like enumerated morphemes (and not letters). For example:
  • 1 = "suc", "succ" and "suck"
    2 = "ess" and "es"
    3 = "ful" and "full"
Then I'd have some sort of "for each morphene in word { next = get_code_for_next_morphene; value = value * MAX + next};" calculation. In that case all of the permutations ("sucessful", "successful", "suckessful", "sucesful", "succesful", "suckesful", ... ) end up with the exact same value; and that value would be used to find a list of valid words for that value ("successful" and no others). Of course for some cases (e.g. maybe someone typed "baer" and the list of valid words is "beer", "bear" and "bare") you might want an additional step to determine the probability of each word being the desired word (but in that case I'd expect the list/s to be short and performance to matter less).
Sum spelling iz sew far a way from the norm that an other app roach iz required all two gether. I'm going to tackle it by coming up with rules for converting writing to a phonetic stream using a more consistent phonetic spelling system (tied to a phonetic lexicon) and then interpret the stream in the same way you would if you were handling speech input (from a microphone), but that's some way down the track at the minute.
Help the people of Laos by liking - https://www.facebook.com/TheSBInitiative/?ref=py_c

MSB-OS: http://www.magicschoolbook.com/computing/os-project - direct machine code programming
embryo

Re: Similarity detection

Post by embryo »

Brendan wrote:That sounds less than ideal to me.
There's no absolute ideal in our world :)

My guess was about possible use case, but even guessed case was wrong :)
Post Reply