Using Rabin-Karp's Algorithm to Find All Occurrences of Substring

Use Case:

The Rabin-Karp algorithm is an algorithm that helps to find all occurrences of a substring in a given text based on hash values. The substring is hashed initially to produce a hash value (a given number in range) and that value is then compared against when iterating through the given text.

There are ways to improve this algorithm but, for now, below is the standard implementation:

JavaScript

function areEqual(S1, S2) {
	if (S1 !== S2) {
		return false
	}
	for (i = 0; i < S1.length; i++) {
		if (S1[i] !== S2[i]) {
			return false
		}
	}
	return true
}

We start by declaring an 'areEqual' function that will take 2 string inputs, iterate through them both, and then return true if they match and false if they don't.

We'll only need this function if the substring and text hash values match to verify a match (in the case of possible hash collision). Otherwise, we'll just ignore this function.

JavaScript

function polyHash(S, p, x) {
	var hash = 0
	for (i = 0; i <= S.length - 1; i++) {
		hash = (hash * x + S.charCodeAt(i)) % p
	}
	return hash
}

Our 'polyHash' function is what we'll use to create hash values of our strings. The function will receive a string (S), a large prime number (p) and a value of x, then use the values to create a hash value along with the native JavaScript function 'charCodeAt' to return the unicode of the current character.

JavaScript

function rabinKarp(T, P) {
	var p = 1019
	x = 34
	var positions = []
	pHash = polyHash(P,p,x)
var text
	var tHash

	// Loop through text
	for (k = 0; k <= (T.length - P.length); k++) {
		text = T.slice(k, (k + P.length))

		tHash = polyHash(text,p,x)

		// If hashes don't match, continue to next loop
		if (pHash !== tHash) {
			continue
		}

		// If hashes do match, push locations to positions list
		if (areEqual(text,P)) {
			positions.push(k)
		}
	}
	return positions
}

var str = 'magicword Lorem ipsum dolor sit magicword amet, an theophrastus deterruisset est. Luptatum consectetuer ex nam. Mea ei blandit reprimique, has at agam adipiscing. Ea odio habeo honestatis duo. Tibique iudicabit corrumpit sed at. Ei mei ullum ornatus magicword corrumpit, te nec quodsi imperdiet euripidis magicword'
console.log(rabinKarp(str, 'magicword')) // [0,32,250,305]

Our 'rabinKarp' function is what we'll call to run the search. This function will receive the body text and the substring that we're looking to find. Here is also where we'll declare our large prime number (p) - the larger the value the less likely we experience hash collision - and our value for x (any random number from 1 to the maximum size of the input).

We'll also declare an empty 'positions' array that will hold the starting character of every substring occurrence and a 'pHash' variable that will hold the hash value of our substring.

The function will start by looping through the body text - character by character. With each loop, it will reestablish the 'text' variable based on the location of the currently iterated character in body text and the length of the substring. It will then hash this value using our 'polyHash' function.

Then it will compare the 'tHash' value with the 'pHash' value. If they don't match, we want to avoid running the 'areEqual' function and instead continue on with looping through the body text string. If they do match, they we want to verify that the strings do indeed match by checking with the 'areEqual' function. If 'areEqual' returns true than we'll add the character location to our 'positions' array. Finally, we'll return the 'positions' array.

Then we'll set 'str' equal to the string to be used as the body text and call the 'rabinKarp' function inside of a console.log function.

Tweet me @tylerewillis

Or send me an email: