This week, someone on Discord was looking for a way to identify a particular word in a block of Arabic text, both with and without Arabic diacritics.

So searching for the string: مِمَّا

Would also return the string: مما

It’s not immediately obvious how this can be achieved. As a straight comparison does not match the characters with diacritics to the characters without diacritics, as they are essentially different entries in the UTF encoding map.

However after a little bit of digging into the Arabic unicode character tables, I realised it was actually quite trivial to do the matching, with a couple of lines of Javascript.

The trick is to input a range of Arabic diacritic characters (as a range of UTF-16 values) into a regular expression and then to replace them using the function String.prototype.replace(), with an empty string.

This is the regGex you need to use to pass into the String.prototype.replace() function as the first variable.

/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g

The second variable in the String.prototype.replace() function, needs to be an empty string. Here’s the full line:

searchText.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');

This works, because when broken down into their unicode entities, characters with diacritics aren’t different characters, but they are the base character with a suffix added. Meaning we can just check for the diacritics specifically and to replace them (with an empty string), without affecting the rest of the word:

function lookUpText(searchText){
    let searchTextClean = searchText.replace(/([^\u0621-\u063A\u0641-\u064A\u0660-\u0669a-zA-Z 0-9])/g, '');
    let searchArray     = [searchText, searchTextClean];

    console.log(searchArray);
    return searchArray;
}

That’s it! Pass the search word into the function and you’ll get back an array containing the Arabic text with and without diacritics. Technically, this should work with any language that has diacritic characters. You just need to adjust the unicode range of values which are passed into the String.prototype.replace() function as the first variable, to make sure you’re targeting the right unicode characters.

Written by Thomas

18th June 2020