Sitemap

Dealing with string matching gracefully

3 min readJan 13, 2024

Aho-Corasick

It is a word that I remember vividly since the university time. The first time I noticed, it was written in one of the lecturer’s slides. This name intrigued me not only because of its unusual name, but also what it had to do with computer science. Since then, I tried to find out what it actually does and turns out it is an algorithm that combine the ideas in KMP algorithm with those of finite state machines.

“Sigh, another string matching algoritm” ━━ my thought at that time

Since I found string matching was very hard to understand, I don’t bother not to put myself into more confusion by learning this, at least I know what it is used for, my naive thought at that time.

Long story short, in my latest job, I am required to perform an efficient string matching in an all-limited environment. This might provoke some of you to think why an iOS engineer had to put all the hardwork in the client side? this is due to the current circumstances and constraints that we had, that, we don’t have any better choices at the moment.

The job might seemed very intimidating at the moment, and to make things even worse, the constraints given were quite high, which means naive solution is not guaranteed to be working all the time, I had to come up with something better.

This is when I thought of the Aho-Corasick. After researching and reading the original paper. I was amazed of how Alfred V. Aho and Margaret J. Corasick’s intuition works for them to be able to deduct the string matching process into such a beautiful, mathematically proven and efficient process. The algorithm is able to find any words form a finite set of dictionary that is also a substring in a given string in a linear time.

In short, this algorithm utilizes a trie where each node have this property

  • Symbolizes a character
  • Has a reference to another node which have the longest suffix of the node’s character, this reference is often called failure transitions or failure links.
  • If the node is leaf node, it stores an array of the the current word.

This failure transitions enable the algorithm to perform the a linear time search, that is, if the current character is not matching, we can follow the failure link to start matching again from the failure node, this way, we do not retry the search from the node and save a lot of time.

The matching process is described as we traverse the trie simply like an automaton, If yes, continue, otherwise follow the failure node. and along the node visit we collect the stored word in a node if it exists.

How’s the Performance?

Performance is what we truly care even after we are able to understand how it works. We did some benchmarking on Pixel 4 phone and the result is not dissapointing

Unfortunately I cannot share the code since our version has been modified to support our business need😛, but maybe I can share the pseudocode, which is

1. Loop the dictionary of words, constructing prefix trie
2. In each node, link the node to its failure node
3. When matching, collects the associated words of each node if exists.

Until next time👋

--

--

Arya Surya
Arya Surya

Written by Arya Surya

Mostly writes about frontend, including but not limited to iOS and web development. Go follow @agustinustheoo for other stuff

No responses yet