Fast pattern search in Swift... since 1974

the fastest known ways to find one string of characters in another 1

Stay with me. At the beginning brief introduction, later on Swift implementation, tests and conclusion.

Halt and Catch Fire thing

I really enjoy Halt and Catch Fire tv show. Season 1 was about hardware, while season 2 is all about software. All of the characters are imaginary, with some reference to widely known individuals from the business back in '80. I like the show because it reminds me that so much cool stuff was invented back then, and is just re-invented or implemented in the new way today. I watch the episode and think: this could happen today, regardless we work on MacBooks not on Commodore. When Cameron tap on keyboard I can see myself doing the very same thing, only in different way.

Today I want to tell a story of one of the idea from '74 we all use today. Meet the Robert S. Boyer (aka Bob Boyer) and J Strother Moore, two bright individuals who invented The Boyer-Moore Fast String Searching Algorithm which "is the basis of the fastest known ways to find one string of characters in another"1

Robert S. Boyer and J Strother Moore

History and theory

In 1974. It is 41 years ago heroes of this post started with the first idea, they finish their work two years later in 1976 after gather some valuable feedback, from great people like Donald E. Knuth already worked on his search algorithm, published later as Knuth–Morris–Pratt string searching algorithm joint work. Interesting fact, at that time J S. Moore worked at Xerox Palo Alto Research Center, the same place where invented GUI and all that cool stuff. What a great place to work it had to be. And they don't even had iPhones.

So it seems Boyer and Moore heard about Knuth idea and came up with something better, by reversing the way Knuth algorithm work. That is the disruption we talking about! Take something and think about it in different way, resulting with something a way better! This is Silicon Valley spirit or our time. It always been like this. 41 year ago! In the computer history i seems like they had dinosaurs in the backyard.

The thing is, you and me, we all use this idea, in 2015. Can you imagine that, reading this text sitting in the San Francisco caffe?

Reading original publication, when authors mention details about PDP-10 implementation caveats, is nothing but icing on the cake. Not much of us really cares that the crucial part could be reduced to three assembly instructions (uh!).

In theory

The Boyer-Moore Fast String Searching Algorithm was considered as the most efficient text-matching algorithm. It has been formed as opposite to naive approach. Over the time it become basis of fast searching. Few people play with the idea and came up with tuned algorithms that have very good characteristics for example Quick Search by Daniel M. Sunday, to name at least one.

Naive

For given substring and text, naive search approach is to start with the first character in text and move forward matching agains substring. Repeat in the loop until end of text or all characters from substring match. In the worst case I have to examine all the characters in the text and substring at least once.

Naive search

Knuth–Morris–Pratt

Then there is Knuth–Morris–Pratt (KPM) algorithm. The naive searching algorithm would move the pattern down by one all the time but KMP introduce observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

Knuth–Morris–Pratt

Boyer-Moore

Finally Boyer-Moore algorithm that is build around the idea of denial natural left-to-right scan, this time we scan pattern from the right of the left, while text is scanned from the left to the right. This clever idea allows us to determine if pattern match, based on single character from pattern, and as a result move forward of the length of pattern at once. Actually there are 4 observations that make this an efficient algorithm, however this is the main differentiator.

this is fast, isn't it?

Implementation

Time to implement this algorithm with Swift. This is what what bring me to this point in the first place.

I defined struct BoyerMoore and implemented function for "bad-character rule" (observation 3a from the paper, aka delta1). I did not implement "good-suffix" (observation 3b from the paper, aka delta2) part of the algorithm at this point.

struct BoyerMoore {

    enum BoyerMooreError: ErrorType {
        case NotFound
    }

    private let data:[Int]

    init(data: [Int]) {
        self.data = data
    }

    init(stringLiteral value: String) {
        self.data = value.unicodeScalars.map { Int($0.value) }
    }

Start with the data, which is array of integers (UInt8 type would be more appropriate type, though Int is more readable). I go with the Int.

Bad character rule (aka delta1) observation is the information how far we can move forward matching (or not) given character from the pattern (it has no sense to check further because pattern won't match anyway). delta1 is constant for pattern, it don't have to be calculated over and over. These values can be pre-calculated, however I didn't do it this way, instead decided to go with the function that for given input return the value:

    /// If char does not occur in pat, then patlen;
    /// else patlen - j, where j is the maximum integer
    /// such that pat(j) = char
    func delta1(pat: [Int], char: Int) -> Int {
        var i = 0
        // skip the last one, doesn't matter.
        for (i = pat.count - 1; i > 0 && pat[i - 1] != char; i--) { /* noop */ }
        return pat.count - i
    }

and main loop:

    func search(pat: String) throws -> Range<Int> {
        return try search(pat.unicodeScalars.map { Int($0.value) })
    }

    func search(pat:[Int]) throws -> Range<Int> {
        guard pat.count > 0 else {
            return Range(start:0, end:pat.count)
        }

        var i = pat.count - 1
        var j = 0
        while (i < data.count) {
            for (j = pat.count - 1; j >= 0 && data[i] == pat[j]; --i, --j) { /* noop */ }
            if (j < 0) {
                return Range(start:i + 1, end:i + 1 + pat.count)
            }

            i = i + delta1(pat, char: data[i])
        }

        throw BoyerMooreError.NotFound
    }
}

The result of search() function is Range<Int> value with the position of found pattern or throws error when not found.

Search in array of Integers

let data = [72,69,82,69,32,73,83,32, 65,32,83,73, 77,80,76,69,32,69,88,65,77,80,76,69]  
let range = BoyerMoore(data: data).search([65,32,83,73])  
// Range(startIndex: 8, endIndex: 12)

Search in String (technically in array of integers)

let range = BoyerMoore(stringLiteral: "HERE IS A SIMPLE EXAMPLE").search("A SI")  
// Range(startIndex: 8, endIndex: 12)

Performance

This is the best part. Algorithm is proven to be fast.

How fast it is in Swift?

I mentioned earlier that this sample implementation has no delta2() part - that means it's not that performant as it can be, however from my tests I can see that Swift code is at least as fast as C code from NSData, OR FASTER, and a way faster that NSString.rangeOfString()

Tests

For every test I took the very same part of "lorem ipsum" string and search for the very same substring: "imentum nibh,". Because search in such small sample is fast in general, I repeat search operation 10,000 times to get noticeable times.

#Test1 - Swift

Average time: 0.012 sec (7% STDEV)

func testPerformanceBoyerMoore() {  
    let loremText = "Maecenas sed diam eget risus varius blandit sit amet non magna. Cras mattis consectetur purus sit amet fermentum. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Cras mattis consectetur purus sit amet fermentum. Maecenas faucibus mollis interdum."
    let loremData = loremText.unicodeScalars.map { Int($0.value) }
    self.measureMetrics([XCTPerformanceMetric_WallClockTime], automaticallyStartMeasuring: true) { () -> Void in
        for _ in 0...9999 {
            try! BoyerMoore(data: loremData).search([105, 109, 101, 110, 116, 117, 109, 32, 110, 105, 98, 104, 44, 32])
        }
    }
}
#Test2 - NSData

Average time: 0.014 sec (7% STDEV)

func testPerformanceNSData() {  
    let loremText = "Maecenas sed diam eget risus varius blandit sit amet non magna. Cras mattis consectetur purus sit amet fermentum. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Cras mattis consectetur purus sit amet fermentum. Maecenas faucibus mollis interdum."
    let loremData = loremText.unicodeScalars.map { Int($0.value) }
    let data = NSData(bytes: loremData, length: loremData.count)
    let searchData = NSData(bytes: [65,32,83,73], length: 4)
    self.measureMetrics([XCTPerformanceMetric_WallClockTime], automaticallyStartMeasuring: true) { () -> Void in
        for _ in 0...9999 {
            data.rangeOfData(searchData, options: NSDataSearchOptions(rawValue: 0), range: NSRange(location: 0, length: data.length))
        }
    }
}
#Test3 - CFData

Average time: 0.012 sec (8% STDEV)

func testPerformanceCFData() {  
    let loremText = "Maecenas sed diam eget risus varius blandit sit amet non magna. Cras mattis consectetur purus sit amet fermentum. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Cras mattis consectetur purus sit amet fermentum. Maecenas faucibus mollis interdum."
    let loremData = loremText.unicodeScalars.map { UInt8($0.value) }
    let data = CFDataCreate(nil, loremData, loremData.count)
    let searchData = CFDataCreate(nil, [65,32,83,73], 4)

    self.measureMetrics([XCTPerformanceMetric_WallClockTime], automaticallyStartMeasuring: true) { () -> Void in
        for _ in 0...9999 {
            CFDataFind(data, searchData, CFRangeMake(0, loremData.count), CFDataSearchFlags(rawValue: 0))
        }
    }
}
#Test4 - NSString

Average time: 0.051 sec (3% STDEV)

func testPerformanceFoundationNSString() {  
    let loremText = "Maecenas sed diam eget risus varius blandit sit amet non magna. Cras mattis consectetur purus sit amet fermentum. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Cras mattis consectetur purus sit amet fermentum. Maecenas faucibus mollis interdum."
    self.measureMetrics([XCTPerformanceMetric_WallClockTime], automaticallyStartMeasuring: true) { () -> Void in
        for _ in 0...9999 {
            loremText.rangeOfString("imentum nibh,")
        }
    }
}

Let's summarize tests results:

Swift CFData NSData NSString
Language Swift C Objective-C Objective-C
Time (s) 0.012 0.012 0.014 0.051

Values for this sample shows that Swift implementation is the fastest code to find given pattern in the string. However don't have to be true for any set of data.

One more thing... I intentionally tested against NSData because CFData (from CoreFoundation) has implemented Boyer-Moore algorithm (fully, not only partially).

Conclusion

Turn out Swift code can be fast as long as any of Swift method is not used. Whenever I try to use map() or filter() etc... performance reduce dramatically (like 200x slower).

I personally think performance is not the priority of Swift, yet, nonetheless I hope it become priority in the post Swift 2.0 release.

We will see, meantime... enjoy what you have learned today. It's been pleasure for me to learn this too.

Code

Presented code can be found on my Github account: krzyzanowskim/BoyerMoore. Give a star if you found it helpful.