Constricted development with reticulate

I’ve been using the reticulate package occasionally for a while now, so I was surprised to see that it had only just been officially released.

It’s a brilliant piece of work, allowing python and R to coexist in the same workflow.

Another opportunity came up today to use it so I thought it might be nice to do a very quick blog post to show just how easy it is to take external python code and have it callable directly from R. In this case, @coolbutuseless posed a challenge on Twitter to write a fast ‘needle in a haystack’ search of a small vector inside a larger one. I looked over the existing candidates and figured some sort of Sieve of Eratosthenes-esque algorithm might have a chance (though the name eluded me entirely at the time).

My proposal was to search for the first digit using which(), and use this reduced vector of possible-matches in additional tests on the remaining parts of the ‘needle’. @coolbutuseless refactored my attempt allowing for arbitrary length needles and found it to do quite well against the current offerings. What he still wanted though was a Boyer–Moore string search algorithm implementation. This is apparently what GNU grep uses, so it’s probably pretty okay.

That algorithm is pretty clever about how it goes about the search, starting in a similar way to what I did (the sieve approach was apparently the leading string match method prior to Boyer-Moore). It’s much more complicated though, so I wasn’t about to write one of those myself in R. Nowadays, people think of C/C++ when there’s functionality they want to grab from elsewhere. There’s a C implementation on the Wikipedia site, so that seems like a nice place to start. I saved the text to a new boyermoore.c file and ran

R CMD SHLIB boyermoore.c

from a terminal to compile it into boyermore.so. This could then be loaded into R with dyn.load("boyermore.so") and in theory called with .C("boyer_moore", <something>, <something>). I tried a couple of <something>s (which weren’t pointers) and promptly crashed RStudio.

The python implementation is also listed on Wikipedia, so I figured that’s another route to try. I saved the text to a new boyermoor.py file (also embedded below) and started about loading the functions from R. This is actually much simpler than for C:

library(reticulate)
bm <- py_run_file("boyermoor.py")

This executes the python file and creates a new named list with each exported python function as an element. How easy is that!?! Calling the function would be as easy as

bm$string_search(needle, haystack)

Not quite that easy of course… The implementation assumes that both the ‘needle’ and the ‘haystack’ are text, not numbers. To solve this, I converted my numbers (in the range 0 to 12) to letters using the built-in LETTERS vector. After testing that it worked as expected, a benchmark test showed that it was nowhere near as fast as my R approach. I can’t say this is due to the algorithm itself, which should be fairly fast, but probably has more to do with the fact that I’m using two different languages.

The entire call from R looks pretty neat and tidy

despite a lot of python code in the background

I’d certainly recommend having reticulate functions in your arsenal next time you need to attack a problem using python from within R. There’s a whole heap of useful ways to interact between R and python with this including importing python modules and calling python scripts, etc…

As a side-note: keep an eye on the ergo project to connect the go language in just as easily.

One Response to “Constricted development with reticulate”

  1. jonocarroll says:

    For posterity, the right way to use the C version in R, courtesy of Jim Hester: https://gist.github.com/jimhester/2d02d63d0459d9fdadcf15e08c3afc0a

Leave a Reply