# Adding strings in R

This started out as a *“hey, I wonder…”* sort of thing, but as usual, they tend to end up as interesting voyages into the deepest depths of code, so I thought I’d write it up and share. Shoutout to @coolbutuseless for proving that a little curiosity can go a long way and inspiring me to keep digging into interesting topics.

This post came across my feed last week, referring to the roperators package on CRAN. In that post, the author introduces an infix operator from that package which ‘adds’ (concatenates/pastes) strings

"using infix (%) operators" %+% "R can do simple string addition" #> [1] "using infix (%) operators R can do simple string addition"

This might be familiar if you use python

>>> "python " + "adds " + "strings" 'python adds strings'

or javascript

"javascript " + "also adds " + "strings" "javascript also adds strings"

or perhaps even go

package main import "fmt" func main() { fmt.Println("go " + "even adds " + "strings") } > "go even adds strings"

but this is not something natively available in R

"this doesn't" + "work" #> Error in "this doesn't" + "work" : #> non-numeric argument to binary operator

Could we make it work, though? That got me wondering. My first guess was to just create a new `+`

function which *does* allow for this. The normal addition operator is

`+` #> function (e1, e2) .Primitive("+")

so a first attempt might be

`+` <- function(e1, e2) { if (is.character(e1) | is.character(e2)) { paste0(e1, e2) } else { base::`+`(e1, e2) } }

This checks to see if the left or right side of the operator is a character-classed object, and if either is, it pastes the two together. Otherwise it just uses the ‘regular’ addition operator between the two arguments. This works for simple cases, e.g.

"a" + "b" #> [1] "ab" "a" + 2 #> [1] "a2" 2 + 2 #> [1] 4 2 + "a" #> [1] "2a"

But we hit an important snag if we try to add to character-represented numbers

"200" + "200" #> [1] "200200"

That’s probably going to be an issue if we read in unformatted data (e.g. from a CSV) as characters and try to treat it like numbers. Normally this would throw the above error about not being numeric, but now we get a silent weird number-character. That’s no good.

An extension to this checks whether or not we have the number-as-a-character situation and falls back to the correct interpretation in that case

`+` <- function(e1, e2) { ## unary if (missing(e2)) return(e1) if (!is.na(suppressWarnings(as.numeric(e1))) & !is.na(suppressWarnings(as.numeric(e2)))) { ## both arguments numeric-like but characters return(base::`+`(as.numeric(e1), as.numeric(e2))) } else if ((is.character(e1) & is.na(suppressWarnings(as.numeric(e1)))) | (is.character(e2) & is.na(suppressWarnings(as.numeric(e2))))) { ## at least one true character return(paste0(e1, e2)) } else { ## both numeric return(base::`+`(e1, e2)) } } "a" + "b" #> [1] "ab" "a" + 2 #> [1] "a2" 2 + 2 #> [1] 4 2 + "a" #> [1] "2a" "2" + "2" #> [1] 4 2 + "edgy" + 4 + "me" #> [1] "2edgy4me"

So, that’s one option for string addition in R. Is it the right one? The idea of actually dispatching on a character class is inviting. Can we just add a `+.character`

method (since there doesn’t seem to already be one)? Normally when we have S3 dispatch we need a generic function, which calls `UseMethod("class")`

, but we don’t have that in this case. `+`

is an internal generic, which is probably the first sign that we’re going to have trouble. If we try to define the method

`+.character` <- function(e1, e2) { paste0(e1, e2) } "a" + "b" #> Error in "a" + "b" : non-numeric argument to binary operator

It seems to fail. What went wrong? Is dispatch not working?

We want to dispatch on “character” — is that what we have?

class("a") #> [1] "character"

What if we explicitly create an object with that class?

structure("a", class = "character") + 2 #> [1] "a2 2 + structure("a", class = "character") #> [1] "2a"

What if we try to dispatch on some new class?

`+.foo` <- function(e1, e2) { paste0(e1, e2) } structure("a", class = "foo") + 2 #> [1] "a2

but no dice for just a regular atomic character object. Time to revisit the help pages.

In R, addition is limited to particular classes of objects, defined by the Ops group (there are also Math, Summary, and Complex groups). The methods for the Ops group members describe which classes can be involved in operations involving any of the Ops group members:

"+", "-", "*", "/", "^", "%%", "%/%"

"&", "|", "!"

"==", "!=", "<", "<=", ">=", ">"

These methods are:

methods("Ops") [1] Ops,array,array-method [2] Ops,array,structure-method [3] Ops,nonStructure,nonStructure-method [4] Ops,nonStructure,vector-method [5] Ops,structure,array-method [6] Ops,structure,structure-method [7] Ops,structure,vector-method [8] Ops,vector,nonStructure-method [9] Ops,vector,structure-method [10] Ops.data.frame [11] Ops.data.table* [12] Ops.Date [13] Ops.difftime [14] Ops.factor [15] Ops.numeric_version [16] Ops.ordered [17] Ops.POSIXt [18] Ops.raster* [19] Ops.roman* [20] Ops.ts* [21] Ops.unit*

What’s missing from this list, in order for us to be able to just use “string” + “string” is a character method. What’s perhaps even more surprising is that there *is* a `roman`

method! Whaaaat?

as.roman("1") + as.roman("5") #> [1] VI as.roman("2000") + as.roman("18") #> [1] MMXVIII

Since the operations need to be defined for all the members of the Ops group, we would also need to define what to do with, say, `*`

between strings. When one side is a string and the other is a number, a reasonable approach might be that which was taken in the original post (using a new infix `%s*%`

)

"a" %s*% 3 #> [1] "aaa"

There is, of course, a function to do this already

strrep("a", 3) #> [1] "aaa"

but I could see creating `"a" * 3`

as a shortcut to this. I don’t know what one would expect `"a" * "b"`

to produce.

The problem with where this is heading is that we aren’t allowed to create the method for an atomic class, as Joris Meys and Brodie Gaslam point out on Twitter

Yes, you're right. Below is what I remembered, which suggested that if it were not sealed, it could be defined, but that isn't true b/c `do_arith` only dispatches on objects (as you point out), although in theory it could dispatch on atomics, but probably doesn't for speed. pic.twitter.com/UXk6Tdm3lW

— BrodieG (@BrodieGaslam) October 4, 2018

setMethod("+", c("character", "character"), function(e1, e2) paste0(e1, e2)) #> Error in setMethod("+", c("character", "character"), function(e1, e2) paste0(e1, : #> the method for function ‘+’ and signature e1="character", e2="character" is sealed and cannot be re-defined

so no luck there. Brodie also links to a Stack Overflow discussion on this very topic where it is pointed out by Martin Mächler that this has been discussed on r-devel — that makes for some interesting historical weigh-ins on why this isn’t a thing in R. Incidentally, the small-world effect comes into play regarding that Stack Overflow post as one of the three answers happens to be a former work colleague of mine.

So, in the end, it seems the best we can do is the rather long-winded overwrite of `+`

which checks if the arguments really are characters. I don’t mind this, and would probably use it if it was in base R or a package. The biggest issue that people seem to have with this is that it ‘looks like’ addition, but it’s not commutative. If that word is new to you, it just means that `x + y`

should give the same answer as `y + x`

. For numbers, the regular + satisfies this:

2 + 3 #> [1] 5 3 + 2 #> [1] 5

but when we try to do this with strings… not so much

"a" + "b" #> [1] "ab" "b" + "a" #> [1] "ba"

This doesn’t particularly bother me, because I’m okay with this not actually being ‘mathematical addition’. The fun turn this then took was the suggestion from Joris Meys that Julia’s non-associative operators is a strength of the language. There, the way that you group values matters

a + b + c is parsed as +(a, b, c) not +(+(a, b), c).

I’ll eventually get around to learning more Julia, but this is already hurting my brain.

That distinction may be of interest, however, to Miles McBain, whose concern was more about repeated applications of `+`

being a bottleneck

I hate + for string concatenation. "a" + "b" + "c" is paste("a", paste("b","c")). So you end up copying the data in "b" and "c" twice due to the data being immutable. That can really add up fast with more +'s if you are careless. Like I was in my first programming job.

— Miles McBain (@MilesMcBain) October 4, 2018

In that case, parsing as `+("a", "b", "c")`

is exactly what would be desired.

So, what’s the conclusion of all of this? I’ve learned (and re-learned) a heap more about how the Ops group works, I’ve played a lot with dispatch, and I’ve thought deeply about edge-cases for adding strings. I’ve also been exposed to a bit more Julia. All in all, a worthwhile dive into something potentially silly, but a lot of fun. If you have some thoughts on the matter, leave a comment here or reply on Twitter — I’d love to hear about another angle to this story.

`stringi` has has `%s+%` for a while now 🙂

Hahaha reading this reminds me of an afternoon me and my coworker spent while making roperators.

The reason we don’t overwrite + is threefold:

1) CRAN has a policy against packages changing base R functionality

2) It’d add overhead to numerical operations

3) It’s a bad idea to have + work in a dynamically typed language, if your data comes in as 1 + ‘1’ when it should be 1 + 1, you’ll end up with ’11’ instead of 2. If that variable goes into a report or a function that converts it into, say, an integer, then you have the worst kind of error possible – the kind that gives a result.

There were some other things to do with keeping the package as small and independent as possible for production environments (eg AWS), but those are the three main reasons.

Cheerio!

Hi! You’re tricky to track down — your site could use an ‘About Me’ page. Do you have a Twitter handle? GitHub? FYI, there are some rendering issues in your code (e.g.my_string < - class="st" span=""> ‘using infix (%) operators ‘). Neat package, btw!

Good points. 1 is a showstopper of course. 3 is addressed in my longer version which checks if coercion works, but given that “2”:”5″ currently works (R >= 3.4) I would (contentiously) propose that “2”+”5″ should perhaps follow suit.

Not messing with something as fundamental as + is a good idea. However you did good service by providing your `+’ function. I changed it to `%+%’, and I will use it for the rare occasions when I want to add strings. Thanks