forcats::fct_match

This journey started almost exactly a year ago, but it’s finally been sufficiently worked through and merged! Yay, I’ve officially contributed to the tidyverse (minor as it may be).

It began with a tweet, recalling a surprise I encountered that day during some routine data processing

For those of you not so comfortable with pipes and dplyr, I was trying to subset a data.framedata‘ (with a column g having values "W", "X_Y" and "Z") to only those rows for which the column g had the value "X_Y" or "Z" (not the actual values, of course, but that’s the idea). Without dplyr this might simply be

data[data$g %in% c("X Y", "Z"), ]

To make that more concrete, let’s actually show it in action

data <- data.frame(a = 1:5, g = c("X_Y", "W", "Z", "Z", "W"))
data
#>   a   g
#> 1 1 X_Y
#> 2 2   W
#> 3 3   Z
#> 4 4   Z
#> 5 5   W

data %>% 
   filter(g %in% c("X Y", "Z"))
#>   a g
#> 1 3 Z
#> 2 4 Z

filter isn’t at fault here — the same issue would arise with [ — I have mis-specified the values I wish to match, so I am returned only the matching values. %in% is also performing its job – it returns a logical vector; the result of comparing the values in the column g to the vector c("X Y", "Z"). Both of these functions are behaving as they should, but the logic of what I was trying to achieve (subset to only these values) was lost.

Now, in some instances, that is exactly the behaviour you want — subset this vector to any of these values… where those values may not be present in the vector to begin with

data %>% 
   filter(values %in% all_known_values)

The problem, for me, is that there isn’t a way to say “all of these should be there”. The lack of matching happens silently. If you make a typo, you don’t get that level, and you aren’t told that it’s been skipped

simpsons_characters %>% 
   filter(first_name %in% c("Homer", "Marge", "Bert", "Lisa", "Maggie")

Technically this is a double-post because I also want to sidenote this with something I am amazed I have not known about yet (I was approximately today years old when I learned about this)… I’ve used regexmatching for a while, and have been surprised at how well I’ve been able to make it work occasionally. I’m familiar with counting patterns ((A){2} to match two occurrences of A) and ranges of counts ((A){2,4} to match between two and four occurrences of A) but I was not aware that you can specify a number of mistakes that can be included to still make a match… 

grep("Bart", c("Bart", "Bort", "Brat"), value = TRUE)
#> [1] "Bart"

grep("(Bart){~1}", c("Bart", "Bort", "Brat"), value = TRUE)
#> [1] "Bart" "Bort"

(“Are you matching to me?”… “No, my regex also matches to ‘Bort'”)

Use (pattern){~n}to allow up to nsubstitutions in the pattern matching. Refer here and here.

Back to the original problem — filterand %in%are doing their jobs, but we aren’t getting the result we want because we made a typo, and we aren’t told that we’ve done so.

Enter a new PR to forcats(originally to dplyr, but forcatsdoes make more sense) which implements fct_match(f, lvls). This checks that all of the values in lvlsare actually present in fbefore returning the logical vector of which entries they correspond to. With this, the pattern becomes (after loading the development version of forcatsfrom github)

data %>% 
   filter(fct_match(g, c("X Y", "Z")))
#> Error in filter_impl(.data, quo): Evaluation error: Levels not present in factor: "X Y".

Yay! We’re notified that we’ve made an error. "X Y"isn’t actually in our column g. If we don’t make the error, we get the result we actually wanted in the first place. We can now use this successfully

data %>% 
   filter(fct_match(g, c("X_Y", "Z")))
#>   a   g
#> 1 1 X_Y
#> 2 3   Z
#> 3 4   Z

It took a while for the PR to be addressed (the tidyverse crew have plenty of backlog, no doubt) but after some minor requested changes and a very neat cleanup by Hadley himself, it’s been merged.

My original version had a few bells and whistles that the current implementation has put aside. The first was inverting the matching with fct_excludeto make it easier to negate the matching without having to create a new anonymous function, i.e. ~!fct_match(.x). I find this particularly useful since a pipe expects a call/named function, not a lambda/anonymous function, which is actually quite painful to construct

data %>%
   pull(g) %>%
   (function(x) !fct_match(x, c("X_Y", "Z")))
#> [1] FALSE  TRUE FALSE FALSE  TRUE

whereas if we defined

fct_exclude <- function(f, lvls, ...) !fct_match(f, lvls, ...)

we can use

data %>%
   pull(g) %>%
   fct_exclude(c("X_Y", "Z"))
#> [1] FALSE  TRUE FALSE FALSE  TRUE

The other was specifying whether or not to include missing levels when considering if lvls is a valid value in f since unique(f) and levels(f) can return different answers.

The cleanup really made me think about how much ‘fluff’ some of my code can have. Sure, it’s nice to encapsulate some logic in a small additional function, but sometimes you can actually replace all of that with a one-liner and not need all that. If you’re ever in the mood to see how compact internal code can really be, check out the source of forcats.

Hopefully this pattern of filter(fct_match(f, lvls)) is useful to others. It’s certainly going to save me overlooking some typos.

6 Responses to “forcats::fct_match”

  1. Congrats!!! Useful enhancement πŸ™‚

  2. Nice work Jonathan!
    FYI, the post slips into an *unmatched* italics somewhere around “No, my regex also matches to β€˜Bort'”.
    Perhaps that was a deep pun πŸ˜‰

  3. Thanks for this Jono! Love the regex trick aside, was new to me too.

    It would be useful to extend this to non-factors as well, such as characters. Instead of checking against the factor levels, check if all the values used for filtering are indeed in the variable and break if they aren’t. Shouldn’t be too hard to implement, but not sure what the best place is for such a function.

    • Sorry for the late approval — I’ve been getting lots of trackback spam so comments are locked until approved.

      This works fine even with characters

      d < - data.frame(a = 1:5, g = c("X_Y", "W", "Z", "Z", "W"), stringsAsFactors = FALSE)
      filter(d, forcats::fct_match(g, c("X_Y", "Z")))

      with an error if the requested values aren't present. forcats converts any character vectors to factors on the way in. Since it returns a logical vector, there's no more conversion needed.

  4. Ah, I wondered where it went πŸ˜‰

    That is wonderful! (I honestly don’t know much about forcats since I tend to avoid factors). This is going to be my new default %in%. Thanks again.

Leave a Reply