Posts on Irregularly Scheduled Programming

I Patched R to Solve an Exercism Problem

website@jcarroll.com.au (Jonathan Carroll) — Mon, 26 Feb 2024 00:00:00 +0000

With a serious yak shaving deviation, I have a really short “cheat” solution to one of the featured Exercism problems. It’s been a really insightful journey getting to this point, and as always I’ve learned a lot along the way. The fact that I was able to understand the required changes and propose them is thanks to the open-source nature of programming languages.

This all started out innocently enough - I’ve been using Exercism.org to get a feel for a lot of other programming languages. Last year they hosted the 12in23 challenge where they invited users to try out one programming language each month, solving 5 non-trivial (beyond printing "Hello, World!") exercises in any of a handful of featured languages, with a different ‘theme’ (functional, object-oriented, low-level, …) each month.

I ended up earning the “polyglot” badge for completing this, but I actually ended up doing a lot more

I found this to be an extremely useful exercise in finding out what I did and didn’t like about different languages, and I’m diving a lot deeper into the ones I enjoyed the most. Worthwhile, but certainly an involved challenge as I solved all the exercises locally on my own machine, which means I had to install each of the languages and get their test suites working. My laptop is now capable of running code in more than 20 languages.

This year, Exercism are hosting a 48in24 challenge where rather than just needing to use one language to solve several problems in a month, it’s one problem in three languages in a week. Spending a month to get up to speed with a language I’d never seen (or installed) was one thing, but now I potentially need to do that three times over in a week when the languages are all new to me. Still, I persist, and so far I’m keeping up.

One of the featured problems was ‘Roman Numerals’ where we need to write an algorithm to convert integers into Roman numerals. The featured languages were Julia (I’m already familiar, nice), Elixir (fine, I get to learn what that looks like), and (Pharo) smalltalk, which needs its own IDE and is a workflow entirely distinct from anything I’ve done before. I won’t be willingly writing any more smalltalk in the near future.

When I saw that this was the challenge problem I was delighted - I know that R already has as.roman() in the standard library!

as.roman(2024)

## [1] MMXXIV

It even does math with these values - I base most of my real-life conversations with my wife around Simpsons quotes so I’m all-too familiar with this bit

Bart enters the backstage of the lion enclosure at a zoo and sees a sign:

“Caution: exit through door 7 only. All other rooms contain man-eating tigers.”

“Think, Bart. Where have you seen Roman numerals before? I know… Rocky V! That was the fifth one. So, Rocky five plus Rocky two equals… Rocky VII!”Adrian’s Revenge”!”

R can do this math just fine

(five <- as.roman(5))

## [1] V

(two <- as.roman(2))

## [1] II

five + two

## [1] VII

It’s more than just concatenating those, of course

(five + two) * 2

## [1] XIV

I downloaded the exercise and fired up RStudio. I edited the exercise template to just pass the input argument to as.roman() and hit ‘Run Tests’… all tests failed 😢

Ah, as.roman() returns an object of class roman

class(as.roman(321))

## [1] "roman"

and the tests have an expect_equal() (not expect_equivalent())

test_that("48 is XLVIII", {
  expect_equal(roman(48), "XLVIII")
})

Okay, so it won’t be perfectly clean, but it just need to be re-classed

as.character(as.roman(2024))

## [1] "MMXXIV"

Re-run the tests, and…

── Failure ('test_roman-numerals.R:105:3'): 3999 is MMMCMXCIX ──────────────────
roman(3999) not equal to "MMMCMXCIX".
1/1 mismatches
x[1]: NA
y[1]: "MMMCMXCIX"

[ FAIL 1 | WARN 0 | SKIP 0 | PASS 25 ]

What? twenty-five tests pass, one fails? And produces NA?

Back to the problem statement, it says

For this exercise, we are only concerned about traditional Roman numerals, in which the largest number is MMMCMXCIX (or 3,999).

So, why does that fail? I checked the docs for as.roman() and it says

Only numbers between 1 and 3899 have a unique representation as roman numbers, and hence others result in as.roman(NA).

Odd.

I checked out some other languages. The intro video for the problem mentions another “cheat” solution which is to use Common Lisp, because that has a formatter. Using GNU Common Lisp…

(format nil "~@R" 2024)

"MMXXIV"

Trying out this last value works

(format nil "~@R" 3999)

"MMMCMXCIX"

and a too-large value produces an informative error

(format nil "~@R" 4000)

 The ~@R format directive requires an integer in the range 1 - 3999, not 4000

So, the limit does seem to be 3999, not 3899. Checking Wikipedia, that also seems to be the case. I also found a python implementation which goes up to 3999.

So, why won’t R convert those last 100 numbers?

I dug around the source code (an accessible copy is on GitHub but the “real” copy is on SVN) and found several references to the 3899 limit, none of which could be circumvented.

I posted a quick summary of the issue I saw on my side-blog hoping that it was something I just misunderstood.

I asked around on Mastodon, and no one was sure why this was the case. The next step was to email the R-devel mailing list. I don’t do this lightly, as it has a reputation for not being the friendliest place. Kurt Hornik, who originally wrote the implementation back in 2006 replied and agreed that it was probably just an oversight.

He asked if I could file the bug in Bugzilla the R bug tracker … and asked if I could also add a patch. At this point, I remembered what I was doing in the first place - solving an Exercism problem.

The term “Yak Shaving” means performing some task that appears entirely unrelated to what you were originally trying to do, usually as a result of several other related tasks. It’s summarised well in this clip from the TV show ‘Malcolm in the Middle’

where the character goes to change a lightbulb but finds that the shelf is coming loose, so he gets a screwdriver from a drawer and notices that the drawer isn’t running smoothly, so he gets some lubricant but it’s empty so he gets in his car to go to the store to buy more but the car sounds funny, so he tries fixing that… When his wife asks him “are you going to replace that lightbulb” he yells - appearing from underneath the car covered in grease - “what does it look like I’m doing???”.

The absurdity being that “before you can complete this task, you need to shave a yak.”

In this scenario, someone is asking me “are you solving that Exercism problem?” and I’m emerging from a console typing make check-devel shouting “what does it look like I’m doing?”.

So, how do I submit a patch to R itself? I recalled that I detailed some instructions for that the last time I was trying to resolve an R-related bug. I pulled the bleeding-edge SVN commit and built R locally, which means

./configure --with-x=no --without-recommended-packages --disable-byte-compiled-packages --disable-java

and

make -j4

to build the source (using 4 cores).

Then, I searched for the values I might need to change by running

grep -R 3899

This turns up a bunch of files

src/nmath/qnorm.c:                         2.04426310338993978564e-15 + 1.4215117583164458887e-7)*
src/library/utils/all.R:        if(upper > 3899L)
src/library/utils/all.R:            stop(gettextf("too many list items (at most up to %d)", 3899L),
src/library/utils/man/roman.Rd:  Only numbers between 1 and 3899 have a unique representation as roman
src/library/utils/man/roman.Rd:## simple consistency checks -- arithmetic when result is in  {1,2,..,3899} :
src/library/utils/man/format.Rd:    numerals (with the number of the last item maximally 3899).
src/library/utils/R/format.R:        if(upper > 3899L)
src/library/utils/R/format.R:            stop(gettextf("too many list items (at most up to %d)", 3899L),
src/library/utils/po/R-de.po:#~ msgid "too many list items (at most up to number 3899)"
src/library/utils/po/R-de.po:#~ msgstr "zu viele Listenelemente (höchstens 3899!)"
src/library/utils/po/R-fr.po:#~ msgid "too many list items (at most up to number 3899)"
src/library/utils/po/R-fr.po:#~ msgstr "trop d'entrées de liste (3899 maximum)"
src/library/utils/po/R-ja.po:#~ msgid "too many list items (at most up to number 3899)"
src/library/utils/po/R-ja.po:#~ msgstr " リストの項目が多すぎます（最大でも3899です） "
src/library/utils/po/R-zh_CN.po:#~ msgid "too many list items (at most up to number 3899)"
src/library/utils/po/R-zh_CN.po:#~ msgstr "串列项太多(最多只能有3899项("
src/library/utils/po/R-ru.po:#~ msgid "too many list items (at most up to number 3899)"
grep: src/library/utils/po/R-ru.po: binary file matches
src/library/stats/tests/ppr_test.csv:"153",0.0669524872438994,1,0.970846742038665
src/main/g_her_glyph.c:  /******** Hershey Glyphs 3800 to 3899 ********/
src/main/g_her_glyph.c:  /******** Oriental Hershey Glyphs 3800 to 3899 ********/
tests/d-p-q-r-tests.Rout: [1] 0.340823726 0.100413165 0.293668976 0.389968983 0.124439520 0.207270198
tests/reg-examples2.Rout:Zambia          0.16361 -0.07917 -0.33899  0.09406  0.228232  0.7482 0.512
tests/lm-tests.Rout.save:Zambia          0.16361 -0.07917 -0.33899  0.09406  0.228232  0.7482 0.512
tests/lm-tests.Rout:Zambia          0.16361 -0.07917 -0.33899  0.09406  0.228232  0.7482 0.512
tests/d-p-q-r-tests.Rout.save: [1] 0.340823726 0.100413165 0.293668976 0.389968983 0.124439520 0.207270198
tests/Examples/stats-Ex.Rout.save:[49]  0.29918521  0.05938999 -0.20355761 -0.02439309 -1.14548572 -0.94045141
tests/Examples/stats-Ex.Rout.save: 0.05510098  0.59958858  1.14407618  1.68856379  2.23305139  2.77753899 
tests/Examples/stats-Ex.Rout.save:9   0.7335126 -1.3468740  2.8138992
tests/Examples/stats-Ex.Rout:[49]  0.29918521  0.05938999 -0.20355761 -0.02439309 -1.14548572 -0.94045141
tests/Examples/stats-Ex.Rout: 0.05510098  0.59958858  1.14407618  1.68856379  2.23305139  2.77753899 
tests/Examples/stats-Ex.Rout:9   0.7335126 -1.3468740  2.8138992

Many of these I could ignore - float values which happen to match. But many were familiar from when I was digging through the source.

I took the first of these src/library/utils/all.R and edited the value from 3899 to 3999 then ran svn status and a lot of files were marked ? (untracked) while none were M (modified). This confused me, until I realised that as part of the build process, R creates these all.R files which aren’t the source-of-truth for the code. Changing, instead, src/library/utils/R/format.R the changes were reflected in svn status.

The format changes were from formatting functions I wasn’t previously familiar with, namely formatOL for “format ordered list” which prefixes a list of items

formatOL(paste0("Chapter", 1:5))

## [1] "1. Chapter1" "2. Chapter2" "3. Chapter3" "4. Chapter4" "5. Chapter5"

and can do the conversion

formatOL(paste0("Chapter", 1:5), type = "Roman")

## [1] "  I. Chapter1" " II. Chapter2" "III. Chapter3" " IV. Chapter4"
## [5] "  V. Chapter5"

and which produces an error if you try to pass in too many items (more than the critical value)

formatOL(10:4010, type = "Roman")

Error in formatOL(10:4010, type = "roman") : 
  too many list items (at most up to 3899)

In addition to the logic of the code itself, the .Rd files needed updating (base R does not use Roxygen and automatic generation of man files) and the .po files (language translations) with the translated error messages (which I hopefully corrected accurately; my French is okay-ish and that looks to be correct, and my Japanese is very beginner-level and I haven’t yet learned those words).

Since I had dug through the source, I knew that “3899” was not the only critical value; there were some tests for x >= 3900 I needed to also take care of. This was a useful point to note - if you’re going to change behaviour at a numeric boundary, it’s probably a good idea to stick to either strictly greater than or greater-than-or-equal if you want to make searching for that value straightforward.

Searching for 3900 again turns up a lot of false positives, but also

src/library/utils/R/roman.R:    if(check.range) x[x <= 0L | x >= 3900L] <- NA
src/library/utils/R/roman.R:    ind <- is.na(x) | (x <= 0L) | (x >= 3900L)

so I edited those as well.

Producing the requisite patch is then just

svn diff > as.roman-upper-limit.patch

creating the file I eventually attached to my Bugzilla report.

Kurt Hornik approved these changes and merged them on my behalf (only R-core members can merge into the source) so my changes should be reflected in the next release.

To prove to myself that this was done and dusted, I wanted to pull the latest version again, but I didn’t want to risk that my local changes were what I was seeing. Docker is a good option in this case (and may have been for the development itself).

Beyond my previous instructions, I figured there were more up-to-date instructions out there. This is a slightly newer post about the process, but I eventually just copied the Dockerfile from rocker’s drd image and built the most up-to-date image locally. This means that I was definitely running an independent installation of R (in docker) and not my local copy (still distinct from my installed copy that RStudio sees).

Using this docker image, I could confirm that my changes had been merged

 as.roman(3999)

[1] MMMCMXCIX

as well as the necessary translations; I noticed that a simple Sys.setenv(LANGUAGE="ru") worked once, but after that I couldn’t change the language again. The {and} package seemed to do the trick

formatOL(1:4000, type = "roman")

Error in formatOL(1:4000, type = "roman") : 
  too many list items (at most up to 3999)

and::set_language("ru")
formatOL(1:4000, type = "roman")

Ошибка в formatOL(1:4000, type = "roman") :
  слишком много элементов в списке (самое большее 3999)

and::set_language("fr")
formatOL(1:4000, type = "roman")

Erreur dans formatOL(1:4000, type = "roman") : 
  trop d'entrées de listes (3999 maximum)

and::set_language("ja")
formatOL(1:4000, type = "roman")

 formatOL(1:4000, type = "roman") でエラー: 
   リストの項目が多すぎます (最大でも 3999 です)

and::set_language("de")
formatOL(1:4000, type = "roman")

Fehler in formatOL(1:4000, type = "roman") : 
  zu viele Listenpunkte (höchstens 3999)

and::set_language("zh_CN")
formatOL(1:4000, type = "roman")

Error in formatOL(1:4000, type = "roman") : 列表项太多(最多只能有3999项)

With all looking good on the R side, now I just need to wait for the changes to be officially released and then for Exercism to update their version to that so that I can submit my “cheat” one-line solution to the Roman Numerals exercise.

I’ll be waiting!

I’m hoping that the process described here can be of use to someone else who finds a similar issue in the language; the process can be fairly straightforward, but I believe success was dependent on a few key points;

don’t assume that the language is broken, it may be something you did wrong or misunderstood
look through the source to confirm what you’re seeing and do your best to understand it
ask around; someone may be able to resolve the confusion or point you in the right direction
“it takes a village” - if you want it to get fixed, you might need to put in some work.

And who knows, maybe it is a bug that’s been there for the last 15+ years?

This definitely ended up being more of an exercise than I planned, but I’m grateful for Exercism prompting me to try out both languages I do and don’t know; for R being open-source so I could find this bug; and for the wealth of knowledge out there to figure out how to do all this.

If you have any comments, feel free to use the comment section below, or hit me up on Mastodon.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.2 (2023-10-31)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2024-02-26
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.18    2023-06-19 [1] CRAN (R 4.3.2)
##  bookdown      0.36    2023-10-16 [1] CRAN (R 4.3.2)
##  bslib         0.6.0   2023-11-21 [3] CRAN (R 4.3.2)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.3.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.23    2023-11-01 [3] CRAN (R 4.3.2)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.7   2023-11-03 [3] CRAN (R 4.3.2)
##  htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.2)
##  httpuv        1.6.12  2023-10-23 [1] CRAN (R 4.3.2)
##  icecream      0.2.1   2023-09-27 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.45    2023-10-30 [3] CRAN (R 4.3.2)
##  later         1.3.1   2023-05-02 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.4   2023-11-07 [3] CRAN (R 4.3.2)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.3.2)
##  pkgbuild      1.4.2   2023-06-26 [1] CRAN (R 4.3.2)
##  pkgload       1.3.3   2023-09-22 [1] CRAN (R 4.3.2)
##  prettyunits   1.2.0   2023-09-24 [3] CRAN (R 4.3.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.8   2023-05-02 [1] CRAN (R 4.3.2)
##  promises      1.2.1   2023-08-10 [1] CRAN (R 4.3.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.2   2023-08-10 [3] CRAN (R 4.3.1)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.11  2023-07-06 [1] CRAN (R 4.3.2)
##  remotes       2.4.2.1 2023-07-18 [1] CRAN (R 4.3.2)
##  rlang         1.1.2   2023-11-04 [3] CRAN (R 4.3.2)
##  rmarkdown     2.25    2023-09-18 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.2)
##  shiny         1.7.5.1 2023-10-14 [1] CRAN (R 4.3.2)
##  stringi       1.8.2   2023-11-23 [3] CRAN (R 4.3.2)
##  stringr       1.5.1   2023-11-14 [3] CRAN (R 4.3.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.3.2)
##  usethis       2.2.2   2023-07-06 [1] CRAN (R 4.3.2)
##  vctrs         0.6.4   2023-10-12 [3] CRAN (R 4.3.1)
##  xfun          0.41    2023-11-01 [3] CRAN (R 4.3.2)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.3.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.3
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

My First Julia Package - TriangulArt.jl

website@jcarroll.com.au (Jonathan Carroll) — Sun, 04 Feb 2024 00:00:00 +0000

I’ve tried to get this same image transformation working at least three times now, but I can finally celebrate that it’s working! I’ve been (re-)learning Julia and I still love the language, so it was time to take my learning to the next level and actually build a package.

For those not familiar, Julia is a much newer language than my daily-driver R, and with that comes the freedom to take a lot of good features from other languages and implement them. There are some features that R just won’t ever get, but they’re available in Julia and they’re very nice to use.

I’ve written solutions to the first 20 or so Project Euler problems in Julia … wow, 5 years ago.

More recently, I have solved the first 18 days of Advent of Code 2023 in Julia (my solutions are in a fork of a package that I’m not using, so they more or less run independently).

With those under my belt, I revisited a project I’ve tried to implement several times. I like the low-poly look and wanted to recreate it - it’s just an image transformation, right? I’m even somewhat familiar with Delaunay Triangulation, or at least its dual the Voronoi Tesselation from my days building spatial maps of fishing areas.

It sounds like a simple enough problem; choose some points all over the image, triangulate between all of them, then shade the resulting triangles with the average colour of the pixels they enclose.

I found this nice image of a rainbow lorikeet (these frequent my backyard)

so I got to work trying to chop it up into triangles.

Well, the naive approach is simple enough, but it produces some terrible results. I’ve built that version into what I did eventually get working, and it’s… not what I want

The problem is that by randomly selecting points across the image, you lose all the structure. With enough triangles you might recover some of that, but then you have a lot of triangles and lose that low-poly vibe.

After much searching for a better way to do this, I found this article from 2017. It’s a python approach, but I figured I knew enough Julia and Python now that I could try to make a 1:1 translation.

The first step is to get the random sampling working, because it allows me to start testing the triangulation parts quickly. Generating those is pretty clean

function generate_uniform_random_points(image::Matrix{RGB{N0f8}}, n_points::Integer=100)
    ymax, xmax = size(image)[1:2]
    rand(n_points, 2) .* [xmax ymax]
end

The triangulation itself is handled by DelaunayTriangulation::triangulate() - for once I’m happy that there’s so much scientific/statistical support in Julia

rng = StableRNG(2)
tri = triangulate(all_points; rng)

Slightly trickier is figuring out which points are in which triangle. For that, I am thankful for PolygonOps::inpolygon(). With the pixels for each triangle identified, it was only a matter of averaging the R, G, and B channels to get the median colour.

I got that working, but with the results above - far from pleasant. The next, much harder step, was to weight the points towards the ’edges’ of the image. I couldn’t find an easy way to translate the python code for locally sampling the entropy (via skimage)

filters.rank.entropy(im2, morphology.disk(entropy_width))

so I tried to build something of my own. I tried edge-detection algorithms but I was clearly doing something wrong with it. Partly, I suspect, not doing the down-weighting that the python version includes.

Since the pixels we want to up-weight are all along lines, choosing these at random can end up with several right next to each other, which we don’t want. The python version does something a little clever - it selects one point, then reduces the weighting of the entire image with a Gaussian around that point, so that nearby points are unlikely to also be selected.

In the end, I failed to find a good Julia alternative, but calling python code is (almost) as simple as using PyCall; @pyimport skimage as skimage (with slight modifications to use in a package, as I would later discover).

With that in place, I was able to successfully weight towards high-entropy regions; regions where a larger number of bytes are required to encode a histogram of the grayscale pixels, i.e. where there’s a lot going on. The results are much more pleasing

Along the way I added some debug features, such as plotting the vertices and edges of the triangulation on top of the image

With the workflow more or less working, I ran some profiling to see if I could speed it up. Unsurprisingly, generating the weighted points was one area where a lot of time was spent, though it’s not yet clear if that’s because it’s python code or because that’s genuinely one of the most complex parts of the code - my best Julia alternative was to write my own short Shannon entropy function and make it search locally with ImageFiltering::mapwindow

function shannon(w::AbstractMatrix)
   cm = collect(values(countmap(w))) / sum(size(w)) / 256
   sum([-x * log2(x) for x in cm])
end

mapwindow(shannon, image, (19, 19))

though, this creates a square subsampling, whereas the python version uses a nicer disk.

The profiling shows a lot of blank areas, and I’m not sure how to interpret those

I realised at this point that I actually didn’t know how long the python version takes to run. I grabbed the original source code and tried running it (after installing the relevant python packages) but it failed - some packages had changed their arguments and signatures since this was written. A couple of small updates later, my fork now runs the code. It doesn’t take terribly long to run - it doesn’t display the image, it saves it, and I’m not sure if that’s a factor. I (naively?) expected that the Julia version would be a lot faster, and I’m hopeful that there’s performance I’ve left on the table.

If anyone is interested in playing with a small-ish Julia package, feel free to poke at this - it’s definitely not being used for anything critical.

For now, I’m enjoying throwing images at this and getting some nice looking results

If you’re interested in having a play with this package or helping to improve it, it’s on GitHub - I’m not planning to publish it to the registry any time soon, but that’s perhaps something to look forward to in the future. For now, the main issues I see with this package are:

The white border around the produced image remains - I have tried setting margin=0mm but that doesn’t appear to help
Performance is not as good as it can be, I suspect; the entropy calculation (calling python) is definitely a bottleneck.
To speed up the processing, only every 10th pixel is used to determine the average colour of the triangle - this may fail to identify an entire triangle.
CI - I generated this package in VSCode using PkgTemplates and it is the first Julia package I’ve built. CI failed immediately, so I’ve probably done something wrong.
I am still somewhat of a beginner in Julia, so there are probably many places in which improvements can be made - feel free to suggest them!

As always, I can be found on Mastodon and the comment section below.

devtools::session_info()

```{r sessionInfo, echo = FALSE} devtools::session_info() ```

Making Links a Little Less Hyper

website@jcarroll.com.au (Jonathan Carroll) — Sun, 17 Dec 2023 00:00:00 +0000

Hyperlinks are great - they add value to a block of text by adding additional links out to more things to read - but they’re a distraction if you’re trying to read an in-depth piece of text and comprehend it linearly. Let’s hack the web!

I’ve been reading the updated edition of The Shallows: What the Internet Is Doing to Our Brains and one particular point resonated with me - on discussing the differences between print books and web content, the author notes that the proliferation of hyperlinks throughout the latter distract us from reading because we constantly assess whether or not we will click a link, or worse, we click it and read that instead.

I’m very familiar with this distraction from reading long-form articles, and I typically end up with a backlog of open tabs that I plan to get to later so that I’m not too distracted from what I’m currently reading. With that said, I do reflect on the fact that I don’t think this is a new thing - I eventually stopped reading the footnotes in several non-fiction books recently because I felt it was too hard to keep flipping between the text and the itemized notes at either the end of the chapter or book.

Footnotes, with a number next to a word or at the end of a phrase, can be overlooked a bit easier than inline hyperlinks, where some stretch of text serves as the anchor to some other page, e.g. this link to Wikipedia without having to show the actual URL of the page.

I started thinking - what if we just turn the hyperlinks off, temporarily? We can do a full reading of some content, then come back with the hyperlinks back on to see what additional content there is to read, knowing that we’ve already parsed the primary content and are now available to investigate the secondary components. I think that distinction is significant - if it was primary content, it would be in the main text, but by using a hyperlink or a footnote, it is deemed secondary.

Web content has the benefit that, while it’s presented in one particular way on your screen, the instructions for doing so come along for the ride behind the scenes in the form of HTML, JavaScript, CSS, and other browser-related things. With the right knowledge, they’re malleable, and we can change them. What if we just find all the hyperlinks and make them look like regular text?

I searched for whether or not such a thing already exists, and sure enough one of the first results I found was a post from someone who came to exactly the same conclusion after reading the same book! In their post they describe building a Firefox extension that does what I want - turns off styling by overriding the CSS of anchor tags so that they inherit their parent styling

a,
a:hover,
a:focus,
a:active,
a:visited {
    text-decoration: none !important;
    color: inherit !important;
    background-color: inherit !important;
    border-bottom: initial !important;
}

This works for that person on that browser, but I want something a bit more general.

I don’t exactly know why they fell out of favor, but JavaScript bookmarklets are great - a bookmark containing JavaScript code you can run on any page. This seemed like what I wanted, so I started hacking away and eventually came up with this function

(function () {
    ql = document.getElementById('quietlinks')
    if (ql) {
        ql.parentNode.removeChild(ql);
    } else {
        s = document.createElement('style')
        s.id = "quietlinks"
        s.innerText = "a, a:hover, a:focus, a:active, a:visited { text-decoration: none !important; color: inherit !important; background-color: inherit !important; border-bottom: initial !important;}"
        document.head.appendChild(s)
    }
})();

It checks to see if there is an element with the ID "quietlinks" and if not, it creates a new overriding style element with the above CSS and appends that to the page’s head. If it does find such an element, it removes it from the page. This function therefore toggles between applying custom CSS and removing it.

To work in a bookmarklet it needs to all be one line and quotes need to be translated, and the minifier I used shortens variable names, so it ends up looking like this

javascript:!function(){if(ql=document.getElementById(%22quietlinks%22),ql)ql.parentNode.removeChild(ql);else{var e=document.createElement(%22style%22);e.id=%22quietlinks%22,e.innerText=%22a, a:hover, a:focus, a:active, a:visited { text-decoration: none !important; color: inherit !important; background-color: inherit !important; border-bottom: initial !important;}%22,document.head.appendChild(e)}}();

To save this bookmarklet, drag this link into your bookmarks bar (or just click it to try it on this page)

QuietLinks

Now you can use it - here are a few links to nowhere that are styled. Click on the saved bookmarklet, and they should lose their styling. Click on it again and the styling will return.

When text contains a lot of links it can be difficult to read because we continually need to stop to decide whether or not we will follow the link. Disabling the hyperlink decoration could be helpful.

These are all fake links that go literally nowhere, so they don’t have the domain suffix (inserted via JavaScript, as described in this post) but on a regular site, the result should look something like this

Note that the hyperlinks remain - they’re still clickable - but they no longer have special styling applied to them (commonly blue with underline) so they just read like regular text.

There are some edge cases I discovered already, e.g. the links on Mastodon are all <span> elements, not <a>, so they aren’t affected. My intention was for reading long articles, though, so that doesn’t bother me too much. This approach is also a bit greedy in finding anchors, so the ‘share’ icons below disappear as their styling is disrupted. I’ll keep using this for a while and see if it really helps me comprehend what I’m reading.

Comments and improvements most welcome. I can be found on Mastodon or use the comments below.

P.S. I’m conscious of the irony in writing a post filled with hyperlinks while at the same time disparaging them, but with this tool at hand, it’s fine!

Advent of Array Elegance (AoC2023 Day 7)

website@jcarroll.com.au (Jonathan Carroll) — Sun, 10 Dec 2023 00:00:00 +0000

I’m solving Advent of Code this year using a relaxed criteria compared to last year in that I’m allowing myself to use packages where they’re helpful, rather than strictly base R. Last year I re-solved half of the exercises using Rust which helped me learn a lot about Rust. This year I’m enamored with APL, and I wanted to share a particularly beautiful solution.

⚠⚠⚠⚠⚠

Spoilers ahead for Day 7, in case you haven’t yet completed it yourself.

⚠⚠⚠⚠⚠

I solved Day 7 of Advent of Code using base R by testing whether or not a given hand was of each type with an individual function, either returning 0 (if it was not of that type) or N + a score, where N was sufficiently different between each type that they would sort nicely. For the score, I initially tried offsetting each card in a poor-man’s base-15 as 15^(4:0)*card_score but later improved on that by using hex digits (which automatically sort nicer). The large N values ensured that ‘type’ would be sorted before the first/second/etc.. card.

That was sufficient to do an apply(strength, hands), calculate the order of those, and multiply by the relevant bids. Aside from a bug not caught by the test case (the difference between bid*order(x) and bid[order(x)]*seq_along(x)) it was an okay solution to the problem, and it worked.

After solving each day, I’ve been trying to re-solve using APL; in particular Dyalog APL. For those who don’t know, APL is an old language (circa 1960s) borne from a mathematical notation in which a single glyph (symbol) represents some operation or application of a function. This makes it look very different to more modern languages, partly because of the glyphs, but also because it requires no boilerplate whatsoever. As an array language, it deals with vectors and matrices without needing to “loop over columns” or “for i in values”. It looks scary at first, but it’s really not - once you’re familiar with the glyphs it’s actually beautiful!

Let’s say you have a matrix m which contains the values 1 through 9

and you want to sum the columns. Chances are, the language you normally use will require you to first calculate the size of the matrix, maybe even perform a loop. In APL, it’s

    +⌿m
12 15 18

⌿ is the glyph for “reduce along first axis”, or perform some operation (supplied as its left argument) to its right argument. +⌿ is therefore “sum columns”. No boilerplate, just a direct explanation (the glyphs themselves are better names than any word you could attach to them) of what needs to be done.

Sure, you need to learn the glyphs, and potentially even how to enter them; one option being a prefix key then a corresponding key. How committed am I to learning those, you ask? Well, here’s my laptop

My laptop with APL stickers on the keys

I considered using APL for my Day 7 solution, but it was so many functions defined, and fiddly if/else logic, I figured it was just ill-suited to APL. Then I saw a video recap of an APL solution for Day 7 and my mind was blown.

Meanwhile, I saw a post from Elias Mårtenson, creator of the Kap language, promoting some examples of Kap and was even more interested given that it can do some things that (Dyalog) APL can’t, like produce graphics.

Can your APL do this?

    chart:line mtcars 
┌→──────────────────────────────────────────────────────────────┐
↓1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1│
│4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4│
│4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2│
└───────────────────────────────────────────────────────────────┘

chart:line mtcars in Kap

Kap is a fairly new APL-based language (written in Kotlin) that supports most of Dyalog APL, but adds some cool extensions and alterations like lazy evaluation and parallel execution.

Uiua is another new language on the scene (written in Rust) which also supports graphics; the Uiua logo itself is written in Uiua

Xy ← °⍉⊞⊟. ÷÷2: -÷2,⇡.200
Rgb ← [:°⊟×.Xy ↯△⊢Xy0.5]
u ← ↥<0.2:>0.7.+×2 ×.:°⊟Xy
c ← <:⍜°√/+ Xy
⍉⊂:-¬u c1 +0.1 ↧¤c0.95Rgb

Uiua logo, coded in Uiua

The online editor for Uiua uses colours to distinguish different functions/operators, and the author has the flexibility to do what they want with that, so it’s awesome to see what they’ve used for “all” (⋔) and “transpose” (⍉)…

Uiua coloured glyphs for ‘all’ and ‘transpose’

I figured I’d try to reproduce the APL solution in Kap as a way to learn more about that language. The APL/Kap solution is so elegant! Additionally, I tried writing equivalent R code. I’ll interleave all three in this post (a nice excuse to get tabsets working!).

Reading Input

To start with, get the data into the workspace - this reads in a vector with each element representing a row of input

APL

Reading from a file is performed using ⎕NGET

    ⊃⎕NGET'p07.txt'1

 32T3K 765  T55J5 684  KK677 28  KTJJT 220  QQQJA 483

Kap

Kap uses some namespaces, which makes reading in a bit nicer, and the output is boxed, with explicit quotes for strings

p ← io:read "p07.txt"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃"32T3K 765" "T55J5 684" "KK677 28" "KTJJT 220" "QQQJA 483"┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛

readLines reads in each line as an element of a vector

p <- readLines("example07.txt")
p

## [1] "32T3K 765" "T55J5 684" "KK677 28"  "KTJJT 220" "QQQJA 483"

Preprocessing

The input consists of hands of cards juxtaposed with a bid value, separated by a space. The approach here is not to treat them individually, but to create a matrix containing columns of hands and columns of bids.

APL

Partition ((≠⊆⊢)) on spaces (' ') for each (¨) row
```
    ' '(≠⊆⊢)¨⊃⎕NGET'p07.txt'1
```
```
 32T3K  765    T55J5  684    KK677  28    KTJJT  220    QQQJA  483
```
It’s not entirely clear from this layout, but this is a vector of length-2 vectors.

These are “mixed” (stacked; ↑), and the result assigned (←) to p
```
p←↑' '(≠⊆⊢)¨⊃⎕NGET'p07.txt'1
```
```
 32T3K  765 
 T55J5  684 
 KK677  28  
 KTJJT  220 
 QQQJA  483 
```
This is now a matrix, where the first column contains the hands, the second (last) column contains the bids.

Kap

Rather than the partition idiom, Kap has regex support, so splitting the components involes regex:split for each (¨) element of input

p←⊃{" " regex:split ⍵}¨p

┌→────────────┐
↓"32T3K" "765"│
│"T55J5" "684"│
│"KK677"  "28"│
│"KTJJT" "220"│
│"QQQJA" "483"│
└─────────────┘

The boilerplate of R’s matrix construction takes a toll after using APL/Kap…

p <- matrix(unlist(strsplit(p, " ")), ncol = 2, byrow = TRUE)
p

##      [,1]    [,2] 
## [1,] "32T3K" "765"
## [2,] "T55J5" "684"
## [3,] "KK677" "28" 
## [4,] "KTJJT" "220"
## [5,] "QQQJA" "483"

Extraction

The hands and bids can be extracted into their own variables.

APL

This can be achieved several ways, but a clean way is by reducing (/) with either the ‘leftmost’ (⊣) or ‘rightmost’ (⊢) operator, and evaluating (executing ⍎) each (¨) of the bids to convert from strings to numbers
```
hands←⊣/p
bids←⍎¨⊢/p
```
Kap

Kap uses exactly the same approach as APL for this
```
hands←⊣/p
bids←⍎¨⊢/p
```
R

R’s ‘subset by index’ works just fine, but if this was generalised I’d use something like p[, ncol(p)] to get to the last column
```
hands <- p[,1]
hands
```
```
## [1] "32T3K" "T55J5" "KK677" "KTJJT" "QQQJA"
```
```
bids <- as.integer(p[,2])
bids
```
```
## [1] 765 684  28 220 483
```

Tabulate

Now comes the interesting part! Rather than deal with the types separately, one approach is to identify them by their relative counts; a five-of-a-kind has 5 of one card and nothing elese; a four-of-a-kind has four of one and one of another.

APL

APL has a “key” (⌸) which takes a function as a left argument, which can be to count the occurrences of each element with “tally” (≢)
```
      {⍺,≢⍵}⌸'TGGATAACTTGAAC'
```
```
T 4
G 3
A 5
C 2
```
In this case, we can get just the tallied count of each card in the hand with
```
    {⊢∘≢⍵}⌸¨hands
```
```
2 1 1 1  1 3 1  2 1 2  1 2 2  3 1 1
```
We can then sort (⍵[⍒⍵]) these, take just the first two values (2↑), and decode (⊥) using base 10 to a single number. A nice feature of APL is that trying to take the “first N” elements of a single element pads to the full N with zeroes.
```
    f←{10⊥2↑{⍵[⍒⍵]}⊢∘≢⌸⍵}
    f¨hands
```
```
21 31 22 22 31
```

Kap

Kap doesn’t have the equivalent Key, but after some discussion with the creator, it’s entirely possible to get something that does the same

  key⇐(⍪+⌿≡⌻)∘∪ ⍝ using outer product - see the R solution
  key2⇐{u←∪⍵ ⋄ c←⍸˝∧u⍳⍵} ⍝ using inverse 'where' and 'index of'

  key2¨hands

┌→────────────────────────────────────────┐
│┌→──────┐ ┌→────┐ ┌→────┐ ┌→────┐ ┌→────┐│
││2 1 1 1│ │1 3 1│ │2 1 2│ │1 2 2│ │3 1 1││
│└───────┘ └─────┘ └─────┘ └─────┘ └─────┘│
└─────────────────────────────────────────┘

The rest is the same as APL, except Kap uses a dedicated sort (∨)

    handrank⇐{10⊥2↑∨⊢/key ⍵}
    handrank¨hands

┏━━━━━━━━━━━━━━┓
┃21 31 22 22 31┃
┗━━━━━━━━━━━━━━┛

I wanted to recreate the above approach in R, so this will take the long way ’round.

First, we need a ‘key’ function

key <- function(x) {
  l <- strsplit(x, "")[[1]]
  setNames(colSums(outer(l, unique(l), "==")), unique(l))
}

sapply(hands, key)

## $`32T3K`
## 3 2 T K 
## 2 1 1 1 
## 
## $T55J5
## T 5 J 
## 1 3 1 
## 
## $KK677
## K 6 7 
## 2 1 2 
## 
## $KTJJT
## K T J 
## 1 2 2 
## 
## $QQQJA
## Q J A 
## 3 1 1

The idea of this is to create an outer product between the set of unique letters in the string, and the individual letters, performing an == check on each combination

y <- strsplit(hands[2], "")[[1]]
outer(y, unique(y), "==")

##       [,1]  [,2]  [,3]
## [1,]  TRUE FALSE FALSE
## [2,] FALSE  TRUE FALSE
## [3,] FALSE  TRUE FALSE
## [4,] FALSE FALSE  TRUE
## [5,] FALSE  TRUE FALSE

This is, of course, unnecessary as R has a way to do this

table(y)

## y
## 5 J T 
## 3 1 1

but I wanted to see how to do it from scratch.

Applying this over the hands, we can sort each of the counts again, but now taking the first two values fails for the five-of-a-kind which only has a 5, so in that case I add the missing 0. Decoding as base 10 can be done a couple of ways, but pasting and converting seems to work fine.

handrank <- function(x) {
  rank <- sort(sapply(x, key), decreasing = TRUE)
  if (length(rank) == 1) rank <- c(rank, 0)
  as.integer(paste(rank[1:2], collapse = ""))
}

sapply(hands, handrank)

## 32T3K T55J5 KK677 KTJJT QQQJA 
##    21    31    22    22    31

Subsequent Rankings and Answer

Finally, the part where the ‘array’ approach shines! Rather than constructing some sortable number for each hand, we can just score each card and use an array.

APL

Creating a vector of all the cards is aided by the ‘numbers as a string’ helper ⎕D. Drop the first two of these (2↓) then append the ‘face’ cards
```
    r←'TJQKA',⍨2↓⎕D
    r
```
```
23456789TJQKA
```
Stacking the hands into a matrix of cards
```
    ↑hands
```
```
32T3K
T55J5
KK677
KTJJT
QQQJA
```
we can ask for the index of matches to the individual cards with ⍳
```
    r⍳↑hands
```
```
 2  1  9  2 12
 9  4  4 10  4
12 12  5  6  6
12  9 10 10  9
11 11 11 10 13
```
Prepending (,) each column with the tabulated type of each hand
```
    r{⍵,⍺⍳↑hands}f¨hands
```
```
21  2  1  9  2 12
31  9  4  4 10  4
22 12 12  5  6  6
22 12  9 10 10  9
31 11 11 11 10 13
```
Now, some real magic… APL support “total array ordering” which means we can just sort the entire thing - it will sort by the first column, using the second and subsequent columns for ties. Given that the first column is the ‘type’ of hand, and subsequent columns are values of each card in order, that’s precisely the sorting we need!
```
    r{⍋⍋⍵,⍺⍳↑hands}f¨hands
```
```
1 4 3 2 5
```
There’s a nice discussion about why the double grading from BQN

Finally, multiplying by the bids themselves, and sum-reducing gives the final answer
```
  +/r{bids×⍋⍋⍵,⍺⍳↑hands}f¨hands
```
```
6440
```
Kap

This is mostly the same solution as APL, except I couldn’t find the ‘numbers as string’ so i just typed it out. Kap also uses ‘disclose’ (⊃) in place of mix (↑) (ref).
```
    ranks←"23456789TJQKA"
    +/ranks{bids×1+⍋⍋⍵,⍺⍳⊃hands}handrank¨hands
```
```
6440
```

R doesn’t support Total Array Ordering, but it does seem to have a way to do it, so say the documentation examples for order

## or along 1st column, ties along 2nd, ... *arbitrary* no.{columns}:
dd[ do.call(order, dd), ]

That only works for a data.frame, which is a list (per do.call’s requirement). We can still work with that. First, smoosh together all the hands and convert the individual cards to a matrix - again, a long line of commands for what is reasonably straightforward in APL… 3 3⍴'abcdefghi' reshapes those 9 letters into a 3x3 matrix.

m <- matrix(strsplit(paste0(hands, collapse = ""), "")[[1]], ncol = 5, byrow = TRUE)
m

##      [,1] [,2] [,3] [,4] [,5]
## [1,] "3"  "2"  "T"  "3"  "K" 
## [2,] "T"  "5"  "5"  "J"  "5" 
## [3,] "K"  "K"  "6"  "7"  "7" 
## [4,] "K"  "T"  "J"  "J"  "T" 
## [5,] "Q"  "Q"  "Q"  "J"  "A"

The individual cards vector benefits from coercing the digits to characters

ranks <- c(2:9, "T", "J", "Q", "K", "A")

The index mapping does actually work nicely with match, except it returns a single vector, not a matrix, so we need to reshape yet again. Plus, this time, the matches went down columns not along rows, so we need to use byrow = FALSE

mm <- matrix(match(m, ranks), ncol = 5, byrow = FALSE)
mm

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    1    9    2   12
## [2,]    9    4    4   10    4
## [3,]   12   12    5    6    6
## [4,]   12    9   10   10    9
## [5,]   11   11   11   10   13

Prepending with the type rankings does work nicely via cbind

g <- cbind(sapply(hands, handrank), mm)
g

##       [,1] [,2] [,3] [,4] [,5] [,6]
## 32T3K   21    2    1    9    2   12
## T55J5   31    9    4    4   10    4
## KK677   22   12   12    5    6    6
## KTJJT   22   12    9   10   10    9
## QQQJA   31   11   11   11   10   13

Then ordering with the do.call idiom

gdf <- as.data.frame(g)
gdf[do.call(order, gdf), ]

##       V1 V2 V3 V4 V5 V6
## 32T3K 21  2  1  9  2 12
## KTJJT 22 12  9 10 10  9
## KK677 22 12 12  5  6  6
## T55J5 31  9  4  4 10  4
## QQQJA 31 11 11 11 10 13

Putting this all together into a function

sortrank <- function(x, y) {
  m <- matrix(strsplit(paste0(y, collapse = ""), "")[[1]], ncol = 5, byrow = TRUE)
  mm <- matrix(match(m, x), ncol = 5, byrow = FALSE)
  g <- cbind(sapply(y, handrank), mm)
  do.call(order, as.data.frame(g))
}

sortrank(ranks, hands)

## [1] 1 4 3 2 5

This isn’t the double sorting that APL and Kap used, and that little difference is what held me up for all too long trying to figure out why my solution passed tests but gave the wrong answer. Annoyingly, this mistake doesn’t show up in the test case because the ranks only differ by a switched place. The true input was not so kind.

This result is the order in which we need to place the bids, so doing that, then multiplying by the position (since it’s sorted, this is just a vector from 1 to the number of elements) we get the answer

sum(bids[sortrank(ranks, hands)]*seq_along(bids))

## [1] 6440

Summary

So, how do these solutions all look? I’ll stop with the tabsets for a side-by-side comparison.

Compacting the APL solution (which does involve some duplication) it’s as simple as

p←↑' '(≠⊆⊢)¨⊃⎕NGET'p07.txt'1
+/('TJQKA',⍨2↓⎕D){(⍎¨⊢/p)×⍋⍋⍵,⍺⍳↑⊣/p}{10⊥2↑{⍵[⍒⍵]}⊢∘≢⌸⍵}¨⊣/p

which, admittedly, requires a fair amount of unpacking to read. In full form, it’s

p←↑' '(≠⊆⊢)¨⊃⎕NGET'p07.txt'1
hands←⊣/p
bids←⍎¨⊢/p
f←{10⊥2↑{⍵[⍒⍵]}⊢∘≢⌸⍵}
r←'TJQKA',⍨2↓⎕D
+/r{bids×⍋⍋⍵,⍺⍳↑hands}f¨hands

which is still pretty nice, considering what it’s doing.

The R solution, somewhat minimally, and leveraging table, is

handrank <- function(x) {
  rank <- sort(sapply(strsplit(x, ""), table), decreasing = TRUE)
  if (length(rank) == 1) rank <- c(rank, 0)
  as.integer(paste(rank[1:2], collapse = ""))
}

sortrank <- function(x, y) {
  m <- matrix(strsplit(paste0(y, collapse = ""), "")[[1]], ncol = 5, byrow = TRUE)
  mm <- matrix(match(m, x), ncol = 5, byrow = FALSE)
  g <- cbind(sapply(y, handrank), mm)
  do.call(order, as.data.frame(g))
}

solve <- function(x) {
  p <- matrix(unlist(strsplit(x, " ")), ncol = 2, byrow = TRUE)
  hands <- p[,1]
  bids <- as.integer(p[,2])
  ranks <- c(2:9, "T", "J", "Q", "K", "A")
  sum(bids[sortrank(ranks, hands)]*seq_along(bids))
}

solve(readLines("example07.txt"))

Certainly more typing, but still a much shorter solution than the one I originally came up with.

Takeaways

Both APL and Kap (and so many other languages) benefit greatly from treating a string as an array of characters. This always hurts in R, where strsplit(x, "") is needed.

The array approach here highlights how one can think differently about a problem, provided the tools are at hand.

Kap has a lot to offer - it’s (vastly) newer, which comes with both advantages (can do new things) and disadvantages (things need to be implemented, and they won’t necessarily carry over 1:1).

Advent of Code once again proves to be a useful exercise.

One more thing

I saw a solution in Uiua on Mastodon and had to give it a go, too…

Input ← ⊜(⊜□≠@ .)≠@\n.&fras"p07.txt"
Label ← ⇌"AKQJT98765432"
Bids ← ⋕⊢↘1⍉
Cards ← ⊐≡(⊗:Label)⊢⍉
Types ← 0_1_2_4_8_5_10_9_3_6_12_11_13_7_14_15⊚1_4_3_3_2_2_1
TypeStr ← ⊏⊗⊙Types≡(°⋯≡/=◫2⊏⍏.)
/+×+1⍏⍏/+×ⁿ⇌⇡6⧻Label⊂⊃TypeStr⍉⊃Cards Bids Input

I think this is taking the same approach, though unpacking this is even trickier.

Comments and improvements most welcome. I can be found on Mastodon or use the comments below.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.2 (2023-10-31)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-12-10
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.18    2023-06-19 [1] CRAN (R 4.3.2)
##  bookdown      0.36    2023-10-16 [1] CRAN (R 4.3.2)
##  bslib         0.5.1   2023-08-11 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.3.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.22    2023-09-29 [3] CRAN (R 4.3.1)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.6.1 2023-10-06 [3] CRAN (R 4.3.1)
##  htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.2)
##  httpuv        1.6.12  2023-10-23 [1] CRAN (R 4.3.2)
##  icecream      0.2.1   2023-09-27 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.44    2023-09-11 [3] CRAN (R 4.3.1)
##  later         1.3.1   2023-05-02 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.3.2)
##  pkgbuild      1.4.2   2023-06-26 [1] CRAN (R 4.3.2)
##  pkgload       1.3.3   2023-09-22 [1] CRAN (R 4.3.2)
##  prettyunits   1.2.0   2023-09-24 [3] CRAN (R 4.3.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.8   2023-05-02 [1] CRAN (R 4.3.2)
##  promises      1.2.1   2023-08-10 [1] CRAN (R 4.3.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.2   2023-08-10 [3] CRAN (R 4.3.1)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.11  2023-07-06 [1] CRAN (R 4.3.2)
##  remotes       2.4.2.1 2023-07-18 [1] CRAN (R 4.3.2)
##  rlang         1.1.1   2023-04-28 [3] CRAN (R 4.3.0)
##  rmarkdown     2.25    2023-09-18 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.2)
##  shiny         1.7.5.1 2023-10-14 [1] CRAN (R 4.3.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [3] CRAN (R 4.3.0)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.3.2)
##  usethis       2.2.2   2023-07-06 [1] CRAN (R 4.3.2)
##  vctrs         0.6.4   2023-10-12 [3] CRAN (R 4.3.1)
##  xfun          0.40    2023-08-09 [3] CRAN (R 4.3.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.3.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.3
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Advent of Code 2022

website@jcarroll.com.au (Jonathan Carroll) — Tue, 28 Nov 2023 00:00:00 +0000

In the lead up to Christmas each year, Advent of Code offers a series of 25 puzzles which start out reasonably simple, but get progressively harder, eventually requiring knowledge of algorithms and dynamic programming techniques. Last year I solved these in (strictly) base R on the day they were released (or as close to as I could). I then (starting Dec 27) went back and re-solved (13 of) them in Rust.

This post details what I learned along the way and some fun visualisations I made.

As I eventually ran out of time before the 2023 AoC event, some of the latter solutions are just links to my GitHub repo without comment. I’ll try to update those at some point.

Quicklinks (click here to hide):

Day 1: Calorie Counting

R

I hadn’t participated in AoC before this, so part of this day involved setting up a clean way to get the puzzle into R and figure out how I was going to run/test my solutions. The {aoc} package makes this quite smooth by using a session cookie to fetch the puzzle from the website and scaffold the input and functions for a given day.

Each puzzle has a small worked example, which requires a small example data input. For the first two weeks I painstakingly copied this input from the puzzle text to the templated example_input_xx function. The actual input for the puzzle is typically much larger and I believe is randomised from a handful of variants so that not everyone gets the exact same input, which makes sharing solutions less of a problem. This input is stored in a .txt file in the inst/ directory by the {aoc} package, which also templates a run-dayxx.R file which reads said input.

All that’s left for the user is to fill in the fxxa and fxxb functions which solve the part a and part b of each day’s puzzle.

Solving the puzzle begins with parsing the input data, which may be a newline-delimited series of numbers, or something more complex. In this case, groups of numbers delimited by a blank line. This puzzle asks us to find the group with the largest total. With the data loaded as a long string containing newlines \n I split at a double-newline, then spit within each group at the remaining newline, trimmed the string, converted to an integer, and summed the result, which gives a total value per group. Finally, I determined the largest value from the groups with the pattern x[which.max(x)]

f01a <- function(x) {
  xvec <- strsplit(x, "\n\n")[[1]]
  tots <- unlist(lapply(xvec, \(y) sum(as.integer(trimws(strsplit(y, "\n")[[1]])))))
  tots[which.max(tots)]
}

An alternative would have been to sort tots and take the first value.

The second part of each puzzle expands the problem - in this case, rather than just the largest value from a group, it asks for the largest three groups

f01b <- function(x) {
  xvec <- strsplit(x, "\n\n")[[1]]
  tots <- unlist(lapply(xvec, \(y) sum(as.integer(trimws(strsplit(y, "\n")[[1]])))))
  res <- 0
  for (i in 1:3) {
    n <- which.max(tots)
    res <- res + tots[n]
    tots <- tots[-n]
  }
  res
}

In hindsight, sum(head(sort(tots, decreasing = TRUE), 3)) looks like it would have been clearer.

I wasn’t interested in the performance of my solutions, but for the sake of comparison later, here is how long these take to run over the real input, which contains 2251 lines

microbenchmark::microbenchmark(f01a(x), f01b(x), times = 100, check = NULL)
Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval cld
 f01a(x) 21.21986 21.50103 22.26573 21.63479 21.82128 31.41858   100   a
 f01b(x) 21.18393 21.52034 22.29896 21.67297 21.87063 32.98093   100   a

Running the final solutions from the templated inst/run-dayxx.R file involves building the package (so that the daily functions are available) and running

library(adventofcode22)
x <- readLines("./inst/input01.txt")
x <- paste(x, collapse = "\n")

p1 <- f01a(x)
p2 <- f01b(x)

Rust

Returning to these puzzles from Rust presents the same issue - how do I get the inputs and parse them? The equivalent to the {aoc} package in Rust is a template repository {advent-of-code-rust} which adds some functionality to cargo to scaffold and solve each day’s puzzle.

This crate also adds some helper functions for reading in the inputs and some tests for confirming that the solutions successfully solve with the example data.

Working with arbitrary data in Rust was a bit of a learning experience for me - until this point I’d worked with known structures where I knew exactly what size and shape to expect, and as such I could define what needed to happen. With the puzzle input, I needed to learn how to work with unknown lengths and anticipate what might not work.

I learned from my R solutions that a shared ‘helper’ function to read the data is quite useful, so I started there. As with the R solution, splitting the data into groups at a double newline produces the ‘elf’ groups. Splitting each of those groups involved a map which splits each group’s text into lines(), converts to integers (u32) with parse(), then sum()s each group, collect()ing the result from each group back into a vector. The unwrap() in the middle of this is because parse() can fail - something may not be representable as an integer - so parse() returns a Result type, which can be either a value, or an error. unwrap() simply says “this will never fail, but if it does, crash the entire program”.

fn parse01(input: &str) -> Vec<u32> {
    let elf = input.split("\n\n").collect::<Vec<&str>>();
    let calories: Vec<u32> = elf
        .into_iter()
        .map(|x| x.lines().map(|l| l.parse::<u32>().unwrap()).sum())
        .collect();
    calories
}

Actually solving the first part is then just converting the vector to an iterator and taking the maximum value

pub fn part_one(input: &str) -> Option<u32> {
    let calories = parse01(input);
    calories.into_iter().max()
}

This returns an Option because a) that’s what the solution template requires, and b) into_iter() needs to be able to run out of values.

For the second part I took advantage of the idea I should have had for the R solution and sorted the result (in-place), reversed it (in-place), took the first 3 values, and summed them

pub fn part_two(input: &str) -> Option<u32> {
    let mut calories = parse01(input);
    calories.sort();
    calories.reverse();
    let top3 = calories.iter().take(3);
    Some(top3.sum())
}

Running this solution, the timing seems impressive

cargo solve 01
    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/01`
🎄 Part 1 🎄
72718 (elapsed: 1.01ms)
🎄 Part 2 🎄
213089 (elapsed: 1.07ms)

(about 20x faster than the R solution) except that this is the debug build - it still has debug symbols and some other things that make it not as fast as it can be. Using a release build…

cargo solve 01 --release
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/01`
🎄 Part 1 🎄
72718 (elapsed: 115.74µs)
🎄 Part 2 🎄
213089 (elapsed: 116.51µs)

Yes - one hundred microseconds. 🤯

Day 2: Rock Paper Scissors

R

This puzzle involves combinations of A, B, C and X, Y, Z which lead to different configurations. I don’t know if it’s quite cheating, but I just hardcoded the results into some helper functions

f02_helper <- function(x) {
  switch(x,
         AX = 1 + 3,
         AY = 2 + 6,
         AZ = 3 + 0,
         BX = 1 + 0,
         BY = 2 + 3,
         BZ = 3 + 6,
         CX = 1 + 6,
         CY = 2 + 0,
         CZ = 3 + 3)
}

and summed the matching values, dropping spaces

f02a <- function(x) {
  x <- gsub(" ", "", x)
  sum(sapply(x, f02_helper))
}

The second part is just a variation on this, so another helper and the same idea

f02b_helper <- function(x) {
  switch(x,
         AX = 3 + 0,
         AY = 1 + 3,
         AZ = 2 + 6,
         BX = 1 + 0,
         BY = 2 + 3,
         BZ = 3 + 6,
         CX = 2 + 0,
         CY = 3 + 3,
         CZ = 1 + 6)
}

f02b <- function(x) {
  x <- gsub(" ", "", x)
  sum(sapply(x, f02b_helper))
}

There was probably an algorithmic way to achieve this, but the answer works.

For comparison sake…

microbenchmark::microbenchmark(f02a(x), f02b(x), times = 100, check = NULL)
Unit: milliseconds
    expr      min       lq     mean   median       uq      max neval cld
 f02a(x) 4.814174 4.921636 5.225773 4.992835 5.259293 8.873809   100   a
 f02b(x) 4.837843 4.918625 5.142451 4.984148 5.117036 7.762490   100   a

Rust

I could do similar with Rust, using a match clause inside a map

pub fn part_one(input: &str) -> Option<u32> {
    let guide = parse02(input);
    let res: Vec<u32> = guide
        .into_iter()
        .map(|x| match x.as_str() {
            "AX" => 1 + 3,
            "AY" => 2 + 6,
            "AZ" => 3 + 0,
            "BX" => 1 + 0,
            "BY" => 2 + 3,
            "BZ" => 3 + 6,
            "CX" => 1 + 6,
            "CY" => 2 + 0,
            "CZ" => 3 + 3,
            _ => 0,
        })
        .collect();
    Some(res.iter().sum())
}

pub fn part_two(input: &str) -> Option<u32> {
    let guide = parse02(input);
    let res: Vec<u32> = guide
        .into_iter()
        .map(|x| match x.as_str() {
            "AX" => 3 + 0,
            "AY" => 1 + 3,
            "AZ" => 2 + 6,
            "BX" => 1 + 0,
            "BY" => 2 + 3,
            "BZ" => 3 + 6,
            "CX" => 2 + 0,
            "CY" => 3 + 3,
            "CZ" => 1 + 6,
            _ => 0,
        })
        .collect();
    Some(res.iter().sum())
}

This time, the difference in timing wasn’t so pronounced

cargo solve 02 --release
    Finished release [optimized + debuginfo] target(s) in 0.02s
     Running `target/release/02`
🎄 Part 1 🎄
15422 (elapsed: 699.20µs)
🎄 Part 2 🎄
15442 (elapsed: 527.74µs)

Still faster, but now we’re dealing with string comparisons.

Day 3: Rucksack Reorganization

R

This puzzle involves ‘rucksacks’ containing letters so we’re going to be doing more string comparisons. The problem statement says that we need to find the single character that is common between the first and second halves of a string. As will be a common theme, I approached this by first solving it for one input as a helper, then mapping over all the inputs. My solution involves splitting the first and second halves of the string with strsplit(), finding the intersection of these (which should be a single character), and matching that to either lowercase or uppercase letters, which R nicely has as inbuilt data structures letters and LETTERS, respectively. This makes for, I believe, a fairly compact solution

f03a <- function(x) {
  sum(sapply(x, f03_helper))
}

f03_helper <- function(x) {
  half <- nchar(x)/2
  comp1 <- strsplit(substring(x, 1, half), "")[[1]]
  comp2 <- strsplit(substring(x, half+1), "")[[1]]
  solo <- intersect(comp1, comp2)
  prio <- match(solo, c(letters, LETTERS))
  prio
}

The second part expands to using 3 groups instead of the two halves. I needed a way to split the input (one string per line) into groups of 3. I haven’t used this in a very long time, but I remembered learning about “generate factor levels” gl() back when I first learned R. This produces a sequence of factors which can be passed to split(), so splitting 12 lines into blocks of 3 would produce 4 levels:

gl(12/3, 3)

##  [1] 1 1 1 2 2 2 3 3 3 4 4 4
## Levels: 1 2 3 4

Aside from that, the only other difference was the double intersection - it’s a shame that intersect only takes two arguments, so I just need to perform it twice

f03b <- function(x) {
  grps <- split(x, as.integer(gl(length(x)/3, 3)))
  sum(sapply(grps, f03b_helper))
}

f03b_helper <- function(x) {
  x1 <- strsplit(x[1], "")[[1]]
  x2 <- strsplit(x[2], "")[[1]]
  x3 <- strsplit(x[3], "")[[1]]
  comm <- intersect(intersect(x1, x2), x3)
  prio <- match(comm, c(letters, LETTERS))
  prio
}

Rust

Figuring out how to do this in Rust took a bit more effort. I don’t know if it was the best way, but I found I could take an intersection of a HashSet object. Rust has a nice split_at() method which helps split the strings, and (as with the lines() used earlier) a chars() method to split into individual characters. No inbuilt letters, though, so I used an ASCII lookup trick to calculate the priority.

use std::collections::HashSet;

fn parse03(input: &str) -> Vec<String> {
    input.lines().map(|x| x.to_string()).collect()
}

fn shared_item(rucksack: String) -> Vec<char> {
    let l = rucksack.len();
    let (str1, str2) = rucksack.split_at(l / 2);

    let comp1: HashSet<char> = HashSet::from_iter(str1.chars());
    let comp2: HashSet<char> = HashSet::from_iter(str2.chars());

    let common = comp1.intersection(&comp2);

    common.copied().collect()
}

fn priority(item: char) -> u32 {
    match item {
        lowercase @ 'a'..='z' => lowercase as u32 - ('a' as u32) + 1,
        uppercase @ 'A'..='Z' => uppercase as u32 - ('A' as u32) + 27,
        _ => 0,
    }
}

pub fn part_one(input: &str) -> Option<u32> {
    let parsed = parse03(&input);
    let repeated: Vec<_> = parsed.iter().map(|x| shared_item(x.to_owned())).collect();
    let mut s = 0;
    for c in repeated {
        s += priority(c[0])
    }
    Some(s)
}

Definitely not as clean as the R solution, here. For the second part, I found some help in a Reddit thread about a three-way intersection. Here, the chunks(n) method nicely produces the three groups

fn badge(rucksacks: Vec<String>) -> Vec<char> {
    let mut badges = vec![];

    for group in rucksacks.chunks(3) {
        let h1: HashSet<char> = HashSet::from_iter(group[0].chars());
        let h2: HashSet<char> = HashSet::from_iter(group[1].chars());
        let h3: HashSet<char> = HashSet::from_iter(group[2].chars());

        let common: Vec<_> = h1
            .iter()
            .filter(|e| h2.contains(e) && h3.contains(e))
            .collect();
        badges.push(*common[0]);
    }

    badges
}

pub fn part_two(input: &str) -> Option<u32> {
    let parsed = parse03(&input);
    let badge: Vec<_> = badge(parsed);
    let mut s = 0;
    for c in badge {
        s += priority(c)
    }
    Some(s)
}

Day 4: Camp Cleanup

Now the parsing gets harder. This puzzle involves finding where ranges overlap, so

5-7: ....567..  
7-9: ......789

overlap at the 7, while

2-6: .23456...  
4-8: ...45678.

overlaps at 4, 5, and 6.

R

Again, taking the “do it once, then map” approach, I converted the format a-b into a:b and then eval(parse(text))’d the result. This worked surprisingly well. The puzzle then asks how many times one of the pair is entirely contained within the other, so all() and %in% are great help here.

f04a <- function(x) {
  sum(sapply(x, f04_helper))
}

f04_helper <- function(x) {
  both <- sapply(sub("-", ":", strsplit(x, ",")[[1]]), \(y) eval(parse(text = y)), simplify = FALSE, USE.NAMES = FALSE)
  all(both[[1]] %in% both[[2]]) || all(both[[2]] %in% both[[1]])
}

The second part asks for how many overlap at all, so it’s just a change from all() to any()

f04b <- function(x) {
  sum(sapply(x, f04b_helper))
}

f04b_helper <- function(x) {
  both <- sapply(sub("-", ":", strsplit(x, ",")[[1]]), \(y) eval(parse(text = y)), simplify = FALSE, USE.NAMES = FALSE)
  any(both[[1]] %in% both[[2]]) || any(both[[2]] %in% both[[1]])
}

Rust

I created a structure to contain the ranges, parsed out the strings into actual ranges, and parsed the input

#[derive(Debug)]
struct Assignments {
    sections: String,
}

impl Assignments {
    fn ids(&self) -> std::ops::Range<u32> {
        let rangelimits = &self.sections.split_once('-').unwrap();
        let start = rangelimits.0.parse::<u32>().unwrap();
        let end = rangelimits.1.parse::<u32>().unwrap();
        start..end
    }
}

fn create_assignments(line: &str) -> Vec<Assignments> {
    let pair = line.split_once(',').unwrap();
    let p1 = Assignments {
        sections: pair.0.to_string(),
    };
    let p2 = Assignments {
        sections: pair.1.to_string(),
    };
    vec![p1, p2]
}

fn parse04(input: &str) -> Vec<Vec<Assignments>> {
    let l = input.lines();
    l.into_iter().map(|x| create_assignments(x)).collect()
}

Having to do this in Rust made me a happy that R has an intersect() function, because now I needed one and had to code it by hand (I think…)

To determine if one range is fully contained within another, I compared the start and end values. Iterating over the pairs I just incremented a counter for those which were fully overlapping

fn fully_contains(pairs: Vec<Assignments>) -> bool {
    let p1 = pairs[0].ids();
    let p2 = pairs[1].ids();
    if p1.len() >= p2.len() {
        return p1.start <= p2.start && p1.end >= p2.end;
    } else {
        return p2.start <= p1.start && p2.end >= p1.end;
    }
}

pub fn part_one(input: &str) -> Option<u32> {
    let all_assignments = parse04(input);
    let mut overlapping = 0;
    for ass in all_assignments {
        if fully_contains(ass) {
            overlapping += 1;
        }
    }
    Some(overlapping)
}

For the second part, I needed another algorithm, so StackOverflow to the rescue

fn overlap_at_all(pairs: Vec<Assignments>) -> bool {
    let p1 = pairs[0].ids();
    let p2 = pairs[1].ids();
    // https://stackoverflow.com/a/325964/4168169
    // (StartA <= EndB) and (EndA >= StartB)
    p1.start <= p2.end && p1.end >= p2.start
}

pub fn part_two(input: &str) -> Option<u32> {
    let all_assignments = parse04(input);
    let mut overlapping = 0;
    for ass in all_assignments {
        if overlap_at_all(ass) {
            overlapping += 1;
        }
    }
    Some(overlapping)
}

Day 5: Supply Stacks

R

This one made me a little more afraid as it involved parsing ASCII-art-like input

    [D]
[N] [C]
[Z] [M] [P]
 1   2   3

move 1 from 2 to 1
move 3 from 1 to 3
move 2 from 2 to 1
move 1 from 1 to 2

Starting with the “stacks”, I realised that a “crate” involved 4 characters and possibly a space (e.g. [A]) so I could substring into those. I reversed them so that the top “crate” was first

extract_stack <- function(x) {
  # split into stacks
  n <- seq(1, nc <- nchar(x), by = 4)
  stack <- substring(x, n, c(n[-1]-1, nc))
  stack <- trimws(sub("]", "", sub("[", "", stack, fixed = TRUE), fixed = TRUE))
  stack
}

get_stacks <- function(x) {
  x <- x[1:(grep("^$", x)-1)]
  y <- t(sapply(x, extract_stack, USE.NAMES = FALSE))
  stackno <- y[nrow(y), ]
  y <- y[-nrow(y), ]
  z <- as.list(as.data.frame(y))
  z <- lapply(z, rev)
  z <- lapply(z, \(w) w[w != ""])
  z
}

Parsing the instructions was a great opportunity for something like {unglue}, if only I wasn’t limiting myself to strictly base R. Nonetheless, the instructions formed a straightforward pattern, so it wasn’t too hard to work with

get_instruction <- function(x) {
  x <- sub("move ", "", x)
  n <- as.integer(sub("([0-9]+).*", "\\1", x))
  x <- sub("^.*?from ", "", x)
  from = as.integer(sub("([0-9]+).*", "\\1", x))
  to <- as.integer(sub("^.*?to ", "", x))
  data.frame(n, from, to)
}

get_instructions <- function(x) {
  x <- x[(grep("^$", x)+1):length(x)]
  y <- lapply(x, get_instruction)
  do.call(rbind, y)
}

Performing the crane operations only involved selecting some number (1 or several) of elements from the head of some list and appending it to another

crane <- function(stack, inst, model) {
  for (r in seq_len(nrow(inst))) {
    stack <- .crane(stack, inst[r, ], model)
  }
  stack
}

.crane <- function(stack, inst, model) {
  sfrom <- paste0("V", inst$from)
  sto <- paste0("V", inst$to)
  pick <- tail(stack[[sfrom]], inst$n)
  if (model == 9000) {
    pick <- rev(pick)
  }
  stack[[sfrom]] <- head(stack[[sfrom]], -inst$n)
  stack[[sto]] <- c(stack[[sto]], pick)
  stack
}

The flexibility and symmetry of head(n), head(-n), tail(n), and tail(-n) made this particularly nice. This was one instance where I re-used my solution to the first part with an argument for the second part.

Rust

If I thought the input parsing was hard in R, I wasn’t looking forward to doing it in Rust. I implemented the stacks in much the same way - taking 4 chars at a time

#[derive(Debug)]
struct Stacks {
    stacks: String,
}

impl Stacks {
    fn crates(&self) -> Vec<Vec<char>> {
        let stacklines = &self
            .stacks
            .lines()
            .into_iter()
            .map(|x| x.chars().collect::<Vec<char>>())
            .collect::<Vec<Vec<char>>>();
        let mut stackentries = vec![];
        for l in stacklines.iter() {
            stackentries.push(l.iter().skip(1).step_by(4).collect::<Vec<&char>>());
        }
        // reshape to stacks
        let mut stack = vec![vec![' '; 50]; stackentries[1].len()];
        for s in 0..stackentries.len() - 1 {
            for el in 0..stackentries[s].len() {
                stack[el][s] = stackentries[s][el].to_owned()
            }
        }
        for s in 0..stack.len() {
            stack[s].reverse();
            stack[s].retain(|x| *x != ' ');
        }

        stack
    }
}

The instructions invited a regex solution, but I found it to be (relatively) slow. I tried the ‘unglue’ approach

let re = Regex::new(r"move (\d*) from (\d*) to (\d*)").unwrap();
let caps = re.captures(&self.input).unwrap();

and this ended up taking 89ms. The full R solution took 257ms which is certainly more, but I expected a better improvement moving to Rust. I refactored to avoid using the regex, instead just filtering to chars that parsed as numbers

#[derive(Debug)]
struct Instructions {
    input: String,
}

impl Instructions {
    fn parse(&self) -> (usize, usize, usize) {
        let instr = String::from(&self.input);
        let caps = instr
            .split_whitespace()
            .filter(|c| c.parse::<usize>().is_ok())
            .collect::<Vec<_>>();
        let moveto = caps[0].parse::<usize>().unwrap();
        let from = caps[1].parse::<usize>().unwrap();
        let to = caps[2].parse::<usize>().unwrap();

        (moveto, from, to)
    }
}

and this version ran in 403µs - much better.

Putting the two pieces together as a tuple

fn parse05(input: &str) -> (Stacks, Vec<Instructions>) {
    let parts = input.split_once("\n\n").unwrap();
    let stacks = Stacks {
        stacks: String::from(parts.0),
    };
    let instr = parts
        .1
        .lines()
        .map(|x| Instructions {
            input: String::from(x),
        })
        .collect::<Vec<_>>();
    (stacks, instr)
}

Actually running the simulation required a crane function

fn crane(crates: Vec<Vec<char>>, instr: (usize, usize, usize)) -> Vec<Vec<char>> {
    let mut tmpcrates = crates.clone();
    for _i in 0..instr.0 {
        let tomove: char = tmpcrates[instr.1 - 1].pop().unwrap();
        tmpcrates[instr.2 - 1].push(tomove);
    }
    tmpcrates
}

pub fn part_one(input: &str) -> Option<String> {
    let (stacks, instr) = parse05(&input);
    let mut crates = stacks.crates();
    for i in 0..instr.len() {
        crates = crane(crates, instr[i].parse());
    }
    let tops = crates.iter().map(|s| s.last().unwrap()).collect::<String>();
    Some(tops)
}

and, not reusing the solution, part two

fn crane9001(crates: Vec<Vec<char>>, instr: (usize, usize, usize)) -> Vec<Vec<char>> {
    let mut tmpcrates = crates.clone(); 
    let new_len = tmpcrates[instr.1 - 1].len();
    let mut tomove = vec![];
    for _i in 0..instr.0 {
        tomove.push(tmpcrates[instr.1 - 1].pop().unwrap());
    }
    tomove.reverse();
    tmpcrates[instr.1 - 1].truncate(new_len - instr.0);
    for x in tomove.into_iter() {
        tmpcrates[instr.2 - 1].push(x);
    }
    tmpcrates
}

pub fn part_two(input: &str) -> Option<String> {
    let (stacks, instr) = parse05(&input);
    let mut crates = stacks.crates();

    for i in 0..instr.len() {
        crates = crane9001(crates, instr[i].parse());
    }
    let tops = crates.iter().map(|s| s.last().unwrap()).collect::<String>();
    Some(tops)
}

Not as bad as it could have been. The fact that Rust treats strings as a vector of Chars (as many other languages do) makes some of this a lot nicer. It’s something I do wish R did differently now that I’ve used it in other places, but strings are hard.

Day 6: Tuning Trouble

After complaining about strings the previous day, parsing this one sounded potentially tricky, but I think it worked out nicely. The problem involves finiding the first group of 4 characters where they are all different.

R

Returning to the ’do it once, then *apply approach, I was happy to know that R’s substring is vectorised, so the first and last arguments can be vectors, e.g. taking 10 letters at a time of the alphabet

l <- paste0(letters, collapse = "")
l

## [1] "abcdefghijklmnopqrstuvwxyz"

substring(l, seq(1, 17), seq(10, 26))

##  [1] "abcdefghij" "bcdefghijk" "cdefghijkl" "defghijklm" "efghijklmn"
##  [6] "fghijklmno" "ghijklmnop" "hijklmnopq" "ijklmnopqr" "jklmnopqrs"
## [11] "klmnopqrst" "lmnopqrstu" "mnopqrstuv" "nopqrstuvw" "opqrstuvwx"
## [16] "pqrstuvwxy" "qrstuvwxyz"

This is the exact sort of grouping I need for this puzzle. The rest is figuring out if the group contains 4 unique characters. The offset is to account for the number of characters since the start of the original string

f06_helper <- function(x) {
  grp4 <- substring(x, seq(1, nchar(x)), seq(4, nchar(x)))
  4 + which(sapply(strsplit(grp4, ""), \(y) length(unique(y))) == 4)[1] - 1
}

f06a <- function(x) {
  sapply(x, f06_helper)
}

The second part really could have just been adding an argument to specify the group length, but I went the long way around

f06b_helper <- function(x) {
  grp14 <- substring(x, seq(1, nchar(x)), seq(14, nchar(x)))
  14 + which(sapply(strsplit(grp14, ""), \(y) length(unique(y))) == 14)[1] - 1
}

f06b <- function(x) {
  sapply(x, f06b_helper)
}

Rust

Without R’s vectorised substring, I needed to parse 4 characters at a time - again I was thankful that Rust treats strings as a series of Chars. To keep track of which Chars had been seen in the last 4 Chars I used a HashSet. I was pleased to learn that R does in fact have such a structure in the form of utils::hashtab() but this is only available in newer versions of R

use std::collections::HashSet;

pub fn part_one(input: &str) -> Option<u32> {
    let mut i: u32 = 1;
    let mut recent = HashSet::new();
    let mut lastchars = vec![' '; 3];

    for c in input.chars() {
        for j in 0..3 {
            recent.insert(lastchars[j]);
        }
        recent.insert(c);
        if i > 3 && recent.len() == 4 {
            break
        };
        for i in 0..2 {
            lastchars[i] = lastchars[i+1];
        }
        lastchars[2] = c;
        recent.clear();
        i += 1;
    }
    Some(i)
}

The second part is again very similar, and again rather than adapting my solution I wrote a new one for part two

pub fn part_two(input: &str) -> Option<u32> {
    let mut i: u32 = 1;
    let mut recent = HashSet::new();
    let mut lastchars = vec![' '; 13];

    for c in input.chars() {
        for j in 0..13 {
            recent.insert(lastchars[j]);
        }
        recent.insert(c);
        if i > 13 && recent.len() == 14 {
            break
        };
        for i in 0..12 {
            lastchars[i] = lastchars[i+1];
        }
        lastchars[12] = c;
        recent.clear();
        i += 1;
    }
    Some(i)
}

Day 7: No Space Left On Device

The input for this puzzle is a bit gnarly

$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
...

but it turned out a brute-force replacement approach didn’t work too badly.

R

There’s probably some good algorithm to deal with this, but instead I wrote a lot of for-loops to see what needed to be done. The tricky part of the recursion was having somewhere to keep track of a) which directory I was currently in, and b) what I’d already seen. I’m sure a recursive approach could be of help here, but instead I used an environment because I knew it was somewhat memory efficient; a global list would need to keep allocating and be slow.

f07a <- function(x) {
  dir_env <- new.env()
  current_dir <- "/"
  assign(current_dir, 0, envir = dir_env)
  for (inst in x) {
    if (inst == "$ cd /") next
    if (inst == "$ cd ..") {
      current_dir <- head(current_dir, -1)
      next
    }
    if (startsWith(inst, "$ cd")) {
      dir <- sub("$ cd ", "", inst, fixed = TRUE)
      current_dir <- c(current_dir, dir)
        assign(paste0(current_dir, collapse = "/"), 0, envir = dir_env)
      next
    }
    if (inst == "$ ls") next
    if (startsWith(inst, "dir")) {
      dir <- sub("dir ", "", inst)
      next
    } else if (grepl("^[0-9]", inst)) {
      l <- strsplit(inst, " ")[[1]]
      size <- l[1]
      for (d in seq_along(current_dir)) {
          this.d <- paste0(current_dir[1:d], collapse = "/")
          assign(this.d, dir_env[[this.d]] + as.integer(size), envir = dir_env)
      }
      next
    } else {
      stop("what?")
    }
  }
  sizes <- sapply(ls(dir_env), get, env = dir_env)
  res <- sum(sizes[which(sizes <= 100000)])
  list(del = res, env = dir_env)
}

This isn’t recursive, but it works. The second part is much shorter, since it can reuse the first part

f07b <- function(x) {
  alldirs <- f07a(x)$env
  todelete <- alldirs[["/"]] - 40000000
  sizes <- sapply(ls(alldirs), get, env = alldirs)
  candidates <- sizes[which(sizes >= todelete)]
  smallest <- candidates[which.min(candidates)]
  smallest
}

Rust

Here I took the same approach, but using a HashMap as the filesystem

use std::{collections::HashMap};

pub fn part_one(input: &str) -> Option<u32> {
    let mut dir_deque = vec![];
    let mut current_dir = String::from("");
    let mut filesystem = HashMap::new();
    for l in input.lines() {
        if l == "$ cd .." {
            dir_deque.pop().unwrap();
            current_dir = dir_deque.join("");
            continue;
        } else if l == "$ cd /" {
            current_dir = String::from("/");
            filesystem.insert(current_dir.clone(), 0);
            dir_deque = vec![String::from("/")];
            continue;
        } else if l.starts_with("dir") {
            continue;
        } else if l.starts_with("$ cd") {
            let new_dir = l.replace("$ cd ", "");
            dir_deque.push(new_dir.clone() + &"/");
            current_dir = current_dir + &new_dir.clone() + &"/";
            filesystem.insert(current_dir.clone(), 0);
            continue;
        } else if char::is_digit(l.chars().nth(1).unwrap(), 10) {
            let parts = l.split_whitespace().collect::<Vec<_>>();
            let dir_size = parts[0].parse::<u32>().unwrap();
            for d in 0..dir_deque.len() {
                let this_d = dir_deque[0..=d].join("");
                let known_size = filesystem.get(&this_d).unwrap();
                filesystem.insert(this_d, known_size + dir_size);
            }
            continue;
        }
    }

    let totalsize = filesystem.iter()
            .filter(|&(_k, v)| *v <= 1e5 as u32)
            .map(|(_k, v)| *v)
            .collect::<Vec<u32>>()
            .iter()
            .sum();

    Some(totalsize)
        
}

pub fn part_two(input: &str) -> Option<u32> {
    let mut dir_deque = vec![];
    let mut current_dir = String::from("");
    let mut filesystem = HashMap::new();
    for l in input.lines() {
        if l == "$ cd .." {
            dir_deque.pop().unwrap();
            current_dir = dir_deque.join("");
            continue;
        } else if l == "$ cd /" {
            current_dir = String::from("/");
            filesystem.insert(current_dir.clone(), 0);
            dir_deque = vec![String::from("/")];
            continue;
        } else if l.starts_with("dir") {
            continue;
        } else if l.starts_with("$ cd") {
            let new_dir = l.replace("$ cd ", "");
            dir_deque.push(new_dir.clone() + &"/");
            current_dir = current_dir + &new_dir.clone() + &"/";
            filesystem.insert(current_dir.clone(), 0);
            continue;
        } else if char::is_digit(l.chars().nth(1).unwrap(), 10) {
            let parts = l.split_whitespace().collect::<Vec<_>>();
            let dir_size = parts[0].parse::<u32>().unwrap();
            for d in 0..dir_deque.len() {
                let this_d = dir_deque[0..=d].join("");
                let known_size = filesystem.get(&this_d).unwrap();
                filesystem.insert(this_d, known_size + dir_size);
            }
            continue;
        }
    }

    let to_delete = filesystem.get("/").unwrap() - (4e7 as u32);
    let candidates = filesystem.iter()
        .filter(|&(_k, v)| *v >= to_delete)
        .map(|(_k, v)| *v)
        .collect::<Vec<u32>>();
    Some(*candidates.iter().min().unwrap())

}

Day 8: Treetop Tree House

(see links)

R

Rust

Day 9: Rope Bridge

This one involves keeping track of the positions of several ‘knots’ in a rope as it moves.

R

I wrote a lot of helper functions for this one

f09a <- function(x) {
  visited <- list()
  head_pos <- c(5, 1)
  tail_pos <- c(5, 1)
  # print_grid(head_pos, tail_pos)
  visited <- c(visited, list(tail_pos))
  for (instr in x) {
    tmp <- move_rope(head_pos, tail_pos, instr)
    head_pos <- tmp[[1]]
    tail_pos <- tmp[[2]]
    visited <- c(visited, tmp[[3]])
  }
  length(unique(visited))
}

print_grid <- function(head_pos, tail_pos, size = 6) {
  grid <- matrix(".", nrow = size, ncol = size)
  grid[matrix(tail_pos, ncol = 2)] <- "T"
  grid[matrix(head_pos, ncol = 2)] <- "H"
  print(grid)
}

print_knots <- function(k, size = 10) {
  grid <- matrix(".", nrow = size, ncol = size)
  for (i in seq_len(length(k))) {
    grid[matrix(k[[i]], ncol = 2)] <- i
  }
  print(grid)
}

move_head <- function(head_pos, dir) {
  if (dir == "L") return(c(head_pos[1], head_pos[2] - 1))
  if (dir == "R") return(c(head_pos[1], head_pos[2] + 1))
  if (dir == "U") return(c(head_pos[1] - 1, head_pos[2]))
  if (dir == "D") return(c(head_pos[1] + 1, head_pos[2]))
}

move_rope <- function(head_pos, tail_pos, x) {
  visited <- list()
  dir <- sub(" .*", "", x)
  dist <- as.integer(sub("[LRUD] ", "", x))
  for (i in seq_len(dist)) {
    head_pos <- move_head(head_pos, dir)
    tail_pos <- move_tail(head_pos, tail_pos)
    visited <- c(visited, list(tail_pos))
  }
  return(list(head_pos, tail_pos, visited))
}

move_knots <- function(knots, x) {
  visited <- list()
  dir <- sub(" .*", "", x)
  dist <- as.integer(sub("[LRUD] ", "", x))
  for (i in seq_len(dist)) {
    knots[[1]] <- move_head(knots[[1]], dir)
    for (i in 2:10) {
      knots[[i]] <- move_tail(knots[[i-1]], knots[[i]])
    }
    visited <- c(visited, list(knots[[10]]))
  }
  return(list(knots, visited))
}

touching <- function(head_pos, tail_pos) {
  (head_pos[1] == tail_pos[1] && head_pos[2] == tail_pos[2]) ||
  (abs(head_pos[1] - tail_pos[1]) <= 1 && abs(head_pos[2] - tail_pos[2]) <= 1)
}

move_tail <- function(head_pos, tail_pos) {
  if (touching(head_pos, tail_pos)) return(tail_pos)
  if (tail_pos[1] == head_pos[1]) return(c(tail_pos[1], tail_pos[2] + sign(head_pos[2] - tail_pos[2])*1))
  if (tail_pos[2] == head_pos[2]) return(c(tail_pos[1] + sign(head_pos[1] - tail_pos[1]*1), tail_pos[2]))
  return(c(tail_pos[1] + sign(head_pos[1] - tail_pos[1])*1, tail_pos[2] + sign(head_pos[2] - tail_pos[2])*1))
}

This was also the first one I found the time to plot - here I plotted the path of the 10th knot, as well as the positions of the other knots after each step

Rust

Day 10: Cathode-Ray Tube

(see links)

R

Rust

Day 11: Monkey in the Middle

(see links)

R

Rust

Day 12: Hill Climbing Algorithm

This one required that I learn a pathfinding algorithm - something I hadn’t really done before. I ended up learning (and implementing) Dijkstra’s Algorithm for finding the shortest paths between nodes in a graph.

R

f12a <- function(x) {
  rows <- strsplit(x, "")
  grid <- matrix(unlist(rows), ncol = nchar(x[1]), byrow = TRUE)
  ngrid <- grid
  ngrid[which(grid == "S", arr.ind = TRUE)] <- "a"
  ngrid[which(grid == "E", arr.ind = TRUE)] <- "z"
  ngrid[] <- match(ngrid[], letters)
  mode(ngrid) <- "integer"
  startat <- which(t(grid) == "S")
  endat <- which(t(grid) == "E")
  min_path <- dijkstra(ngrid, endat, dir = -1)
  min_pathp[startat]
}

get_pos <- function(grid, v) {
  i <- floor((v-1)/ncol(grid))+1
  j <- ((v-1) %% ncol(grid))+1
  return(c(i, j))
}

can_reach <- function(ngrid, v, dir = 1) {
  x <- get_pos(ngrid, v)
  i <- x[1]
  j <- x[2]
  # can only move 1 row away
  res <- abs(floor(0:(prod(dim(ngrid))-1) / ncol(ngrid)) + 1 - i) <= 1 &
    # can only move 1 col away
    abs((0:(prod(dim(ngrid))-1)%%ncol(ngrid)) + 1 - j) <= 1 &
    # can't move diagonally
    abs(floor(0:(prod(dim(ngrid))-1) / ncol(ngrid)) + 1 - i) + abs((0:(prod(dim(ngrid))-1)%%ncol(ngrid)) + 1 - j) == 1
    if (dir == 1) {
    # can only step up 1
      res <- res & c(t(ngrid - ngrid[i, j] <= 1))
    } else {
      res <- res & c(t(ngrid[i, j] - ngrid <= 1))
    }
  as.integer(res)
}

dijkstra <- function(grid, start, dir = -1){
  #' Implementation of dijkstra using on-demand query
  #' derived from https://www.algorithms-and-technologies.com/dijkstra/r
  #' This returns an array containing the length of the shortest path from the start node to each other node.
  #' It is only guaranteed to return correct results if there are no negative edges in the graph. Positive cycles are fine.
  #' This has a runtime of O(|V|^2) (|V| = number of Nodes), for a faster implementation see @see ../fast/Dijkstra.java (using adjacency lists)
  #' @param graph an adjacency-matrix-representation of the graph where (x,y) is the weight of the edge or 0 if there is no edge.
  #' @param start the node to start from.
  #' @param dir are we going up or down? passed to can_reach()
  #' @return an array containing the shortest distances from the given start node to each other node

  # This contains the distances from the start node to all other nodes
  distances = rep(Inf, prod(dim(grid)))
  paths = rep(list(), prod(dim(grid)))

  # This contains whether a node was already visited
  visited = rep(FALSE, prod(dim(grid)))

  # The distance from the start node to itself is of course 0
  distances[start] = 0
  paths[[start]] = start

  # While there are nodes left to visit...
  repeat{

    # ... find the node with the currently shortest distance from the start node...
    shortest_distance = Inf
    shortest_index = -1
    for(i in seq_along(distances)) {
      # ... by going through all nodes that haven't been visited yet
      if(distances[i] < shortest_distance && !visited[i]){
        shortest_distance = distances[i]
        shortest_index = i
      }
    }

    if(shortest_index == -1){
      # There was no node not yet visited --> We are done
      return (list(distances, paths))
    }
    # ...then, for all neighboring nodes that haven't been visited yet....
    g <- can_reach(grid, shortest_index, dir = dir)
    for(i in seq_along(g)) {
      # ...if the path over this edge is shorter...
      if(g[i] != 0 && distances[i] > distances[shortest_index] + g[i]){
        # ...Save this path as new shortest path.
        distances[i] = distances[shortest_index] + g[i]
        paths[[i]] <- c(paths[[shortest_index]], i)
      }
      # Lastly, note that we are finished with this node.
      visited[shortest_index] = TRUE
    }
  }
}

With a decent amount of plotting code, I ended up with an animation showing the solution for the test data

which I was very pleased about. Even better, was the full solution animation

Rust

Day 13: Distress Signal

This one involves comparing nested lists like [[1],[2,3,4]] vs [[1],4]. I really want to go back and try this one in Haskell because those comparisons are (I believe) built-in.

R

Rust

Day 14: Regolith Reservoir

I didn’t finish my Rust solution for this one, but I was very happy with my R solution. The goal here is to fill the area with falling sand, allowing for some obstacles.

R

f14a <- function(x) {
  allrocks <- lapply(x, rocks)
  cave <- matrix(".", nrow = 200, ncol = 1500)
  for (r in allrocks) {
    for (rr in seq_along(r[-1])) {
      f <- fill_rocks(r[rr], r[rr+1])
      cave[f] <- "#"
    }
  }
  done <- FALSE
  while(!done) {
    cave <- fall(cave, c(1, 500+500))
    done <- cave[matrix(c(1,1), ncol = 2)] == "X"
  }
  sum(cave == "o")
}

fall <- function(cave, sand, crit = "fall") {
  down <- c(sand[1]+1, sand[2])
  if (crit == "fall" && down[1] > 200) {
    cave[matrix(c(1,1), ncol = 2)] <- "X"
    return(cave)
  } else if (blocked(cave, c(1, 500+500))) {
    sandmat <<- rbind(sandmat, c(1, 500+500))
    cave[matrix(c(1,500+500), ncol = 2)] <- "o"
    cave[matrix(c(1,1), ncol = 2)] <- "X"
    return(cave)
  }
  if (blocked(cave, down)) {
    downleft <- c(sand[1]+1, sand[2]-1)
    if (blocked(cave, downleft)) {
      downright <- c(sand[1]+1, sand[2]+1)
      if (blocked(cave, downright)) {
        sandmat <<- rbind(sandmat, sand)
        cave[matrix(sand, ncol = 2)] <- "o"
      } else {
        return(fall(cave, downright))
      }
    } else {
      return(fall(cave, downleft))
    }
  } else {
    return(fall(cave, down))
  }
  return(cave)
}

blocked <- function(cave, x) {
  cave[matrix(x, ncol = 2)] %in% c("#", "o")
}

rocks <- function(x) {
  rocks <- strsplit(x, " -> ")[[1]]
  rocks <- strsplit(rocks, ",")
  for (r in seq_along(rocks)) {
    rocks[[r]] <- as.integer(rocks[[r]])
    rocks[[r]][1] <- rocks[[r]][1] + 500
  }
  rocks
}

fill_rocks <- function(x, y) {
  x <- x[[1]]
  x[2] <- x[2] + 1
  y <- y[[1]]
  y[2] <- y[2] + 1
  # horizontal
  if (x[1] == y[1]) {
    span <- x[2]:y[2]
    return(matrix(c(span, rep(x[1], length(span))), ncol = 2, byrow = FALSE))
  }
  # vertical
  if (x[2] == y[2]) {
    span <- x[1]:y[1]
    return(matrix(c(rep(x[2], length(span)), span), ncol = 2, byrow = FALSE))
  }
}

I animated the falling sand, filling up the cave, but with so many particles it didn’t go very well, especially when limiting the frames

Instead, a render of the final solution, with the sand coloured by the time at which it came to rest, looked much cooler

Day 15: Beacon Exclusion Zone

(see links)

R

Day 16: Proboscidea Volcanium

(see links)

R

Day 17: Pyroclastic Flow

While it was never mentioned by name, this one was essentially a game of Tetris.

R

At this point, there’s too much code to copy inline. Check out the repo links.

I couldn’t help but plot this one as an animation…

Day 18: Boiling Boulders

I got to learn even more algorithms for this one - this time a flood-fill algorithm.

R

I took advantage of one of coolbutuseless’ packages {isocubes} to plot the shape of the lava droplet

Day 19: Not Enough Minerals

(see links)

R

Day 20: Grove Positioning System

(see links)

R

Day 21: Monkey Math

I had no intentions of making it on to the leaderboard for timing, even though the puzzles were released at an entirely reasonable time for me. I actually got to this one late in the evening due to some other commitments, but I am quietly confident that my R solution could have been one of the fastest solves…

The problem is figuring out what the value of the ‘root’ monkey is, given the following operations

root: pppw + sjmn
dbpl: 5
cczh: sllz + lgvd
zczc: 2
ptdq: humn - dvpt
dvpt: 3
lfqf: 4
humn: 5
ljgn: 2
sjmn: drzm * dbpl
sllz: 4
pppw: cczh / lfqf
lgvd: ljgn * ptdq
drzm: hmdt - zczc
hmdt: 32

R

I realised fairly quickly that I could just make each of the connections a function call, and evaluate the entire stack! This was very fast to write, and I got to my full solution faster than most of those at the top of the leaderboard, but much later in the day.

f21a <- function(x) {
  defs <- sapply(x, parseInput)
  for (d in defs) {
    eval(parse(text = d))
  }
  format(root(), scientific = FALSE)
}

parseInput <- function(x) {
  monkey <- sub("^(.*):.*", "\\1", x)
  ret <- sub(".*: (.*)$", "\\1", x)
  if (is.na(suppressWarnings(as.integer(ret)))) {
    ret <- strsplit(ret, " ")[[1]]
    v1 <- ret[1]
    op <- ret[2]
    v2 <- ret[3]
    def <- paste0(monkey, " <- function() { ", v1, "() ", op, " ", v2, "() }")
  } else {
    def <- paste0(monkey, " <- function() { ", ret, " }")
  }
}

Sure, sometimes (most of the time), eval(parse(text = )) is a terrible idea, but in this case it worked out great!

Day 22: Monkey Map

(see links)

R

Day 23: Unstable Diffusion

(see links)

R

Day 24: Blizzard Basin

(see links)

R

Day 25: Full of Hot Air

(see links)

R

Summary

I really enjoyed advent of code, and I ended up donating as thanks for providing such a nice experience. I’ll be having a go at AoC 2023 but won’t be so strict; I may not solve each puzzle on the day it’s released and I will be allowing myself to use whatever libraries and whatever languages I want.

Will you be participating? I’d love to compare solutions once we’re done! I can be found on Mastodon and I’ll be commenting on the puzzles as I go.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.2 (2023-10-31)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-11-28
##  pandoc   3.1.8 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.18    2023-06-19 [1] CRAN (R 4.3.2)
##  bookdown      0.36    2023-10-16 [1] CRAN (R 4.3.2)
##  bslib         0.5.1   2023-08-11 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.3.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.22    2023-09-29 [3] CRAN (R 4.3.1)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.6.1 2023-10-06 [3] CRAN (R 4.3.1)
##  htmlwidgets   1.6.2   2023-03-17 [1] CRAN (R 4.3.2)
##  httpuv        1.6.12  2023-10-23 [1] CRAN (R 4.3.2)
##  icecream      0.2.1   2023-09-27 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.44    2023-09-11 [3] CRAN (R 4.3.1)
##  later         1.3.1   2023-05-02 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.3.2)
##  pkgbuild      1.4.2   2023-06-26 [1] CRAN (R 4.3.2)
##  pkgload       1.3.3   2023-09-22 [1] CRAN (R 4.3.2)
##  prettyunits   1.2.0   2023-09-24 [3] CRAN (R 4.3.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.8   2023-05-02 [1] CRAN (R 4.3.2)
##  promises      1.2.1   2023-08-10 [1] CRAN (R 4.3.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.2   2023-08-10 [3] CRAN (R 4.3.1)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.11  2023-07-06 [1] CRAN (R 4.3.2)
##  remotes       2.4.2.1 2023-07-18 [1] CRAN (R 4.3.2)
##  rlang         1.1.1   2023-04-28 [3] CRAN (R 4.3.0)
##  rmarkdown     2.25    2023-09-18 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.2)
##  shiny         1.7.5.1 2023-10-14 [1] CRAN (R 4.3.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [3] CRAN (R 4.3.0)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.3.2)
##  usethis       2.2.2   2023-07-06 [1] CRAN (R 4.3.2)
##  vctrs         0.6.4   2023-10-12 [3] CRAN (R 4.3.1)
##  xfun          0.40    2023-08-09 [3] CRAN (R 4.3.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.3.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.3
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Print Debugging (Now with Icecream!)

website@jcarroll.com.au (Jonathan Carroll) — Tue, 07 Nov 2023 00:00:00 +0000

Print debugging has its place. Sure, it’s not always the best way to debug something, but it can often be the fastest. In this post I describe a useful way to do this in Rust and how we can get similar behaviour in R.

I love the Rust dbg! macro - it wraps a value or expression and prints the result to help debug what’s happening in the middle of some function. If we had some complicated function that combined some values, e.g.

fn f(val1: i32, val2: i32) -> i32 {
    // do some things
    let otherval: i32 = 10;
    // final result
    val1 + val2 + otherval
}

fn main() {
    println!("final result = {}", f(5, 6))
}

Running this gives just the final result, as expected

final result = 21

We might want to check on what the intermediate combinations of otherval and val1 or val2 (terrible names, I know). One option is to just print them

fn f(val1: i32, val2: i32) -> i32 {
    // do some things
    let otherval: i32 = 10;
    println!("{}", otherval + val1);
    println!("{}", otherval + val2);
    // final result
    val1 + val2 + otherval
}

fn main() {
    println!("final result = {}", f(5, 6))
}

Running this shows the values we printed, but with no context

15
16
final result = 21

We could add some context manually

fn f(val1: i32, val2: i32) -> i32 {
    // do some things
    let otherval: i32 = 10;
    println!("first temp val = {}", otherval + val1);
    println!("second temp val = {}", otherval + val2);
    // final result
    val1 + val2 + otherval
}

fn main() {
    println!("final result = {}", f(5, 6))
}

producing

first temp val = 15
second temp val = 16
final result = 21

but across an entire codebase, this is going to get messy, fast.

A better option is the dbg! macro, which takes an expression (a value, or something that produces a value) and prints both the expression itself and the resulting value

fn f(val1: i32, val2: i32) -> i32 {
    // do some things
    let otherval: i32 = 10;
    dbg!(otherval + val1);
    dbg!(otherval + val2);
    // final result
    val1 + val2 + otherval
}

fn main() {
    println!("final result = {}", f(5, 6))
}

Running this produces

[src/main.rs:15] otherval + val1 = 15
[src/main.rs:16] otherval + val2 = 16
final result = 21

and we see we get the file/linenumber context, the expression we wrapped, and the value.

This is extremely useful, and helps me to figure out what’s going on in the middle of some implementation.

One of the downsides is that even if I make a release build, these statements remain, so I need to go through and remove them all once I’m finished debugging.

A “better” solution is to use a full logging solution like the log crate which enables using different log levels, turning off logging outside of some threshold, etc… but that seems more suited to intentional logging, not debugging something that isn’t working.

Having played with this in Rust, of course I wanted the same thing in R. I built a minimal viable proof-of-concept which leverages {rlang} to capture the expression

dbg <- function(x) {
  ex <- rlang::f_text(rlang::enquos(x)[[1]])
  ret <- rlang::eval_bare(x)
  message(glue::glue("DEBUG: {ex} = {ret}"))
  ret
}

This works rather well - it postpones evaluation of the expression and prints the result without affecting any variables

a <- 1
b <- 3
x <- dbg(a + b)

## DEBUG: a + b = 4

y <- dbg(2*x + 3)

## DEBUG: 2 * x + 3 = 11

z <- 10 + dbg(y*2)

## DEBUG: y * 2 = 22

## [1] 32

It lacks one beautiful part of the Rust solution, though - if I include this in some functions sourced from a file, I won’t be able to tell which file the statement came from. Plus, it doesn’t deal so well with large structures

x <- dbg(head(mtcars))

## DEBUG: head(mtcars) = c(21, 21, 22.8, 21.4, 18.7, 18.1)DEBUG: head(mtcars) = c(6, 6, 4, 6, 8, 6)DEBUG: head(mtcars) = c(160, 160, 108, 258, 360, 225)DEBUG: head(mtcars) = c(110, 110, 93, 110, 175, 105)DEBUG: head(mtcars) = c(3.9, 3.9, 3.85, 3.08, 3.15, 2.76)DEBUG: head(mtcars) = c(2.62, 2.875, 2.32, 3.215, 3.44, 3.46)DEBUG: head(mtcars) = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22)DEBUG: head(mtcars) = c(0, 0, 1, 1, 0, 1)DEBUG: head(mtcars) = c(1, 1, 1, 0, 0, 0)DEBUG: head(mtcars) = c(4, 4, 4, 3, 3, 3)DEBUG: head(mtcars) = c(4, 4, 1, 1, 2, 1)

At some point I saw a blog post about a debug logging package {icecream} which had this ability of tracing the srcref of a file containing the debug statement, so I wanted to see if I could extract that to suit my needs. Running the ic() statement as a print debugger works nicely

f <- function(val1, val2) {
  otherval <- 10
  icecream::ic(otherval + val1)
  icecream::ic(otherval + val2)
  val1 + val2 + otherval
}
f(5, 6)

## ℹ ic| `otherval + val1`: num 15

## ℹ ic| `otherval + val2`: num 16

## [1] 21

it indeed wraps the expression and shows the result. After poking around at the internals, I realised it actually does everything that I wanted, I just needed to change some of the defaults

options(icecream.peeking.function = utils::head,
        icecream.max.lines = 5,
        icecream.prefix = "dbg:",
        icecream.always.include.context = TRUE)

Now it prints like the dbg! macro

f <- function(val1, val2) {
  otherval <- 10
  icecream::ic(otherval + val1)
  icecream::ic(otherval + val2)
  val1 + val2 + otherval
}
f(5, 6)

## ℹ dbg: `f()` in <env: global> | `otherval + val1`: [1] 15

## ℹ dbg: `f()` in <env: global> | `otherval + val2`: [1] 16

## [1] 21

To make it even more like the Rust macro, I made a similar binding of .dbg (so that it doesn’t appear in my workspace by default) and added the following to my .Rprofile

# install.packages("icecream")
if (requireNamespace("icecream", quietly = TRUE)) {
  .dbg <- icecream::ic
  options(icecream.peeking.function = utils::head,
          icecream.max.lines = 5,
          icecream.prefix = "dbg:",
          icecream.always.include.context = TRUE)
}

so now I can add debug statements

f <- function(val1, val2) {
  otherval <- 10
  .dbg(otherval + val1)
  .dbg(otherval + val2)
  val1 + val2 + otherval
}
f(5, 6)

## ℹ dbg: `f()` in <env: global> | `otherval + val1`: [1] 15

## ℹ dbg: `f()` in <env: global> | `otherval + val2`: [1] 16

## [1] 21

Better yet, I can turn them off if I don’t need them

icecream::ic_disable()
f(5, 6)

## [1] 21

This works as I had hoped; I can even debug partway through an expression

icecream::ic_enable()
3 + .dbg(4 + 6)

## ℹ dbg: <env: global> | `4 + 6`: [1] 10

## [1] 13

and if I source a file, I get the context

## test_dbg.R:
f <- function() {
  x <- 7
  .dbg(x + 3)
  7
}

source("test_dbg.R")
f()

ℹ dbg: `f()` in test_dbg.R:3:2 | `x + 3`: [1] 10

[1] 7

It even deals with printing larger objects, given the “peeking_function” and “max lines” options above

.dbg(mtcars)

## ℹ dbg: <env: global> | `mtcars`: 
## mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

That seems to be feature-equivalent to the Rust dbg! macro, plus the ability to turn off the messages, so I’m very happy with this result.

Do you have a better solution? I can be found on Mastodon or use the comments below.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-11-07
##  pandoc   3.1.8 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.5.1   2023-08-11 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.21    2023-05-05 [3] CRAN (R 4.3.0)
##  fansi         1.0.4   2023-01-22 [3] CRAN (R 4.2.2)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.6   2023-08-10 [3] CRAN (R 4.3.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  icecream      0.2.1   2023-09-27 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.43    2023-05-25 [3] CRAN (R 4.3.0)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pillar        1.9.0   2023-03-22 [3] CRAN (R 4.2.3)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.1.2)
##  rmarkdown     2.24    2023-08-14 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  utf8          1.2.3   2023-01-31 [3] CRAN (R 4.2.2)
##  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.1.2)
##  xfun          0.40    2023-08-09 [3] CRAN (R 4.3.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Hooray, Array!

website@jcarroll.com.au (Jonathan Carroll) — Mon, 09 Oct 2023 00:00:00 +0000

If you’re reading this hoping that I’m done with droning on about array-languages, close the tab… it only gets worse from here. If you thought APL was unreadable, even after my earlier blog posts, again - close button is right there. In this post I try out a brand new stack-based array language and continue to advocate for learning such things.

I subscribe to a lot of RSS feeds these days - it’s certainly making a comeback, and it’s great to see developers returning to blogging outside of paid platforms. Keeping up with all of those posts, however, does take quite a bit of time. So when I find one I do really find engaging, I do my best to dig in.

This post by Hillel Wayne wasn’t specifically interesting (my dance card for learning new languages is already pretty full, but I can’t help looking at others) but it did present a small, bite-sized puzzle to solve; what’s a simple way to “generate 5 random 10-character strings”. Now, that’s pretty much a code-golf question right there. Hillel presents a solution in Raku (a.k.a. Perl 6 - note the “p” and “6” in the Raku logo) as a “quick” solution

for ^5 {say ('a'..'z').roll(10).join}

dmjfpwxspu
vernmlljkw
korntotesp
rkpewsoqjn
blswvruden

and I can’t argue with that - it’s readable (even though I don’t know much Perl/Raku), I can reason about what and how it’s doing what it’s doing (making an educated guess about ^ which is indeed a range operation and say being an output method; roll is a nice choice for random selection).

When I see problems like this, I start to think through what tools I could use to solve it. I still default to R because it’s the language I know best, but my first attempt isn’t nearly as concise

sapply(1:5, \(x) paste0(sample(letters, 10, replace = TRUE), collapse = ""))

## [1] "kcqicytylm" "peybjcbumk" "bbvhibgqjs" "uzbpzrkywp" "zettlmjghm"

I do like that R has letters (and LETTERS) as built-in structures; that makes things a little easier. I could write that just as easily as a for-loop, especially since I don’t actually need an argument to the anonymous function

for (i in 1:5) {
  print(
    paste0(
      sample(letters, 10, replace = TRUE), 
      collapse = ""
    )
  )
}

## [1] "zwwpihqipr"
## [1] "cxunwvojaq"
## [1] "xlkcubjysw"
## [1] "ilpohtgcag"
## [1] "ralzlrszen"

Side-by-side, these aren’t all that different…

Raku vs R solutions with common colouring

R defaults to replace = FALSE which needs adjusting, and doesn’t like concatenating strings quite so easily as join(), but otherwise the translation is fairly straightforward. The Raku version is shorter, for sure.

I could probably go and try a few other languages, but I’m all too tempted to try APL. Unfortunately, tryapl.org seems to be down, but then I remembered… New on the scene is Uiua (pronounced “wee-wuh”) following the footsteps of other APL-descendants such as BQN. This was covered by The Array Cast panel who interviewed the author, as well as Conor from the same group in several videos.

The idea of a stack-based language is that you put some data “on the stack” then you no longer need to refer to arguments; a monadic function just applies to whatever is on the top of the stack. A dyadic function just applies to to the top two pieces of data on the stack. Need another copy of your data somewhere in your processing? Just duplicate it on the stack.

The way this works in practice is you “read” from right-to-left (same as APL), so if we put the values 2 and 3 on the stack then use the dyadic + function

+ 3 2

Similar to APL, this language uses glyphs for primitive functions, but a really nice feature is writing out the “name” of the function you want (in the above case, add) which the interpreter will convert to the glyph for you, so

add 3 2

produces the same code (with glyphs) and output.

Working with a stack would certainly be something different for me, but I figured it’s worth a try! The first hurdle came quickly; how do I get the letters of the alphabet? Reading through the examples, I found that I can specify a string literal with @, and Uiua supports some arithmetic on these so this works

+ @a 1
+ @a 25

@b
@z

Next, I needed to generate all the letters, and thankfully, adding a range from 1 to 25 (⇡26) works

+@a⇡26

"abcdefghijklmnopqrstuvwxyz"

Note that you can also use add @a range 26 - the interpreter inserts the glyphs for you.

Next, I need a way to sample 10 letters from this. There’s a rand function (the glyph looks like a dice - nice!) but it only produces a single value between 0 and 1. Additionally, I need to run this several times to get the 10 values. The front page of the website has a nice example demonstrating exactly this, so that helps. It uses rand (⚂) and repeat (⍥) to generate 5 random numbers between 0 and 1, then mult (×) to bring the range up to 0 to 10, then finally floor (⌊) to return to integers.

⌊×10[⍥⚂5]

[5 3 7 8 4]

In my case, I want to generate 10 values and I need to multiply by 26 to have the right indices

⌊×26[⍥⚂10]+@a⇡26

"abcdefghijklmnopqrstuvwxyz"
[10 25 23 20 4 25 15 2 24 24]

The values on the stack are then the letters of the alphabet, and 10 indices to be selected.

This is where I had to pause and think - how do I repeat this 5 times? There’s no loops (I don’t think). Then I realised - this is an array language… I should be leveraging that!

Instead of asking for 10 indices, I can ask for 50. Then, I just need to reshape (↯) these 50 values into 5 groups of 10

↯5_10⌊×26[⍥⚂50]+@a⇡26

"abcdefghijklmnopqrstuvwxyz"
╭─                               
╷ 21 19  4 18  2 24  6  1  2  6  
   0 12  1  1 12  2 12  7 12  0  
   5  1 19  6 22 19 23 18 12 25  
  20 13 10 19 17  2 12  1 16  4  
   9 24  6  9 18  6 21 18 23  1  
                                ╯

Now, there are two data objects on the stack; the alphabet, and an array of indices to be selected. A dyadic select (⊏) will take these two objects, and select elements of the first based on indices of the second, and voila!

⊏↯5_10⌊×26[⍥⚂50]+@a⇡26

╭─              
╷ "gewctqbttq"  
  "vsbvzbqiod"  
  "wpmkmnuxwz"  
  "rymyxzqibo"  
  "zxtnpadwvl"  
               ╯

That’s a walkthrough of the glyphs in my final solution - you can play with it on the website yourself - but one could enter those function names in full and the interpreter will figure it out for you

select reshape 5_10 floor mult 26[repeat rand 50] add @a range 26

...

⊏ ↯ 5_10 ⌊ × 26[⍥⚂ 50] + @a ⇡ 26

╭─              
╷ "wtyefkiavu"  
  "gfllwkuqcn"  
  "qydoiyqprk"  
  "awvdxdsymj"  
  "zzvueychem"  
               ╯

I know people like to say this is “unreadable” but with a little colour, a lot of the elements of the Raku and R solutions are here

Uiua solution with colours corresponding to the earlier Raku and R solutions

So… is that more concise than the R or even Raku solutions? Gosh, no. BUT, I had a lot more fun writing it. For certain problems, APL-like languages really do have a lot to offer, and for all I know there’s a much better way to spell this in very few glyphs that I’ve overlooked (feel free to send me one!).

You can make quite complex things; the Uiua logo itself - made in Uiua!

xy ← ⍘⍉⊞⊟. ÷÷2∶ -÷2,⇡.200
Rgb ← [∶⍘⊟×.xy ↯△⊢xy0.5]
u ← ↥<0.2∶>0.7.+×2 ×.∶⍘⊟xy
c ← <∶√/+ⁿ2 xy
⍉⊂∶-¬u c1 +0.1 ∺↧Rgb c0.95

Uiui code to generate the Uiua logo

Another neat fact about this language is that it’s written in Rust, so it’s potentially quite fast as well. Array stuff in Rust is top of mind for me at the moment - this cool post from earlier in the year covers an implementation of some APL-like array processing in Rust which I’m keen to dig deeper into (there’s a not-too-old repo of things already built). I clearly need to re-read my own posts, because I actually linked to that cool post above in my first APL-related post, but because I had searched for “rank polymorphism” and it fit the bill.

The fact that R has a lot of these array-compatible functions out-of-the-box is terribly underpromoted and undercelebrated. Bringing this back around to R, can I use the array method there? I can certainly build a matrix of 50 letters quite concisely, though the fact that R doesn’t concatenate characters so easily still hurts

m <- matrix(sample(letters, 50, replace = TRUE), 5, 10)
apply(m, 1, \(x) paste0(x, collapse = ""))

## [1] "wzyfpoyegm" "xjehbspfql" "vjpvimtwkm" "uzkwmgcmix" "suakdpagvl"

I’m hoping to play a bit more with Uiua, and I was genuinely impressed that I managed to solve this at all, but I’m still just beginning my journey in APL and there’s no shortage of things to learn there. In fact, despite having no tryapl.org, I do have the Ride editor locally. A bit of searching for clues later, and I have something!

In (Dyalog) APL you can create the uppercase alphabet with just

  ⎕A

ABCDEFGHIJKLMNOPQRSTUVWXYZ

similar to LETTERS. Lowercase letters can be generated with

  819⌶⎕A

abcdefghijklmnopqrstuvwxyz

or (possibly implementation-specific)

  ⎕c⎕a

abcdefghijklmnopqrstuvwxyz

Selecting random elements from this involves roll with the syntax

  ?5 10⍴26

18 16  7  8 22 25 15 17 24 19
18 23 24  9 25 17  4  2 25 24
10 13  6 11 10 17 21  9 15 20
25  8  3 12  4  2 21  3  1 18
 2  5 17 19 25  3  3 21  9  4

which produces random values between 1 and the right argument (in this case, a 5x10 reshape of the value 26 repeated over and over). That’s exactly what we need as indices to select letters. Putting these together

⎕c⎕a[?5 10⍴26]

axpyohnotq
hsrottizwk
dgecrgxbcu
qvvxszptpq
wmaktfuvwf

Even better - if we store the letters like R does, and define a functional version which takes a left argument (⍺; the shape of the array), a right argument (⍵; the letters to sample from), and automatically calculates the length as ≢⍵, then the entire solution is

letters←⎕c⎕a
randstrings←{⍵[?⍺⍴≢⍵]}
5 10 randstrings letters

npentutsdo
jttcnqeuqm
imgrtupyfx
eliiqnishu
jonkovlmcn

Okay, that’s concise! And, provided you know what ?, ⍴, and ≢ do, it’s fairly readable (in my opinion, at least).

Can you make a better/shorter/more interesting solution to the random strings problem? Or can improve the Uiua solution? I can be found on Mastodon or use the comments below.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-10-09
##  pandoc   3.1.8 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.5.1   2023-08-11 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.21    2023-05-05 [3] CRAN (R 4.3.0)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.6   2023-08-10 [3] CRAN (R 4.3.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.43    2023-05-25 [3] CRAN (R 4.3.0)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.1.2)
##  rmarkdown     2.24    2023-08-14 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.1.2)
##  xfun          0.40    2023-08-09 [3] CRAN (R 4.3.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Four Filters for Functional (Programming) Friends

website@jcarroll.com.au (Jonathan Carroll) — Wed, 30 Aug 2023 00:00:00 +0000

I’m part of a local Functional Programming Meetup group which hosts talks, but also coordinates social meetings where we discuss all sorts of FP-related topics including Haskell and other languages. We’ve started running challenges where we all solve a given problem in a language of our choosing then discuss over drinks how they compare.

This month we went with an “easy” problem with a wrinkle - we would solve the ‘strain’ exercise from Exercism (Haskell, Python - your access to these is likely conditional on you being enrolled in that language track) with an extension:

The problem is trivial; the challenge is to solve it in 4 different ways using your language of choice.

The problem itself is given as

Implement the keep and discard operation on collections. Given a collection and a predicate on the collection’s elements, keep returns a new collection containing those elements where the predicate is true, while discard returns a new collection containing those elements where the predicate is false.

For example, given the collection of numbers:

1, 2, 3, 4, 5

And the predicate:

“is the number even?”

Then your keep operation should produce:

2, 4

While your discard operation should produce:

1, 3, 5

but with a restriction:

Keep your hands off that filter/reject/whatchamacallit functionality provided by your standard library! Solve this one yourself using other basic tools instead.

I figured it’s a good opportunity to write as I solve it, so here’s my R solutions.

I’ll define a test case so I can try out things as I go

test_vec <- c(1, 2, 3, 4, 5)
test_vec

## [1] 1 2 3 4 5

and the predicates related to ‘even’ and ‘odd’ as functions which return TRUE or FALSE

is_even <- function(x) {
  x %% 2 == 0
}

is_even(7)

## [1] FALSE

is_even(8)

## [1] TRUE

is_odd <- function(x) {
  !is_even(x)
}

is_odd(7)

## [1] TRUE

is_odd(8)

## [1] FALSE

Firstly, the restriction doesn’t seem to worry me because when I think of “filter” in R I immediately think of dplyr::filter() which works on data.frame (or tibble) objects, and (given the examples) we’re aiming to work with vectors (the problem is stated the same in several languages, so “collection” is a generalisation).

What about base::Filter()? The help states

Filter extracts the elements of a vector for which a predicate (logical) function gives true.

Filter(is_even, test_vec)

## [1] 2 4

Filter(is_odd, test_vec)

## [1] 1 3 5

Yep, that works exactly as I hoped, but is also a built-in “filter” so I can’t use it.

When I think of keep and discard I do think of the purrr functions, and while these, too do exactly what I want

purrr::keep(test_vec, is_even)

## [1] 2 4

purrr::discard(test_vec, is_even)

## [1] 1 3 5

They’re in a library, so I’m going to say they don’t count.

One of the things I like about the way R does subsetting (via the square-bracket [ which is by itself a function, but requires a matching ] to satisfy the parser) is that you can use a logical vector to subset another vector,

c(3, 5, 8, 12)[c(TRUE, FALSE, FALSE, TRUE)]

## [1]  3 12

which means that if I can produce such a logical vector, say, by applying a predicate function, I can do subsetting that way

test_vec[is_even(test_vec)]

## [1] 2 4

test_vec[is_odd(test_vec)]

## [1] 1 3 5

Instead of using is_odd() I can just negate the logical vector to get the same effect

test_vec[!is_even(test_vec)]

## [1] 1 3 5

I can make those into functions that take a predicate and a vector

keep_1 <- function(f, x) {
  x[f(x)]
}

discard_1 <- function(f, x) {
  x[!f(x)]
}

Testing these

keep_1(is_even, test_vec)

## [1] 2 4

discard_1(is_even, test_vec)

## [1] 1 3 5

One down!

One thing to note with this approach is that R is vectorised - I’ve discussed this a few times on this blog (most recently) - which means that these predicate functions will gladly take a vector, not just a single value. This works for the is_even() function because inside that, the modulo operator %% is itself vectorised, so

is_even(c(2, 4, 6, 9, 11, 13))

## [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

As I wrote in my previous post, thinking like this just becomes so natural in R that I have to force myself to remember that not every language does that.

It’s also worth mentioning that I’m passing a reference to the function is_even to our keep and discard functions - that is serving as our predicate because I need a way to state “is the number even?” which references the number, so I need a function. That doesn’t have to be a named function, though - it could be an “anonymous” function (a “lambda”) if I wanted

keep_1(function(z) z %% 2 == 0, test_vec)

## [1] 2 4

discard_1(function(z) z %% 2 == 0, test_vec)

## [1] 1 3 5

I can subset a vector with a logical vector of the same length, specifying whether or not to include that element, but I can also subset by position (keeping in mind that R is a 1-based language which means the first element is indexed by a 1 - why would any language do anything different? 😜)

c(3, 5, 8, 12)[c(1, 4)]

## [1]  3 12

The function which() takes a logical vector and returns which indices are TRUE

which(c(TRUE, FALSE, FALSE, TRUE))

## [1] 1 4

so I can use this with our predicate to keep elements

keep_2 <- function(f, x) {
  x[which(f(x))]
}

keep_2(is_even, test_vec)

## [1] 2 4

However, discarding elements by index doesn’t use a logical negation, it uses a negative sign (-)

discard_2 <- function(f, x) {
  x[-which(f(x))]
}

discard_2(is_even, test_vec)

## [1] 1 3 5

If you look at the source of Filter(), you’ll see that I wasn’t far off of exactly that

Filter

## function (f, x) 
## {
##     ind <- as.logical(unlist(lapply(x, f)))
##     x[which(ind)]
## }
## <bytecode: 0x55da1b321ad8>
## <environment: namespace:base>

but it still counts.

Another option would be to unpack the elements themselves and do some stepwise comparisons in a loop. For each element el in the vector x, test if f(el) is TRUE, and if it is, concatenate el to the end of the accumulating result vector

keep_3 <- function(f, x) {
  result <- c()
  for (el in x) {
    if (f(el)) {
      result <- c(result, el)
    }
  }
  result
}

keep_3(is_even, test_vec)

## [1] 2 4

discard_3 <- function(f, x) {
  result <- c()
  for (el in x) {
    if (!f(el)) {
      result <- c(result, el)
    }
  }
  result
}

discard_3(is_even, test_vec)

## [1] 1 3 5

Of course, this approach is a Bad Idea™ in general but I’m not optimising anything here. This approach does have the advantage that it isn’t relying on R’s vectorised capabilities, since each element is passed to the predicate function individually, so if I did have a non-vectorized predicate function, this would still work.

I really want a “weird” way to do this. R has plenty of weird to go around, but since I’ve been learning some Haskell, and the challenge originally referenced the Haskell solution, what if I code a Haskell-esque solution?

Haskell makes good use of recursive functions. Any loop can be written as a recursion (and vice-versa) so the previous solution is a good starting point. First, I define a base case; if I run out of numbers to process, return NULL. A convenient feature of R vectors is that NULLs are dropped

c(1, 2, NULL, 3, 4, NULL, 5)

## [1] 1 2 3 4 5

Otherwise, I can take the first value in the vector and test it with the predicate. If it returns TRUE I can append it to what I’ve calculated so far, and recursively call the function again with the rest of the vector. That could look like

keep_4 <- function(f, x) {
  if (!length(x)) return(NULL)
  if (f(x[1])) {
    return(c(x[1], Recall(f, x[-1])))
  } else {
    return(Recall(f, x[-1]))
  }
}

keep_4(is_even, test_vec)

## [1] 2 4

Some interesting points about this: the Recall() function is nice for defining a recursive function. I could have used keep_4 there, but the advantage of this implementation is that I can rename the function and it still works as expected

keep_4_also <- keep_4
rm("keep_4")

keep_4_also(is_even, test_vec)

## [1] 2 4

If I had explicitly referenced keep_4 inside itself, that recursion would fail with this renaming.

The negative subsetting works as described above; x[-1] means “not including the first element”. Lastly, testing if (!length(x)) works because 0 can be coerced to FALSE and any other value to TRUE, so if the length of x is not 0, this condition is met.

The discarding variant is similar, just with the two returns() around the other way, or

discard_4 <- function(f, x) {
  if (!length(x)) return(NULL)
  if (!f(x[1])) {
    return(c(x[1], Recall(f, x[-1])))
  } else {
    return(Recall(f, x[-1]))
  }
}

discard_4(is_even, test_vec)

## [1] 1 3 5

There we go; 4 hand-coded implementations of keep and discard in R.

Can you think of another that doesn’t use Filter() or an external library? Let me know in the comments below or on Mastodon. I’m looking forward to seeing how people solved this in other languages.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-08-30
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.5.0   2023-06-09 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.21    2023-05-05 [3] CRAN (R 4.3.0)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.6   2023-08-10 [3] CRAN (R 4.3.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.43    2023-05-25 [3] CRAN (R 4.3.0)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.1.2)
##  rmarkdown     2.23    2023-07-01 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.1.2)
##  xfun          0.40    2023-08-09 [3] CRAN (R 4.3.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Now You're Thinking with Arrays

website@jcarroll.com.au (Jonathan Carroll) — Tue, 29 Aug 2023 00:00:00 +0000

I keep hearing the assertion that “writing APL/Haskell/etc… makes you think differently” and I kept wondering why I agreed with the statement but at the same time didn’t think too much of it. I believe I’ve figured out that it’s because I happened to have been using Array-aware languages this whole time! It turns out R is an even better language for beginners than I thought.

Let’s start with some basics. A “scalar” value is just a number by itself. That might have some units that may or may not be represented well in what you’re doing, but it’s a single value on its own, like 42. A “vector” in R is just a collection of these “scalar” values and is constructed with the c() operator

c(3, 4, 5, 6)

## [1] 3 4 5 6

Going right back to basics, the [1] output at the start of the line indicates the index of the element directly to its right, in this case the first element. If we had more elements, then the newline starts with the index of the first element on that line. Here I’ve set the line width smaller than usual so that it wraps sooner

1:42

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12
## [13] 13 14 15 16 17 18 19 20 21 22 23 24
## [25] 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42

The quirk of how R works with vectors is the there aren’t actually any scalar values - if you try to create a vector with only a single element, it’s still a vector

x <- c(42)
x

## [1] 42

is.vector(x)

## [1] TRUE

(note the [1] indicating the first index of the vector x and the vector TRUE). Even if you don’t try to make it a vector, it still is one

x <- 42
x

## [1] 42

is.vector(x)

## [1] TRUE

Mike Mahoney has a great post detailing the term “vector” and how it relates to an R vector as well as the more mathematical definition which involves constructing an “arrow” in some space so that you describe both “magnitude” and “direction” at the same time.

“Direction and magnitude”

So, we always have a vector if we have a 1-dimensional collection of data. But wait, you say, there’s also list!

x <- list(a = 42)
x

## $a
## [1] 42

is.vector(x)

## [1] TRUE

A nice try, but lists are also vectors, it’s just that they’re “recursive”

is.recursive(x)

## [1] TRUE

is.recursive(c(42))

## [1] FALSE

Fine, what about a matrix, then?

x <- matrix(1:9, nrow = 3, ncol = 3)
x

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

is.vector(x)

## [1] FALSE

No, but that still makes sense - a matrix isn’t a vector. It is however, an “array” - the naming convention in R is a bit messy because, while “matrix” and “array” are often the same thing, as the dimensions increase, more things expect an “array” class, so R tags a “matrix” with both

class(matrix())

## [1] "matrix" "array"

This recently surprised Josiah Parry leading to this post explaining some of the internal inconsistencies of this class of object.

Now that we have vectors figured out, I can get to the point of this post - that thinking about data with a “vector” or even “array” mindset works differently.

I started learning some APL because I loved some videos by code_report. The person behind those is Conor Hoekstra. I didn’t realise that I’d actually heard a CoRecursive podcast episode interviewing him, so now I need to go back and re-listen to that one. Conor also hosts The Array Cast podcast that I heard him mention in yet another of his podcasts (how do people have the time to make all of these!?!). I was listening to the latest of these; an interview with Rob Pike, one of the co-creators of Go and UTF-8 - it’s a really interesting interview full of history, insights, and a lot of serious name dropping.

Anyway, Rob is describing what it is he really likes about APL and says

“I saw a talk some time ago, I wish I could remember where it was, where somebody said, this is why programming languages are so difficult for people. Let’s say that I have a list of numbers and I want to add seven to every number on that list. And he went through about a dozen languages showing how you create a list of numbers and then add seven to it. Right? And it went on and on and on. And he said,”Wouldn’t it be nice if you could just type 7+ and then write the list of numbers?” And he said, “Well, you know what? There’s one language that actually does that, and that’s APL.” And I think there’s something really profound in that, that there’s no ceremony in APL. If you want to add two numbers together in any language, you can add two numbers together. But if you want to add a matrix and a matrix, or a matrix and a vector, or a vector and a scaler or whatever, there’s no extra ceremony involved. You just write it down.

(link to shownotes)

The talk he mentions is linked in the shownotes as “Stop Writing Dead Programs” by Jack Rusher (Strange Loop 2022) (linked to the relevant timestamp, and which I’m pretty sure I’ve watched before - it’s a great talk!) where Jack shows how to add 1 to a vector of values in a handful of languages. He demonstrates that in some languages there’s lots you need to write that has nothing to do with the problem itself; allocating memory, looping to some length, etc… then leads to demonstrating that the way to do this in APL is

1 + 1 2 3 4

with none of the overhead - just the declaration of what operation should occur.

The excitement with which Rob explains this in the podcast spoke to how important this idea is; that you can work with more than just scalar values in the mathematical sense without having to explain to the language what you mean and write a loop around a vector.

Two questions were buzzing at the back of my mind, though:

Why isn’t this such a revelation to me?
Is this not a common feature?

I know R does work this way because I’m very familiar with it, and perhaps that is the answer to the first question - I know R better than any other language I know, and perhaps I’ve just become accustomed to being able to do things like “add two vectors”.

a <- c(1, 2, 3, 4, 5) # or 1:5
a + 7

## [1]  8  9 10 11 12

Now you’re thinking with ~~portals~~ vectors!

The ideas of “add two vectors” and “add a number to a vector” are one in the same, as discussed above. The ability to do so is called “rank polymorphism” and R has a weak version of it - not everything works for every dimension, but single values, vectors, and matrices do generalise for many functions. I can add a value to every element of a matrix, too

m <- matrix(1:12, nrow = 3, ncol = 4)
m

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

m + 7

##      [,1] [,2] [,3] [,4]
## [1,]    8   11   14   17
## [2,]    9   12   15   18
## [3,]   10   13   16   19

and adding a vector to a matrix repeats the operation over rows

m <- matrix(1, nrow = 3, ncol = 4)
m

##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1    1
## [2,]    1    1    1    1
## [3,]    1    1    1    1

v <- c(11, 22, 33)
m + v

##      [,1] [,2] [,3] [,4]
## [1,]   12   12   12   12
## [2,]   23   23   23   23
## [3,]   34   34   34   34

Sidenote: the distinction between “repeat over rows” and “repeat over columns” is also discussed in the Array Cast episode - if you want to know more, there’s “leading axis theory”. R uses column-major order which is why the matrix m filled the sequential values down the first column, and why you need to specify byrow = TRUE if you want to fill the other way. It’s also why m + v repeats over rows, although if you are expecting it to repeat over columns and try to use a v with 4 elements it will (silently) work, recycling the vector v, and giving you something you didn’t expect

v <- c(11, 22, 33, 44)
m + v

##      [,1] [,2] [,3] [,4]
## [1,]   12   45   34   23
## [2,]   23   12   45   34
## [3,]   34   23   12   45

{reticulate} has a really nice explainer of the differences between R (column-major) and python (row-major), and importantly, the interop between these two.

So, is working with arrays actually so uncommon? I first thought of Julia, and since it’s much newer than R and took a lot of inspiration from a variety of languages, perhaps it works

a = [1, 2, 3, 4, 5]
a + 7

ERROR: MethodError: no method matching +(::Vector{Int64}, ::Int64)
For element-wise addition, use broadcasting with dot syntax: array .+ scalar

Not quite, but the error message is extremely helpful. Julia wants to perform element-wise addition using the broadcasting operator . so it needs to be

a .+ 7

5-element Vector{Int64}:
  8
  9
 10
 11
 12

Still, that’s a “know the language” thing that’s outside of “add a number to a vector”, so no credit.

Well, what about python?

a = [1, 2, 3, 4, 5]
a + 7

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can only concatenate list (not "int") to list

The canonical way, I believe, is to use a list comprehension

a = [1, 2, 3, 4, 5]
[i + 7 for i in a]

[8, 9, 10, 11, 12]

and we’re once more using a language feature that’s outside of “add a number to a vector” so again, no credit. For the pedants: there is library support for this if you use numpy

import numpy as np

a = [1, 2, 3, 4, 5]
np.array(a) + 7

array([ 8,  9, 10, 11, 12])

but I wouldn’t call that a success.

I asked ChatGPT what other languages could do this and it suggested MATLAB. Now, that’s a proprietary language I don’t have access to, but octave is an open-source alternative that is more or less the same, so I tried that

a = [1, 2, 3, 4, 5];
a

a =

   1   2   3   4   5

a + 7

ans =

    8    9   10   11   12

Okay, yeah - a win for MATLAB. I remember using MATLAB back at Uni in an early (second year?) Maths (differential equations?) course and it was probably the very first time I was actually introduced to a programming language. IIRC, “everything is a matrix” (which works out okay for engineering and maths use-cases) so this a) probably isn’t surprising that it works, and b) makes sense that it gets lumped in with the “array languages”.

Thinking back to the other programming languages I’ve learned sufficiently, I wondered how Fortran dealt with this - I used Fortran (90) for all of my PhD and postdoc calculations. I loved that Fortran had vectors (and n-dimensional arrays) without having to do any manual memory allocation, and for that reason alone it was well-suited to theoretical physics modeling. I’ve been re-learning some Fortran via Exercism, so I gave that a go

$ cat array.f90

program add_to_array

  implicit none
  integer, dimension(5) :: a

  a = (/1, 2, 3, 4, 5/)
  print *, a + 7

end program add_to_array

Compiling and running this…

$ gfortran -o array array.f90
$ ./array

           8           9          10          11          12

A win! Okay, a little ceremony to declare the vector itself, but that’s strict typing for you.

With these results at hand, I think back to the question

Why isn’t this such a revelation to me?

I learned MATLAB, Fortran, then R, over the course of about a decade, and barely touched other languages with any seriousness while doing so… I’ve been using array languages more or less exclusively all this time.

“You merely learned to use arrays, I was born in them, molded by them”

Better still, they’re all column-major array languages.

I think this is why APL seems to beautiful to me - it does what I know I want and it does it with the least amount of ceremony.

I wrote a bit about this in a previous post - that a language can hide some complexity for you, like the fact that it does need to internally do a loop over some elements in order to add two vectors, but when the language itself provides an interface where you don’t have to worry about that, things get beautiful.

At PyConAU this year there was a keynote “The Complexity of Simplicity” which reminded me a lot of another post “Complexity Has to Live Somewhere”. I think APL really nailed removing a lot of the syntax complexity of a language, leaving mainly just the operations you wish to perform. Haskell does similar but adds back in (albeit, useful) language features that involve syntax.

Of the languages I did learn first, I would have to say that R wins over MATLAB and Fortran in terms of suitability as a first programming language, but now that I recognise that the “array” way of thinking comes along with that, I really do think it has a big advantage over, say, python in terms of shaping that mindset. Sure, if you start out with numpy you may gain that same advantage, but either way I would like to think there’s a lot to be gained from starting with an “array-aware” language.

Did I overlook another language that can work so nicely with arrays? Have you reflected on how you think in terms of arrays and programming in general? Let me know in the comments or on Mastodon.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-08-29
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.5.0   2023-06-09 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.21    2023-05-05 [3] CRAN (R 4.3.0)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.6   2023-08-10 [3] CRAN (R 4.3.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.43    2023-05-25 [3] CRAN (R 4.3.0)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.1.2)
##  rmarkdown     2.23    2023-07-01 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.1.2)
##  xfun          0.40    2023-08-09 [3] CRAN (R 4.3.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Taking from Infinite Sequences

website@jcarroll.com.au (Jonathan Carroll) — Fri, 18 Aug 2023 00:00:00 +0000

One thing that has really caught my attention as I learn more programming languages is the idea of generators or infinite sequences of values. Yes, infinite. Coming from R, that seems unlikely, but in at least several other languages, it’s entirely possible thanks to iterators and lazy evaluation.

I saw this video which solves a codewars challenge using an infinite list, which references this one on the same topic.

First, a diversion into recursion

A timely combination. @rverbsr

In Haskell, one of the first exercises after “Hello, World!” people discover is the Fibonacci sequence, where the $n^{\rm th}$ value is given by the sum of the two previous values. As a function, this can be written as

λ> fib 0 = 0
λ> fib 1 = 1
λ> fib n = fib (n-1) + fib (n-2)

Essentially, fib(0) returns 0. fib(1) returns 1, and for any value n it returns the (recursively defined) sum of the two previous values. This isn’t infinite at all…

\[ \begin{align*} {\rm fib}(4) &= {\rm fib}(3) + {\rm fib}(2)\\ &= ({\rm fib}(2) + {\rm fib}(1)) + ({\rm fib}(1) + {\rm fib}(0))\\ &= {\rm fib}(1) + {\rm fib}(0) + {\rm fib}(1) + {\rm fib}(1) + {\rm fib}(0)\\ &= 1 + 0 + 1 + 1 + 0\\ &= 3 \end{align*} \]

We could write that in R as

fib <- function(n) {
  if (n == 0) return(0)
  if (n == 1) return(1)
  fib(n - 1) + fib(n - 2)
}

fib(4)

## [1] 3

fib(8)

## [1] 21

This may come as a surprise to some - the function is defined in terms of itself and some base cases. This is entirely fine in R and Haskell as they’re lazily evaluated - nothing happens until a value is actually used. In R, this means that if an argument to a function isn’t used, it’s not evaluated at all

loud_func <- function() {
  cat("HELLLOOOOO!!!\n")
}

stays_quiet <- function(f, g = loud_func()) {
  f(c(1, 2, 3, 4))
}

stays_quiet(f = mean)

## [1] 2.5

Notice that although the argument g is the invocation of loud_func(), it’s never evaluated because we don’t use it. If we did use it…

noisy <- function(f, g = loud_func()) {
  g
  f(c(1, 2, 3, 4))
}

noisy(sum)

## HELLLOOOOO!!!

## [1] 10

So, in these languages, we can define a function recursively. Given the base case(s), these will eventually return just a number, so the computation will complete.

Instead of just adding the values together, we can create a sequence of values by concatenating the iterations together. Starting with data and working down to a base case is called “recursion”, while starting from a base case and building up a data structure is “corecursion”.

If we want a sequence of values that represents the Fibonacci numbers, we can use

fibs <- function(n) {
  if (n == 0) return(0)
  if (n == 1) return(1)
  prev <- fibs(n - 1)
  c(prev, sum(tail(prev, 2)))
}

fibs(10)

##  [1]  1  1  2  3  5  8 13 21 34 55

What’s blown my mind recently is the concept of infinite data structures. If I defined some function that, instead of working down to a base case, just kept expanding, say, by concatenating with a larger number (corecursion), such as

inf_series <- function(x) {
  c(x, inf_series(x + 1))
}

Then, if I tried to evaluate inf_series(5) this would produce

c(5, inf_series(6))

which would expand to

c(5, 6, inf_series(7))

and so on… forever. Of course, R can’t keep going forever

inf_series(5)

Error: C stack usage  7971732 is too close to the limit

This error comes about because each function in R is a “closure” which “encloses” its environment. The way R keeps track of that (and where it needs to return after returning from a function) is by adding “stack frames” each time it dives deeper into a function calling a function. We’ve asked it to add infinity of these, so at some point it says “too many”.

Okay, so, not possible, right?

In Haskell I can define

λ> fibs = 0 : scanl (+) 1 fibs

which is a concatenation (:) of 0 with the result of scanl (+) 1 fibs. Note carefully, this isn’t a function - it’s a vector of values defined recursively 🤯

Mind = blown

To explain the definition a little more: scanl is similar to reduce in that it takes a starting value, a vector, and a binary operator, but rather than reducing the vector to a value, it creates a new vector with the successively reduced values. In this example, the values 1..5 are successively added (+) to 0, so the second entry is 0+1=1, the next is 1+2=3, the next is 3+3=6, then 6+4=10, then 10+5=15

λ> scanl (+) 0 [1..5]
[0,1,3,6,10,15]

In the Fibonacci case, the operator is still addition, but the starting value is 1 and the vector is … the entire vector we’re defining. Writing out some of the terms makes this easier to understand

λ> scanl (+) 1 [0, 1, 1, 2, 3, 5, 8]
[1,1,2,3,5,8,13,21]

The first terms in the sequence, after concatenating with 0 will be

λ> [0, 1, 1, 2, 3, 5, 8]

so, by defining fibs in terms of fibs, the sequence can go on forever. So, what if you try to print this? In GHCI, the output will just stream forever, which isn’t particularly useful. Instead, we can ask for some number of values, say, ten

λ> take 10 fibs
[0,1,1,2,3,5,8,13,21,34]

Due to the laziness of Haskell, nothing is computed until it’s needed, so asking for any number of values is fast, despite the list being “infinite”

λ> take 100 fibs
[0,1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,1597,2584,4181,6765,10946,
 17711,28657,46368,75025,121393,196418,317811,514229,832040,1346269,2178309,
 3524578,5702887,9227465,14930352,24157817,39088169,63245986,102334155,
 165580141,267914296,433494437,701408733,1134903170,1836311903,2971215073,
 4807526976,7778742049,12586269025,20365011074,32951280099,53316291173,
 86267571272,139583862445,225851433717,365435296162,591286729879,956722026041,
 1548008755920,2504730781961,4052739537881,6557470319842,10610209857723,
 17167680177565,27777890035288,44945570212853,72723460248141,117669030460994,
 190392490709135,308061521170129,498454011879264,806515533049393,
 1304969544928657,2111485077978050,3416454622906707,5527939700884757,
 8944394323791464,14472334024676221,23416728348467685,37889062373143906,
 61305790721611591,99194853094755497,160500643816367088,259695496911122585,
 420196140727489673,679891637638612258,1100087778366101931,1779979416004714189,
 2880067194370816120,4660046610375530309,7540113804746346429,12200160415121876738,
 19740274219868223167,31940434634990099905,51680708854858323072,
 83621143489848422977,135301852344706746049,218922995834555169026]

The idea of taking some values from an iterator shows up in other languages.

In Rust, I can create an infinite iterator of the value 1

use std::iter;
let ones = iter::repeat(1);

and I can take some number of these, say, five, collected into a vector

ones.take(5).collect::<Vec<_>>()

[1, 1, 1, 1, 1]

Python also has a notion of infinite sequences, and they’re likely even more common.

When you work with a (regular, finite) list of values in a range, you get a range object

numbers = range(1, 9)
numbers

## range(1, 9)

type(numbers)

## <class 'range'>

You can convert this to a regular list with

list(numbers)

## [1, 2, 3, 4, 5, 6, 7, 8]

type(list(numbers))

## <class 'list'>

If you filter the numbers (which works on a range or a list) you get a filter object

even_numbers = filter(lambda x: x % 2 == 0, numbers)
even_numbers

## <filter object at 0x7f5ac61da470>

type(even_numbers)

## <class 'filter'>

which you can also convert to a list to see the values

list(even_numbers)

## [2, 4, 6, 8]

But, you can use this as an iterable, so you can get the ‘next’ value as many times as you need (defined here again to restart the iterator)

even_numbers = filter(lambda x: x % 2 == 0, numbers)
next(even_numbers)

## 2

next(even_numbers)

## 4

next(even_numbers)

## 6

next(even_numbers)

## 8

until there’s none left

next(even_numbers)

## Error in py_call_impl(callable, dots$args, dots$keywords): StopIteration

As before, you can convert these to a fixed-length list, if desired

list(filter(lambda x: x % 2 == 0, numbers))

## [2, 4, 6, 8]

Still with me? Great. We can create an infinite list, if we want to, because it isn’t evaluated until we ask for elements

def infinitenumbers():
    count = 0
    while True:
        yield count
        count += 1

nums = infinitenumbers()        

nums

## <generator object infinitenumbers at 0x7f5ac615d310>

type(nums)

## <class 'generator'>

This is a generator which means it’s capable of generating values. We can ask for as many as we want, now

next(nums)

## 0

next(nums)

## 1

next(nums)

## 2

next(nums)

## 3

next(nums)

## 4

If we want to convert some number of these to a list, we need a new function, roughly the equivalent of Haskell’s take, in order to extract these

from itertools import islice

nums = infinitenumbers()        

list(islice(nums, 10))

## [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

and as before, if we ask for more now, we get the next batch

list(islice(nums, 10))

## [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

So, can we do this at all in R? We can’t build an infinite data structure per se, but what if we use a recursive definition and just… stop when it’s recursed enough times?

I initially thought about hacking into functions with body() to keep track of this, but we’ve already discussed something we can use - the stack frames! R keeps track of how deep it’s recursed using these, and we can access that information with sys.calls(), the help for which describes

.GlobalEnv is given number 0 in the list of frames. Each subsequent function evaluation increases the frame stack by 1.

So, we can tell from inside a recursive function how deeply nested we currently are.

I wrote the following helper, named for the Haskell inspiration

take <- function(f, n, x = 1) {
  current_depth <- length(sys.calls())  # Subtract 1 to exclude the current call,
                                        # but add 1 to start at 1
  
  # uncomment this line to watch the magic happen!
  # cat("Current Stack Depth: ", current_depth, "\n")
 
  if (current_depth >= n) {
    return(f(x))
  } else {
    return(c(f(x), take(f, n, x + 1)))
  }
}

this checks length(sys.calls()) which starts at 1 the first time it’s called, and adds 1 every time we go deeper. So long as we haven’t reached the requested depth, it combines the passed-in function evaluated at $x + i$ with a new evaluation one level deeper.

When that reaches the requested depth, it returns the evaluated function, bubbling up the returned values so that we end up with a vector of $n$ values

\[ [f(x), f(x+1), f(x+2), f(x+3), \dots, f(x+n-1)] \]

Neat idea, but does it work? We can’t pass in an actual infinite data structure, but we can pass a function that defines one

A (trivial) function that produces a number at each value of x is

numbers <- function(x) {
  x
}

If we pretend that’s a generator for every number, we can “take” some values from it

take(numbers, 5)

 [1] 1 2 3 4 5

take(numbers, 10)

 [1]  1  2  3  4  5  6  7  8  9 10

A more complicated recipe for an infinite list of numbers could be

g <- function(x) {
  x + 1
}

take(g, 7)

[1] 2 3 4 5 6 7 8

take(g, 10)

 [1]  2  3  4  5  6  7  8  9 10 11

Dealing with non-sequential numbers might be trickier… what if we want all the even numbers?

evens <- function(x) {
  if(x %% 2 == 0) x else NULL
}

take(evens, 10)

 [1]  2  4  6  8 10

No, that only checks the first 10 numbers. Instead,

evens <- function(x) {
  x * 2
}

take(evens, 7)

 [1]  2  4  6  8 10 12 14

take(evens, 10)

 [1]  2  4  6  8 10 12 14 16 18 20

What about our original example?

fibs <- function(x) {
  if (x == 0) return(0)
  if (x == 1) return(1)
  fibs(x - 1) + fibs(x - 2)
}

# 10 values, starting at 0
take(fibs, 10, x = 0)

 [1]  0  1  1  2  3  5  8 13 21 34

# 10 values, starting at 1
take(fibs, 10, x = 1)

 [1]  1  1  2  3  5  8 13 21 34 55

Or, if we want 12 values, starting at the 10th

take(fibs, 12, x = 10)

 [1]    55    89   144   233   377   610   987  1597  2584  4181  6765 10946

It… works!

I’d say that’s working quite nicely!!!

Some caveats to keep in mind, though…

Since we’re relying on a count of stack frames on top of .GlobalEnv, this take() implementation won’t work nicely inside another function. In fact, since {knitr} is already a few functions deep, it also doesn’t work in an Rmd file (including this blog which is Rmd via {blogdown}). Not for use in production, but a fun exercise to figure it out at all.

Is there a better way to achieve this take() functionality? Where do you use infinite iterators/generators in R or another language? Spot an improvement that I can make? I can be found on Mastodon or use the comments below.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-08-18
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.5.0   2023-06-09 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.21    2023-05-05 [3] CRAN (R 4.3.0)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  here          1.0.1   2020-12-13 [1] CRAN (R 4.1.2)
##  htmltools     0.5.6   2023-08-10 [3] CRAN (R 4.3.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.43    2023-05-25 [3] CRAN (R 4.3.0)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lattice       0.21-8  2023-04-05 [4] CRAN (R 4.3.0)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  Matrix        1.6-0   2023-07-08 [4] CRAN (R 4.3.1)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  png           0.1-7   2013-12-03 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  reticulate    1.26    2022-08-31 [1] CRAN (R 4.1.2)
##  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.1.2)
##  rmarkdown     2.23    2023-07-01 [3] CRAN (R 4.3.1)
##  rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.1.2)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.1.2)
##  xfun          0.40    2023-08-09 [3] CRAN (R 4.3.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ─ Python configuration ───────────────────────────────────────────────────────
##  python:         /usr/bin/python3
##  libpython:      /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so
##  pythonhome:     //usr://usr
##  version:        3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
##  numpy:          /home/jono/.local/lib/python3.10/site-packages/numpy
##  numpy_version:  1.24.1
##  
##  NOTE: Python version was forced by RETICULATE_PYTHON_FALLBACK
## 
## ──────────────────────────────────────────────────────────────────────────────

Pythagorean Triples with Comprehensions

website@jcarroll.com.au (Jonathan Carroll) — Sun, 13 Aug 2023 00:00:00 +0000

I’ve been learning at least one new programming language per month through Exercism and the #12in23 challenge. I’ve keep saying, every time you learn a new language, you learn something about all the others you know. Plus, once you know $N$ languages, the $N+1^{\rm th}$ is significantly easier. This post covers a calculation I came across in Haskell, and how I can now do the same in a lot of other languages - and perhaps can’t as easily in others.

All of the languages here, I’m learning via Exercism, or at least I’m completing a handful or more exercises in each of the languages, which means learning enough of the syntax to be able to complete those. The #12in23 challenge is to try 12 languages in 2023… I’m doing just fine so far

#12in23 progress as of July 2023 - I already have my 12, but no reason to stop now

Haskell

I’ve been reading the (great!) online version of Learn You a Haskell for Great Good! - Haskell is a (properly) “pure” functional language, part of which means it has no side-effects, which includes, say, printing to the console. Haskell, of course, has a way around this (monads!) but it means there’s a lot to get through before you even get to a printing “Hello, World!” example. It’s also lazy which means it doesn’t evaluate something if it doesn’t need to, which makes for some good performance, sometimes.

This video does a really nice job explaining the principles of pure functional programming using JavaScript to introduce Haskell, building recursive functions that only take a single argument and return a single value.

One example that caught my eye in the list comprehensions section was this

ghci> let rightTriangles' = [ (a,b,c) | c <- [1..10], b <- [1..c], a <- [1..b], a^2 + b^2 == c^2, a+b+c == 24]  
ghci> rightTriangles'  
[(6,8,10)]

This perhaps isn’t too hard to read, even for those unfamiliar with the language. ghci is the interactive REPL for the Glasgow Haskell Compiler, so the prompt starts with that. Haskell uses a let binding to identify variables, and the apostrophe just indicates that this is a slightly different version compared to the one defined slightly earlier in the chapter.

The list comprehension itself is perhaps not so dissimilar to one you’d find in Python; it defines some tuple (a, b, c) and | identifies some constraints, namely that c is taken from a range of 1 to 10, b is taken from a range of 1 to c, and a is taken from a range of 1 to b, along with the criteria that $a^2 + b^2 = c^2$ (the numbers form a Pythagorean triple) and their sum is 24. I discussed the Pythagorean triples in my last post - no coincidence (/s). If you evaluate this line, you more-or-less immediately get back the result

[(6,8,10)]

which is a Pythagorean triple

\[6^2 + 8^2 = 36 + 64 = 100 = 10^2\]

for which

\[a + b + c = 6 + 8 + 10 = 24\] This isn’t a groundbreaking calculation, but I’ve done a lot of R, and my mind was a little blown that such a calculation could really be done in a single line just by specifying those constraints. Not a solver, not a grid of values with a filter, just specifying constraints.

R

Anyone who knows me knows I write a lot of R. I wrote a book on it. I solved all of the Advent of Code 2022 puzzles in strictly base R (I really need to write that post).

Now, R (unfortunately) doesn’t have any comprehensions, list or otherwise, so I started to wonder how I would do this in R. The best I can come up with is

expand.grid(a=1:10, b=1:10, c=1:10) |>
  dplyr::filter(a^2 + b^2 == c^2 & 
                  a + b + c == 24 & 
                  a < b & 
                  b < c)

##   a b  c
## 1 6 8 10

but that involves explicitly creating all 1000 combinations of a, b, and c. There may be a multi-step way to limit the grid to $a < b$ and $b < c$ but that’s more code. Maybe the Haskell solution also has to generate these behind the scenes, but it isn’t up to the user to do that, so it feels nicer. I like the filter() verb here - technically the & joining is redundant and I could have passed each condition as its own argument. expand.grid() is one of those underutilised functions that comes in very handy sometimes - or its cousin tidyr::crossing() which wraps this and additionally performs de-duplication and sorting.

Now that I know more languages, I felt I could explore this a bit further!

Python

In Python, which I feel is well-known for list comprehensions, this translates more or less 1:1 to

[(a,b,c) for c in range(1,11) for b in range(1,c) for a in range(1,b) if ((a**2 + b**2) == c**2) if a+b+c==24]

[(6, 8, 10)]

Of course, ranges are specified differently, but otherwise this follows the Haskell solution quite nicely, including the dynamic ranges of b and a which avoids needing to search the entire 10*10*10 space.

I appreciate there’s a silly language war between Python and R but honestly, a lot of stuff is written in Python and a lot of people write in Python. I figure it’s better to understand that language for when I need it than to stick my head in the sand and claim some sort of superiority. There’s bits I don’t like, sure, but that doesn’t mean I shouldn’t learn it. I’m even registered and attending PyConAU next weekend.

Rust

Rust is a fun language with easily the most helpful compiler ever made - you can make a lot of mistakes, but the error messages and hints are unparalleled. I’m currently taking Tim McNamara’s ‘How To Learn Rust’ course which has a lot of practical lessons and I’ve built some fun things already. I completed the first 13 Advent of Code 2022 puzzles in Rust, after which it all got a bit too complicated (and I do really need to write that post).

Rust doesn’t have list comprehensions (I believe there are cargo crates which do add such functionality) so it’s back to nested loops

for c in 1..=10 {
  for b in 1..=c {
    for a in 1..=b {
      if a*a + b*b == c*c && a+b+c == 24 {
        println!("{}, {}, {}", a, b, c);
      }
    }
  }
};

6, 8, 10

That doesn’t allocate a result at all, it just prints the values when it encounters them, and since the loop is nested, it can limit the search to $b \leq c$ and $a \leq b$, but it does explicitly run the loop across all those combinations. It’s possible there’s a much better way to do this, but I couldn’t think of it.

Common Lisp

I like the idea of Common Lisp, and I’m making my way through Practical Common Lisp slowly. I suspect I enjoy some of the descendants like Clojure a bit more, but it’s absolutely worth learning. Miles McBain has a great post about how learning about lisp quoting helps understand more of the tidyverse. I have used lisp in a code-golf post.

Lisp doesn’t have comprehensions so it relies on loops, and again, just prints the result, returning NIL

  (loop for c from 1 to 10
        do (loop for b from 1 to c
                 do (loop for a from 1 to b
                      do (when (and (= (+ a b c) 24) (= (+ (* a a) (* b b)) (* c c)))
                        (format t "~d, ~d, ~d~%" a b c)))))

6, 8, 10
NIL

The loop is still constrained to $b \leq c$ and $a \leq b$, but definitely runs through all those values.

Julia

I really want to learn more Julia, but I’m not entirely new to the language. I have completed the first 25 Project Euler problems in Julia (by no means optimised solutions). I think what’s holding me back is the fact that almost every presentation using it is so very mathsy - and I’m a physicist by training. I love that the tidyverse is making its way over in the forms of Queryverse, DataFramesMeta, and more recently (and most likely with more success) the Tidier family.

Julia does have list comprehensions, and additionally has an “element” operator with the mathematically-familiar symbol ∈

[(a,b,c) for c ∈ 1:10, b ∈ 1:10, a ∈ 1:10 if (a^2 + b^2 == c^2) && (a+b+c == 24) && b <= c && a <= b]

1-element Vector{Tuple{Int64, Int64, Int64}}:
 (6, 8, 10)

Unfortunately, the choices for b and a still need to run through all 10 values because Julia doesn’t allow these to be co-defined like Haskell and Python do. I came to Julia from mainly only knowing R, so dealing with an output of type Vector{Tuple{Int64, Int64, Int64}} initially proved to be a challenge, but I’d say learning more Rust has made me feel a lot more comfortable around working with types.

Clojure

Clojure feels to me like “lisp, but with good libraries”. There’s definitely syntax differences, but most of them feel like improvements.

(for [c (range 11)
      b (range c)
      a (range b)
     :when (and (== (+ (* a a) (* b b)) (* c c)) (== (+ a b c) 24))]
[a b c])

([6 8 10])

This still feels like a comprehension, but the syntax is certainly a bit more convoluted. Bonus points for the dynamic ranges of b and a. Still, a long way off of completely unreadable, I’d say.

Scala

I’m learning a lot of functional programming, and I think I’m happy that some of the textbooks use Scala rather than some alternatives. I’m still very new to this language, but so far I think I like it.

Again, we’re back to a loop, but most of it is straightforward assignments and we get the dynamic ranges of b and a

for {
     c <- 1 until 11
     b <- 1 until c
     a <- 1 until b
     if a * a + b * b == c * c & a + b + c == 24
     } {
        println(s"Side lengths: $a, $b, $c")
}

Side lengths: 6, 8, 10

C

I mentioned that I performed this calculation in C in my last post - that ends up being just a loop

int a, b, c

printf("%4s\t%4s\t%4s\t%4s\n", "a", "b", "c");
printf("   -------------------------\n");
for (c = 1; c <= 24; c++) 
  for (b = 1; b <= c; b++)
    for (a = 1; a <= b; a++)
      if ( ( pow ( a, 2 ) + pow ( b, 2 ) ) == pow ( c, 2 ) ) {
        printf("%4i\t%4i\t%4i\t%4i\n", a, b, c);
      }

I haven’t run the output directly, since it needs an entire program supporting it, but it’s the right answer.

So, what does it look like if you run all of these together? I’ve been getting back into using tmux and it’s very powerful. One of the features is splitting a window into panes, so I did that - one for each of these languages!

Calculating the Pythagorean Triple with perimeter 24 in several languages at once - link

I still think the Haskell solution shines above all the rest. It has all of the simplicity and language richness with none of the boilerplate. I like that it’s declarative (“get an answer to this”) rather than imperative (“do this, then that, then loop here…”). Comparing all of these, it’s clear there’s no guarantees about being able to define the dynamic iteration ranges so another win for Haskell, there.

Following my last post, @Kazinator mentioned to me that the “TXR Lisp code, calling calcsum directly via FFI using Lisp nested arrays” could be written as

$ cat calcsum.tl
(typedef arr3d (ptr (array (ptr (array (ptr (array int)))))))

(with-dyn-lib "./calcsum.so"
  (deffi calcsum "calcsum" void (int (ptr arr3d))))

(let* ((dim 16)
       (arr (vector dim)))
  (each ((a 0..dim))
    (set [arr a] (vector dim))
    (each ((b 0..dim))
      (set [[arr a] b] (vector dim 0))))
  (calcsum (pred dim) arr)
  (each-prod ((a 1..dim)
              (b 1..dim)
              (c 1..dim))
    (let ((sum [[[arr a] b] c]))
      (if (plusp sum)
        (put-line (pic "### + ### + ### = ####" a b c sum))))))

$ txr  calcsum.tl

  3 +   4 +   5 =   12
  5 +  12 +  13 =   30
  6 +   8 +  10 =   24
  9 +  12 +  15 =   36

This calculates all the combinations up to some value (as my post did) but it’s already clear there’s some cool features there.

How does your favourite language calculate the Pythagorean triple with a sum of 24? What can I do better in the solution I have above for a language you know? I can be found on Mastodon or use the comments below.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-08-13
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.5.0   2023-06-09 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  dplyr         1.1.2   2023-04-20 [3] CRAN (R 4.3.0)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.21    2023-05-05 [3] CRAN (R 4.3.0)
##  fansi         1.0.4   2023-01-22 [3] CRAN (R 4.2.2)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  generics      0.1.3   2022-07-05 [3] CRAN (R 4.2.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.5   2023-03-23 [3] CRAN (R 4.2.3)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  JuliaCall     0.17.5  2022-09-08 [1] CRAN (R 4.1.2)
##  knitr         1.43    2023-05-25 [3] CRAN (R 4.3.0)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lattice       0.21-8  2023-04-05 [4] CRAN (R 4.3.0)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  Matrix        1.6-0   2023-07-08 [4] CRAN (R 4.3.1)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pillar        1.9.0   2023-03-22 [3] CRAN (R 4.2.3)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgconfig     2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  png           0.1-7   2013-12-03 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  reticulate    1.26    2022-08-31 [1] CRAN (R 4.1.2)
##  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.1.2)
##  rmarkdown     2.23    2023-07-01 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  tibble        3.2.1   2023-03-20 [3] CRAN (R 4.3.1)
##  tidyselect    1.2.0   2022-10-10 [3] CRAN (R 4.2.1)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  utf8          1.2.3   2023-01-31 [3] CRAN (R 4.2.2)
##  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.1.2)
##  xfun          0.39    2023-04-20 [3] CRAN (R 4.3.0)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Wrapping C Code in an R Package

website@jcarroll.com.au (Jonathan Carroll) — Fri, 11 Aug 2023 00:00:00 +0000

Your collaborator says to you “I have some code I’d like to distribute to people who will probably work in R most of the time. I don’t write R, but I write C. Can you package this up for me?” so you have a few options: re-write the code in R, package up the C code and make it available in R, or say no. I decided to try out the second of these, and this post details how I achieved that.

Before we even start, this is an excellent post summarising many of the finer points involved here - go read that! Then, read some of @coolbutuseless’ various repositories demonstrating how to wrap C code into R packages. These, and many others, go much deeper into how to achieve this, but I’m going to detail what I did because a) I’ll want to remember, later; b) I had enough trouble piecing together what I needed between these excellent posts and some older, possibly out of date posts; and c) I did build some functionality beyond what was done in those straightforward examples.

Those of you who know R really well probably know that the language itself is in no small part written in C. Many packages do the same, usually for performance reasons. This becomes most apparent if you install a package “from source” and see a lot of this mess fly past in your console

gcc -I"/usr/share/R/include" -DNDEBUG -I./pkg/    -fvisibility=hidden -fpic  -g -O2 -ffile-prefix-map=/build/r-base-4A2Reg/r-base-4.1.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c file1.c -o file1.o
gcc -I"/usr/share/R/include" -DNDEBUG -I./pkg/    -fvisibility=hidden -fpic  -g -O2 -ffile-prefix-map=/build/r-base-4A2Reg/r-base-4.1.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c file2.c -o file2.o
gcc -I"/usr/share/R/include" -DNDEBUG -I./pkg/    -fvisibility=hidden -fpic  -g -O2 -ffile-prefix-map=/build/r-base-4A2Reg/r-base-4.1.2=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g  -c pkg.c -o pkg.o
gcc -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -o pkg.so file1.o file2.o pkg.o -L/usr/lib/R/lib -lR

Other languages are supported, including Fortran (yet to be superseded for numerical libraries), C++, Rust, and various others. You can usually dig into the source of these if you can track down where they come from. When debugging a function call, R is happy to step through individual lines of R code. Try the following

debugonce(seq.default)
seq(5)

and step through the lines of seq.default until it reaches 1L:from (yes, seq(from = x) produces the values 1 to from… sigh) where it returns that value as

exiting from: seq.default(5)
[1] 1 2 3 4 5

When the function involves C code, though, R can’t step through that because it’s compiled. One of the most common ways to hit that limitation is when a function calls either .Internal() or .Primitive().

I went looking for a function containing one of these (there are plenty) and found .row_names_info

# number of rownames
.row_names_info(mtcars)

## [1] 32

# the rownames themselves
.row_names_info(mtcars, type = 0)

##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"

if we wanted to see what .row_names_info() does we would write

.row_names_info

## function (x, type = 1L) 
## .Internal(shortRowNames(x, type))
## <bytecode: 0x563e6bf3c890>
## <environment: namespace:base>

but we can’t see any deeper unless we ask where that C code lives. I recommend using pryr::show_c_source() (as I did in a previous post) to identify the C code for these, e.g.

pryr::show_c_source(.Internal(shortRowNames(mtcars)))

shortRowNames is implemented by do_shortRowNames with op = 0

which opens a GitHub search of a copy of the R source in a browser. The file we want is attrib.c and contains the C code

SEXP do_shortRowNames(SEXP call, SEXP op, SEXP args, SEXP env)
{
    /* return  n if the data frame 'vec' has c(NA, n) rownames;
     *	       nrow(.) otherwise;  note that data frames with nrow(.) == 0
     *		have no row.names.
     ==> is also used in dim.data.frame() */

    checkArity(op, args);
    SEXP s = getAttrib0(CAR(args), R_RowNamesSymbol), ans = s;
    int type = asInteger(CADR(args));

    if( type < 0 || type > 2)
	error(_("invalid '%s' argument"), "type");

    if(type >= 1) {
	int n = (isInteger(s) && LENGTH(s) == 2 && INTEGER(s)[0] == NA_INTEGER)
	    ? INTEGER(s)[1] : (isNull(s) ? 0 : LENGTH(s));
	ans = ScalarInteger((type == 1) ? n : abs(n));
    }
    return ans;
}

Fully interpreting this is beyond the scope of this post, but the links at the start of this post cover most of what’s not plain C code here.

I won’t share my collaborator’s exact code, but I can write enough C that I can create something with all the relevant features.

Let’s calculate Pythagorean Triples! These are sets of 3 integers (whole numbers) a, b, and c such that a triangle with sides of those lengths will be a right-triangle (contains a 90 degree / right-angle). These have the property that \[a^2 + b^2 = c^2\]

Pythagorean theorem https://en.wikipedia.org/wiki/Pythagorean_theorem

The smallest of these is 3, 4, 5 because \[3^2 + 4^2 = 9 + 16 = 25 = 5^2\]

Generating these just happens to fit the use-case I’m emulating, plus I have a whole other blog post coming up about these (stay tuned!).

Some C code to generate these up to some maximum side-length, written similar to the code I received, is

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main (int argc, char *argv[]) {

  int a, b, c;
  int maxval;
  int ***triangles;

  if ( argc != 2 ) {
    printf("Usage: triangle max_side_length\n");
    exit(EXIT_FAILURE);
  }

  maxval = atoi( argv[1] );

  triangles = (int ***) malloc (maxval * sizeof(int **));
  for (a = 0; a < maxval; ++a) {
    triangles[a] = (int **) malloc (maxval * sizeof(int *));
    for (b = 0; b < maxval; ++b) {
      triangles[a][b] = (int *) malloc (maxval * sizeof(int));
      for (c = 0; c < maxval; ++c) {
        triangles[a][b][c] = 0;
      }
    }
  }

  for (c = 1; c <= maxval; c++) {
    for (b = 1; b <= c; b++)
      for (a = 1; a <= b; a++)
        if ( ( pow ( a, 2 ) + pow ( b, 2 ) ) == pow ( c, 2 ) ) {
          triangles[a][b][c] = a + b + c;
        }
  }

  printf("%4s\t%4s\t%4s\t%4s\n", "a", "b", "c", "sum");
  printf("   -------------------------\n");
  for (c = 1; c <= maxval; c++) {
    for (b = 1; b <= c; b++)
      for (a = 1; a <= b; a++)
        if ( ( pow ( a, 2 ) + pow ( b, 2 ) ) == pow ( c, 2 ) ) {
          printf("%4i\t%4i\t%4i\t%4i\n", a, b, c, triangles[a][b][c]);
          }
      }

  exit(EXIT_SUCCESS);

}

I won’t make this an entire C tutorial, but the main pieces are:

Load some libraries for printing to screen, doing math, …

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

Define an entrypoint function (the thing that will run when the code is run) which takes some number of character arguments argv, the first of which is the compiled name of the program itself

int main (int argc, char *argv[]) {

Define some variables, the most significant being triangles which is denoted as a pointer to a pointer to a pointer (!)

  int a, b, c;
  int maxval;
  int ***triangles;

That’s a lot of redirection, but it’s just creating a reference to a 3-dimensional array.

Side-note: 0-indexed languages actually make a bit more sense when working with pointer math because a “vector” of memory addresses really only needs to “point” to the starting address, then every element is some offset away from that, so the first element of some vector vec might have some address x, but you can access that with vec[0]. You can access the next element with vec[1] which means “offset 1 position from x, the starting address.” You can access the fifth value by offsetting 4 positions, so vec[4].

One of my favourite bits of C trivia is that this syntactic sugar of using square brackets to identify positions actually translates to

vec[0] is at address x + 0 => vec + 0
vec[1] is at address x + 1 => vec + 1
vec[2] is at address x + 2 => vec + 2
...
vec[5] is at address x + 5 => vec + 5

but addition (+) is symmetric (commutative) so we can just as easily write

vec + 0 => 0 + vec => 0 + x is at address 0[vec]
vec + 1 => 1 + vec => 1 + x is at address 1[vec]
vec + 2 => 2 + vec => 2 + x is at address 2[vec]
...
vec + 5 => 5 + vec => 5 + x is at address 5[vec]

and it all works out… 5[obj] is valid, and corresponds to the same address as obj[5].

Back to our function. If only one argument is passed in (the name of the program) then the usage information is printed, otherwise the next argument is used to set the upper bound on the length of a side of the triangle, converting it from a string to an int with atoi

  if ( argc != 2 ) {
    printf("Usage: triangle max_side_length\n");
    exit(EXIT_FAILURE);
  }

  maxval = atoi( argv[1] );

Next, the array is allocated (and assigned a default value of 0)

  triangles = (int ***) malloc (maxval * sizeof(int **));
  for (a = 0; a < maxval; ++a) {
    triangles[a] = (int **) malloc (maxval * sizeof(int *));
    for (b = 0; b < maxval; ++b) {
      triangles[a][b] = (int *) malloc (maxval * sizeof(int));
      for (c = 0; c < maxval; ++c) {
        triangles[a][b][c] = 0;
      }
    }
  }

and then, finally, we do the ‘calculation’ which just involves stepping through every value, and if our criteria of \[a^2 + b^2 == c^2\] is met, we assign the sum of these to an element in triangles indexed by a, b, and c

  for (c = 1; c <= maxval; c++) {
    for (b = 1; b <= c; b++)
      for (a = 1; a <= b; a++)
        if ( ( pow ( a, 2 ) + pow ( b, 2 ) ) == pow ( c, 2 ) ) {
          triangles[a][b][c] = a + b + c;
        }
  }

This isn’t efficient at all - there will be lots of 0 values, but this is a simple program.

The last section of the code just loops back through all of a, b, and c and when it finds a non-zero element, it prints it, along with the sum a + b + c (the value in triangles[a][b][c])

  printf("%4s\t%4s\t%4s\t%4s\n", "a", "b", "c", "sum");
  printf("   -------------------------\n");
  for (c = 1; c <= maxval; c++) {
    for (b = 1; b <= c; b++)
      for (a = 1; a <= b; a++)
        if ( ( pow ( a, 2 ) + pow ( b, 2 ) ) == pow ( c, 2 ) ) {
          printf("%4i\t%4i\t%4i\t%4i\n", a, b, c, triangles[a][b][c]);
          }
      }

With all of this saved as triangles.c we can compile and run this code in a terminal

$ cc -O -o triangle triangles.c

$ ./triangle 
Usage: triangle max_side_length

$ ./triangle 16
   a	   b	   c	 sum
   -------------------------
   3	   4	   5	  12
   6	   8	  10	  24
   5	  12	  13	  30
   9	  12	  15	  36

Woot! You can even check that it has worked: \[9^2 + 12^2 = 81 + 144 = 225 = 15^2\]

Back to the goal of this post - how do we get R to run that? We have some C code, now what?

First, I created an R package. I like using RStudio for this as it auto-generates a lot of the structure I want. It does, however, create an example R file R/hello.R (and corresponding man/hello.Rd page) so I delete those. I also delete the NAMESPACE because I’m going to use {roxygen} to generate a new one. I check ‘Generate documentation with Roxygen’ in the Build tools menu, making sure to select ‘Build & Reload’ (which should be a default, no?)

Generate documentation with Roxygen

and build my empty package.

I love the {usethis} package for building packages, and there’s support there for what we’re doing, too! usethis::use_c() sets up some structure and adds the required boilerplate so that Roxygen knows we’re using C code. This does add a src/code.c file we can delete and in its place we can put our own C code.

If you read the links at the start of this post, you’ll recognise that this C code isn’t quite ready to be used in an R package, though - we need to be able to send an R object (a SEXP) to this C code, not a char. More importantly, the functionality of the C code is all wrapped up in the main() entrypoint function - it would be great if that was refactored out to another function that could be called from main() but also from an R-facing function.

I communicated this to my colleague and they agreed we could refactor that, but they want to still run the C code from the command line, so we can’t just put everything in our R-facing function. The actual processing in the code could go to a new function that doesn’t return anything, but does update the 3-dimensional array passed by reference

void calculate_sum(int maxval, int ****tri_sum) {

  int a, b, c;

  for (c = 1; c <= maxval; c++) {
    for (b = 1; b <= c; b++)
      for (a = 1; a <= b; a++)
        if ( ( pow ( a, 2 ) + pow ( b, 2 ) ) == pow ( c, 2 ) ) {
          (*tri_sum)[a][b][c] = a + b + c;
        }
  }
}

[... in main()]

 printf("calling external sum\n");
 calculate_sum(maxval, &triangles);

Yes, that’s a pointer to a pointer to a pointer to a pointer (!!!!).

The gotchas I encountered here were that

*tri_sum[a][b][c]

would be a pointer to the indexed object, so I needed

(*tri_sum)[a][b][c]

and &triangles sends a reference to the triangles object.

Compiling and running this shows that we’ve successfully refactored out the main functionality

$ cc -O -o triangle1 triangles1.c

$ ./triangle1 20
calling external sum
   a	   b	   c	 sum
   -------------------------
   3	   4	   5	  12
   6	   8	  10	  24
   5	  12	  13	  30
   9	  12	  15	  36
   8	  15	  17	  40
  12	  16	  20	  48

But this still isn’t quite what we need for R… we need to pass and return SEXPs.

Rather than disrupt the runnable C code, we can add some additional R-specific code. That requires the R-related libraries

#include <R.h>
#include <Rinternals.h>

(keeping in mind that these are required if the user is compiling all of this code - it’s possible, but perhaps we’ll comment these out when just using the C code standalone).

We need a function that takes a SEXP (our maximum value) and returns a SEXP - this is required, but so far we’re just printing to screen. We’ll return something for now. A function that meets these criteria and calls the new calculate_sum() could be

SEXP C_triangles(SEXP maximum) {

  int a, b, c;
  int ***triangles;

  int maxval = * INTEGER(maximum);

  triangles = (int ***) malloc (maxval * sizeof(int **));
  for (a = 0; a < maxval; ++a) {
    triangles[a] = (int **) malloc (maxval * sizeof(int *));
    for (b = 0; b < maxval; ++b) {
      triangles[a][b] = (int *) malloc (maxval * sizeof(int));
      for (c = 0; c < maxval; ++c) {
        triangles[a][b][c] = 0;
      }
    }
  }

  printf("calling C function to calc sum\n");
  calculate_sum(maxval, &triangles);

  printf("%4s\t%4s\t%4s\t%4s\n", "a", "b", "c", "sum");
  printf("   -------------------------\n");
  for (c = 1; c < maxval; ++c) {
    for (b = 1; b <= c; ++b)
      for (a = 1; a <= b; ++a)
        if ( ( pow ( a, 2 ) + pow ( b, 2 ) ) == pow ( c, 2 ) ) {
          printf("%4i\t%4i\t%4i\t%4i\n", a, b, c, triangles[a][b][c]);
        }
  }

  SEXP result = PROTECT(allocVector(LGLSXP, 1));
  LOGICAL(result)[0] = 1;
  UNPROTECT(1);

  return(result);

}

This is very similar to what’s in main() - it still performs the allocation then calls out to the calculating code, then prints the result. The only new part is creating a logical result object (1 == TRUE) so that there’s something to return.

You can read about PROTECT which guards against garbage collection in the R-exts manual.

The new functions called here such as allocVector come from the Rinternals library and are macros for functions starting with Rf_ - i.e. Rf_allocVector. I had some issues initially because I followed some guides which used #define R_NO_REMAP. Keep in mind that if you use that (so that library functions aren’t remapped) you will need to use the Rf_ versions of these functions. I ended up removing the #define myself, but I’m not sure if that will bite me later.

This also needs to convert the SEXP input maximum to a C int via * INTEGER(maximum).

We now have something that should work in our R package! Saving this as src/triangles.c we can add the R interface as R/triangles.R containing just

#' triangles
#'
#' @export
triangles <- function(maxval) {
  .Call("C_triangles", as.integer(maxval))
}

where we definitely only send an integer to the C function.

Build the package, which compiles the code, and load the package

library(triangles)
triangles(20)

calling C function to calc sum
   a	   b	   c	 sum
   -------------------------
   3	   4	   5	  12
   6	   8	  10	  24
   5	  12	  13	  30
   9	  12	  15	  36
   8	  15	  17	  40
[1] TRUE

We see the debug print statement, then the printed output, and finally our returned TRUE. Success!

The original code was made to work in a command line pipeline where the values were read back in by a subsequent program, e.g.

$ ./triangle 16 | tail +3 | awk '{ sum += $4 } END { print sum }'
102

so printing to stdout made sense there, but we want to use the values in R, so it would be great to return an actual data.frame. This repo contains a great example of doing that but I want to return a data.frame with a variable number of rows, and I need to allocate data into that. ChatGPT actually got me close enough to a working version. Here’s what I ended up with


  [...]

  calculate_sum(maxval, &triangles);

  /* count rows */
  int nrows = 0;
  for (c = 1; c < maxval; ++c) {
    for (b = 1; b <= c; ++b)
      for (a = 1; a <= b; ++a)
          if (triangles[a][b][c] != 0) {
          nrows += 1;
        }
  }

  /* output a data.frame */
  int ncols = 4;

  SEXP col1, col2, col3, col4, df;
  PROTECT(df = allocVector(VECSXP, ncols));

  PROTECT(col1 = allocVector(INTSXP, nrows));
  PROTECT(col2 = allocVector(INTSXP, nrows));
  PROTECT(col3 = allocVector(INTSXP, nrows));
  PROTECT(col4 = allocVector(INTSXP, nrows));

  int j = 0;
  for (c = 1; c < maxval; ++c) {
    for (b = 1; b <= c; ++b) {
      for (a = 1; a <= b; ++a) {
        if (triangles[a][b][c] != 0) {
          INTEGER(col1)[j] = a;
          INTEGER(col2)[j] = b;
          INTEGER(col3)[j] = c;
          INTEGER(col4)[j] = triangles[a][b][c];
          j += 1;
        }
      }
    }
  }

  SET_VECTOR_ELT(df, 0, col1);
  SET_VECTOR_ELT(df, 1, col2);
  SET_VECTOR_ELT(df, 2, col3);
  SET_VECTOR_ELT(df, 3, col4);

  SEXP colNames;
  PROTECT(colNames = allocVector(STRSXP, ncols));
  SET_STRING_ELT(colNames, 0, mkChar("a"));
  SET_STRING_ELT(colNames, 1, mkChar("b"));
  SET_STRING_ELT(colNames, 2, mkChar("c"));
  SET_STRING_ELT(colNames, 3, mkChar("sum"));
  setAttrib(df, R_NamesSymbol, colNames);

  SEXP rowNames;
  PROTECT(rowNames = allocVector(STRSXP, nrows));
  for (int i = 0; i < nrows; ++i) {
    char rowName[11];
    snprintf(rowName, sizeof(rowName), "%10d", i + 1
    SET_STRING_ELT(rowNames, i, mkChar(rowName));
  }
  setAttrib(df, R_RowNamesSymbol, rowNames);

  SEXP className;
  PROTECT(className = allocVector(STRSXP, 1));
  SET_STRING_ELT(className, 0, mkChar("data.frame"));
  classgets(df, className);

  UNPROTECT(8);
  return df;

Going through the biggest parts of this: first we identify how many rows we want to output by counting the nonzero elements of the passed-by-reference triangles

  /* count rows */
  int nrows = 0;
  for (c = 1; c < maxval; ++c) {
    for (b = 1; b <= c; ++b)
      for (a = 1; a <= b; ++a)
          if (triangles[a][b][c] != 0) {
          nrows += 1;
        }
  }

then allocating vectors - first a list the length of the number of columns (4) then one for each of the columns with length nrows

 /* output a data.frame */
  int ncols = 4;

  SEXP col1, col2, col3, col4, df;
  PROTECT(df = allocVector(VECSXP, ncols));

  PROTECT(col1 = allocVector(INTSXP, nrows));
  PROTECT(col2 = allocVector(INTSXP, nrows));
  PROTECT(col3 = allocVector(INTSXP, nrows));
  PROTECT(col4 = allocVector(INTSXP, nrows));

These are then populated in a loop with a new counter for the nonzero elements

  int j = 0;
  for (c = 1; c < maxval; ++c) {
    for (b = 1; b <= c; ++b) {
      for (a = 1; a <= b; ++a) {
        if (triangles[a][b][c] != 0) {
          INTEGER(col1)[j] = a;
          INTEGER(col2)[j] = b;
          INTEGER(col3)[j] = c;
          INTEGER(col4)[j] = triangles[a][b][c];
          j += 1;
        }
      }
    }
  }

and finally the vectors linked into the list

  SET_VECTOR_ELT(df, 0, col1);
  SET_VECTOR_ELT(df, 1, col2);
  SET_VECTOR_ELT(df, 2, col3);
  SET_VECTOR_ELT(df, 3, col4);

The rest is mostly boilerplate of setting up the data.frame: assigning column names

  SEXP colNames;
  PROTECT(colNames = allocVector(STRSXP, ncols));
  SET_STRING_ELT(colNames, 0, mkChar("a"));
  SET_STRING_ELT(colNames, 1, mkChar("b"));
  SET_STRING_ELT(colNames, 2, mkChar("c"));
  SET_STRING_ELT(colNames, 3, mkChar("sum"));
  setAttrib(df, R_NamesSymbol, colNames);

and row names

  SEXP rowNames;
  PROTECT(rowNames = allocVector(STRSXP, nrows));
  for (int i = 0; i < nrows; ++i) {
    char rowName[11];
    snprintf(rowName, sizeof(rowName), "%10d", i + 1
    SET_STRING_ELT(rowNames, i, mkChar(rowName));
  }
  setAttrib(df, R_RowNamesSymbol, rowNames);

and the class itself

  SEXP className;
  PROTECT(className = allocVector(STRSXP, 1));
  SET_STRING_ELT(className, 0, mkChar("data.frame"));
  classgets(df, className);

then finally UNPROTECTing the PROTECTED objects and returning the data.frame

  UNPROTECT(8);
  return df;

At one point, I had forgotten that I had modified an example and now had more PROTECT wrappers around objects, but hadn’t updated the number in UNPROTECT. It turns out this leads to an error in R about a stack imbalance - not particularly meaningful if you don’t realise what that means, so FYI.

So, with this new code in src/triangles.c we re-build and reload and try it out

library(triangles)

x <- triangles(16)

x

##            a  b  c sum
##          1 3  4  5  12
##          2 6  8 10  24
##          3 5 12 13  30
##          4 9 12 15  36

str(x)

## 'data.frame':	4 obs. of  4 variables:
##  $ a  : int  3 6 5 9
##  $ b  : int  4 8 12 12
##  $ c  : int  5 10 13 15
##  $ sum: int  12 24 30 36

Nothing printed when not expected, and the result is really a data.frame! We can even work with the data now

sum(x$sum)

## [1] 102

We still have the C code, and this can be updated as it evolves without affecting the R interface to it. With the R parts commented out, it can still be run as if it was just a regular C file. If we really want to compile it with the R parts still there we can include the R libraries (on a linux system, for example) with

$ cc -O -o triangle triangles.c -I/usr/share/R/include -L/usr/lib/R/lib -lR

Update 12-Aug-2023: I forgot to mention that in order to pass checks, R wants to have the following, typically in a file src/init.c

#include <R.h>
#include <Rinternals.h>
#include <stdlib.h> // for NULL
#include <R_ext/Rdynload.h>

/* .Call calls */
extern SEXP C_triangles(SEXP maximum);

static const R_CallMethodDef CallEntries[] = {
  {"C_triangles", (DL_FUNC) &C_triangles, 1},
  {NULL, NULL, 0}
};

void R_init_addr(DllInfo *dll) {
  R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
  R_useDynamicSymbols(dll, FALSE);
}

I won’t say I understand this bit, but it is mentioned in this part of Davis’ post.

I also updated the snprintf call in the rownames assignment since I got a warning about truncation.

There are some valid concerns about the fact that I’m not explicitly free()ing the allocated memory, so I plan to add some code to do that.

As always, I’ve learned a lot messing with things outside of my comfort zone here. I wouldn’t say that I want to write a lot more C code, but at least now I feel somewhat comfortable bringing into R to work with.

The package I detailed building here is on GitHub: https://github.com/jonocarroll/triangles in case it’s useful to you.

There’s always more for me to learn, though, so if you have comments, feedback, suggestions for improvements, etc… I can be found on Mastodon or use the comments below.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-08-12
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.5.0   2023-06-09 [3] CRAN (R 4.3.1)
##  cachem        1.0.8   2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.6.1   2023-03-23 [3] CRAN (R 4.2.3)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.33  2023-07-07 [3] CRAN (R 4.3.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.21    2023-05-05 [3] CRAN (R 4.3.0)
##  fastmap       1.1.1   2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3   2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.5   2023-03-23 [3] CRAN (R 4.2.3)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.7   2023-06-29 [3] CRAN (R 4.3.1)
##  knitr         1.43    2023-05-25 [3] CRAN (R 4.3.0)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.2   2023-06-30 [3] CRAN (R 4.3.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.5   2023-04-18 [3] CRAN (R 4.3.0)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.1.2)
##  rmarkdown     2.23    2023-07-01 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0  2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.7   2023-07-15 [3] CRAN (R 4.3.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.12  2023-01-11 [3] CRAN (R 4.2.2)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  triangles   * 0.1.0   2023-08-11 [1] local
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  vctrs         0.6.3   2023-06-14 [1] CRAN (R 4.1.2)
##  xfun          0.39    2023-04-20 [3] CRAN (R 4.3.0)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.7   2023-01-23 [3] CRAN (R 4.2.2)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Argument Matching Across Languages

website@jcarroll.com.au (Jonathan Carroll) — Sun, 06 Aug 2023 00:00:00 +0000

With Functional Programming, we write functions which take arguments and do something with or based on those arguments. You might not think there’s much to learn about given that tiny description of “an argument to a function” but the syntax and mechanics of different languages is actually widely variable and intricate.

Let’s say I have some function in R that takes three arguments, x, y, and z, and just prints them out in a string in that order.

r_fun <- function(x, y, z) {
  sprintf("arguments are: %s, %s, %s", x, y, z)
}

Calling this function with good practices (specifying all the argument names in full) would look like this

r_fun(x = "a", y = "b", z = "c")

## [1] "arguments are: a, b, c"

I said “in full” because by default, R will happily do partial matching, so long as it can uniquely figure out which argument you mean

long_args <- function(alphabet = "a to z", altitude = 100) {
  print(sprintf("alphabet: %s", alphabet))
  print(sprintf("altitude: %d", altitude))
}
long_args(alphabet = "[A-Z]", altitude = 50)

## [1] "alphabet: [A-Z]"
## [1] "altitude: 50"

In this case, both arguments start with "al" so it’s ambiguous up to there

long_args(al = "letters")

## Error in long_args(al = "letters"): argument 1 matches multiple formal arguments

but we only need to specify enough letters to disambiguate

long_args(alpha = "LETTERS", alt = 200)

## [1] "alphabet: LETTERS"
## [1] "altitude: 200"

Relying on this behaviour is dangerous, and it’s recommended to turn on warnings when this happens with

options(warnPartialMatchArgs = TRUE)
long_args(alpha = "LETTERS", alt = 200)

## Warning in long_args(alpha = "LETTERS", alt = 200): partial argument match of
## 'alpha' to 'alphabet'

## Warning in long_args(alpha = "LETTERS", alt = 200): partial argument match of
## 'alt' to 'altitude'

## [1] "alphabet: LETTERS"
## [1] "altitude: 200"

You don’t have to use argument names when calling the function, though - you can just rely on positional arguments

r_fun("a", "b", "c")

## [1] "arguments are: a, b, c"

and this is very commonly done, despite it being less clear to what any of those refer, and runs the risk that the function changes argument ordering in an updated version. It works, though.

Extensive sidenote: square-bracket matrix subsetting officially uses the (poorly? traditionally?) named arguments i and j as [i, j] but it actually entirely ignores them and uses positional arguments. The documentation (?`[`) does warn about this

“Note that these operations do not match their index arguments in the standard way: argument names are ignored and positional matching only is used. So m[j = 2, i = 1] is equivalent to m[2, 1] and not to m[1, 2].”

but it would be very easy to get bitten by it if one tried to use the names directly

m <- matrix(1:9, 3, 3, byrow = TRUE)
m

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

m[i = 1, j = 2]

## [1] 2

m[j = 2, i = 1]

## [1] 4

Thomas Lumley notes that

“it used to be that no primitive functions did argument matching by name.”/” and “-’ and switch() and some others still don’t. I’m not sure why”[” wasn’t changed in 2.11 when a bunch of primitives got normal argument matching.”

Worse still, perhaps - the seq() function creates a sequence of values. It has the formal arguments with defaults from = 1 and to = 1 so you can calculate

seq(from = 2, to = 5)

## [1] 2 3 4 5

or you can leverage the default of from = 1

seq(to = 5)

## [1] 1 2 3 4 5

However, there are five “forms” in which you can provide arguments to this function and they behave differently. If you only specify the first argument unnamed, it treats this as to despite the first argument being from

seq(5)

## [1] 1 2 3 4 5

which is extra strange, because if you do specify to with its ostensibly default value 1, the sequence is backwards

seq(5, to = 1)

## [1] 5 4 3 2 1

Back to our function - a feature that makes R really neat is that you can specify the named arguments in any order

r_fun(z = "c", x = "a", y = "b")

## [1] "arguments are: a, b, c"

If you don’t specify them by name, R will default to positions, so specifying just one (e.g. z) but leaving the rest unspecified, R will presume you want the others in positional order

r_fun(z = "c", "a", "b")

## [1] "arguments are: a, b, c"

Where it gets really interesting is you can go back to named arguments further along and again, R will figure out that you mean the remaining unnamed argument

r_fun(z = "c", "b", x = "a")

## [1] "arguments are: a, b, c"

This only holds if the function doesn’t use the ellipses ... which captures “any other arguments” when calling the function, often to be passed on to another function. If the function signature has ... then all the unnamed arguments are captured. This example function just combines any other arguments into a comma-separated string, if there are any (tested with the under-documented ...length() which returns the number of arguments captured via ...)

dot_f <- function(a = 1, b = 2, ...) {
  print(sprintf("named arguments: %s, %s", a, b))
  if (...length()) {
    print(sprintf("additional arguments: %s", toString(list(...))))
  }
}

You can call this with just the named arguments

dot_f(a = 3, b = 4)

## [1] "named arguments: 3, 4"

or you can add more argument (no name required)

dot_f(a = 3, b = 4, 5)

## [1] "named arguments: 3, 4"
## [1] "additional arguments: 5"

As before, none of the names are really required, and we can add as many as we want

dot_f(3, 4, 5, 6, 7)

## [1] "named arguments: 3, 4"
## [1] "additional arguments: 5, 6, 7"

We can name them if we want

dot_f(a = 3, b = 4, blah = 5)

## [1] "named arguments: 3, 4"
## [1] "additional arguments: 5"

but here be danger, because those names can be anything and aren’t matched to the actual function, so this works (say, I misspelled an argument name a as A)

dot_f(A = 3, B = 4, 5)

## [1] "named arguments: 5, 2"
## [1] "additional arguments: 3, 4"

Notice that the additional arguments are the ones I named (not those in the function definition); the 5 has been positionally matched to a; and b has taken its default value of 2 because no other arguments were provided.

We can still mix up the ordering of positions, provided everything else matches up

dot_f(3, b = 4, 5)

## [1] "named arguments: 3, 4"
## [1] "additional arguments: 5"

dot_f(3, b = 4, 5, a = 2)

## [1] "named arguments: 2, 4"
## [1] "additional arguments: 3, 5"

The flexibility in all of this is what encouraged Joe Cheng to use R as an interface to HTML in the form of shiny, what he calls “a bizzarely good host language” (should link to the right timestamp) and he notes that other languages don’t let you do this sort of mixing up of named and positional arguments.

Okay, that’s R - weird and fun, but a lot of flexibility.

I saw this post mentioned in the #rust hashtag on Mastodon and had a look - it surprised me at first because I thought “what do you mean Rust doesn’t have named arguments?”…

I’ve become so used to the inline help from VSCode when I’m writing Rust that I didn’t realise I wasn’t using named arguments.

Here’s a function I wrote for my toy rock-paper-scissors game in Rust

fn play(a: Throw, b: Throw) -> GameResult {
    let result = match a.cmp(&b) {
        Ordering::Equal => GameResult::Tie,
        Ordering::Greater => GameResult::YouWin,
        Ordering::Less => GameResult::YouLose,
    };

    println!("{} {}", "Result:".purple().bold(), result);

    result
}

It has arguments a and b because I did a terrible job naming them - I knew exactly how I planned to use them, so bad luck to anyone else.

Calling that function further down in the code I have

let user = val.user();
let computer = Throw::computer();
play(user, computer);

BUT what I see in the editor has the argument names, unless I switch off hints (which I have bound to holding Ctrl+Alt at the moment)

Toggling inlay hints in VSCode

So, I can’t just rearrange arguments in Rust?

If I define a function with two arguments

>> fn two_args(a: f64, b: &str) -> String {
        let res = format!("all arguments: {a}, {b}");
        res
}

then I can call it

>> two_args(42.0, "forty-two")
"all arguments: 42, forty-two"

Just swapping the arguments obviously fails because 42.0 isn’t a &str and "forty-two" isn’t a f64. But there isn’t a way to say “the value for that argument is this”; I can’t use any of these

two_args(a = 42.0, b = "forty-two")
two_args(a: 42.0, b: "forty-two")

two_args(b = "forty-two", a = 42.0)
two_args(b: "forty-two", a: 42.0)

I suspect the fact that this was a surprise to me means I’m earlier in my Rust learning than I had thought - I clearly haven’t built anything that has functionality I didn’t directly need, because I haven’t had to worry about calling functions in strange ways yet.

There is one loophole… time to break out another cool toy: {rextendr}

library(rextendr)

rust_function(
  'fn two_args(a: f64, b: &str) -> String {
          let res = format!("all arguments: {a}, {b}");
          res
  }'
)

This produces an R function that takes two arguments, a and b which I can call as if it was an R function

two_args(a = 42, b = "forty-two")

## [1] "all arguments: 42, forty-two"

I can call it without argument names

two_args(42, "forty-two")

## [1] "all arguments: 42, forty-two"

and I can swap them

two_args(b = "forty-two", a = 42)

## [1] "all arguments: 42, forty-two"

This is just because the argument matching happens before the values get sent down to the Rust code - the function here is an R function that calls other code internally

two_args

## function (a, b) 
## .Call("wrap__two_args", a, b, PACKAGE = "librextendr1")
## <bytecode: 0x55d873cff7b8>

I somewhat started out the idea for this blogpost as I was learning some Typescript and came across this https://github.com/gibbok/typescript-book#typescript-fundamental-comparison-rules

“Function parameters are compared by types, not by their names:”

type X = (a: number) => void;
type Y = (a: number) => void;
let x: X = (j: number) => undefined;
let y: Y = (k: number) => undefined;
y = x; // Valid
x = y; // Valid

which initially struck me as strange, and I needed to work through some examples in a live setting. On reflection, I think I see that this is exactly what I would specify in e.g. Haskell - “a function that takes a number”, not “a function with an argument named a which is a number”

x :: Float -> Nothing

Because technically all functions in Haskell actually only take a single argument (the notation Int -> Int -> Int reveals this fact nicely, but in practice the notation makes it feel like multiple arguments can be used) there is no way to “pass arguments by name” but there is a neat way to swap the order of arguments that a function expects to receive; flip

flip :: (a -> b -> c) -> b -> a -> c

>>> flip (++) "hello" "world"
"worldhello"

-- or

>>> "hello" ++ "world"
"helloworld

Those of you familiar with R’s S3 dispatch functionality will perhaps note that the ‘first’ argument has a special role; it controls exactly which method will be called. If we had some function which was flexible in the sense that it could take several different ‘classes’ and do something different with them, we would write that as

flexi <- function(a, b) {
  UseMethod("flexi")
}

flexi.matrix <- function(a, b) {
  paste0("a is a matrix, b = ", b)
}

flexi.data.frame <- function(a, b) {
  paste0("a is a data.frame, b = ", b)
}

flexi.default <- function(a, b) {
  paste0("a is something else, b = ", b)
}

Now, depending on whether a is a matrix, a data.frame, or something else, one of the ‘methods’ will be called

flexi(a = matrix(), b = 7)

## [1] "a is a matrix, b = 7"

flexi(a = data.frame(), b = 8)

## [1] "a is a data.frame, b = 8"

flexi(a = 1, b = 9)

## [1] "a is something else, b = 9"

even if we swap the order of the arguments in the call

flexi(b = 3, a = matrix())

## [1] "a is a matrix, b = 3"

S4 dispatch goes even further and dispatches based on more than just the class of the first argument. Stuart Lee has a great guide on S4. The point is, you can do something different depending on what you pass to multiple arguments

s4flexi(matrix(), data.frame(), 7)
s4flexi(matrix(), data.frame(), list())
s4flexi(matrix(), data.frame(), NULL)

Julia has some of the most interesting argument parsing. I love the Haskell-like function declarations - so little boilerplate! We define some function f that takes two arguments

f(a, b) = a + b

## f (generic function with 1 method)

f(4, 5)

## 9

Similar to the Rust situation, though - these aren’t named outside of the function body, so we can’t refer to them either in that order or reversed

f(a = 4, b = 5)

MethodError: no method matching f(; a=4, b=5)
Closest candidates are:
  f(!Matched::Any, !Matched::Any) at none:3 got unsupported keyword arguments "a", "b"

The reason is that Julia uses the python-esque keyword argument syntax, where unnamed arguments appear first, followed by any keyword arguments following a ;, so we can specify these correctly as

f(; a, b) = a + b

## f (generic function with 2 methods)

f(a = 4, b = 6)

## 10

Julia is optionally typed, which means we can be flippant with the types here, or we can be very specific - we can specify that a should be an integer and b should be a string, and that produces a different method compared to what we already defined. In this case, I want to return a string with the two values

f(; a::Int, b::String) = "$a; $b"

## f (generic function with 2 methods)

f(a = 42, b = "life, universe, everything")

## "42; life, universe, everything"

Since these are now named, we can swap them

f(b = "L, U, E", a = 42)

## "42; L, U, E"

but what’s even more powerful is we can define a general method, and add type-specific methods for whatever combination of argument types we want; the first of these returns an integer, while the other two return strings

g(a, b) = a + b

## g (generic function with 1 method)

g(a::Int, b::String) = "unnamed int, string: $a; $b"

## g (generic function with 2 methods)

g(a::String, b::Int) = "unnamed string, int: $a; $b"

## g (generic function with 3 methods)

Then, depending on what types we provide in each argument, a different method is called

g(3, 2)

## 5

g("abc", 123)

## "unnamed string, int: abc; 123"

g(123, "abc")

## "unnamed int, string: 123; abc"

Similar to S4, but so easy to declare and use! Of course, this doesn’t work if we want these to be named since that would be ambiguous.

As I’m slowly learning APL, I’ve found it interesting that there’s a well-known approach of writing “point-free” (“tacit”) functions which don’t specify arguments at all.

Last of all, I’ve had the pleasure of dealing with C this week including passing a pointer to some object into a function, in which case the value outside of the function is updated. That’s a whole other post I’m working on.

How does your favourite language use arguments? Let me know! I can be found on Mastodon or use the comments below.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-08-06
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  assertthat    0.2.1   2019-03-21 [3] CRAN (R 4.0.1)
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  brio          1.1.3   2021-11-30 [1] CRAN (R 4.1.2)
##  bslib         0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem        1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  DBI           1.1.3   2022-06-18 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  dplyr         1.0.10  2022-09-01 [3] CRAN (R 4.2.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fansi         1.0.3   2022-03-24 [3] CRAN (R 4.2.0)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs            1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  generics      0.1.3   2022-07-05 [3] CRAN (R 4.2.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  JuliaCall     0.17.5  2022-09-08 [1] CRAN (R 4.1.2)
##  knitr         1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pillar        1.8.1   2022-08-19 [3] CRAN (R 4.2.1)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgconfig     2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rextendr    * 0.3.0   2023-05-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown     2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rprojroot     2.0.3   2022-04-02 [1] CRAN (R 4.1.2)
##  rstudioapi    0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  sass          0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  tibble        3.1.8   2022-07-22 [3] CRAN (R 4.2.2)
##  tidyselect    1.2.0   2022-10-10 [3] CRAN (R 4.2.1)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  utf8          1.2.2   2021-07-24 [3] CRAN (R 4.2.0)
##  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  withr         2.5.0   2022-03-03 [3] CRAN (R 4.2.0)
##  xfun          0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Array Languages: R vs APL

website@jcarroll.com.au (Jonathan Carroll) — Fri, 07 Jul 2023 00:00:00 +0000

I’ve been learning at least one new programming language a month through Exercism which has been really fun and interesting. I frequently say that “every language you learn teaches you something about all the others you know” and with nearly a dozen under my belt so far I’m starting to worry about the combinatorics of that statement.

APL isn’t on the list of languages but I’ve seen it in codegolf solutions often enough that it seemed worth a look.

Now, when I say “learning” I mean “good enough to do 5 toy exercises” which is what you need to do to in order to earn the badge for that month in the “#12in23 challenge” (gamification FTW). That’s often sufficient for me to get a taste for the language and see if it’s something I’d like to dive deeper into.

It means I’ve been watching a lot of “<language> beginner tutorial” videos recently, which may have been what prompted YouTube to suggest to me a video from code_report; I think this one comparing a leetcode solution to the problem

find the GCD (greatest common divisor) of the smallest and largest numbers in an array

written in 16 (sixteen!!!) languages. Some of those I know a little or a moderate amount about, but one stood out. The APL solution comprises 5 glyphs (symbols) representing operations

      ⌈/∨⌊/

I’ve seen APL solutions pop up in codegolf and they’ve just looked like madness, which is probably fair. The linked video prompted me to look into some of their other videos and they do a great job explaining the glyphs in APL and how they compare to other languages. It turns out this madness is not nearly as hard to read as it looks. The above glyphs represent “maximum” (⌈), “reduce” (/), “GCD” (∨), and “minimum” (⌊) and those all correspond well to the problem statement. The function itself is “point-free” whereby the argument(s) aren’t specified at all; like saying mean rather than mean(x). For the truly adventurous: ‘The Hideous Beauty of Point-Free Programming; An exercise in combinators using Haskell’

I ended up diving deeper and deeper, and it all started to make more and more sense.

In a recent stream, ThePrimeagen responded to the comment about some language that “<x> is more readable” with “readability is just familiarity” and that stuck with me - I’m not entirely sure I 100% agree with it because I can find several ways to write some code that someone familiar with that language will either find easy or hard to read, despite familiarity. I think {dplyr} in R does a fantastic job of abstracting operations with verbs and making data pipelines easy to comprehend, certainly much more than the base-equivalent code.

So, would APL be “readable” if I was more familiar with it? Let’s find out!

There aren’t that many glyphs in APL - there are far more unique functions in most big libraries from any mainstream language. Looking at the top of the ‘ride’ editor for Dyalog APL there are 80 glyphs. To make a slightly unfair example, there are a lot of exported functions (288 of them) in {dplyr}…

packageVersion("dplyr")

## [1] '1.0.10'

ns <- sort(getNamespaceExports("dplyr"))
head(ns, 20)

##  [1] ".data"        "%>%"          "across"       "add_count"    "add_count_"  
##  [6] "add_row"      "add_rownames" "add_tally"    "add_tally_"   "all_equal"   
## [11] "all_of"       "all_vars"     "anti_join"    "any_of"       "any_vars"    
## [16] "arrange"      "arrange_"     "arrange_all"  "arrange_at"   "arrange_if"

tail(ns, 20)

##  [1] "tbl_vars"            "tibble"              "top_frac"           
##  [4] "top_n"               "transmute"           "transmute_"         
##  [7] "transmute_all"       "transmute_at"        "transmute_if"       
## [10] "tribble"             "type_sum"            "ungroup"            
## [13] "union"               "union_all"           "validate_grouped_df"
## [16] "validate_rowwise_df" "vars"                "with_groups"        
## [19] "with_order"          "wrap_dbplyr_obj"

Taking the functions listed as S3method or export in the NAMESPACE file is 470+. Sure, these aren’t all user-facing, but still. Lots.

So, 80 isn’t a “huge” number, if that’s the entire language.

I watched some more videos about what the glyphs mean and how they work. I started to become slightly familiar with what they mean. Learning is done with the hands, not the eyes, though - as this (not new) blog post goes into great detail on, so I felt that I needed to actually write something. I installed Dyalog APL and the ride editor (given that it uses glyphs, a non-standard editor seems to make sense; I’ve otherwise been completing the Exercism solutions in emacs). I also found tryapl.org as an online editor.

The first step was to just follow along what I’d seen in the videos. I had most recently watched this one that does include a comparison to R (and Julia) so I tried to recreate what I’d seen built up. I was shocked that I actually could!

Recreating construction of an X-matrix in APL using tryapl.org

From reshaping into a matrix, to building up the sequence, to inserting the combinator - it all came together easily enough.

On “combinators” - if you aren’t familiar with Lambda Calculus and have a spare hour, this is a wonderful talk explaining the basics and demonstrating them using JavaScript.

More videos, more learning. I found this one which is another leetcode problem which was roughly

find the maximum value of the sum of rows of a matrix

That sounded like something R would easily handle, but this particular video didn’t feature R. It did feature C++, the solution for which requires two for loops and looked (to me) horrific - I’m used to just passing a matrix to an R function and not having to worry about loops.

I’ve had many discussions on this topic because for whatever reason, for loops have a particular reputation in R despite them not (necessarily) being any worse than any other solution. The short response is that if you’re using one when you could be using vectorisation, you’re probably stating your problem poorly and can do better (in terms of readability, performance, or both). This video covers the points really nicely.

Jenny Bryan made the point that

Of course someone has to write loops… It doesn’t have to be you

alluding to the fact that vectorisation (either with the *apply family or purrr) still has a C loop buried within (I covered some of this myself in another post).

Miles McBain makes a point of never using them (directly).

Okay, so, returning to the leetcode problem. The APL solution in the video is reshaping (⍴) a vector to a matrix then reducing (/) addition (+) across rows (last-axis; c.f. first axis would be +⌿) and reducing (/) that with maximum (⌈) making the entire solution

      x ← 3 3⍴1 2 3 5 5 5 3 1 4
      ⌈/+/x
 15

which is an elegant, compact solution. APL agrees to ignore the [1] at the start of R’s default output if R agrees to ignore the odd indenting of APL commands.

As a sidenote: I love that I finally get to use the OG assignment arrow ← that inspired the usage in R (as <-). This isn’t some ligature font, it’s the actual arrow glyph with Unicode code point U+2190. The APL keyboard has this on a key and that was common around the time that it made it into R (or S).

The video explains that this solution is particularly nice because it’s explicit that two “reduce” operations are occurring. The + operator in APL can be either unary (takes 1 argument) or binary (takes 2 arguments) but it can’t loop over an entire vector. To achieve that, it’s combined with / which performs “reduce”, essentially applying + across the input.

It’s a fairly straightforward answer with R, too:

a <- matrix(c(1, 2, 3,
              5, 5, 5,
              3, 1, 4),
            3, 3, byrow = TRUE)
a

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    5    5    5
## [3,]    3    1    4

max(rowSums(a))

## [1] 15

and done. Nice. No for loops. Or are there? Of course there are, somewhere, but can we write this “like” the APL solution and be more explicit with the “reduce” steps over binary operators? R has a Reduce() function for exactly this case.

A simplified rowSums() function could just be applying the sum operation to the rows of the matrix

s <- function(x) apply(x, 1, sum)

but sum(x) is itself vectorised - it’s an application of the binary + operation across a vector, so really we could have

s <- function(x) apply(x, 1, \(y) Reduce(`+`, y))
s(a)

## [1]  6 15  8

This isn’t so bad compared to APL which “naturally” performs the reduction over that dimension. Compare (⍝ signifies a comment):

      x
1 2 3
5 5 5
3 1 4

⍝ "rowSums"

      +/x
6 15 8

⍝ "colSums"

      +⌿x
9 8 12

There’s nothing here that says x needs to have more than 1 dimension, though - it’s the same operator(s) on a vector, just that they do the same thing

      +/(1 2 3)
6
      +⌿(1 2 3)
6

max is also vectorised, so a simple, ostensibly binary version of that could be

m <- function(x, y) ifelse(x > y, x, y)
m(1, 2)

## [1] 2

m(4, 2)

## [1] 4

Together, an R solution using these could be

Reduce(m, s(a))

## [1] 15

which, if we shortened Reduce to a single character

R <- Reduce

would be

R(m, s(a))

## [1] 15

That’s not a lot more characters than APL. I’ve abstracted at least one of the functions, though - APL uses the operators directly, in which case we’d have

maxWealth <- \(x) R(m, apply(x, 1, \(y) R(`+`, y)))
maxWealth(a)

## [1] 15

That’s only using Reduce, binary +, a simplified max (which we could imagine was a built-in we could shorten to m), and the apply over rows.

Comparing these directly (with some artistic license):

 m R + R
 ⌈ / + /

The point of this whole exercise wasn’t to rebuild the APL solution in R - it was to think more deeply about what abstractions R offers and how they compare to a language that uses (only) the atomic constructs directly.

I love that in R I can pass either individual values or a vector to sum and it “just deals with it”

sum(4, 5, 6) # sum some "scalars"

## [1] 15

vals <- c(4, 5, 6)
sum(vals) # sum a vector

## [1] 15

This ability to hide/abstract the looping over dimensions and to work directly with objects with more than one dimension is what qualifies R as an “array language”. This is also (mimicking, perhaps) “rank polymorphism” which APL does have. Julia gets around this with “broadcasting”. But, at least in R, this hides/abstracts some of what is happening, and sometimes/often, that’s a for loop.

Does every programmer need to know the gory details? Absolutely not. Might it be useful for gaining a better understanding of the language and how to work with it? I really think it is. It’s why I’m digging further and further into functional programming in general.

I do believe that the APL solution is more explicit in what it’s doing; that it doesn’t hide (much, if any) of the implementation details. I’m comfortable with the abstractions in R and will continue to write R for this reason, but if I had a need to do some array math in any other language, I now feel like APL really does have a lot to offer.

Bonus Round

I was thinking about the leetcode problem and thought that a slightly more complex version would be to return “which row has the maximum?” rather than the maximum itself.

In R, there is another useful function to achieve this

which.max(rowSums(a))

## [1] 2

so, have I learned enough APL to do this myself?

There’s a “Grade Down” operator (⍒) which seems equivalent to R’s order(decreasing = TRUE) and a “First” operator (⊃) like head(n = 1) so a solution seems to be to get the indices of the sorted (decreasing) elements then take the first one

      ⊃⍒+/x
2

Apparently an alternative would be to find the (first) element of the input (⍵) that matches the maximum which would be

      {⍵⍳⌈/⍵}(+/x)
2

which, at least to me, isn’t as elegant.

Lastly, Kieran Healy relayed to me a small algorithm for finding ‘primes smaller than some number’ in APL which cleaned up as

      ((⊢~∘.×⍨)1↓⍳)(50)
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47

This makes use of some combinators (e.g. the C-combinator ⍨ - possibly the coolest glyph in the entire system), but roughly involves filtering values not (~) members (∈) of values produced by the outer (º) product (.) using multiplication (×) (i.e. numbers that can be made by multiplying other numbers) from the sequence (⍳) from 2 to some value (dropping (↓) 1; 3↓⍳8 == 4:8). With the small amount I’ve learned - mostly from watching someone else use the language - I was able to decipher at least what the operators were in all of that, even if I probably couldn’t come up with the solution myself.

I’m happy to call that “readable”.

I looked around for code to generate the primes below some number in R. I couldn’t (easily) find one that worked without an explicit loop. I found a version in {sfsmisc} which compacts to

primes. <- function(n) {
  ## By Bill Venables <= 2001
  x <- 1:n
  x[1] <- 0
  p <- 1
  while((p <- p + 1) <= floor(sqrt(n)))
    if(x[p] != 0)
      x[seq(p^2, n, p)] <- 0
  x[x > 0]
}
primes.(50)

##  [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47

Taking inspiration from the APL solution, though - what if we just generate all products from the set of numbers 2:n and exclude those as “not prime” from all the numbers up to n?

primes <- function(n) {
  s <- 2:n
  setdiff(s, c(outer(s, s, `*`)))
}
primes(50)

##  [1]  2  3  5  7 11 13 17 19 23 29 31 37 41 43 47

That… works! It’s slower and uses more memory, for sure, but that wasn’t our criteria, and isn’t relevant for a once-off evaluation. Even better - I can “read” exactly what it’s doing.

I’ve learned a lot and I’ll continue to learn more about APL because I really do think that understanding how these operators come together to build a function will be enlightening in terms of a functional approach.

I still haven’t made it to trying out BQN (almost constructed by incrementing each letter of APL, IBM -> HAL style, but perhaps officially “Big Questions Notation”, and sometimes pronounced “bacon”) but it sounds like it has some newer improvements over APL and will be worth a try.

As always, comments and discussions are welcome here or on Mastodon.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-07-07
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  assertthat    0.2.1   2019-03-21 [3] CRAN (R 4.0.1)
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem        1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  DBI           1.1.3   2022-06-18 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  dplyr         1.0.10  2022-09-01 [3] CRAN (R 4.2.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fansi         1.0.3   2022-03-24 [3] CRAN (R 4.2.0)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs            1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  generics      0.1.3   2022-07-05 [3] CRAN (R 4.2.1)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  knitr         1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pillar        1.8.1   2022-08-19 [3] CRAN (R 4.2.1)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgconfig     2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown     2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rstudioapi    0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  sass          0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  tibble        3.1.8   2022-07-22 [3] CRAN (R 4.2.2)
##  tidyselect    1.2.0   2022-10-10 [3] CRAN (R 4.2.1)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  utf8          1.2.2   2021-07-24 [3] CRAN (R 4.2.0)
##  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  xfun          0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Reflecting on Macros

website@jcarroll.com.au (Jonathan Carroll) — Sat, 10 Jun 2023 00:00:00 +0000

I’ve been following the drama of the RustConf Keynote Fiasco (RKNF, per @fasterthanlime) from a great distance - I’m not involved in that community beyond starting to learn the language. But the controversial topic itself Compile-Time Reflection seemed like something interesting I could learn something about.

A good start is usually a Wikipedia page, and I found one called “Reflective programming” under the “MetaProgramming” category, where it defines

reflection is the ability of a process to examine, introspect, and modify its own structure and behavior

That sounds somewhat familiar from what metaprogramming I’ve read about. One of the great features of R is the ability to inspect and rewrite functions, for example, the body of the sd() function (calculating the standard deviation of the input) looks like

sd

## function (x, na.rm = FALSE) 
## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <bytecode: 0x5647347e3980>
## <environment: namespace:stats>

Trying to extract a “component” of that function results in the ever-classic error

sd[1]

## Error in sd[1]: object of type 'closure' is not subsettable

However, using body() we can get to the components of the function

body(sd)

## sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))

body(sd)[1]

## sqrt()

and we can even mess with it (meaninglessly, in this case)

vals <- c(1, 3, 5, 7)
sd(vals)

## [1] 2.581989

my_sd <- sd
body(my_sd)[1] <- call("log")
my_sd # note that the function now (wrongly) uses log() instead of sqrt()

## function (x, na.rm = FALSE) 
## log(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
##     na.rm = na.rm))
## <environment: namespace:stats>

my_sd(vals)

## [1] 1.89712

The Wikipedia page lists the following example of reflection in R

# Without reflection, assuming foo() returns an S3-type object that has method "hello"
obj <- foo()
hello(obj)

# With reflection
class_name <- "foo"
generic_having_foo_method <- "hello"
obj <- do.call(class_name, list())
do.call(generic_having_foo_method, alist(obj))

Using a more concrete data object and class, e.g. tibble::tibble and summary might be clearer

library(tibble) # do.call doesn't like pkg::fun as a string

# Without reflection
obj <- tibble(a = 1:2, b = 3:4)
summary(obj)

##        a              b       
##  Min.   :1.00   Min.   :3.00  
##  1st Qu.:1.25   1st Qu.:3.25  
##  Median :1.50   Median :3.50  
##  Mean   :1.50   Mean   :3.50  
##  3rd Qu.:1.75   3rd Qu.:3.75  
##  Max.   :2.00   Max.   :4.00

# With reflection
class_name <- "tibble"
generic_having_foo_method <- "summary"
obj <- do.call(class_name, list(a = 1:2, b = 3:4))
obj

## # A tibble: 2 × 2
##       a     b
##   <int> <int>
## 1     1     3
## 2     2     4

do.call(generic_having_foo_method, alist(obj))

##        a              b       
##  Min.   :1.00   Min.   :3.00  
##  1st Qu.:1.25   1st Qu.:3.25  
##  Median :1.50   Median :3.50  
##  Mean   :1.50   Mean   :3.50  
##  3rd Qu.:1.75   3rd Qu.:3.75  
##  Max.   :2.00   Max.   :4.00

So, maybe it’s more to do with being able to use a string containing the “name” of a function and go and find that function, or just the ability to generate functions on-demand based on non-function objects (?). Please, let me know if there’s a more enlightening explanation.

I still don’t think I understand that at all (more time required) but I did note in some additional research that “reflection” and “macros” are two very similar concepts. Now macros are something I’ve heard of at least, so I was off to do some more research.

Unfortunately, web searches for the terms “reflection” and “macro” turn up a lot of macro-lens photography results.

I’ve heard of macros in Julia where they’re used to “rewrite” an expression. This is a nice rundown of the process, as are the official docs. These are used in many places. One up-and-coming place is the new Tidier.jl which implements the tidyverse (at least the most common dplyr and purrr parts) using macros (denoted with a @ prefix)

using Tidier
using RDatasets

movies = dataset("ggplot2", "movies");

@chain movies begin
    @mutate(Budget = Budget / 1_000_000)
    @filter(Budget >= mean(skipmissing(Budget)))
    @select(Title, Budget)
    @slice(1:5)
end

Rust uses macros for printing (amongst other things); println!() is a macro, apparently at least in part because it needs to be able to take an arbitrary number of args, so one can write

>> println!("a = {}, b = {}, c = {}", 1, 2, 3)
a = 1, b = 2, c = 3

Rust has a shorthand macro for creating a new vector vec!()

>> let v = vec![2, 3, 4];

and also has the “debug macro” dbg!() which is super handy - it prints out the expression you wrap, plus the value, so you can inspect the current state with e.g.

>> dbg!(&v);
[src/lib.rs:109] &v = [
    2,
    3,
    4,
]

This last one would be great to have in R… as a side note, we could construct a simple version with {rlang}

dbg <- function(x) {
  ex <- rlang::f_text(rlang::enquos(x)[[1]])
  ret <- rlang::eval_bare(x)
  message(glue::glue("DEBUG: {ex} = {ret}"))
  ret
}

a <- 1
b <- 3
x <- dbg(a + b)

## DEBUG: a + b = 4

y <- dbg(2*x + 3)

## DEBUG: 2 * x + 3 = 11

z <- 10 + dbg(y*2)

## DEBUG: y * 2 = 22

In all of these examples of macros, the code that is run is different to the code you write because the macro makes some changes before executing.

In R there isn’t a “proper” way to do this but we do have ways to manipulate code and we do have ways to retrieve “unparsed” input, e.g. substitute(). A quick look for “macros in R” turned up a function in a package that is more than 20 years old (I was only starting University when this came out and knew approximately 0 programming) and comes with a journal article; gtools::defmacro() by Thomas Lumley has a construction for writing something that behaves like a macro.

That article is from 2001 when R 1.3.1 was being released. The example code made me do a double-take

library(gtools)

####
# macro for replacing a specified missing value indicator with NA
# within a dataframe
###
setNA <- defmacro(df, var, values,
  expr = {
    df$var[df$var %in% values] <- NA
  }
)

# create example data using 999 as a missing value indicator
d <- data.frame(
  Grp = c("Trt", "Ctl", "Ctl", "Trt", "Ctl", "Ctl", "Trt", "Ctl", "Trt", "Ctl"),
  V1 = c(1, 2, 3, 4, 5, 6, 999, 8, 9, 10),
  V2 = c(1, 1, 1, 1, 1, 2, 999, 2, 999, 999),
  stringsAsFactors = TRUE
)
d

##    Grp  V1  V2
## 1  Trt   1   1
## 2  Ctl   2   1
## 3  Ctl   3   1
## 4  Trt   4   1
## 5  Ctl   5   1
## 6  Ctl   6   2
## 7  Trt 999 999
## 8  Ctl   8   2
## 9  Trt   9 999
## 10 Ctl  10 999

# Try it out
setNA(d, V1, 999)
setNA(d, V2, 999)
d

##    Grp V1 V2
## 1  Trt  1  1
## 2  Ctl  2  1
## 3  Ctl  3  1
## 4  Trt  4  1
## 5  Ctl  5  1
## 6  Ctl  6  2
## 7  Trt NA NA
## 8  Ctl  8  2
## 9  Trt  9 NA
## 10 Ctl 10 NA

Wait - I thought… there’s no assignment in those last lines, but the data is being modified!?! Sure enough, the internals of defmacro make it clear that this is the case, but it seemed like magic. Essentially, this identifies what needs to happen, what it needs to happen to (via substitute()), and makes it happen in the parent.frame(). Neat! So, what else can we do with this?

I thought about it for a while and realised what could be a [te|ho]rrific one…

Just a couple of weeks ago, Danielle Navarro made a wish

not for the first time I find myself wishing that push() and pop() were S3 generics in #rstats

Now, if you’re not familiar with those, pop(x) removes the first element of a structure x (e.g. a vector) and returns that first value, leaving the original object x containing only the remaining elements, whereas push(x, y) inserts the value y as the first element of x, moving the remaining elements down the line. These show up more in object-oriented languages, but they don’t exist in R.

If we define a vector a containing some values

a <- c(3, 1, 4, 1, 5, 9)

and we wish to extract the first value, we can certainly do so with

a[1]

## [1] 3

but, due to the nature of R, the vector a is unchanged

## [1] 3 1 4 1 5 9

Instead, we could remove the first value of a with

a[-1]

## [1] 1 4 1 5 9

but again, a remains unchanged - in order to modify a we must redefine it as e.g.

a <- a[-1]
a

## [1] 1 4 1 5 9

If we wanted to build a pop() function, we could use substitute() to figure out what the passed input object was, perform the extraction of the first element, and so on…

But as we’ve just seen, there’s a better way to define that - a macro!

r_pop <- gtools::defmacro(x, expr = {
  ret <- x[1]
  x <- x[-1]
  ret
})

Now, if we use that on a vector

a <- c(3, 1, 4, 1, 5, 9)
r_pop(a)

## [1] 3

## [1] 1 4 1 5 9

It works!!!

Danielle wanted a Generic, though, so we can easily make pop() a Generic and add methods for some classes (which can be further extended).

To that end, I present a brand new package; {weasel}

pop() goes the {weasel}

This defines pop() and push() as Generics with methods defined for vectors, lists, and data.frames

a <- list(x = c(2, 3), y = c("foo", "bar"), z = c(3.1, 4.2, 6.9))
a

## $x
## [1] 2 3
## 
## $y
## [1] "foo" "bar"
## 
## $z
## [1] 3.1 4.2 6.9

x <- pop(a)
a

## $y
## [1] "foo" "bar"
## 
## $z
## [1] 3.1 4.2 6.9

## [1] 2 3

a <- data.frame(x = c(2, 3, 4), y = c("foo", "bar", "baz"), z = c(3.1, 4.2, 6.9))
a

##   x   y   z
## 1 2 foo 3.1
## 2 3 bar 4.2
## 3 4 baz 6.9

x <- pop(a)
a

##   x   y   z
## 2 3 bar 4.2
## 3 4 baz 6.9

##   x   y   z
## 1 2 foo 3.1

a <- c(1, 4, 1, 5, 9)
a

## [1] 1 4 1 5 9

push(a, 3)
a

## [1] 3 1 4 1 5 9

a <- data.frame(y = c("foo", "bar", "baz"), z = c(3.1, 4.2, 6.9))
a

##     y   z
## 1 foo 3.1
## 2 bar 4.2
## 3 baz 6.9

push(a, data.frame(y = 99, z = 77))
a

##     y    z
## 1  99 77.0
## 2 foo  3.1
## 3 bar  4.2
## 4 baz  6.9

I wrote this (simple) package as a bit of an exercise - I really don’t think you should actually use it for anything. The “looks like it modifies in-place but actually doesn’t” is really non-idiomatic for R. Nonetheless, I was really interested to see that defmacro can be used as a function definition that the dispatch machinery will respect. The only catch I’ve found so far is that I can’t use ellipses (...) in the function signature.

I noticed that Dirk Schumacher built a similar defmacro package more recently, but that appears to be more aimed at building macros to be expanded on package load (funnily enough, “compile-time macros” - we’ve come full circle). This seems like a great opportunity for “inlining” some functions. I’ll definitely be digging deeper into that one.

Let me know if you have a better explanation of any of the concepts I’ve (badly) described here; I’m absolutely just learning and following Julia Evans’ advice about blogging.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-06-17
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem        1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fansi         1.0.3   2022-03-24 [3] CRAN (R 4.2.0)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs            1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  gtools      * 3.9.4   2022-11-27 [1] CRAN (R 4.1.2)
##  htmltools     0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  knitr         1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pillar        1.8.1   2022-08-19 [3] CRAN (R 4.2.1)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgconfig     2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown     2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rstudioapi    0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  sass          0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  tibble      * 3.1.8   2022-07-22 [3] CRAN (R 4.2.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  utf8          1.2.2   2021-07-24 [3] CRAN (R 4.2.0)
##  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  weasel      * 0.1.0   2023-06-09 [1] local
##  xfun          0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Hyperlink Annotations in JavaScript and CSS

website@jcarroll.com.au (Jonathan Carroll) — Fri, 02 Jun 2023 00:00:00 +0000

This might not have been difficult for a seasoned web-dev, but it was reasonably tricky to find a clear solution online (at least it was for me) so here’s how I added the neat domain hints next to all the hyperlinks on my blog.

I’m familiar with these from the RWeekly site where hyperlinks are annotated with the target domain -

Rweekly.org annotated hyperlinks

I find this really useful to see where a link will take me.

I hadn’t looked into how those were being added, but we certainly weren’t doing it manually; we use regular markdown links like

[link description](https://example.com)

I recently saw a blog where these also appeared and it made me want to figure out how to add those to mine. Trying to search for “add domain next to hyperlink” doesn’t seem to produce much along the lines of what I was trying to do, and to make matters worse, I wasn’t sure whether this was part of Hugo/blogdown, JavaScript, CSS, or something else entirely.

I finally got enough clues to patch together a solution and I’m pretty happy with it!

My approach was to add some JavaScript (JQuery, I believe) to all built pages that inserts the hostname of the link target in parentheses. The simple version of that looks like

$('a').each(function () {
    this.hostname && $(this).after(' ('+this.hostname+')');
});

Breaking this down:

$(a) locates all instances of an anchor (<a href="...">)
.each() is a map over each of these, which takes a function
provided the hostname attribute is not empty, some text is inserted after() which adds this.hostname; just the base URL of the site being linked to

Getting this sourced into my blog means placing this inside a

$(document).ready(function() {

}

block. I added all of that to a new links.js file in static/js/ (which I had to create). I then edited /layouts/partials/footer_custom.html to include

<script src="{{ "js/links.js" | relURL }}"></script>

This inserts a <script> line into every page, adding the path of that file relative to the actual site. Phew.

On testing that, it does work, but it works for every link on the page, including those in the header, the social media share buttons… everything. That’s exactly what we asked for, of course, by selecting $(a).

After using the Inspect developer tools, I found that the main article of a blog post on my site has a blog-post class, so I can filter down the annotations to just anchors within that with

$(document).ready(function() {
  $('.blog-post a').each(function() {
      this.hostname && $(this).after(' ('+this.hostname+')');
  });
});

Checking the output, that prevents the header links from being annotated, but the share buttons are still within that article. Excluding those specifically just needs a .not()

$(document).ready(function() {
  $('.blog-post a').not('.share a').each(function() {
      this.hostname && $(this).after(' ('+this.hostname+')');
  });
});

Lastly, there are some annotations to my own page - those aren’t necessary (though I suppose they don’t hurt). I can remove those from being processed by checking if the link destination hostname is the same as the current page hostname (i.e. if it’s a link to the current site or an external link)

$(document).ready(function() {
  $('.blog-post a').not('.share a').each(function() {
      if (this.hostname != window.location.hostname) {
          this.hostname && $(this).after(' ('+this.hostname+')');
      }
  });
});

And that’s it! External links are now annotated.

As a last step, I decided to style these slightly differently. That means adding a class to the added text, which I did with a <span>

$(document).ready(function() {
  $('.blog-post a').not('.share a').each(function() {
      if (this.hostname != window.location.hostname) {
          this.hostname && $(this).after(' <span class="link-annot">('+this.hostname+')</span>');
      }
  });
});

Adding some CSS for this class means some of the same steps; I added a new static/css/links.css file and added

.link-annot {
  color: #808080;
  font-size: 14px;
  font-family: monospace;
}

to make the annotations grey, slightly smaller than the body text, and in monospace font.

I made sure this was sourced into the pages by editing /layouts/partials/head_custom.html to include

<link rel="stylesheet" href="{{ "css/links.css" | relURL }}" />

and finally, I have what I wanted!

Hyperlink annotations; automatically added and styled

If anyone wants to do the same, all of the changes I needed to make are in this pull request

Was there an easier way to do this? Let me know on Mastodon or use the comments below.

Which Plot Was That?

website@jcarroll.com.au (Jonathan Carroll) — Fri, 26 May 2023 00:00:00 +0000

Plotly has a nice way of making click-events available to the calling language, but it doesn’t quite work simply when using subplot(). This isn’t a post about a new feature, but I didn’t quickly find a resource for it so I’ll add my findings to make it easier for the next person.

Plotly (as a graphics library) is a JavaScript library that has been ported to R, Python, Julia, and - surprising to me - MATLAB and F#. It provides an interactive plotting framework that works really nicely for web-facing apps including R’s {shiny}.

I’m currently building an internal tool at work and wanted to add some click-event-based reactivity. Plotly supports that by registering an ‘event’ with a ‘source’ which can be listened to with an event_data() call. A simple shiny app demonstrating that might be

library(plotly)
library(shiny)

ui <- basicPage("",
                mainPanel(
                  plotlyOutput("p"),
                  verbatimTextOutput("out")
                )
)

server <- function(input, output, session) {
  output$p <- renderPlotly({
    plotly::plot_ly(data = mtcars,
                    y = ~ hp,
                    x = ~ mpg,
                    type = "scatter",
                    mode = "markers",
                    source = "click_src") |> # default is "A"
      event_register("plotly_click")
  })

  output$out <- renderPrint({
    click_data <- event_data("plotly_click", source = "click_src")
    req(click_data)
    message("CLICK!")
    click_data
  })
}

runApp(shinyApp(ui = ui, server = server))

Listening to click events in plotly

There’s a bit to break down here if you’re not familiar with {shiny};

A user interface stored as ui which describes how the app should “look”. In this extremely simple case, it’s some plotly output followed by some text.
A server function which performs the ‘backend’ operations, sending outputs to the components corresponding to the UI elements. In this case producing a plotly plot of the mtcars dataset with a ‘scatter’ plot of the hp column on the y-axis and the mpg column on the x-axis. The source argument specifies a ‘label’ for the event (defaulting to "A" but specified as "click_src" in this case). Finally, the ‘event’ is registered. This example also includes a text output of the data associated with clicking on a point in the plot, and a message the console every time that happens.
A call to runApp() which starts an app with the specified ui and server.

This generates a simple shiny app with one plot. Clicking on any of the points produces a text output containing:

curveNumber: identifying the ‘trace’ number for that data. We only have one, so this will always be 0 (JavaScript starts counting at 0)
pointNumber: ostensibly the index of the clicked point in the original dataset, though I believe that may not always be the case
x the x-coordinate of the clicked point
y the y-coordinate of the clicked point

This is nice for interacting with the plot to, say, highlight a row in a table containing the same data. With two of these plots side-by-side one can give each a unique source and “listen” to those independently.

If, however, we have several plots and want them to share a common x-axis (so that panning works across all of the plots) we need to “combine” the plots using plotly::subplot(). This doesn’t take a source argument itself, and when we provide a list of several plots, it produces a warning that

Warning: Can only have one: source

How, then, do we identify which subplot was clicked?

If each subplot contained a single “trace”, then curveNumber would correspond to that trace (in the order they were supplied to subplot) and we could identify which subplot was clicked. A small example of the server code (the UI would be the same) for such a setup might be

server <- function(input, output, session) {
  output$p <- renderPlotly({
    p1 <- plotly::plot_ly(data = mtcars,
                          y = ~ hp,
                          x = ~ mpg,
                          type = "scatter",
                          mode = "markers")

    p2 <- plotly::plot_ly(data = mtcars,
                          y = ~ wt,
                          x = ~ mpg,
                          type = "scatter",
                          mode = "markers")

    p3 <- plotly::plot_ly(data = mtcars,
                          y = ~ disp,
                          x = ~ mpg,
                          type = "scatter",
                          mode = "markers")

    s <- plotly::subplot(
      list(p1, p2, p3),
      shareX = TRUE,
      nrows = 3,
      heights = c(1, 1, 1)/3
    ) |>
      event_register("plotly_click")
    s$x$source <- "click_src" # subplot does not take a `source` argument
    s
  })

  output$out <- renderPrint({
    click_data <- event_data("plotly_click", source = "click_src")
    req(click_data)
    message("CLICK!")
    click_data
  })
}

Multiple traces - the second is “trace1” because JavaScript counts from 0

Because subplot doesn’t take a source argument, the (single) source needs to be added into the resulting object by force with the s$x$source line. This works, and we can get click data back from each subplot. In theory, curveNumber identifies which subplot was clicked.

However, if a subplot contained multiple traces (as my actual example did - a difficult to count number of traces that was updated as the underlying data changed… each different ‘color’ point you plot is a unique trace) then this gets complicated.

A minor update to the server, adding one additional “markers” trace to the second plot…

    p2 <- plotly::plot_ly(data = mtcars,
                          y = ~ wt,
                          x = ~ mpg,
                          type = "scatter",
                          mode = "markers") |>
      add_markers(y = ~ drat)                   # <- an additional trace

With one additional trace, it becomes difficult to determine which plot was clicked based on curveNumber alone

So, how can we identify the subplot when we can’t count the traces? The solution appears to be to add another entry to the click-data using customdata…

server <- function(input, output, session) {
  output$p <- renderPlotly({
    p1 <- plotly::plot_ly(data = mtcars,
                          y = ~ hp,
                          x = ~ mpg,
                          type = "scatter",
                          mode = "markers",
                          customdata = "first_plot")       # <--

    p2 <- plotly::plot_ly(data = mtcars,
                          y = ~ wt,
                          x = ~ mpg,
                          type = "scatter",
                          mode = "markers",
                          customdata = "second_plot") |>   # <--
      add_markers(y = ~ drat, customdata = "second_plot")  # <--

    p3 <- plotly::plot_ly(data = mtcars,
                          y = ~ disp,
                          x = ~ mpg,
                          type = "scatter",
                          mode = "markers",
                          customdata = "third_plot")       # <--

    s <- plotly::subplot(
      list(p1, p2, p3),
      shareX = TRUE,
      nrows = 3,
      heights = c(1, 1, 1)/3
    ) |>
      event_register("plotly_click")
    s$x$source <- "click_src"
    s
  })

  output$out <- renderPrint({
    click_data <- event_data("plotly_click", source = "click_src")
    req(click_data)
    message("CLICK!")
    click_data 
  })
}

By adding some customdata it’s easy to determine which plot was clicked

In this example I’ve added a single customdata value to each plot so it will be recycled across all of the data points in each plot. I’ve also added the same "second_plot" value to both of the traces in the second plot, but you could further distinguish those if desired. You can also add a vector of customdata (one value per point, in order) to individually identify the records, such as a key value to deterministically reproduce the pointNumber functionality.

As a final check (after doing all the old-school research myself) I asked an AI how to identify which plot was clicked and it more or less gave the answers I’ve described here, with some (different) example code and all. It took a bit of prompting to get it to go further than just using the curveNumber but I was amazed that it really did produce a (more or less) working proof-of-concept with minimal refinement. I definitely need to jump straight to that more often instead of fiddling around with solutions that don’t work for too long.

Is there a better way to achieve this? Let me know! I’m pretty much not on the bird site any more but I can be found on Mastodon or use the comments below.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-06-17
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem        1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs            1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  knitr         1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown     2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rstudioapi    0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  sass          0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  xfun          0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Polyglot Exploration of Function Overloading

website@jcarroll.com.au (Jonathan Carroll) — Mon, 03 Apr 2023 00:00:00 +0000

I’ve been working my way through Exercism exercises in a variety of languages because I strongly believe every language you learn something about teaches you about all the others you know, and makes for useful comparisons between what features they offer. I was¹ Learning Me a Haskell for Great Good (there’s a guide/book by that name) and something about Pattern Matching just seemed extremely familiar.

Pattern Matching is sort of like a case statement, but rather than just comparing literal values against some enum, it takes into consideration how the input “looks”. A simple example is to match against either an empty list [] (just that; an empty list) or a non-empty list denoted (x:xs). In Haskell, : is a concatenation operator (cons in lisp) so this is the concatenation of x and the rest of a list, xs. The wildcard pattern _ matching “whatever”.

A map function definition (from here) is then

map _ []     = []
map f (x:xs) = f x : map f xs

This is two definitions for map, depending on which pattern is provided as the two arguments. The first takes “whatever” (doesn’t matter, is ignored) and an empty list and just returns an empty list. The second takes some function f and a non-empty list, and concatenates (:) (f x) (the first element of the list x provided to the function f) with map f xs (the result of providing f and the rest of the list xs to map, recursively).

Since Haskell is strongly typed, I don’t think this can be used to define the same named function for different types, but it can certainly do something different depending on the pattern of the data. In this example, if the argument is an empty list, return 0; if the argument is a length-1 list (arg1 concatenated with an empty list) then return arg1 * 100, and if the argument is a longer list, return the product of the first element and the second. This then prints out calling fun 5.0 and fun [5.0, 5.0]

fun :: [Float] -> Float
fun [] = 0.0
fun (arg1:[]) = arg1 * 100.0
fun (arg1:arg2) = arg1 * (head arg2)

main = do
  print (fun [5.0])
  print (fun [5.0, 5.0])

500.0
25.0

Woo! A different function called depending on the input. I believe it might be possible to actually have optional arguments via the Data.Maybe package but I couldn’t get it to compile an example the way I wanted².

Rust has something similar but more specific to a case statement; a match expression can take patterns as options and return whichever matches (example from here)

fn main() {
    let input = 's';

    match input {
        'q'                   => println!("Quitting"),
        'a' | 's' | 'w' | 'd' => println!("Moving around"),
        '0'..='9'             => println!("Number input"),
        _                     => println!("Something else"),
    }
}

Moving around

Another common use of match is to switch between the enums Some and None or Ok and Err (see here).

The familiarity of the Haskell pattern matching / function definition took me back to one of the very first programming ‘tricks’ I learned way back in the late 2000’s working on my PhD, using Fortran; “function overloading”. I wasn’t formally taught programming at all (an oversight, given how important it became to doing my research), so I just had to pick up bits and pieces from people who knew more.

I had a bunch of integration routines³ which were slightly different depending on whether or not the limits were finite⁴, so I had to call the right one with various if statements. The ‘trick’ I was taught was to use INTERFACE / MODULE PROCEDURE blocks to “dispatch” depending on the function signature, or at least the number of arguments. This meant that I could just call integrate regardless of whether it was a signature with 4 arguments, or a signature with an additional argument if a bound was Infty.

A “small” (Fortran is hardly economical with page real-estate) example of this, following the Haskell example, defines two functions Fun1arg and Fun2arg which can be consolidated into fun with the INTERFACE block. Calling fun(x) or fun(x, y) is routed to the function with the relevant signature.

MODULE exampleDispatch
  IMPLICIT NONE

  INTERFACE fun
     MODULE PROCEDURE Fun1arg, Fun2arg
  END INTERFACE fun

  CONTAINS

    ! A function that takes one argument
    ! and multiplies it by 100
    REAL FUNCTION Fun1arg(arg1)
      IMPLICIT NONE
      REAL, INTENT( IN ) :: arg1
      Fun1arg = arg1 * 100.0
    END FUNCTION Fun1arg

    ! A function that takes two arguments
    ! and multiplies them
    REAL FUNCTION Fun2arg(arg1, arg2)
      IMPLICIT NONE
      REAL, INTENT( IN ) :: arg1, arg2
      Fun2arg = arg1 * arg2
    END FUNCTION Fun2arg

END MODULE exampleDispatch

PROGRAM dispatch

  USE exampleDispatch

  IMPLICIT NONE
  REAL :: a = 5.0
  REAL :: fun

  PRINT *, fun(a)
  PRINT *, fun(a, a)

END PROGRAM dispatch

   500.000000    
   25.0000000

That takes me back! I’m going to dig out my old research code and get it into GitHub for posterity. I’m also going to do the Fortran exercises in Exercism to reminisce some more.

So, not quite the same as the Haskell version, but it got me thinking about dispatch. R has several approaches. The most common is S3 in which dispatch occurs based on the class of the first argument to a function, so you can have something different happen to a data.frame argument and a tibble argument, but in both cases the signature has the same “shape” - only the types vary.

Wiring that up to work differently with a list and any other value (the default case, which would break for anything that doesn’t vectorize, but it’s a toy example) looks like

fun <- function(x) {
  UseMethod("fun")
}

fun.default <- function(x) { 
  x * 100
}

fun.list <- function(x) {
  x[[1]] * x[[2]]
}

fun(5)
fun(list(5, 5))

[1] 500
[1] 25

Another option is to use S4 which is more complicated but more powerful. Here, dispatch can occur based on the entire signature, though (and I may be wrong) I believe that, too, still needs to have a consistent “shape”. A fantastic guide to S4 is Stuart Lee’s post here.

A S4 version of my example could have two options for the signature; one where both x and y are "numeric", and another where y is "missing". "ANY" would also work and encompass a wider scope.

setGeneric("fun", function(x, y, ...) standardGeneric("fun"))

setMethod("fun", c("numeric", "missing"), function(x, y) {
  x * 100
})

setMethod("fun", c("numeric", "numeric"), function(x, y) {
  x * y
})

fun(5)
fun(5, 5)

[1] 500
[1] 25

So, can we ever do what I was originally inspired to do - write a simple definition of a function that calculates differently depending on the number of arguments? Aha - Julia to the rescue!! Julia has a beautifully simple syntax for defining methods on signatures: just write it out!

fun(x) = x * 100
fun(x, y) = x * y

println(fun(5))
println(fun(5, 5))

500
25

That’s two different signatures for fun computing different things, and a lot less boilerplate compared to the other languages, especially Fortran. What’s written above is the entire script. You can even go further and be specific about the types, say, mixing Int and Float64 definitions

fun(x::Int) = x * 100
fun(x::Float64) = x * 200

fun(x::Int, y::Int) = x * y
fun(x::Int, y::Float64) = x * y * 2

println(fun(5))
println(fun(5.))
println(fun(5, 5))
println(fun(5, 5.))

It doesn’t get simpler or more powerful than that!!

I’ve added all these examples to a repo split out by language, and some instructions for running them (assuming you have the language tooling already set up).

Do you have another example from a language that does this (well? poorly?) or similar? Leave a comment if you have one, or find me on Mastodon

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-06-17
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem        1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs            1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  knitr         1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown     2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rstudioapi    0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  sass          0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  xfun          0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

in part due to a strong representation of Haskell at my local Functional Programming Meetup↩︎
I’m highly likely doing something wrong - I never wrote any Haskell before last week↩︎
Numerical Recipes in Fortran 90 was about the most important book we had for writing code, basically nothing else was trusted - getting a digital copy of the code was considered a sign of true power↩︎
what, you don’t have to integrate up to infinity in your code?↩︎

Version Zero Easter Eggs

website@jcarroll.com.au (Jonathan Carroll) — Fri, 31 Mar 2023 00:00:00 +0000

I’ve just finished reading ‘Version Zero’ by David Yoon. I really enjoyed it. There’s some (javascript) code on some separator pages between some of the chapters that is loosely tied into the plot and general theme of the book. I love solving puzzles, so what was I supposed to do, just leave it at that?

Incidentally, I’ve hooked my reading list into my mini blog so my ‘Currently Reading’ list is (ideally) up to date.

I can’t help myself when it comes to puzzles or easter eggs like this. Decrypting the new Australian 50c coin puzzle triggered a conversation with one of our top spy agencies. I learned a whole lot from solving this Gaussian Primes puzzle.

I very much enjoyed the book. I won’t give away too much, but it does a great job of calmly building up a story, the characters, the plot and then ramping up the excitement. In the acknowledgements the author thanks his genius nephew Eric Yoon (yoonicode.com - what a great URL!) for the easter egg code.

I figured I’d try to to run the code and see what it does. I wanted to carefully read this - especially given the theme of the book - to make sure it wasn’t going to delete my networking setup or something. The first step was to get the code from the (paper) pages into a computer. I looked around - web search, GitHub, Eric’s site - and couldn’t find an online copy anywhere. I tried searching for a few other unique-looking terms in the code but nothing. Has no one written up a discussion about this easter egg? The book is from 2021, so it’s not that new. I came across it randomly walking the shelves at my local library (credit to librarians for prominently featuring great suggestions!). Maybe I’ve just overlooked a write up somewhere, but maybe I’m the first?

I had a go at OCR via tesseract but since it’s javascript and not a text language, it didn’t have much luck. There are supposedly some language packs for tesseract but none of them helped with the images I have.

So, with no digital copy of the code to pull in, I guess the only thing to do is to ~~forget about it and move on~~MANUALLY TYPE ALL 80-SOMETHING LINES IN. This was … interesting, but not too bad, really. The choice of font in the book and somewhat low dpi meant that the difference between ( and { was very subtle. Having a little bit of domain knowledge helps in this case. Entering the code was otherwise just a matter of typing until I got to this (or what looks something like this) line

subunit.push(btoa("ß]xëÏz×|ç¼Û¾v"));

Oh… No.

My best guess was to use Google Docs’ ‘insert character’ which lets you draw the symbol you’re looking for and gives some options.

Google Docs’ insert character dialog

The next step was to actually confirm that I hadn’t made any transcription errors, which should show up if I try to run the code. I checked that I could run a file of javascript code with node and it worked, sort of. It failed about halfway through because a function wasn’t defined - btoa and atob are deprecated in Node. I added some definitions for atob and btoa and that was resolved

atob = a => Buffer.from(a, 'base64').toString('binary')
btoa = b => Buffer.from(b, 'binary').toString('base64')

The code didn’t run, though, because some values didn’t convert to BigInt, probably because I’d entered them wrong.

I carefully went through what I’d entered and did realise that there was only a subtle difference between 1 (one) and l (lowercase L) and made some fixes.

Apart from that, the code ran until it hit the BigInt conversion of a particular value and failed. I forced the code to skip over that value and got to an answer. The result of running the code is similar to the Advent of Code 2022 Day 10 Part Two puzzle which writes out # and spaces in lines to spell out ASCII-art words. Clearly my code was broken, because I could see what it was supposed to spell out, and there were errors.

I figured I had to trace back through the result being built up and figure out which values end up on which lines. Extra fun, because there’s a filter in the middle that removes some of the input. Sure enough, one of the offending lines was the unicode-salad line above - great. The other seemed to be

const datetime = new Date(1997, 7, 24);

Super great - this is going to involve a timezone issue, isn’t it?

Back to the unicode, I searched again for this specific line (minus the unicode) and actually did get a hit - Google Books has a copy of the (Czech?) translation of the book and returns this line. Not precisely (something hasn’t encoded correctly) and not selectable in the book, but selectable in the Google result for it. That wasn’t much help after all.

Let’s walk through the code and see how it works before we resolve it. A full copy (with my own annotations and fixes) is here, if anyone else wants to not have to type all of it in.

const VERSION_NUMBER = 0;
const AGENT = "BLACK HALO";
const year = 0x2018;
const enc = [
    021, 024, 015, 015,
    026, -031, 030, 016,
    034, 027, 021, 034,
    021, 014, 025, -022,
    017, 016, 032, 027
];
let res = ["You are infinite"];
RANDOM_SEED = 20879976793454946324n;

## 0
## BLACK HALO
## 8216
## 17,20,13,13,22,-25,24,14,28,23,17,28,17,12,21,-18,15,14,26,23
## You are infinite
## 20879976793454946324

So far so good. Some constants and the start of a result res - an array containing some text.

if (VERSION_NUMBER % 2 < 1) res.shift();

With VERSION_NUMBER == 0 this just drops the first (and only) value of res, so we’re back to an empty result.

res.push(enc.map((i, idx) => {
    return String.fromCharCode(
        AGENT.charCodeAt(
            idx % AGENT.length
        ) - i
    );
}).reduce((i, j) => {
    return i.toString() + j.toString();
}));
res[0]

## 18465903081007629328

This does some math on the characters of AGENT and produces what will eventually be the second line of actual output (currently the first).

res.unshift(atob("MzU3NzU1MDM2NTgxMDMzNTg0OTU="));
res[0]

## 35775503658103358495

This becomes the first line of output due to the unshift.

res.push(
    (8939935261623587079n << 2n).toString() 
);
res.push((RANDOM_SEED & 0x18C445CAC40447832n | 0n).toString());
res.push("" + (151845383424178857009896n / BigInt(year)));

## 35759741046494348316
## 18465906380616247312
## 18481667894861107231

become the third, fourth, and fifth line of the output.

The next lines set up something to be used later,

let as_json = {
    coordinates: '{"x": 2, "y": 5}',
    tolerance: 0.1,
    subunit: [2 ** 8]
};
const c = JSON.parse(as_json.coordinates);

## { coordinates: '{"x": 2, "y": 5}', tolerance: 0.1, subunit: [ 256 ] }
## { x: 2, y: 5 }

Then adds a separator of 0 to the result

res.push((z => `value: ${z}`.slice(7))((x => x >>> 42)(3 ** 5)));

## 0

The next part is a bit of a red herring since it sets up a subunit object

let subunit = as_json["subunit"];
eval("subunit" + `${String.fromCharCode(46)}pop()`);

but the eval results in

## subunit.pop()

so that is back to empty.

This adds some data to subunit

subunit.push(69 + 114 + 105 + 99 + 32 + 89 + 111 + 111 + 110);

## 840

but again it’s overwritten with

subunit[0] = Math.round(euclidianDistance(c.x, c.y, 48, 1967.46095)) + "4568824394612736"; 
[...]
/**
    * @returns the distance between 2d point (x, y) and (x1, y1)
    */
function euclidianDistance(x, y, x1, y1) {
    return Math.sqrt(((x - x1) ** 2) + ((y - y1) ** 2));
}

## 19634568824394612736

This is the first line of the second block of output.

The next line of output should come from the code that I have as

subunit.push(btoa("ß]xëÏz×|ç¼Û¾v"));
res = res.concat(subunit);

but that produces

## 3114689613znvNu+dg==

which doesn’t convert to BigInt at all. We’ll come back to that.

const str = "MjQyNDI4NzczNDQ0MjgwNjQ3Njg=bMTk2MTc2ODAxMTY0MTIzMTc2OTY=bMTk2MzQ1Njg0OTI2MDgzODkxMjA=bMA=="; 
res = res.concat(str.split("b").map(b => atob(b)));

splits up the str at the letter b and runs atob over the pieces

## [
##   '24242877344428064768',
##   '19617680116412317696',
##   '19634568492608389120',
##   '0'
## ]

providing the rest of the second block of output and the next separator.

The next lines set up a Date object and extracts part of the string representation (local timezone, but it’s just taking the "19" from "1997")

const datetime = new Date(1997, 7, 24);
res.push(
    datetime.toString().slice(11, 13) +
        (
            634601705079659136n +
            BigInt(datetime.getTime())
        )
);

## 19634602577426259136

which looks okay, but gives the wrong value on the first line of the last block - another one to come back to.

The next lines were fun to enter and validate (not)

res = res.concat(
    [
        "Mjg4NDIxOTU1MjI5NzAyMDYyMDg=", /** block 3, line 2 */
        "MTVkIGhlcnJpbmcgZ2V0IHJla3Q=", 
        "MTEwNTI5MDA1Mjk2MDU5NzY2MTM0",
        "MjQyMzA1MjI2OTg2ODIzNjM5MDQ=", /** block 3, line 3 */
        "SG9wZSB5b3UgbGlrZSBSZWdleCE=", 
        "MTk1MjA0NjkyMDUyODYzMDQ0MDM=",
        "MjE5MjQ2NjY0OTUzMjkxMjQzNTI=", /** block 3, line 4 */
        "MjYwMjg2MDQ4NjAyODMwNTUxMDI=",
        "MTk2MzQ2MDI1OTM2MDc1MTUxMzY=", /** block 3, line 5 */
        "TG92ZSwgUGlsb3QuIDwzICA8MyA=", 
        "MzA0NTgyNTg0Mzk1NzM4OTU3OTM5"
    ].filter( 
       i => i.match(/M[j|T].+[QUINOA][x12][DjTLMNOP]{2}[^aeiou]\*?.{1,5}[a-zA-Z5]+=/g) 
    ).map(atob)
).map((i, j) => {
    if(j > 5 && j < 12 && j != 7) {
        return BigInt(i) & BigInt(31775n << 50n);
    }
    return i;
});

This involves filtering some entries from the big block of encoded text, running the remaining ones through atob, then doing some math on these combined with all the other values from res (effectively only updating the second block of values).

While debugging this, I found another easter egg hidden within - one that wouldn’t be found just by running the code itself. Some of the lines filtered out by the regex convert to plaintext!

atob("MTVkIGhlcnJpbmcgZ2V0IHJla3Q=")
atob("SG9wZSB5b3UgbGlrZSBSZWdleCE=")
atob("TG92ZSwgUGlsb3QuIDwzICA8MyA=")

## 15d herring get rekt
## Hope you like Regex!
## Love, Pilot. <3  <3

Niiiiice!

The final lines of code take these values, convert to binary, and print a # for each 1 (and a space otherwise)

for (const i of res) {
    const bin = BigInt(i).toString(2);
    let ln = "";
    for (const j of bin) ln += j == "0" ? " " : "#";
    console.log(ln);
}

If you do that, a message (slightly corrupted) appears.

I decided to work backwards, since I was fairly sure what the ‘right’ solution should be. Taking those lines (manually corrected), converting them all the way back through the processing in reverse, I could see what the ‘right’ code should be.

The unicode line that produces what I think is the “right” solution is

subunit.push(btoa("ß]xëÏyÓm¼÷\x8DùÓ}ú"));

The characters are mainly close but not perfect, so maybe a LOCALE issue? Something to do with Linux (which I’m on) vs Windows?

The date line seems to be off by exactly 16 hours and 30 minutes which is disturbingly likely to be a timezone issue. I’m at GMT+10:30 (Adelaide, South Australia) at the moment. StackOverflow seems to have a lot of angry comments regarding whether or not this is an issue for Date(). I seem to be able to get the “right” solution with

const datetime = new Date(1997, 7, 24, 16, 30);

With all that in place, it’s time to run all of the code! If I do that, I get…

« Click to reveal! »

#####     #####     #####     #####     #####     #   #     #####
#         #   #     #   #     #           #       #   #     #    
#####     #   #     #####     #  ##       #       #   #     ###  
#         #   #     # #       #   #       #        # #      #    
#         #####     #  ##     #####     #####       #       #####
 
#   #     #####                                                  
## ##     #                                                      
# # #     ###                                                    
#   #     #                                                      
#   #     #####                                                  
 
#   #     #####     #####     #         #         #####          
##  #     #   #     #         #         #         #              
# # #     #   #     ###       #         #         ###            
#  ##     #   #     #         #         #         #              
#   #     #####     #####     #####     #####     #####

which is fitting, and thoroughly satisfying to finally produce.

I also ran this code (with my corrections, minus the atob and btoa definitions) over on jsfiddle.net and it seems to give the right solution, which makes me think perhaps it really is an error in the code or how it was printed.

What an adventure! I learned a lot of javascript (how to run it with node and in a browser for debugging), played with tesseract, and learned about entering unicode. I’m sending this to Eric Yoon for comment and will update if I hear anything.

As a side note for this post, you’ll notice that the code blocks are all nicely rendered as usual - in this case they’re the actual javascript from the easter egg code. {knitr} does have a way to evaluate javascript in code chunks with the node engine, but that essentially runs node -e 'CODE' on each chunk independently, so you can’t define a variable in one chunk then reference it in another. That wasn’t sufficient for this exploration. I did find an (old) implementation that uses {V8} in Yihui’s (already experimental) {runr}, but it was written for a much older version of {knitr} and was out of date.

So, of course the thing to do was ~~just hardcode the output~~SHAVE A YAK AND UPDATE THE IMPLEMENTATION. If you’d like to have javascript code chunks in your Rmd, I’ve made a pull request to that original implementation and have my own fork.

It seems to work okay, with the exception that it doesn’t pull in Buffer so my custom atob function doesn’t work, and it doesn’t have another. It’s also going wrong in terms of the persistent context in that the const and let directives are being seen multiple times and it doesn’t like that. Otherwise, variables persist across chunks just fine - these chunks are fully live:

Define a variable x

x = 1 + 5;

## 6

Then continue the block

x + 12

## 18

So, that’s working.

As always, leave a comment if you have one, or find me on Mastodon (I’m much less on Twitter these days). If you have a correction or annotation to add to the code it’s here.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-03-31
##  pandoc   2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.13    2022-09-24 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem        1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  curl          4.3.3   2022-10-06 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs            1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  htmltools     0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  knitr         1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  pkgbuild      1.3.1   2021-12-20 [1] CRAN (R 4.1.2)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown     2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rstudioapi    0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  runr          0.0.7   2023-03-31 [1] local
##  sass          0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  V8          * 4.2.2   2022-11-03 [1] CRAN (R 4.1.2)
##  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  xfun          0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

The Most Complex Puzzle I've Ever Solved

website@jcarroll.com.au (Jonathan Carroll) — Sat, 29 Oct 2022 00:00:00 +0000

Don’t show me puzzles, unless you want to be responsible for me staying up too late solving them. I’m far too easily nerd-sniped. This one was certainly the most complex I’ve ever solved. Quite complicated too, but definitely the most complex (you’ll see).

A few days ago a colleague of mine pointed me to this FiveThirtyEight article, which isn’t new (2018), but does feature someone else we both work with as the author of the first puzzle. Brandon is a mathematician-turned-computational-biologist/geneticist and a top-level MIT puzzler.

The puzzle in the article consists of this image (you may want to save and enlarge, yourself)

Studies in two-factor authentication

and the clue:

Ugh! Dad says the computer will hurt my eyes, but I doubt that’s his prime concern. Time to see what requires such complex security.

How could I possibly pass up an opportunity to solve such a cool puzzle?

I played with a few ideas, but can’t say I made any progress. My colleague also pointed me to a solution from the MIT puzzles website (I won’t spoil anything just yet) after which things started to make a lot more sense.

The critical words in the clue are “prime” and “complex”… we’re going to be dealing with Gaussian Primes; a special case of Gaussian Integers.

I would say “math warning” but if math scares you, you probably could do with some scaring.

A Gaussian Integer is a complex number (with a real and an imaginary part) $z = a + bi$ where both $a$ and $b$ are integers. A Gaussian Integer is a Gaussian Prime

“if and only if either its norm is a prime number, or it is the product of a unit ($\pm 1$, $\pm i$) and a prime number of the form $4n + 3$”

The first part of this requires that the norm ($a^2 + b^2$) is itself a prime number. This will be a positive, real integer. The alternative means that $a=0$ or $b=0$ and we can write the absolute value of the other (which will be prime) as $4n + 3$ for some non-negative $n$.

Working with complex numbers in R is actually very well supported. It’s not something you’d work with a lot in the vast majority of data science (“the average number of sprockets produced in the first quarter was $2 + 3i$”?) but R has complex as an atomic type and many functions support operations on this.

Okay, with that in mind, we can generate a bunch of Gaussian Primes. In base R, of course. First, we’re going to need a way to determine if an integer is a prime number. We’re not worried about performance, so let’s just try to divide our target number $n$ by every number smaller than $\sqrt n$ (greater than 1); if anything divides cleanly (the result is an integer) it’s not a prime number. That can be implemented (shamelessly stolen from StackOverflow) as

is.prime <- function(n) n == 2L || all(n %% 2L:max(2,floor(sqrt(n))) != 0)

Sanity check:

is.prime(7)

## [1] TRUE

is.prime(131)

## [1] TRUE

is.prime(100)

## [1] FALSE

Next we’ll need a way to tell if a number is a Gaussian Prime. Implementing the definition above, and vectorizing it, involves working with the real (Re()) and imaginary (Im()) parts of a complex number

isGP <- function(n) {
  (Re(n) != 0 && Im(n) != 0 && is.prime(Re(n)^2+Im(n)^2)) ||
    (Re(n) == 0 && is.prime(Im(n)) && abs(Im(n)) %% 4 == 3) ||
    (Im(n) == 0 && is.prime(Re(n)) && abs(Re(n)) %% 4 == 3)
}
isGPv <- Vectorize(isGP)

Sanity check:

isGP(-5-4i)

## [1] TRUE

isGP(3)

## [1] TRUE

isGP(1 + 3i)

## [1] FALSE

isGP(3 + 20i) # https://planetmath.org/gaussianprime

## [1] TRUE

Now we can build a grid of integers on the complex plane and mark which are Gaussian Primes. For the sake of this puzzle, we’ll limit to 250 integers in each positive direction

x <- expand.grid(real = 0:249, im = 0:249)
x$complex <- x$real + (x$im)*1i
x$isGP <- isGPv(x$complex)
head(x)

##   real im complex  isGP
## 1    0  0    0+0i FALSE
## 2    1  0    1+0i FALSE
## 3    2  0    2+0i FALSE
## 4    3  0    3+0i  TRUE
## 5    4  0    4+0i FALSE
## 6    5  0    5+0i FALSE

I’m solving this in base R, but we can use a package for visualising things…

library(ggplot2)
gg <- ggplot(x, aes(real, im, fill = isGP)) +
  geom_tile() +
  scale_fill_manual(
    values = c(`TRUE` = "black", `FALSE` = "white"), guide = "none"
  ) +
  theme_void() + 
  theme(aspect.ratio = 1)
gg

Careful inspection shows that this does match the puzzle image, except that the puzzle version has some additional coloured pixels… Interesting.

Reading the puzzle image (fetched directly, because Chrome wants to give me a .webp and maybe I’m getting too old to deal with that) in as pixel data into three channels (R, G, B) (yes, one external package, fine)

img <- "puzzle.png"
# download.file("https://fivethirtyeight.com/wp-content/uploads/2018/01/puzzle1.png", img)
img <- png::readPNG(img)

we can rescale these to 8-bit numbers, convert to hex, then combine into hex colours

img <- list(
  red = img[,,1]*255, 
  green = img[,,2]*255, 
  blue = img[,,3]*255
)
img <- lapply(img, as.hexmode)
img <- matrix(
  do.call(paste0, img), 
  nrow = 250, ncol = 250, 
  byrow = TRUE
)
# identify the locations of pixels 
#  that are not black or white
idx <- which(! img == "000000" & ! img == "ffffff", arr.ind = TRUE)
cols <- img[idx]

d <- as.data.frame(idx)
# image reads with (0,0) top left
#  so flip it
d$col <- 250 - d$col 
# start at 0
d$row <- d$row - 1 
d$color <- paste0("#", cols)
head(d)

##   row col   color
## 1  57 178 #0000ff
## 2  47 140 #cccc00
## 3  46 125 #ff0000
## 4  60 109 #0000ff
## 5  15 104 #0000ff
## 6  58 103 #ff0000

These colours can be identified just by entering them into a search engine, or by using one of the very recent RStudio builds

known_colors <- c(red = "#ff0000", 
                  orange = "#ff9919", 
                  yellow = "#cccc00",
                  green = "#00ff00", 
                  blue = "#0000ff", 
                  purple = "#7f00cc"
)

d$colorname <- names(known_colors)[match(d$color, known_colors)]
head(d)

##   row col   color colorname
## 1  57 178 #0000ff      blue
## 2  47 140 #cccc00    yellow
## 3  46 125 #ff0000       red
## 4  60 109 #0000ff      blue
## 5  15 104 #0000ff      blue
## 6  58 103 #ff0000       red

This looks fantastic in the recent RStudio versions, FYI

Appropriate colour highlighting in RStudio

Now comes the hard part (and I’ll gladly admit I’d never have figured this out without seeing a solution first) - if we assume the coloured pixels represent complex numbers, and we can factor those into the product of two Gaussian Primes (remember the clue?) then we can do something with those. So, how do we find the factors? Multiplying two numbers, even complex numbers is pretty straightforward. Figuring out which two prime factors a number has (even a regular integer) is the foundation of cryptographic keys.

More searching turns up this resource which details an approach:

There are three cases:

The prime factor p of the norm is 2: This means that the factor of the Gaussian integer is 1+i or 1-i.

The prime factor p of the norm is multiple of 4 plus 3: this value cannot be expressed as a sum of two squares, so p is not a norm, but p2 is. Since p2 = p2 + 02, and there is no prime norm that divides p2, the number p + 0i is a Gaussian prime, and the repeated factor p must be discarded.

The prime factor p of the norm is multiple of 4 plus 1: this number can be expressed as a sum of two squares, by using the methods explained in the sum of squares page. If p = m2 + n2, then you can check whether m + ni or m − ni are divisors of the original Gaussian number.

This translates to: Given the norm $N$ of a Gaussian Prime, the factors of $N$ (denoted $p$) will either be $1 \pm i$, or if $p$ is of the form $p = m^2 + n^2$ then candidates are $m \pm ni$.

So, we’ll need a function to operate on the norm of our Gaussian Prime. The norm itself is defined as

complexnorm <- function(z) {
  Re(z)^2 + Im(z)^2
}

Sanity check:

complexnorm(3 + 4i)

## [1] 25

We can implement the approach above as

norm_factors <- function(N) {

  ## N %% 2 == 0
  if (N %% 2 == 0) {
    if (divides(N, (1+1i))) return(1+1i)
    if (divides(N, (1-1i))) return(1-1i)

    ## N %% 4 == 3
  } else if (N %% 4 == 3) {
    return(NULL)

    ## N %% 4 == 1
  } else if (N %% 4 == 1) {
    return(sos(N))

    ## something's wrong
  } else {
    stop("this shouldn't happen")
  }
}

There are a couple of undefined functions here (R is fine with this; it’s lazy).

We need a way to tell if two complex numbers are “neatly” divisible, in the sense that they produce a Gaussian Integer. I’ve called that divides() and an implementation could be

divides <- function(x, y) {
  z <- x / y
  (intish(Re(z)) && intish(Im(z)))
}

This relies on being able to say that a real, floating-point value looks like an integer. This is an annoying part of working with numbers - sometimes, especially if you’re doing maths, numbers aren’t precisely representable in the computer as you hope. The classic example is

0.1 + 0.2 == 0.3

## [1] FALSE

Why doesn’t that work? Looks simple enough. Let’s print more digits

print(0.1 + 0.2, digits = 20)

## [1] 0.30000000000000004441

This is so common, there’s even a website: https://0.30000000000000004.com/

So, can’t we just use R’s is.integer()? Would I be going through this if we could?

is.integer(3) # entered as a numeric value

## [1] FALSE

is.integer(3L) # entered as an integer

## [1] TRUE

so, if we have a not-entered-as-an-integer, it’s not an integer. What about trying to round-trip through as.integer() and comparing to the original? If x and as.integer(x) are the same, it’s an integer, right?

as.integer(3)        # makes sense

## [1] 3

as.integer(3.000001) # so far so good

## [1] 3

as.integer(3.999999) # oh, no

## [1] 3

so, if our value is ever so slightly under the integer, it will be rounded all the way down to the next integer. Okay, so, how can we do this? round() rounds towards integers, so let’s check if the absolute difference between x and round(x) is very small

intish <- function(x) {
  abs(round(x) - x) < 1e-7
}

Sanity check:

# 43 + 80i = (8 + 3i)(8 + 7i)
divides(43 + 80i, 8 + 3i)

## [1] TRUE

divides(43 + 80i, 8 + 7i)

## [1] TRUE

divides(43 + 80i, 5 + 5i)

## [1] FALSE

The other missing function is for the last condition of norm_factors(), when the factor can be represented as the sum of two squares, so sos() could be implemented as

sos <- function(p) {
  s <- sqrt(p)
  i <- seq_len(ceiling(s))
  g <- expand.grid(i, i)
  g$sos <- g[, 1]^2 + g[, 2]^2
  opts <- unlist(g[g$sos == p, c(1, 2)][1, ])
  c(round(opts[1]) + round(opts[2])*1i,
    round(opts[1]) - round(opts[2])*1i,
    round(opts[2]) + round(opts[1])*1i,
    round(opts[2]) - round(opts[1])*1i)
}

This enumerates all the combinations of integers $i$ up to $\sqrt p$ and checks if the sum of any two squares is equal to the input $p$. If so, those are returned as candidates of the form $m \pm ni$.

In order to use the above approach of norm_factors we need to find the prime factors of the norm of a Gaussian Prime. We will then test each of those with this approach.

Finding the prime factors of a regular integer is a little more straightforward (for very small integers, less than thousands; for integers with thousands of digits we get into public-key cryptography spaces). In this case, we just enumerate the integers, check if the input is divisible, and take those that are prime (according to our earlier definition)

all_prime_factors <- function(x) {
  div <- seq_len(x)
  f <- div[x %% div == 0]
  f[sapply(f, is.prime)]
}

Again, it’s StackOverflow to the rescue here. In case JD Long is reading this, you may be pleased to see that yes, your musings are still being read (and leveraged) over a decade later.

Sanity check:

all_prime_factors(325)

## [1]  1  5 13

Now we can put that all together into a function that finds the factors of a Gaussian Prime

GP_factors <- function(n) {
  # get all prime factors of the norm of n
  allf <- all_prime_factors(complexnorm(n))
  # get all candidate factors of those
  tests <- lapply(allf, norm_factors)
  # flatten into a vector of candidates
  tests <- unlist(tests)
  # remove anything that didn't work
  tests <- tests[!is.na(tests)]
  # check if n can be divided by any candidates and keep those
  tests <- tests[sapply(tests, function(x) divides(n, x))]
  # check if we have a Gaussian Prime and keep those
  tests <- tests[isGPv(tests)]
  # only find positive real and imaginary elements
  res <- tests[sapply(tests, function(x) Re(x) > 0 && Im(x) > 0)]

  # the factors should be the candidate and n / candidate
  # rounded to integers just to be sure
  unique(unname(c(round(res), round(n / res))))
}

but does it work? How about the example from earlier…

Sanity check:

# 43 + 80i = (8 + 3i)(8 + 7i)
GP_factors(43 + 80i)

## [1] 8+3i 8+7i

That. Is. So. Satisfying!

Applying this to our coloured points (converted back to complex)

d$complex <- d$row + d$col*1i
d$factor_pairs <- sapply(seq_len(nrow(d)),
                         function(x) {
                           list(unique(GP_factors(d$complex[x])))
                         })
head(d)

##   row col   color colorname complex factor_pairs
## 1  57 178 #0000ff      blue 57+178i 10+9i, 12+7i
## 2  47 140 #cccc00    yellow 47+140i  8+7i, 12+7i
## 3  46 125 #ff0000       red 46+125i  8+7i, 11+6i
## 4  60 109 #0000ff      blue 60+109i  8+7i, 11+4i
## 5  15 104 #0000ff      blue 15+104i  6+5i, 10+9i
## 6  58 103 #ff0000       red 58+103i  8+5i, 11+6i

Now we just extract the real and imaginary parts of those pairs

d$factor1 <- sapply(d$factor_pairs, `[[`, 1)
d$factor2 <- sapply(d$factor_pairs, `[[`, 2)
d$x1 <- sapply(d$factor1, Re)
d$y1 <- sapply(d$factor1, Im)
d$x2 <- sapply(d$factor2, Re)
d$y2 <- sapply(d$factor2, Im)
head(d)

##   row col   color colorname complex factor_pairs factor1 factor2 x1 y1 x2 y2
## 1  57 178 #0000ff      blue 57+178i 10+9i, 12+7i   10+9i   12+7i 10  9 12  7
## 2  47 140 #cccc00    yellow 47+140i  8+7i, 12+7i    8+7i   12+7i  8  7 12  7
## 3  46 125 #ff0000       red 46+125i  8+7i, 11+6i    8+7i   11+6i  8  7 11  6
## 4  60 109 #0000ff      blue 60+109i  8+7i, 11+4i    8+7i   11+4i  8  7 11  4
## 5  15 104 #0000ff      blue 15+104i  6+5i, 10+9i    6+5i   10+9i  6  5 10  9
## 6  58 103 #ff0000       red 58+103i  8+5i, 11+6i    8+5i   11+6i  8  5 11  6

Looping over the different colors as groups, we can draw segments on our image joining the two Gaussian Prime factors. The segments are all in one corner of the plot, so I’ve zoomed in to the first dozen pixels square. I’ve also faded the Gaussian Primes to make the solution a bit clearer

gglist <- list()
suppressMessages({ # replacing fill scale
  for (col in names(known_colors)) {
    dcol <- d[d$colorname == col, ]
    gglist[[col]] <- gg +
      geom_segment(data = dcol,
                   aes(x = x1, y = y1,
                       xend = x2, yend = y2,
                       col = colorname),
                   linewidth = 1.5,
                   inherit.aes = FALSE) +
      coord_cartesian(xlim = c(0, 12), ylim = c(0, 12)) +
      scale_color_manual(values =
                           setNames(d$colorname, d$colorname),
                         guide = "none") +
      scale_fill_manual(
        values = c(`TRUE` = "grey90", `FALSE` = "white"), 
        guide = "none"
      ) +
      theme_void() +
      theme(aspect.ratio = 1)
  }
})

And, finally, printing the result as a nice reveal, we can plot all of those at once

cowplot::plot_grid(plotlist = gglist, nrow = 2)

This spells out “BOTNET” which is the answer to the puzzle! And what a puzzle!

I had a lot of fun solving this - I’m not sure if there was an easier way, and I definitely couldn’t have made it this far without a significant hint, but I’m very pleased that I could solve the entire thing in (mostly) base R.

As always, comments, critiques, and suggestions are welcome both here and on Twitter.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-06-17
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  assertthat    0.2.1   2019-03-21 [3] CRAN (R 4.0.1)
##  blogdown      1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown      0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib         0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem        1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr         3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli           3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  colorspace    2.0-3   2022-02-21 [3] CRAN (R 4.2.0)
##  cowplot       1.1.1   2020-12-30 [1] CRAN (R 4.1.2)
##  crayon        1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  DBI           1.1.3   2022-06-18 [3] CRAN (R 4.2.1)
##  devtools      2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest        0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  dplyr         1.0.10  2022-09-01 [3] CRAN (R 4.2.1)
##  ellipsis      0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fansi         1.0.3   2022-03-24 [3] CRAN (R 4.2.0)
##  farver        2.1.1   2022-07-06 [3] CRAN (R 4.2.1)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs            1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  generics      0.1.3   2022-07-05 [3] CRAN (R 4.2.1)
##  ggplot2     * 3.4.1   2023-02-10 [1] CRAN (R 4.1.2)
##  glue          1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  gtable        0.3.1   2022-09-01 [3] CRAN (R 4.2.1)
##  highr         0.9     2021-04-16 [3] CRAN (R 4.1.1)
##  htmltools     0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets   1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv        1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  knitr         1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  labeling      0.4.2   2020-10-20 [3] CRAN (R 4.2.0)
##  later         1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr      2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  memoise       2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  munsell       0.5.0   2018-06-12 [3] CRAN (R 4.0.1)
##  pillar        1.8.1   2022-08-19 [3] CRAN (R 4.2.1)
##  pkgbuild      1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgconfig     2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload       1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  png           0.1-7   2013-12-03 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis       0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises      1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps            1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr         1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6            2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  Rcpp          1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown     2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rstudioapi    0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  sass          0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  scales        1.2.1   2022-08-20 [3] CRAN (R 4.2.1)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny         1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi       1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr       1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  tibble        3.1.8   2022-07-22 [3] CRAN (R 4.2.2)
##  tidyselect    1.2.0   2022-10-10 [3] CRAN (R 4.2.1)
##  urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis       2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  utf8          1.2.2   2021-07-24 [3] CRAN (R 4.2.0)
##  vctrs         0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  withr         2.5.0   2022-03-03 [3] CRAN (R 4.2.0)
##  xfun          0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable        1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml          2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────

Polyglot Sorting

website@jcarroll.com.au (Jonathan Carroll) — Sat, 08 Oct 2022 00:00:00 +0000

I’ve had the impression lately that everyone is learning Rust and there’s plenty of great material out there to make that easier. {gifski} is perhaps the most well-known example of an R package wrapping a Rust Cargo crate. I don’t really know any system language particularly well, so I figured I’d wade into it and see what it’s like.

The big advantages I’ve heard are that it’s more modern than C++, is “safe” (in the sense that you can’t compile something that tries to read out of bounds memory), and it’s super fast (it’s a compiled, strictly-typed language, so one would hope so).

I had a browse through some beginner material, and watched some videos on Youtube. Just enough to have some understanding of the syntax and keywords so I could actually search for things once I inevitably hit problems.

Getting everything up and running went surprisingly smoothly. Installing the toolchain went okay on my Linux (Pop!_OS) machine, and the getting started guide was straightforward enough to follow along with. I soon enough had Ferris welcoming me to the world of Rust

----------------------------
< Hello fellow Rustaceans! >
----------------------------
              \
               \
                 _~^~^~_
             \) /  o o  \ (/
               '_   -   _'
               / '-----' \

Visual Studio Code works nicely as a multi-language editor, and while it’s great to have errors visible to you immediately, I can imagine that gets annoying pretty quick (especially if you write as much bad Rust code as I do).

Next I needed to actually code something up myself. I love small, silly problems for learning - you don’t know exactly what problems you’ll solve along the way. This one ended up being really helpful.

I had this tweet

This week I’ve been posting #Python 🐍 quizzes about sorting.

Let’s see if you can put everything together and solve a challenge! 💪#CuriousAboutCode pic.twitter.com/ht51eA3Ttj
— David Amos (@somacdivad) September 16, 2022

in my bookmarks because I wanted to try to solve this with R (naturally) but I decided it was a reasonable candidate for trying to solve a problem and learn some language at the same time, so I decided to give it a go with Rust. This is slightly more complicated than an academic “sort some strings” because it’s “natural sorting” (2 before 10) and has a complicating character in the middle.

The first step was to get Rust to read in and just print back the ‘data’ (strings). I managed to copy some “print a vector of strings” code and got that working. I’ll figure out later what’s going with the format string here

println!("{:?}", x);

After that, I battled errors in converting between String, &str, and i32 types; returning a Result (error) rather than a value; dealing with obscure errors (“cannot move out of borrowed content”, “expected named lifetime parameter” - ???); and a lack of method support for a struct I just created (which didn’t have any inherited ‘type’). All in all, nothing too surprising given I know approximately 0 Rust, but I got there in the end!

Now, this won’t be anything “good”, but it does compile and appears to give the right answer, so I’m led to believe that means it’s “right”.

// enable printing of the struct
#[derive(Debug)]
// create a struct with a String and an integer
// not using &str due to lifetime issues
struct Pair {
    x: String,
    y: i32
}

fn main() {
    // input data vector
    let v = vec!["aa-2", "ab-100", "aa-10", "ba-25", "ab-3"];
    // create an accumulating vector of `Pair`s
    let mut res: Vec<Pair> = vec![];
    // for each string, split at '-', 
    //  convert the first part to String and the second to integer.
    //  then push onto the accumulator
    for s in v {
        let a: Vec<&str> = s.split("-").collect();
        let tmp_pair = Pair {x: a[0].to_string(), y: a[1].parse::<i32>().unwrap() };
        res.push(tmp_pair);
    }
    // sort by Pair.x then Pair.y
    res.sort_by_key(|k| (k.x.clone(), k.y.clone()));
    // start building a new vector for the final result
    let mut res2: Vec<String> = vec![];
    // paste together Pair.x, '-', and Pair.y (as String)
    for s2 in res {
        res2.push(s2.x + "-" + &s2.y.to_string());
    }

    // ["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
    println!("{:?}", res2);
}

Running

cargo run --release

produces the expected output

["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]

Feel free to suggest anything that could be improved, I’m sure there’s plenty.

That might have been an okay place to stop, but I did still want to see if I could solve the problem with R, and how that might compare (in approach, readability, and speed), so I coded that up as

# input vector
s <- c("aa-2", "ab-100", "aa-10", "ba-25", "ab-3")
# split into pairs of strings
x <- strsplit(s, "-")
# take elements of s sorted by the first elements of x then
#  the second (as integers)
s[order(sapply(x, `[[`, 1), as.integer(sapply(x, `[[`, 2)))]

## [1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25"

I don’t love that I had to use sapply() twice, but the only other alternative I could think of was to strip out the first and second element lists and use those in a do.call()

s[do.call(order, list(unlist(x)[c(T, F)], as.integer(unlist(x)[c(F,T)])))]

## [1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25"

which… isn’t better.

I also had an idea to shoehorn dplyr::arrange() into this, but that requires a data.frame. One idea I had was to read in the data, using "-" as a delimiter, explicitly stating that I wanted to read it as character and integer data. That seemed to work, which means I can try what I hoped

suppressMessages(library(dplyr, quietly = TRUE))
# input vector
s <- c("aa-2", "ab-100", "aa-10", "ba-25", "ab-3")

# read strings as fields delimited by '-', 
#  expecting character and integer
s %>% read.delim(
    text = .,
    sep = "-",
    header = FALSE,
    colClasses = c("character", "integer")
) %>%
    # sort by first then second column
    arrange(V1, V2) %>%
    # collapse to single string per row
    mutate(res = paste(V1, V2, sep = "-")) %>%
    pull()

## [1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25"

Why stop there? I know other languages! Okay, the Python and Julia examples I found in other Tweets.

In Julia, two options were offered. This one

strings = String["aa-2", "ab-100", "aa-10", "ba-25", "ab-3"];
print(join.(sort(split.(strings, "-"), by = x -> (x[1], parse(Int, x[2]))), "-"))

## ["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]

(I added a type to the input and an explicit print), and this one

strings = String["aa-2", "ab-100", "aa-10", "ba-25", "ab-3"];
print(sort(strings, by = x->split(x, "-") |> v->(v[1], parse(Int, v[2]))))

## ["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]

The Python example offered by the original author of the challenge was

def parts(s):
    letters, nums = s.split("-")
    return letters, int(nums)

strings = ["aa-2", "ab-100", "aa-10", "ba-25", "ab-3"]

print(sorted(strings, key=parts))

## ['aa-2', 'aa-10', 'ab-3', 'ab-100', 'ba-25']

I actually really like this one - it’s the approach I wanted to use for R; provide sort with a function returning the keys to use. Alas.

Lastly, I remembered that there’s a sort function in bash that can do natural sorting with the -V flag. I’m reminded of this anecdote (“More shell, less egg”) about using a very simple bash script when it’s possible. That came together okay

#!/bin/bash 

v=("aa-2" "ab-100" "aa-10" "ba-25" "ab-3")
readarray -t a_out < <(printf '%s\n' "${v[@]}" | sort -V)
printf '%s ' "${a_out[@]}"
echo 

exit 0

## aa-2 aa-10 ab-3 ab-100 ba-25

By the way, aside from the Rust example, all of these were run directly in the Rmd source of this post with knitr’s powerful engines… multi-language support FTW!

So, how do all these compare? I haven’t tuned any of these for performance; they’re how I would have written them as a developer trying to achieve something. Sure, if performance was an issue, I’d do some optimization, but I was curious just how the performance compares ‘out of the box’.

Mainly for my own posterity, I’ll add how I tracked this. I wrote the relevant code for each language in a file with suffix/filetype appropriate to each language. They’re all here, in case anyone is interested. Then I wanted to run each of them a few times, keeping track of the timing in a file. The solution I went with was to echo into a file (appending each time) both the input and output, with e.g.

echo "Rust (optimised/release)" >> timing
{time cargo run --release} >> timing 2>&1
{time cargo run --release} >> timing 2>&1
{time cargo run --release} >> timing 2>&1

(yes, trivial to loop 3 times, but whatever).

Doing this for all the languages (with both versions for R and Julia) I get

Rust (optimized/release)
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/sort`
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
cargo run --release  0.04s user 0.02s system 99% cpu 0.066 total
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/sort`
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
cargo run --release  0.07s user 0.01s system 99% cpu 0.087 total
    Finished release [optimized] target(s) in 0.00s
     Running `target/release/sort`
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
cargo run --release  0.06s user 0.02s system 98% cpu 0.084 total

R1
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort1.R  0.15s user 0.05s system 102% cpu 0.197 total
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort1.R  0.17s user 0.05s system 102% cpu 0.206 total
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort1.R  0.16s user 0.05s system 103% cpu 0.202 total

R2
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort2.R  0.72s user 0.05s system 100% cpu 0.774 total
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort2.R  0.67s user 0.06s system 100% cpu 0.720 total
[1] "aa-2"   "aa-10"  "ab-3"   "ab-100" "ba-25" 
Rscript sort2.R  0.69s user 0.04s system 99% cpu 0.737 total

Python
['aa-2', 'aa-10', 'ab-3', 'ab-100', 'ba-25']
python3 sort.py  0.03s user 0.00s system 98% cpu 0.032 total
['aa-2', 'aa-10', 'ab-3', 'ab-100', 'ba-25']
python3 sort.py  0.02s user 0.01s system 98% cpu 0.034 total
['aa-2', 'aa-10', 'ab-3', 'ab-100', 'ba-25']
python3 sort.py  0.03s user 0.02s system 98% cpu 0.059 total

Julia1
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort1.jl  1.10s user 0.68s system 236% cpu 0.750 total
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort1.jl  1.14s user 0.64s system 233% cpu 0.765 total
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort1.jl  1.13s user 0.62s system 241% cpu 0.725 total

Julia2
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort2.jl  0.97s user 0.64s system 270% cpu 0.596 total
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort2.jl  1.00s user 0.58s system 259% cpu 0.607 total
["aa-2", "aa-10", "ab-3", "ab-100", "ba-25"]
julia sort2.jl  0.96s user 0.63s system 276% cpu 0.578 total

Bash
aa-2 aa-10 ab-3 ab-100 ba-25 
./sort.sh  0.01s user 0.00s system 109% cpu 0.013 total
aa-2 aa-10 ab-3 ab-100 ba-25 
./sort.sh  0.00s user 0.01s system 108% cpu 0.015 total
aa-2 aa-10 ab-3 ab-100 ba-25 
./sort.sh  0.01s user 0.00s system 99% cpu 0.009 total

This wouldn’t be much of a coding/benchmark post without a plot, so I also did a visual comparison

library(ggplot2)
d <- tibble::tribble(
  ~language, ~version, ~run, ~time,
  "Rust", "", 1, 0.066,
  "Rust", "", 2, 0.087,
  "Rust", "", 3, 0.084,
  "R", "1", 1, 0.197,
  "R", "1", 2, 0.206,
  "R", "1", 3, 0.202,
  "R", "2", 1, 0.774,
  "R", "2", 2, 0.720,
  "R", "2", 3, 0.737,
  "Julia", "1", 1, 0.750,
  "Julia", "1", 2, 0.756,
  "Julia", "1", 3, 0.725,
  "Julia", "2", 1, 0.596,
  "Julia", "2", 2, 0.607,
  "Julia", "2", 3, 0.578,
  "Python", "", 1, 0.032,
  "Python", "", 2, 0.034,
  "Python", "", 3, 0.059,
  "Bash", "", 1, 0.013,
  "Bash", "", 2, 0.015,
  "Bash", "", 3, 0.009
)

d$language <- factor(
  d$language, 
  levels = c("Rust", "R", "Julia", "Python", "Bash")
)

ggplot(d, aes(language, time, fill = language, group = run)) + 
  geom_col(position = position_dodge(0.9)) + 
  facet_grid(
    ~language + version, 
    scales = "free_x", 
    labeller = label_wrap_gen(multi_line = FALSE), 
    switch = "x"
  ) + 
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(
    title = "Performance of sort functions by language", 
    y = "Time [s]", 
    x = "Language, Version"
  ) + 
  scale_fill_brewer(palette = "Set1")

It’s true - Rust does pretty well, even with my terrible coding. My R implementation (the sensible one) isn’t too bad - perhaps over many strings it would be a bit slow. Surprisingly, the Julia implementations are actually quite slow. I don’t have a good explanation for that. I’m using Julia 1.5.0 which is slightly out of date, so perhaps that needs an update. The Python implementation does particularly well - I really should learn more python. The syntax there isn’t the worst, either. Oh, no - do I like that?

The big winner, though, is the simplest of all - Bash crushes the rest of the languages with a 2 liner, and calling it doesn’t involve compiling anything.

As I said, I’m not particularly interested in optimizing any of these - this is how they compare as written.

In summary, I learned some Rust - enough to actually manipulate some data. I’ll keep trying and hopefully some day I’ll be semi literate in it.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2023-06-17
##  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version date (UTC) lib source
##  assertthat     0.2.1   2019-03-21 [3] CRAN (R 4.0.1)
##  blogdown       1.17    2023-05-16 [1] CRAN (R 4.1.2)
##  bookdown       0.29    2022-09-12 [1] CRAN (R 4.1.2)
##  bslib          0.4.1   2022-11-02 [3] CRAN (R 4.2.2)
##  cachem         1.0.6   2021-08-19 [3] CRAN (R 4.2.0)
##  callr          3.7.3   2022-11-02 [3] CRAN (R 4.2.2)
##  cli            3.4.1   2022-09-23 [3] CRAN (R 4.2.1)
##  colorspace     2.0-3   2022-02-21 [3] CRAN (R 4.2.0)
##  crayon         1.5.2   2022-09-29 [3] CRAN (R 4.2.1)
##  DBI            1.1.3   2022-06-18 [3] CRAN (R 4.2.1)
##  devtools       2.4.5   2022-10-11 [1] CRAN (R 4.1.2)
##  digest         0.6.30  2022-10-18 [3] CRAN (R 4.2.1)
##  dplyr        * 1.0.10  2022-09-01 [3] CRAN (R 4.2.1)
##  ellipsis       0.3.2   2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate       0.18    2022-11-07 [3] CRAN (R 4.2.2)
##  fansi          1.0.3   2022-03-24 [3] CRAN (R 4.2.0)
##  farver         2.1.1   2022-07-06 [3] CRAN (R 4.2.1)
##  fastmap        1.1.0   2021-01-25 [3] CRAN (R 4.2.0)
##  fs             1.5.2   2021-12-08 [3] CRAN (R 4.1.2)
##  generics       0.1.3   2022-07-05 [3] CRAN (R 4.2.1)
##  ggplot2      * 3.4.1   2023-02-10 [1] CRAN (R 4.1.2)
##  glue           1.6.2   2022-02-24 [3] CRAN (R 4.2.0)
##  gtable         0.3.1   2022-09-01 [3] CRAN (R 4.2.1)
##  here           1.0.1   2020-12-13 [1] CRAN (R 4.1.2)
##  highr          0.9     2021-04-16 [3] CRAN (R 4.1.1)
##  htmltools      0.5.3   2022-07-18 [3] CRAN (R 4.2.1)
##  htmlwidgets    1.5.4   2021-09-08 [1] CRAN (R 4.1.2)
##  httpuv         1.6.6   2022-09-08 [1] CRAN (R 4.1.2)
##  jquerylib      0.1.4   2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite       1.8.3   2022-10-21 [3] CRAN (R 4.2.1)
##  JuliaCall      0.17.5  2022-09-08 [1] CRAN (R 4.1.2)
##  knitr          1.40    2022-08-24 [3] CRAN (R 4.2.1)
##  labeling       0.4.2   2020-10-20 [3] CRAN (R 4.2.0)
##  later          1.3.0   2021-08-18 [1] CRAN (R 4.1.2)
##  lattice        0.20-45 2021-09-22 [4] CRAN (R 4.2.0)
##  lifecycle      1.0.3   2022-10-07 [3] CRAN (R 4.2.1)
##  magrittr       2.0.3   2022-03-30 [3] CRAN (R 4.2.0)
##  Matrix         1.5-3   2022-11-11 [4] CRAN (R 4.2.2)
##  memoise        2.0.1   2021-11-26 [3] CRAN (R 4.2.0)
##  mime           0.12    2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI         0.1.1.1 2018-05-18 [1] CRAN (R 4.1.2)
##  munsell        0.5.0   2018-06-12 [3] CRAN (R 4.0.1)
##  pillar         1.8.1   2022-08-19 [3] CRAN (R 4.2.1)
##  pkgbuild       1.4.0   2022-11-27 [1] CRAN (R 4.1.2)
##  pkgconfig      2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload        1.3.0   2022-06-27 [1] CRAN (R 4.1.2)
##  png            0.1-7   2013-12-03 [1] CRAN (R 4.1.2)
##  prettyunits    1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx       3.8.0   2022-10-26 [3] CRAN (R 4.2.1)
##  profvis        0.3.7   2020-11-02 [1] CRAN (R 4.1.2)
##  promises       1.2.0.1 2021-02-11 [1] CRAN (R 4.1.2)
##  ps             1.7.2   2022-10-26 [3] CRAN (R 4.2.2)
##  purrr          1.0.1   2023-01-10 [1] CRAN (R 4.1.2)
##  R6             2.5.1   2021-08-19 [3] CRAN (R 4.2.0)
##  RColorBrewer   1.1-3   2022-04-03 [3] CRAN (R 4.2.0)
##  Rcpp           1.0.9   2022-07-08 [1] CRAN (R 4.1.2)
##  remotes        2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  reticulate     1.26    2022-08-31 [1] CRAN (R 4.1.2)
##  rlang          1.0.6   2022-09-24 [1] CRAN (R 4.1.2)
##  rmarkdown      2.18    2022-11-09 [3] CRAN (R 4.2.2)
##  rprojroot      2.0.3   2022-04-02 [1] CRAN (R 4.1.2)
##  rstudioapi     0.14    2022-08-22 [3] CRAN (R 4.2.1)
##  sass           0.4.2   2022-07-16 [3] CRAN (R 4.2.1)
##  scales         1.2.1   2022-08-20 [3] CRAN (R 4.2.1)
##  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
##  shiny          1.7.2   2022-07-19 [1] CRAN (R 4.1.2)
##  stringi        1.7.8   2022-07-11 [3] CRAN (R 4.2.1)
##  stringr        1.5.0   2022-12-02 [1] CRAN (R 4.1.2)
##  tibble         3.1.8   2022-07-22 [3] CRAN (R 4.2.2)
##  tidyselect     1.2.0   2022-10-10 [3] CRAN (R 4.2.1)
##  urlchecker     1.0.1   2021-11-30 [1] CRAN (R 4.1.2)
##  usethis        2.1.6   2022-05-25 [1] CRAN (R 4.1.2)
##  utf8           1.2.2   2021-07-24 [3] CRAN (R 4.2.0)
##  vctrs          0.5.2   2023-01-23 [1] CRAN (R 4.1.2)
##  withr          2.5.0   2022-03-03 [3] CRAN (R 4.2.0)
##  xfun           0.34    2022-10-18 [3] CRAN (R 4.2.1)
##  xtable         1.8-4   2019-04-21 [1] CRAN (R 4.1.2)
##  yaml           2.3.6   2022-10-18 [3] CRAN (R 4.2.1)
## 
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
## 
## ─ Python configuration ───────────────────────────────────────────────────────
##  python:         /usr/bin/python3
##  libpython:      /usr/lib/python3.10/config-3.10-x86_64-linux-gnu/libpython3.10.so
##  pythonhome:     //usr://usr
##  version:        3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
##  numpy:          /home/jono/.local/lib/python3.10/site-packages/numpy
##  numpy_version:  1.24.1
##  
##  NOTE: Python version was forced by RETICULATE_PYTHON_FALLBACK
## 
## ──────────────────────────────────────────────────────────────────────────────

Australian Signals Directorate 50c Coin Decryption

website@jcarroll.com.au (Jonathan Carroll) — Thu, 01 Sep 2022 00:00:00 +0000

Updated: 2022-09-04

I took a very long time to post about the last Australian Signals Directorate (then DSD) decryption, so this time I’ll be a lot more punctual. This article was published today announcing that ASD have collaborated to release a new 50c coin containing a decryption challenge.

The new ASD 50c coin

That looks like fun! Typing in the letters and numbers from the image certainly wasn’t, but after that. Of course, I’ll be solving the entire thing with R.

Apparently there’s 4 6 challenges here.

Added 2022-09-04:

The obverse (head) side of the coin

has some boxes under particular letters (bolded here) in “ELIZABETH AUSTRALIA”. These are Braille numbers.

I’m committed to doing all this solving in base R, so no external packages, but @coolbutuseless has a great post about Braille in R where he notes that the system can be bit-encoded quite nicely. Essentially, the positions of the filled boxes can be represented uniquely by a pattern of bits. This means we can store the Braille numbers as bits and identify which one is which. If we store the lookup table as

nums <- c(1, 5, 3, 11, 9, 7, 15, 13, 6) # 1:9

then we can see one of these (e.g. 8) in the Braille form with

print(matrix(intToBits(nums[8])[1:6], ncol=2, byrow = T))

##      [,1] [,2]
## [1,]   01   00
## [2,]   01   01
## [3,]   00   00

Taking the patterns under each of the letters

code = list(
  B = c(1,1,0,0),
  T = c(1,0,1,0),
  H = c(1,1,1,0),
  A = c(1,0,0,0),
  S = c(1,0,0,1),
  a = c(1,1,0,1)
)

then calculating their bit values

sums <- sapply(code, function(x) sum(x*2^(0:3)))

we can compare against the lookup table and sort the result to see

sort(setNames(match(sums, nums), names(code)))

## A T B a S H 
## 1 2 3 4 5 6

which leads us to the cipher we should use for the next challenge!

The text around the rim looks to be split into sections. The shortest one is

txt1 <- "URMWXOZIRGBRM7DRWGSC5WVKGS"

I tried a few different substitution ciphers and hit gold with an Atbash cipher where the alphabet is simply reversed. That’s easy enough to code up…

solve_atbash <- function(txt) {
  txt <- strsplit(txt, "")[[1]]
  atbash <- rev(LETTERS)
  res <- LETTERS[match(txt, atbash)]
  # if an element doesn't match, it's probably a number 
  # and can go straight in
  res[is.na(res)] <- txt[is.na(res)]
  paste(res, collapse = "")
}

R having the alphabet available as LETTERS is certainly nice in this case. Applying that to the string above we get

solve_atbash(txt1)

## [1] "FINDCLARITYIN7WIDTHX5DEPTH"

which we can space out a bit to read “FIND CLARITY IN 7 WIDTH X 5 DEPTH”. Sounds like we’re going to need a matrix - good news for R!

Trying the next rim letters

txt2 <- "DVZIVZFWZXRLFHRMXLMXVKGZMWNVGRXFOLFHRMVCVXFGRLM"
solve_atbash(txt2)

## [1] "WEAREAUDACIOUSINCONCEPTANDMETICULOUSINEXECUTION"

which once again needs some spaces, but we can read “WE ARE AUDACIOUS IN CONCEPT AND METICULOUS IN EXECUTION”. No additional hints there, I guess - just some filler.

The inner ring of text doesn’t reveal anything with the cipher

inner <- "BGOAMVOEIATSIRLNGTTNEOGRERGXNTEAIFCECAIEOALEKFNR5LWEFCHDEEAEEE7NMDRXX5"
solve_atbash(inner)

## [1] "YTLZNELVRZGHRIOMTGGMVLTIVITCMGVZRUXVXZRVLZOVPUMI5ODVUXSWVVZVVV7MNWICC5"

but we had the earlier clue of a 7 x 5 matrix… that’s only 35 characters, so maybe we need 2

mat1 <- matrix(strsplit(inner, "")[[1]][1:35], 5, 7, byrow = TRUE)
mat1

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] "B"  "G"  "O"  "A"  "M"  "V"  "O" 
## [2,] "E"  "I"  "A"  "T"  "S"  "I"  "R" 
## [3,] "L"  "N"  "G"  "T"  "T"  "N"  "E" 
## [4,] "O"  "G"  "R"  "E"  "R"  "G"  "X" 
## [5,] "N"  "T"  "E"  "A"  "I"  "F"  "C"

Looking down the columns the text reads consistently, so let’s paste those together

res1 <- paste(apply(mat1, 2, paste, collapse = ""), collapse = "")

Doing the same for the remaining letters then joining the results

mat2 <- matrix(strsplit(inner, "")[[1]][36:70], 5, 7, byrow = TRUE)
res2 <- paste(apply(mat2, 2, paste, collapse = ""), collapse = "")
paste(res1, res2, collapse = "")

## [1] "BELONGINGTOAGREATTEAMSTRIVINGFOREXC ELLENCEWEMAKEADIFFERENCEXORHEXA5D75"

which, with spaces, reads “BELONGING TO A GREAT TEAM STRIVING FOR EXCELLENCE WE MAKE A DIFFERENCE XOR HEX A5D75”.

XOR is familiar from the last time I solved the challenge! The key ‘A5D75’ (l33tspeek for ASD’s 75th Anniversary, I take it) doesn’t have an even number of characters so the bytes won’t work out, so I’ll duplicate it enough times to properly xor with the input. I can only assume the big chunk of hex text is the remaining input. Typing that in was … interesting.

hex <- "
E3B8287D4290F7233814D7A47A291DC0F71B2806
D1A53B311CC4B97A0E1CC2B93B31068593332F10
C6A3352F14D1B27A3514D6F7382F1AD0B0322955
D1B83D3801CDB2287D05C0B82A311085A033291D
85A3323855D6BC333119D6FB7A3C11C4A72E3C17
CCBB33290C85B6343955CCBA3B3A1CCBB62E341A
CBF72E3255CAA73F2F14D1B27A341B85A3323855
D6BB333055C4A53F3C55C7B22E2A10C0B97A291D
C0F73E3413C3BE392819D1F73B331185A3323855
CCBA2A3206D6BE3831108B"
hex <- gsub("\\n", "", hex) # remove linebreaks
# split into pairs of bytes
pairs <- sapply(seq(1, nchar(hex), by = 2), function(x) substr(hex, x, x+1))
# xor key from earlier solution, duplicated so that pairs can be extracted
xor <- "A5D75A5D75"
# duplicate to length of input
xor <- rep(sapply(seq(1, nchar(xor), by = 2), function(x) substr(xor, x, x+1)), 40)[1:length(pairs)]
# xor input and key as integers
res <- bitwXor(strtoi(pairs, 16L), strtoi(xor, 16L))
# convert result to ASCII
cat(rawToChar(as.raw(res)))

## For 75 years the Australian Signals Directorate has brought together people with the skills, adaptability and imagination to operate in the slim area between the difficult and the impossible.

What a nice challenge! I don’t expect to be getting a phone call from ASD any time soon, but this was certainly fun to solve with R.

Added 2022-09-04

The inner ring text has a dark/light pattern to it. Treating this as binary

txt <- "BGOAMVOEIATSIRLNGTTNEOGRERGXNTEAIFCECAIEOALEKFNR5LWEFCHDEEAEEE7NMDRXX5"
bin <- "1000001101001110001001000011110001011100100110010011000001100100110010"

then spliting into groups (of 7, since $2^7 = 128$ is sufficient for the ASCII text table)

bin <- sapply(seq(1, nchar(bin), by = 7), function(x) substr(bin, x, x+6))
bin

##  [1] "1000001" "1010011" "1000100" "1000011" "1100010" "1110010" "0110010"
##  [8] "0110000" "0110010" "0110010"

then converting to ASCII, this time with a base of 2 for the binary data

rawToChar(as.raw(strtoi(bin, 2L)))

## [1] "ASDCbr2022"

which looks to be short for “ASD CANBERRA 2022”.

The outer ring additionally has a shaded pattern. Instead of binary, we can treat this as Morse code with a light letter representing a dot, a dark letter representing a dash, and a shaded letter representing a space. If we start at the double space near the top of the coin, the pattern is

txt <- "WNVGRXFOLFHRMVCVXFGRLM.URMWXOZIRGBRM7DRWGSC5WVKGSDVZIVZFWZXRLFHRMXLMXVKGZM"
pat <- "-.. ... -... .- .-.. -... . .-. - .--. .- .-. -.- .---- ----. ....- --... "

Splitting this at the spaces

pat <- strsplit(pat, " ")[[1]]
pat

##  [1] "-.."   "..."   "-..."  ".-"    ".-.."  "-..."  "."     ".-."   "-"    
## [10] ".--."  ".-"    ".-."   "-.-"   ".----" "----." "....-" "--..."

I’m still trying to do this in base R, so again, no packages. Instead I’ll load a lookup table

morse <-
  data.frame(char = c(
    "A", "B", "C", "D",
    "E", "F", "G", "H",
    "I", "J", "K", "L",
    "M", "N", "O", "P",
    "Q", "R", "S", "T",
    "U", "V", "W", "X",
    "Y", "Z", "0", "1",
    "2", "3", "4", "5",
    "6", "7", "8", "9",
    ",", "?", ":", "-",
    "\"", "(", "=", "*",
    ".", ";", "/", "'",
    "_", ")", "+", "@",
    " "),
    row.names = c(
      ".-", "-...", "-.-.", "-..",
      ".", "..-.", "--.", "....",
      "..", ".---", "-.-", ".-..",
      "--", "-.", "---", ".--.",
      "--.-", ".-.", "...", "-",
      "..-", "...-", ".--", "-..-",
      "-.--", "--..", "-----", ".----",
      "..---", "...--", "....-", ".....",
      "-....", "--...", "---..", "----.",
      "__..__", "..__..", "___...", "_...._",
      "._.._.", "_.__.", "_..._", "_.._",
      "._._._", "_._._.", "_.._.",
      ".____.", "..__._", "_.__._", "._._.",
      ".__._.", "   ")
  )

I like using rownames as an easy way to lookup values, despite the aversion to them in the tidyverse. Now it’s just a matter of extracting the values based on the lookup

paste(morse[pat, ], collapse = "")

## [1] "DSBALBERTPARK1947"

which stands for “DSB ALBERT PARK 1947”. Back when the division was started in 1947 at Albert Park it was the Defence Signals Bureau.

The very last part is the squares and circles - that appears to be the ADS’s typeface and I think just spells out “ASD”

Thanks for the comments and helpful tips, everyone!

~~Now I just need to get one of the coins as a souvenir.~~ I managed to get one of the coins from the Mint, and they’re now sold out.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 21.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2022-09-04                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  blogdown      1.8     2022-02-16 [1] CRAN (R 4.1.2)
##  bookdown      0.24    2021-09-02 [1] CRAN (R 4.1.2)
##  brio          1.1.1   2021-01-20 [3] CRAN (R 4.0.3)
##  bslib         0.3.1   2021-10-06 [1] CRAN (R 4.1.2)
##  cachem        1.0.3   2021-02-04 [3] CRAN (R 4.0.3)
##  callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.2)
##  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
##  crayon        1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
##  desc          1.4.1   2022-03-06 [1] CRAN (R 4.1.2)
##  devtools      2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
##  digest        0.6.27  2020-10-24 [3] CRAN (R 4.0.3)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
##  evaluate      0.14    2019-05-28 [3] CRAN (R 4.0.1)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.0.3)
##  fs            1.5.0   2020-07-31 [3] CRAN (R 4.0.2)
##  glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
##  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.1.2)
##  jsonlite      1.7.2   2020-12-09 [3] CRAN (R 4.0.3)
##  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
##  magrittr      2.0.1   2020-11-17 [3] CRAN (R 4.0.3)
##  memoise       2.0.0   2021-01-26 [3] CRAN (R 4.0.3)
##  pkgbuild      1.2.0   2020-12-15 [3] CRAN (R 4.0.3)
##  pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.2)
##  ps            1.5.0   2020-12-05 [3] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [3] CRAN (R 4.0.1)
##  R6            2.5.0   2020-10-28 [3] CRAN (R 4.0.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.1   2022-02-03 [1] CRAN (R 4.1.2)
##  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.1.2)
##  rprojroot     2.0.2   2020-11-15 [3] CRAN (R 4.0.3)
##  rstudioapi    0.13    2020-11-12 [3] CRAN (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.1.2)
##  sessioninfo   1.1.1   2018-11-05 [3] CRAN (R 4.0.1)
##  stringi       1.5.3   2020-09-09 [3] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [3] CRAN (R 4.0.1)
##  testthat      3.1.2   2022-01-20 [1] CRAN (R 4.1.2)
##  usethis       2.1.5   2021-12-09 [1] CRAN (R 4.1.2)
##  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
##  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.2)
##  yaml          2.2.1   2020-02-01 [3] CRAN (R 4.0.1)
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Lissajous Curve Matrix in Julia

website@jcarroll.com.au (Jonathan Carroll) — Thu, 12 May 2022 00:00:00 +0000

Another ‘small learning project’ for me as I continue to learn Julia. I’ve said many times that small projects with a defined goal are one of the best ways to learn, at least for me. This one was inspired by yet another Reddit post

These are at least reminiscent of Lissajous curves but they primarily just looked pretty cool - that animation is very nicely put together.

That graphic was made using Typescript which is itself neat to begin with, but it looked like something that Julia might be well-suited to, at least the parts I’ve learned so far. It seems to involve interpolating between points and animation, both of which I recently covered on my mini blog.

Better yet, it appeared that matrix operations might be a useful component, for which Julia seems particularly well-suited.

The first thing I needed to do was to get a polygon plotted in Julia. This already challenged my existing knowledge, but that’s where the learning happens. I dabbled with the Shape class and didn’t get very far. I found some other implementations that plotted shapes, but none (at least that I understood) that produced a set of points I could interpolate between.

I ended up defining my own function that calculates the vertices of an n-sided polygon with a bit of math. There’s very likely already something that does it, but it failed the discoverability aspect. The function I came up with is

"""
    vertices(center, R, n[, closed])

# Arguments
- `center::Point`: center of polygon
- `R::Real`: circumradius
- `n::Int`: number of sides
- `closed::Bool`: should the first point be repeated?

Polygon has a flat bottom and points progress counterclockwise 
starting at the right end of the base

The final point is the starting point when closed = true
"""
function vertices(center::Point, R::Real, n::Int, closed::Bool=true) 
    X = center[1] .+ R * cos.(π/n .* (1 .+ 2 .* (0:n-1)) .- π/2)
    Y = center[2] .+ R * sin.(π/n .* (1 .+ 2 .* (0:n-1)) .- π/2)

    res = permutedims([X Y])
    ## append the start point if closed
    if closed
        res = hcat(res, res[:,1])
    end
    return res
end

This is where playing around with code and data where you know what you want but not how to produce it is the most useful. Coming from R, I was at risk of trying to create a data.frame of x and y points, but Arrays make more sense here. Getting the points in the right structure was the biggest learning experience for me - combining Arrays of Points doesn’t quite work in the way I expect coming from R, but I think this works.

I did play with the idea of making my own struct for this group of Points, but even though (I think) it inherited from AbstractArray, none of the Array methods seemed to work for it - more to learn for next time!

I wanted to make sure that the points I generated here seem to make sense, so I can plot them. Getting the plots to work requires using Plots, and Point comes from GeometryBasics, so

using Plots
import GeometryBasics: Point

then plotting the vertices of a polygon is as easy as

a = vertices(Point(0,0), 1, 5, true);
plot(a[1,:], a[2,:], xlim = (-1.2, 1.2), ylim = (-1.2, 1.2), ratio = 1)
scatter!(a[1,:], a[2,:])

And just by changing the number of vertices

a = vertices(Point(0,0), 1, 6, true);
plot(a[1,:], a[2,:], xlim = (-1.2, 1.2), ylim = (-1.2, 1.2), ratio = 1)
scatter!(a[1,:], a[2,:])

I find it somewhat odd that plot doesn’t have an Array method and I need to explicitly slice out the x and y arguments, but perhaps I’m “holding it wrong”?

Next I wanted to interpolate points between these vertices. I played with interpolation in Julia in my last mini blog post so I knew that function was

interpolate(a, b) = t -> ((1.0 - t) * a + t * b)

Interpolating between vertices meant interpolating between any two vertices, then repeating that over pairs. Taking the case of two vertices first

"""
    _interPoints(pts, steps, slice)

# Arguments
- `pts::Array`: Array of `Point`s representing a polygon
- `steps::Int`: number of points to interpolate
- `slice::Int`: which polygon vertex to begin with; points will  be interpolated to the next vertex

This is an internal function to interpolate points between 
    two vertices of a polygon. It is intended to be used 
    in a `map` across slices of a polygon.
"""
function _interPoints(pts::Array, steps::Int, slice::Int) 
    int = interpolate(pts[:,slice], pts[:,slice+1])
    explode = [int(t) for t in range(0,1,length=steps)]
    return hcat(explode...)
end

which I can test with

a = vertices(Point(0,0), 1, 5, true);
b = _interPoints(a, 10, 1);
plot(a[1,:], a[2,:], ratio = 1)
scatter!(b[1,:], b[2,:])

Then, mapping across pairs of points is just

"""
    interPoints(pts, steps)

# Arguments
- `pts::Array`: Array of `Point`s representing a polygon
- `steps::Int`: number of points to interpolate between each pair of vertices

This takes an `Array` of `Point`s representing polygon vertices and interpolates between the vertices
"""
function interPoints(pts::Array, steps::Int) 
    res = map(s -> _interPoints(pts, steps, s), 1:(size(pts,2)-1))
    return hcat(res...)
end

Plotting all these points

a = vertices(Point(0,0), 1, 5, true);
b = interPoints(a, 10);
plot(b[1,:], b[2,:], xlim = (-1.2, 1.2), ylim = (-1.2, 1.2), ratio = 1)
scatter!(b[1,:], b[2,:])

Animating these points is as simple as

anim = @animate for t in 1:size(b,2)
    plot(b[1,:], b[2,:], xlim = (-1.2, 1.2), ylim = (-1.2, 1.2), ratio=1)
    scatter!([b[1,t]], [b[2,t]], markersize=8)
end

gif(anim, fps = 12)

and I think that’s pretty great progress towards what I want to make. Now I just need to run more of these at different speeds, and find the intersections of them.

Taking the intersection problem first, I just want to create two polygons and extract the x values from one and the y values from the other. Simple enough

"""
Find the intersection of two Arrays (representing polygons)

# Arguments
- `a::Array`: first polygon (for x values)
- `b::Array`: second polygon (for y values)

Take the x values from a and the y values from b
"""
function intersection(a::Array, b::Array) 
    permutedims(hcat([(a[1, :])...], [(b[2, :])...]))
end

permutedims was the big win for me here - I naively expected to be able to transpose an Array but that ends up with some LinearAlgebra.Adjoint mess and I got confused

[1 2; 3 4]

## 2×2 Array{Int64,2}:
##  1  2
##  3  4


[1 2; 3 4]'

## 2×2 Adjoint{Int64,Array{Int64,2}}:
##  1  3
##  2  4

Anyway, this appears to be able to take the intersection of two Arrays. Let’s plot it!

t1 = interPoints(vertices(Point(2,8), 0.5, 5), 10);
t2 = interPoints(vertices(Point(1,7), 0.5, 5), 10);
tx = intersection(t1, t2);

plot(t1[1,:], t1[2,:], xlim = (0,3.5), ylim = (6,9), ratio = 1)
plot!(t2[1,:], t2[2,:])
plot!(tx[1,:], tx[2,:])

Perfect! Now I just need to do it a bunch more times (at different ‘speeds’) and animate it.

I originally worked out the array math by hand and found a suitable number of points to plot for any given polygon and which multiplicative factors I could use, then I worked backwards to formalise it into a function

"""

    speed_factor(poly, speed)

# Arguments
- `poly::Array`: Array of `Point`s representing a polygon
- `speed::Real`: mulitiplicative factor representing how the number of times a polygon should be traversed
"""
function speed_factor(poly::Array, speed::Real)
    if (speed % 1 == 0)
        res = repeat(poly, outer = (1,Int(speed)))
    else 
        n = Int(floor(speed / 1))
        res = repeat(poly, outer=(1,n))
        n_rem = Int(speed*size(poly,2)-size(res,2))
        res = hcat(res, poly[:,1:n_rem])
    end
    res
end

If I create a polygon of 72 interpolated points, I can create another with the same number of points but with larger gaps between them. This means the ‘faster’ polygon will loop around some n>1 number of times.

r = 0.4; # circumradius for the polygon
d = 3;   # number of vertices

# Both produce a 2x72 Array{Float64,2}
tx1 = interPoints(vertices(Point(2,6), r, d), 24)
tx2 = speed_factor(interPoints(vertices(Point(3,6), r, d), 16) , 1.5)

I can create a series of these, say, at speeds of 1, 1.5, 2, 2.4, and 3. These are just nice numbers which are all integer divisors of the largest number of points (72)

## n = 3
r = 0.4;
d = 3;

tx1 = interPoints(vertices(Point(2,6), r, d), 24)
tx2 = speed_factor(interPoints(vertices(Point(3,6), r, d), 16), 1.5) 
tx3 = speed_factor(interPoints(vertices(Point(4,6), r, d), 12), 2)
tx4 = speed_factor(interPoints(vertices(Point(5,6), r, d), 10), 2.4)
tx5 = speed_factor(interPoints(vertices(Point(6,6), r, d), 8), 3)

ty1 = interPoints(vertices(Point(1,5), r, d), 24)
ty2 = speed_factor(interPoints(vertices(Point(1,4), r, d), 16), 1.5)
ty3 = speed_factor(interPoints(vertices(Point(1,3), r, d), 12), 2)
ty4 = speed_factor(interPoints(vertices(Point(1,2), r, d), 10), 2.4)
ty5 = speed_factor(interPoints(vertices(Point(1,1), r, d), 8), 3)

The variable name is arbitrary, but these are a sequence of polygons along the x and y axes of some plot area.

One thing that I really like about Julia is that anything can be in an Array (similar to lists in R) so I can combine these groups of points into an Array of Arrays

allx = [tx1, tx2, tx3, tx4, tx5]
ally = [ty1, ty2, ty3, ty4, ty5]

Now, how to calculate all the intersections? Julia of course does “broadcasting” where we can take some operation and (in R parlance) “vectorize it”. That initially led me to

intersection.(allx, ally)

which does indeed do that - it produces a 5-element Array{Array{Float64,2},1} but that’s not what I wanted… this only calculates the ‘diagonal’ of intersection(tx1, ty1), intersection(tx2, ty2), …

Thankfully, Julia also has list comprehensions, so the full ‘matrix’ of intersections is actually

allint = [intersection(x, y) for x in allx, y in ally]

which produces a 5×5 Array{Array{Float64,2},2} - the full matrix! With that in place, we now have all the pieces we need, so we just need to plot them.

The following sets up a plot on every ‘timestep’ (one per point in the interpolation) where it redraws the canvas, with the progressive drawing of each polygon and the intersections, plus some tracking lines along the x and y extractions. One of the very neat things I entirely failed to appreciate earlier was the concept of enumerated objects - Julia knows that if I ask for x in obj I want to iterate over all the elements

bbox = Point(6.5,6.5);

anim3 = @animate for t in 1:size(tx1,2)
    plot(xlim=(0,bbox[2]), ylim=(0,bbox[2]), 
        legend=false, ratio=1, axis=nothing, border=:none, 
        background_color="black", size=(1200,1200))
    for p in 1:size(allx,1)
        plot!(allx[p][1,1:t], allx[p][2,1:t], color=p, linewidth=6)
        plot!(ally[p][1,1:t], ally[p][2,1:t], color=p, linewidth=6)
        
        plot!([allx[p][1,t], allx[p][1,t]], [0.5, allx[p][2,t]], color="grey", alpha=0.5, linewidth=5)
        plot!([ally[p][1,t], bbox[2]], [ally[p][2,t], ally[p][2,t]], color="grey", alpha=0.5, linewidth=5)
    end
    for p in allint
        plot!(p[1,1:t], p[2,1:t], color="blue", linewidth=5)
    end
end

gif(anim3, "n3.gif", fps=12)

And, finally, the result

With all that in place, it’s reasonably straightforward to adapt this to other polygons. For n=4

r = 0.4;
d = 4;

tx1 = interPoints(vertices(Point(2,6), r, d), 24)
tx2 = speed_factor(interPoints(vertices(Point(3,6), r, d), 16), 1.5)
tx3 = speed_factor(interPoints(vertices(Point(4,6), r, d), 12), 2)
tx4 = speed_factor(interPoints(vertices(Point(5,6), r, d), 10), 2.4)
tx5 = speed_factor(interPoints(vertices(Point(6,6), r, d), 8), 3)

ty1 = interPoints(vertices(Point(1,5), r, d), 24)
ty2 = speed_factor(interPoints(vertices(Point(1,4), r, d), 16), 1.5)
ty3 = speed_factor(interPoints(vertices(Point(1,3), r, d), 12), 2)
ty4 = speed_factor(interPoints(vertices(Point(1,2), r, d), 10), 2.4)
ty5 = speed_factor(interPoints(vertices(Point(1,1), r, d), 8), 3)

allx = [tx1, tx2, tx3, tx4, tx5]
ally = [ty1, ty2, ty3, ty4, ty5]
allint = [intersection(x, y) for x in allx, y in ally]

bbox = Point(6.5,6.5);

anim4 = @animate for t in 1:size(tx1,2)
    plot(xlim=(0,bbox[2]), ylim=(0,bbox[2]), 
        legend=false, ratio=1, axis=nothing, border=:none, 
        background_color="black", size=(1200,1200))
    for p in 1:size(allx,1)
        plot!(allx[p][1,1:t], allx[p][2,1:t], color=p, linewidth=6)
        plot!(ally[p][1,1:t], ally[p][2,1:t], color=p, linewidth=6)
        
        plot!([allx[p][1,t], allx[p][1,t]], [0.5, allx[p][2,t]], color="grey", alpha=0.5, linewidth=5)
        plot!([ally[p][1,t], bbox[2]], [ally[p][2,t], ally[p][2,t]], color="grey", alpha=0.5, linewidth=5)
    end
    for p in allint
        plot!(p[1,1:t], p[2,1:t], color="blue", linewidth=5)
    end
end

gif(anim4, "n4.gif", fps=12)

n=5 (very similar code)

and n=6

I was extremely happy to see these come together, and I’m genuinely surprised by how little code it took. I could certainly imagine trying to do the same in R, but I have doubts that it would come together quite so cleanly.

This is definitely still part of my journey towards learning Julia, so if there’s something in here you can spot that I could have done better, I do encourage you to let me know! Either here in the comments or on Twitter.

The code for generating all of this can be found here.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 21.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2022-05-12                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  blogdown      1.8     2022-02-16 [1] CRAN (R 4.1.2)
##  bookdown      0.24    2021-09-02 [1] CRAN (R 4.1.2)
##  brio          1.1.1   2021-01-20 [3] CRAN (R 4.0.3)
##  bslib         0.3.1   2021-10-06 [1] CRAN (R 4.1.2)
##  cachem        1.0.3   2021-02-04 [3] CRAN (R 4.0.3)
##  callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.2)
##  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
##  crayon        1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
##  desc          1.4.1   2022-03-06 [1] CRAN (R 4.1.2)
##  devtools      2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
##  digest        0.6.27  2020-10-24 [3] CRAN (R 4.0.3)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
##  evaluate      0.14    2019-05-28 [3] CRAN (R 4.0.1)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.0.3)
##  fs            1.5.0   2020-07-31 [3] CRAN (R 4.0.2)
##  glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
##  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.1.2)
##  jsonlite      1.7.2   2020-12-09 [3] CRAN (R 4.0.3)
##  JuliaCall     0.17.4  2021-05-16 [1] CRAN (R 4.1.2)
##  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
##  magrittr      2.0.1   2020-11-17 [3] CRAN (R 4.0.3)
##  memoise       2.0.0   2021-01-26 [3] CRAN (R 4.0.3)
##  pkgbuild      1.2.0   2020-12-15 [3] CRAN (R 4.0.3)
##  pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.2)
##  ps            1.5.0   2020-12-05 [3] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [3] CRAN (R 4.0.1)
##  R6            2.5.0   2020-10-28 [3] CRAN (R 4.0.2)
##  Rcpp          1.0.6   2021-01-15 [3] CRAN (R 4.0.3)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.1   2022-02-03 [1] CRAN (R 4.1.2)
##  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.1.2)
##  rprojroot     2.0.2   2020-11-15 [3] CRAN (R 4.0.3)
##  rstudioapi    0.13    2020-11-12 [3] CRAN (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.1.2)
##  sessioninfo   1.1.1   2018-11-05 [3] CRAN (R 4.0.1)
##  stringi       1.5.3   2020-09-09 [3] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [3] CRAN (R 4.0.1)
##  testthat      3.1.2   2022-01-20 [1] CRAN (R 4.1.2)
##  usethis       2.1.5   2021-12-09 [1] CRAN (R 4.1.2)
##  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
##  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.2)
##  yaml          2.2.1   2020-02-01 [3] CRAN (R 4.0.1)
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Where for (loop) ARt Thou?

website@jcarroll.com.au (Jonathan Carroll) — Fri, 22 Apr 2022 00:00:00 +0000

I’ve long been interested in exactly how R works - not quite enough for me to learn all the internals, but I was surprised that I could not find a clear guide towards exactly how vectorization works at the deepest level.

Let’s say we want to add two vectors which we’ve defined as x and y

x <- c(2, 4, 6)
y <- c(1, 3, 2)

One way to do this (the verbose, elementwise way) would be to add each pair of elements

c(x[1] + y[1], x[2] + y[2], x[3] + y[3])

## [1] 3 7 8

but if you are familiar with not repeating yourself, you might write this in a loop. Best practice involves pre-filling the result to the correct size

res <- c(NA, NA, NA)
for (i in 1:3) {
  res[i] <- x[i] + y[i]
}
res

## [1] 3 7 8

There, we wrote a for loop.

Now, if you’ve ever seen R tutorials or discussions, you’ve probably had it drilled into you that this is “bad” and should be replaced with something else. One of those options is an apply family function

plus <- function(i) {
  x[i] + y[i]
}
sapply(1:3, plus)

## [1] 3 7 8

or something ‘tidier’

purrr::map_dbl(1:3, plus)

## [1] 3 7 8

(yes, yes, these functions are ‘bad’ because they don’t take x or y as arguments) but this stil performs the operation elementwise. If you’ve seen even more R tutorais/discussions you’ve probably been seen that vectorization is very handy - The R function + knows what to do with objects that aren’t just a single value, and does what you might expect

x + y

## [1] 3 7 8

Now, if you’ve really read a lot about R, you’ll know that ‘under the hood’ a for-loop is involved in every one of these, but it’s “lower down”, “at the C level”. Jenny Bryan makes the point that “Of course someone has to write loops / It doesn’t have to be you” and for this reason, vectorization in R is of great benefit.

So, there is a loop, but where exactly does that happen?

At some point, the computer needs to add the elements of x to the elements of y, and the simplest versions of this happens one element at a time, in a loop. There’s a big sidetrack here about SIMD which I’ll try to avoid, but I will mention that the Microsoft fork of R (artist, formerly known as Revolution R) running on Intel chips can do SIMD in MKL.

So, let’s start at the operator.

`+`

## function (e1, e2)  .Primitive("+")

Digging into primitives is a little tricky, but {pryr} can help

pryr::show_c_source(.Primitive("+"))

+ is implemented by do_arith with op = PLUSOP

We can browse a copy of the source for do_arith (in arithmetic.c) here where we see some logic paths for scalars and vectors. Let’s assume we’re working with our example which has length(x) == length(y) > 1. With two non-scalar arguments

if !IS_SCALAR and argc == length(arg) == 2

This leads us to call R_binary

Depending on the class of the arguments, we need to call different functions, but for the sake of our example let’s say we have non-integer real numbers so we fork to real_binary. This takes a code argument for which type of operation we’re performing, and in our case it’s PLUSOP (noted above). There’s a case branch for this in which case, provided the arguments are of the same length (n1 == n2) we call

R_ITERATE_CHECK(NINTERRUPT, n, i, da[i] = dx[i] + dy[i];);

That’s starting to look a lot like a loop - there’s an iterator i and we’re going to call another function.

This jumps us over to a different file where we see LOOP_WITH_INTERRUPT_CHECK definitely performs some sort of loop. This takes the body above and the argument LOOP_ITERATE_CORE which is finally the actual loop!

#define R_ITERATE_CORE(n, i, loop_body) do {    \
   for (; i < n; ++i) { loop_body } \
} while (0)

so, that’s where the actual loop in a vectorized R call happens! ALL that sits behind the innocent-looking +.

That was thoroughly satisfying, but I did originally have in mind comparing R to another language - one where loops aren’t frowned upon because of performance, but rather encouraged… How do Julia loops differ?

Julia is not a vectorized language per se, but it has a neat ability to “vectorize” any operation, though in Julia syntax it’s “broadcasting”.

Simple addition can combine scalar values

3+4

## 7

Julia actually has scalar values (in R, even a single value is just a vector of length 1) so a single value could be

typeof(3)

## Int64

whereas several values need to be an Array, even if it only has 1 dimension

Vector{Int64}([1, 2, 3])

## 3-element Array{Int64,1}:
##  1
##  2
##  3

Trying to add two Arrays does work

[1, 2, 3] + [4, 5, 6]

## 3-element Array{Int64,1}:
##  5
##  7
##  9

but only because a specific method has been written for this case, i.e.

methods(+, (Array, Array))

## # 1 method for generic function "+":
## [1] +(A::Array, Bs::Array...) in Base at arraymath.jl:43

One thing I particularly like is that we can see exactly which method was called using the @which macro

@which [1, 2, 3, 4] + [1, 2, 3, 4]

+(A::Array, Bs::Array...) in Base at arraymath.jl:43

something that I really wish was easier to do in R. The @edit macro even jumps us right into the actual code for this dispatched call.

This ‘add vectors’ problem can be solved through broadcasting, which performs an operation elementwise

[1, 2, 3] .+ [4, 5, 6]

## 3-element Array{Int64,1}:
##  5
##  7
##  9

The fun fact about this I recently learned was that broadcasting works on any operation, even if that’s the pipe itself

["a", "list", "of", "strings"] .|> [uppercase, reverse, titlecase, length]

## 4-element Array{Any,1}:
##   "A"
##   "tsil"
##   "Of"
##  7

Back to our loops, the method for + on two Arrays points us to arraymath.jl (linked to current relevant line) which contains

function +(A::Array, Bs::Array...)
    for B in Bs
        promote_shape(A, B) # check size compatibility
    end
    broadcast_preserving_zero_d(+, A, Bs...)
end

The last part of that is the meaningful part, and that leads to Broadcast.broadcast_preserving_zero_d.

This starts to get out of my depth, but essentially

@inline function broadcast_preserving_zero_d(f, As...)
    bc = broadcasted(f, As...)
    r = materialize(bc)
    return length(axes(bc)) == 0 ? fill!(similar(bc, typeof(r)), r) : r
end

@inline broadcast(f, t::NTuple{N,Any}, ts::Vararg{NTuple{N,Any}}) where {N} = map(f, t, ts...)

involves a map operation to achieve the broadcasting.

Perhaps that’s a problem to tackle when I’m better at digging through Julia.

As always, comments, suggestions, and anything else welcome!

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 21.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2022-04-22                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  blogdown      1.8     2022-02-16 [1] CRAN (R 4.1.2)
##  bookdown      0.24    2021-09-02 [1] CRAN (R 4.1.2)
##  brio          1.1.1   2021-01-20 [3] CRAN (R 4.0.3)
##  bslib         0.3.1   2021-10-06 [1] CRAN (R 4.1.2)
##  cachem        1.0.3   2021-02-04 [3] CRAN (R 4.0.3)
##  callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.2)
##  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
##  crayon        1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
##  desc          1.4.1   2022-03-06 [1] CRAN (R 4.1.2)
##  devtools      2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
##  digest        0.6.27  2020-10-24 [3] CRAN (R 4.0.3)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
##  evaluate      0.14    2019-05-28 [3] CRAN (R 4.0.1)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.0.3)
##  fs            1.5.0   2020-07-31 [3] CRAN (R 4.0.2)
##  glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
##  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.1.2)
##  jsonlite      1.7.2   2020-12-09 [3] CRAN (R 4.0.3)
##  JuliaCall   * 0.17.4  2021-05-16 [1] CRAN (R 4.1.2)
##  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
##  magrittr      2.0.1   2020-11-17 [3] CRAN (R 4.0.3)
##  memoise       2.0.0   2021-01-26 [3] CRAN (R 4.0.3)
##  pkgbuild      1.2.0   2020-12-15 [3] CRAN (R 4.0.3)
##  pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.2)
##  ps            1.5.0   2020-12-05 [3] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [3] CRAN (R 4.0.1)
##  R6            2.5.0   2020-10-28 [3] CRAN (R 4.0.2)
##  Rcpp          1.0.6   2021-01-15 [3] CRAN (R 4.0.3)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.1   2022-02-03 [1] CRAN (R 4.1.2)
##  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.1.2)
##  rprojroot     2.0.2   2020-11-15 [3] CRAN (R 4.0.3)
##  rstudioapi    0.13    2020-11-12 [3] CRAN (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.1.2)
##  sessioninfo   1.1.1   2018-11-05 [3] CRAN (R 4.0.1)
##  stringi       1.5.3   2020-09-09 [3] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [3] CRAN (R 4.0.1)
##  testthat      3.1.2   2022-01-20 [1] CRAN (R 4.1.2)
##  usethis       2.1.5   2021-12-09 [1] CRAN (R 4.1.2)
##  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
##  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.2)
##  yaml          2.2.1   2020-02-01 [3] CRAN (R 4.0.1)
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Codegolf - Lisp Edition

website@jcarroll.com.au (Jonathan Carroll) — Sat, 02 Apr 2022 00:00:00 +0000

I occasionally like a round of code-golf (e.g. recently) and I try to solve these with R, but this one gave me some hope that I could make use of a really cool feature I knew about in common lisp.

lisp is timeless. https://xkcd.com/297

I have occasionally tinkered with lisp - initially because I learned emacs, but later because it’s really interesting and does teach a lot about quoting. Practical Common Lisp is a book I’m still (slowly) making my way through, but it’s a great read so far.

There’s a lot you can do with lisp - you can even connect it up to R (sort of).

Anyway, back to the code-golf. The problem as stated:

It’s 2050, and people have decided to write numbers in a new way. They want less to memorize, and number to be able to be written quicker. For every place value(ones, tens, hundreds, etc.) the number is written with the number in that place, a hyphen, and the place value name. “zero” and it’s place value does not need to be written. The number 0 and negative numbers do not need to be handled, so don’t worry about those.

Input: The input will be a positive integer up to 3 digits.

Output: The output should be a string that looks like something below.

Test cases:

56 => five-ten six
11 => ten one
72 => seven-ten two
478 => four-hundred seven-ten eight
754 => seven-hundred five-ten four
750 => seven-hundred five-ten
507 => five-hundred seven

On it’s own, this seems like it’s going to need some sort of mapping from digits to words. R does have one of those in the {english} package (I know this because I used it the last example in this post) but code-golf doesn’t really allow you to use external packages (mostly).

What gave me hope is something I really wish R had natively, and that the "~R" option of lisp’s format method

(format nil "~R" 14000605)

"fourteen million six hundred five"

This works really nicely, and seemed like an efficient route to a code-golf solution.

What was missing from this? For starters, we explicity need the tens digits to be of the form ‘n-ten’, which isn’t the case here

(format nil "~R" 478)

"four hundred seventy-eight"

I considered trying to do a text replacement of “ty” to “-ten” but, alas,

(format nil "~R" 56)

"fifty-six"

is going to break that pattern.

The alternative, I suppose, is to split out the digits and add the “-hundred” and “-ten” parts. This took me down a rabbit hole, but eventually I managed to pull together enough stack overflow answers to achieve

(map 'list #'digit-char-p (prin1-to-string 458))

(4 5 8)

There’s (hopefully) a faster way to do that, but it works.

Converting each of these digits to words means applying the format in a map. That… also took a while to figure out, and this is probably overkill

(mapcar (lambda (it) (format nil "~R" it)) (map 'list #'digit-char-p (prin1-to-string 458)))

("four" "five" "eight")

Pasting together this result with a list of suffixes requires the concatenate operator, again in a map, but with a lambda function to do this pairwise, otherwise it just appends the lists

(mapcar (lambda(j k) (concatenate 'string j k)) (mapcar (lambda (it) (format nil "~R" it)) (map 'list #'digit-char-p (prin1-to-string 458))) '("-hundred" "-ten" ""))

("four-hundred" "five-ten" "eight")

Nearly there! Or so I thought. How does this suffixing work when there isn’t a hundred digit, e.g. 21?

(print (mapcar (lambda(j k) (concatenate 'string j k)) (mapcar (lambda (it) (format nil "~R" it)) (map 'list #'digit-char-p (prin1-to-string 21))) '("-hundred" "-ten" "")))

("two-hundred" "one-ten")

Well, that’s not right. But lisp seems okay with having the unequal sized lists. How about starting from the ones digit (i.e. reversed)? That means reversing the split digits list and reversing the suffixes list, doing the operations, then reversing the result

(print (reverse (mapcar (lambda(j k) (concatenate 'string j k)) (mapcar (lambda (it) (format nil "~R" it)) (reverse (map 'list #'digit-char-p (prin1-to-string 21)))) (reverse '("-hundred" "-ten" "")))))

("two-ten" "one")

Fantastic! And the larger digits?

(print (reverse (mapcar (lambda(j k) (concatenate 'string j k)) (mapcar (lambda (it) (format nil "~R" it)) (reverse (map 'list #'digit-char-p (prin1-to-string 458)))) (reverse '("-hundred" "-ten" "")))))

("four-hundred" "five-ten" "eight")

Woohoo!

The last step is to manually reverse the suffix list, make it a function, and try out the test cases, which you can try out for yourself here

(defun f(n) (reverse (mapcar (lambda(j k) (concatenate 'string j k)) (mapcar (lambda (it) (format nil "~R" it)) (reverse (map 'list #'digit-char-p (prin1-to-string n)))) '("" "-ten" "-hundred"))))

(print (f 56))
(print (f 11))
(print (f 72))
(print (f 478))
(print (f 754))
(print (f 750))
(print (f 507))

("five-ten" "six") 
("one-ten" "one") 
("seven-ten" "two") 
("four-hundred" "seven-ten" "eight") 
("seven-hundred" "five-ten" "four") 
("seven-hundred" "five-ten" "zero") 
("five-hundred" "zero-ten" "seven")

That’s soemwhat close to what the challenge wants, and the ‘Attempt This Online’ tool linked about claims 198 bytes for this solution, but it’s not quite there yet:

these should be a single string per test, which I presume involves collapsing the list into a 'string
I still have the "zero-ten" and "zero" entries which break the tests
"one" should only appear in the ones entry, so 11 should produce "ten one".

At this point, it was 1am, and I figured I’d learned enough for the day. If anyone would like to improve on this solution, please be my guest.

What’s also great to see is that there’s a Julia solution now!

!n=n<10 ? split(" one two three four five six seven eight nine"," ")[n+1] :
n<20 ? "ten "*!(n-10) :
n<(H=100) ? !(n÷10)*"-"*!(10+n%10) :
n<2H ? "hundred "*!(n-H) :
!(n÷H)*"-"*!(H+n%H)

## ! (generic function with 1 method)

tests = [1; 11; 56; 72; 478; 754; 750; 507];
for t in tests
    println(t => !t)
end

## 1 => "one"
## 11 => "ten one"
## 56 => "five-ten six"
## 72 => "seven-ten two"
## 478 => "four-hundred seven-ten eight"
## 754 => "seven-hundred five-ten four"
## 750 => "seven-hundred five-ten "
## 507 => "five-hundred seven"

I’ll be trying to make sense of this for sure. You can try it out yourself here

As usual, the journey was the important part of this - I got to play with and learn some more lisp. There’s no prize for the challenge aside from arbitrary internet points, so I’m entirely happy with how this turned out.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 21.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2022-04-02                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [3] CRAN (R 4.0.1)
##  blogdown      1.8     2022-02-16 [1] CRAN (R 4.1.2)
##  bookdown      0.24    2021-09-02 [1] CRAN (R 4.1.2)
##  brio          1.1.1   2021-01-20 [3] CRAN (R 4.0.3)
##  bslib         0.3.1   2021-10-06 [1] CRAN (R 4.1.2)
##  cachem        1.0.3   2021-02-04 [3] CRAN (R 4.0.3)
##  callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.2)
##  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
##  crayon        1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
##  DBI           1.1.1   2021-01-15 [3] CRAN (R 4.0.3)
##  desc          1.4.1   2022-03-06 [1] CRAN (R 4.1.2)
##  devtools      2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
##  digest        0.6.27  2020-10-24 [3] CRAN (R 4.0.3)
##  dplyr       * 1.0.8   2022-02-08 [1] CRAN (R 4.1.2)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
##  evaluate      0.14    2019-05-28 [3] CRAN (R 4.0.1)
##  fansi         0.4.2   2021-01-15 [3] CRAN (R 4.0.3)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.0.3)
##  fs            1.5.0   2020-07-31 [3] CRAN (R 4.0.2)
##  generics      0.1.0   2020-10-31 [3] CRAN (R 4.0.3)
##  glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
##  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.1.2)
##  jsonlite      1.7.2   2020-12-09 [3] CRAN (R 4.0.3)
##  JuliaCall     0.17.4  2021-05-16 [1] CRAN (R 4.1.2)
##  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
##  magrittr      2.0.1   2020-11-17 [3] CRAN (R 4.0.3)
##  memoise       2.0.0   2021-01-26 [3] CRAN (R 4.0.3)
##  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
##  pkgbuild      1.2.0   2020-12-15 [3] CRAN (R 4.0.3)
##  pkgconfig     2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.2)
##  ps            1.5.0   2020-12-05 [3] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [3] CRAN (R 4.0.1)
##  R6            2.5.0   2020-10-28 [3] CRAN (R 4.0.2)
##  Rcpp          1.0.6   2021-01-15 [3] CRAN (R 4.0.3)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.1   2022-02-03 [1] CRAN (R 4.1.2)
##  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.1.2)
##  rprojroot     2.0.2   2020-11-15 [3] CRAN (R 4.0.3)
##  rstudioapi    0.13    2020-11-12 [3] CRAN (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.1.2)
##  sessioninfo   1.1.1   2018-11-05 [3] CRAN (R 4.0.1)
##  stringi       1.5.3   2020-09-09 [3] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [3] CRAN (R 4.0.1)
##  testthat      3.1.2   2022-01-20 [1] CRAN (R 4.1.2)
##  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
##  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
##  usethis       2.1.5   2021-12-09 [1] CRAN (R 4.1.2)
##  utf8          1.1.4   2018-05-24 [3] CRAN (R 4.0.2)
##  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.2)
##  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
##  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.2)
##  yaml          2.2.1   2020-02-01 [3] CRAN (R 4.0.1)
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Codegolfing Minecraft Lighting

website@jcarroll.com.au (Jonathan Carroll) — Sat, 26 Mar 2022 00:00:00 +0000

I occasionally like to participate in an odd sport known as ‘code golf’ where the aim is to write some code to achieve a given task using the smallest number of characters.

The tradtional way to cheat at golf is to lower your score

R isn’t optimised for this in the slightest (why would it be?) and there are other languages which have expanded character sets which are, e.g. BQL, MATL, and 05AB1E.

These are typically short, contained problems which can be solved in a variety of ways, so they usually include some restrictions and test cases. Some are more amenable to using R, while others have better support in other languages.

This one caught my eye, partly because my kids are obsessed with Minecraft. The problem as stated:

Minecraft has a fairly unique lighting system. Each block’s light value is either one less than the brightest one surrounding it, or it is a light source itself. Your task is to write a method that takes in a 2D array of light source values, and then returns a 2D array with spread out lighting, where 0 is the minimum value.

Input1 = [
         [0, 0, 4, 0], 
         [0, 0, 0, 0], 
         [0, 2, 0, 0], 
         [0, 0, 0, 0]
        ]

Output1 = [
         [2, 3, 4, 3], 
         [1, 2, 3, 2], 
         [1, 2, 2, 1], 
         [0, 1, 1, 0]
        ]

Input2 = [
         [2, 0, 0, 3], 
         [0, 0, 0, 0], 
         [0, 0, 0, 0], 
         [0, 0, 0, 0]
        ]

Output2 = [
         [2, 1, 2, 3], 
         [1, 0, 1, 2], 
         [0, 0, 0, 1], 
         [0, 0, 0, 0]
        ]

Matrix operations? That sounds like something R can work with. I decided to have a go. There were already some answers using the golfing languages, and I can’t even read those, so those aren’t any help. There was at least one python answer, but I didn’t want to confuse myself trying to translate an existing answer when the tooling doesn’t quite work that way.

Defining the input matrices is straightforward enough

mtest1 <- matrix(c(0, 0, 4, 0,
                   0, 0, 0, 0,
                   0, 2, 0, 0,
                   0, 0, 0, 0), 4, 4, byrow = TRUE)
mtest1

##      [,1] [,2] [,3] [,4]
## [1,]    0    0    4    0
## [2,]    0    0    0    0
## [3,]    0    2    0    0
## [4,]    0    0    0    0

mtest2 <- matrix(c(2, 0, 0, 3,
                   0, 0, 0, 0,
                   0, 0, 0, 0,
                   0, 0, 0, 0), 4, 4, byrow = TRUE)
mtest2

##      [,1] [,2] [,3] [,4]
## [1,]    2    0    0    3
## [2,]    0    0    0    0
## [3,]    0    0    0    0
## [4,]    0    0    0    0

but do remember to set byrow = TRUE if you’re writing your matrix out … by rows.

I needed a way to identify the locations and values of these light sources. I know that which() can return array indices with arr.ind = TRUE so I tried that

which(mtest1 > 0, arr.ind = TRUE)

##      row col
## [1,]   3   2
## [2,]   1   3

I’ll also need the values at those sources

mtest1[mtest1 > 0]

## [1] 2 4

So far, so good. Now, I’ll need to spread ‘light’ out from each of those sources. That seems… trickier.

A few options came to mind, including a convolution operation, but I couldn’t get that to work. I eventually ended up writing a loop to set decreasing values along the row and column of each light source, forwards and backwards.

for (r in seq_along(y)) {
    q <- p <- y[[r]]
    q <- rbind(q, data.frame(l=p$l-seq_along(p$c:n)+1, r=p$r, c=p$c:n))
    q <- rbind(q, data.frame(l=p$l-seq_along(p$c:1)+1, r=p$r, c=p$c:1))
    q <- rbind(q, data.frame(l=p$l-seq_along(p$r:n)+1, r=p$r:n, c=p$c))
    q <- rbind(q, data.frame(l=p$l-seq_along(p$r:1)+1, r=p$r:1, c=p$c))
}

This involved creating a data.frame of row and column values, plus the value of the light at that position. This isn’t efficient at all, and not just from the processing side - it uses a lot of characters.

One way to get around this in codegolf is to use a short alias to a longer named function, e.g. d = data.frame, b = rbind. This saves a lot of characters.

The idea of creating indices at which to set values comes from the fact that a matrix can be subset by another matrix that specifies the rows and columns. i.e.

## create a matrix
m = matrix(1:9, 3, 3, byrow = TRUE)
m

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

## specify the extraction of the elements as (1, 2) and (3, 2)
msub = matrix(c(1, 2, 3, 2), 2, 2, byrow = TRUE)
msub

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    2

m[msub]

## [1] 2 8

This can essentially ‘un-which()’ a matrix.

Once I had the reduced values in each direction away from a light source, for each light source, the last step was to combine these and take the maximum value at each element. Reduce() does this nicely with the function pmax() (parallel maximum which works across multiple vectors rather than the global maximum).

Lastly, a second pass using all these new points as light sources ensures that the ‘light’ is propagated in all directions.

The solution, as I had it, worked and produced the test case results, but it was not yet golfed.

To “golf” some R code there’s some optimisations we can make.

Delete spaces where possible
Use = over <- to save a character each time
For R>=4.1, use the shorthand syntax for function()

f=\(x)x^2
f(3)

## [1] 9

use T and F for TRUE and FALSE - generally inadvisable in regular code, but here they save some characters.
use partial argument matching where possible - it’s a dangerous feature of the language, but you only “need” to use as many letters of a function argument so that it’s uniquely specified, so

which(mtest1>0, arr.ind = TRUE)

##      row col
## [1,]   3   2
## [2,]   1   3

can be shortened to

which(mtest1>0,a=T)

##      row col
## [1,]   3   2
## [2,]   1   3

Create aliases to functions - just remember to alias the name of the function, not the call (with parentheses)

d=determinant.matrix
identical(
  d(mtest1),
  determinant.matrix(mtest1)
)

## [1] TRUE

Use the prefix notation version of functions which require many characters, e.g. if

if (3 > 2) {
  "res1"
} else {
  "res2"
}

`if`(3>2,"res1","res2")

keeping in mind that ifelse() requires the same structure in both returned results (and it evaluates both), which tripped me up

ifelse(TRUE, 1:4, 2)

## [1] 1

With all those in place, I landed at 377 characters for my solution. Certainly not great, considering the python answer was ~200.

I really wanted a better way to “spread” the light out from a single point, but I wasn’t finding any nice solutions to that simpler sub-problem. A great way to get a solution is to ask other people, so I wrote a short post on my new mini blog asking the simpler question of how to achieve this. That cross-posts to Twitter, where June Choe provided a great outer() solution. I’d tried something like that but not so cleverly.

vx <- 4
vy <- 3
vv <- 5
n <- 8
outer(1:n, 1:n, function(x, y) pmax(vv - abs(x - vx) - abs(y - vy), 0))

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    0    1    2    1    0    0    0    0
## [2,]    1    2    3    2    1    0    0    0
## [3,]    2    3    4    3    2    1    0    0
## [4,]    3    4    5    4    3    2    1    0
## [5,]    2    3    4    3    2    1    0    0
## [6,]    1    2    3    2    1    0    0    0
## [7,]    0    1    2    1    0    0    0    0
## [8,]    0    0    1    0    0    0    0    0

This greatly improves R’s chances at solving this efficiently because now we can condense all the ‘spread light’ stuff into this one function, and because it’s not iterative, we can lapply() over the results.

The “final” version, after some more clean up, is an okay 182 characters

ls=\(m,w=T) {
  n=ncol(m)
  p=m>1
  i=which(p,a=T)
  y=lapply(1:nrow(i),\(j)outer(1:n,1:n,\(x,y)pmax(m[p][j]-abs(x-i[j,1])-abs(y-i[j,2]),0)))
  z=Reduce(pmax,y)
  `if`(w,ls(z,F),z)
}

ls(mtest1)

##      [,1] [,2] [,3] [,4]
## [1,]    2    3    4    3
## [2,]    1    2    3    2
## [3,]    1    2    2    1
## [4,]    0    1    1    0

ls(mtest2)

##      [,1] [,2] [,3] [,4]
## [1,]    2    1    2    3
## [2,]    1    0    1    2
## [3,]    0    0    0    1
## [4,]    0    0    0    0

I’m still of course interested if there are more optimisations to be made, so do let me know if you can spot any!

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.2 (2021-11-01)
##  os       Pop!_OS 21.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2022-03-26                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [3] CRAN (R 4.0.1)
##  blogdown      1.8     2022-02-16 [1] CRAN (R 4.1.2)
##  bookdown      0.24    2021-09-02 [1] CRAN (R 4.1.2)
##  brio          1.1.1   2021-01-20 [3] CRAN (R 4.0.3)
##  bslib         0.3.1   2021-10-06 [1] CRAN (R 4.1.2)
##  cachem        1.0.3   2021-02-04 [3] CRAN (R 4.0.3)
##  callr         3.7.0   2021-04-20 [1] CRAN (R 4.1.2)
##  cli           3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
##  crayon        1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
##  DBI           1.1.1   2021-01-15 [3] CRAN (R 4.0.3)
##  desc          1.4.1   2022-03-06 [1] CRAN (R 4.1.2)
##  devtools      2.4.3   2021-11-30 [1] CRAN (R 4.1.2)
##  digest        0.6.27  2020-10-24 [3] CRAN (R 4.0.3)
##  dplyr       * 1.0.8   2022-02-08 [1] CRAN (R 4.1.2)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
##  evaluate      0.14    2019-05-28 [3] CRAN (R 4.0.1)
##  fansi         0.4.2   2021-01-15 [3] CRAN (R 4.0.3)
##  fastmap       1.1.0   2021-01-25 [3] CRAN (R 4.0.3)
##  forcats     * 0.5.1   2021-01-27 [3] CRAN (R 4.0.3)
##  fs            1.5.0   2020-07-31 [3] CRAN (R 4.0.2)
##  generics      0.1.0   2020-10-31 [3] CRAN (R 4.0.3)
##  glue          1.6.1   2022-01-22 [1] CRAN (R 4.1.2)
##  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.1.2)
##  jsonlite      1.7.2   2020-12-09 [3] CRAN (R 4.0.3)
##  knitr         1.37    2021-12-16 [1] CRAN (R 4.1.2)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
##  magrittr      2.0.1   2020-11-17 [3] CRAN (R 4.0.3)
##  memoise       2.0.0   2021-01-26 [3] CRAN (R 4.0.3)
##  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
##  pkgbuild      1.2.0   2020-12-15 [3] CRAN (R 4.0.3)
##  pkgconfig     2.0.3   2019-09-22 [3] CRAN (R 4.0.1)
##  pkgload       1.2.4   2021-11-30 [1] CRAN (R 4.1.2)
##  prettyunits   1.1.1   2020-01-24 [3] CRAN (R 4.0.1)
##  processx      3.5.2   2021-04-30 [1] CRAN (R 4.1.2)
##  ps            1.5.0   2020-12-05 [3] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [3] CRAN (R 4.0.1)
##  R6            2.5.0   2020-10-28 [3] CRAN (R 4.0.2)
##  remotes       2.4.2   2021-11-30 [1] CRAN (R 4.1.2)
##  rlang         1.0.1   2022-02-03 [1] CRAN (R 4.1.2)
##  rmarkdown     2.13    2022-03-10 [1] CRAN (R 4.1.2)
##  rprojroot     2.0.2   2020-11-15 [3] CRAN (R 4.0.3)
##  rstudioapi    0.13    2020-11-12 [3] CRAN (R 4.0.3)
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.1.2)
##  sessioninfo   1.1.1   2018-11-05 [3] CRAN (R 4.0.1)
##  stringi       1.5.3   2020-09-09 [3] CRAN (R 4.0.2)
##  stringr       1.4.0   2019-02-10 [3] CRAN (R 4.0.1)
##  testthat      3.1.2   2022-01-20 [1] CRAN (R 4.1.2)
##  tibble        3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
##  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
##  usethis       2.1.5   2021-12-09 [1] CRAN (R 4.1.2)
##  utf8          1.1.4   2018-05-24 [3] CRAN (R 4.0.2)
##  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.1.2)
##  withr         2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
##  xfun          0.30    2022-03-02 [1] CRAN (R 4.1.2)
##  yaml          2.2.1   2020-02-01 [3] CRAN (R 4.0.1)
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.1
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Adventures in x86 ASM with rx86

website@jcarroll.com.au (Jonathan Carroll) — Thu, 23 Dec 2021 00:00:00 +0000

I just finished ‘Code: The Hidden Language of Computer Hardware and Software’ by Charles Petzold which was a really well-written (in my opinion) guided journey from flashing a light in morse code through to building a whole computer, and everything needed along the way.

The section on encoding instructions for the processor (built up from logic gates) - assembly instructions as a human readable version of the machine code - was particularly interesting to me, and as I was describing this to a colleague I remembered that it’s not the first time I’ve played with assembly…

Years and years ago (I don’t recall how it actually started) I spent some time trying to solve a puzzle. I don’t recall whether I saw the puzzle or a solution first, but I do remember wanting to be able to understand it properly, and ideally be able to use some software I wrote to reach the solution.

The puzzle was just a set of characters on a poster for the (then named) Australian Defence Signals Directorate (now the Australian Signals Directorate - one of our Secret Squirrel orgs) at ruxcon in 2011

ruxcon2011 DSD poster

Yes, that was a long time ago, but I never wrote up what I did, and now seems like a good enough time to get really distracted.

I would be surprised if I understood it well enough at the time, so I suspect I was aware of this blogpost which walks through the solution (spoilers). Nonetheless, I wanted to be able to do that myself, not just follow some instructions - I was confident that I could write enough code (of some sort) to go from this sequence of letters and symbols to the final solution.

My attempts at the time were mostly command-line attempts; the blog post linked above uses only web services, so that felt like I could make it ‘my own’. I first needed to get the characters into my computer - that’s just writing them out to a text file, say, a file called dsd

# dsd:
6AAAAABbi8uDwx4zwDPSigOK
ETLCiAM8AHQrg8EBg8MB6+wz
/7/z+TEct0SlpGf5dRyl53US
YQEE56Ri7Kdkj8IAABkcOsw=

Knowing that this is base-64 encoded, I can decode it with hexdump

cat dsd | base64 -d | hexdump -C

00000000  e8 00 00 00 00 5b 8b cb  83 c3 1e 33 c0 33 d2 8a  |.....[.....3.3..|
00000010  03 8a 11 32 c2 88 03 3c  00 74 2b 83 c1 01 83 c3  |...2...<.t+.....|
00000020  01 eb ec 33 ff bf f3 f9  31 1c b7 44 a5 a4 67 f9  |...3....1..D..g.|
00000030  75 1c a5 e7 75 12 61 01  04 e7 a4 62 ec a7 64 8f  |u...u.a....b..d.|
00000040  c2 00 00 19 1c 3a cc                              |.....:.|
00000047

To just get the bytecode, I used some different options and saved the file as dsd.hex

cat dsd | base64 -d | hexdump  -v -e '/1 %02X ' > dsd.hex

# dsd.hex:
E8 00 00 00 00 5B 8B CB 83 C3 1E 33 C0 33 D2 8A 03 8A 11 32 C2 88 03 3C 00 74 2B 83 C1 01 83 C3 01 EB EC 33 FF BF F3 F9 31 1C B7 44 A5 A4 67 F9 75 1C A5 E7 75 12 61 01 04 E7 A4 62 EC A7 64 8F C2 00 00 19 1C 3A CC

I did go a similar route to the linked blogpost and converted these bytes to shellcode, wrapped them in a C program and disassembled it with gdb, but much simpler was to use a better tool, in this case udis which I needed to install separately. This gives the same result as the blogpost, which was nice

udcli -x dsd.hex > dsd.hex.asm

# dsd.hex.asm:
0000000000000000 e800000000       call 0x5                
0000000000000005 5b               pop ebx                 
0000000000000006 8bcb             mov ecx, ebx            
0000000000000008 83c31e           add ebx, 0x1e           
000000000000000b 33c0             xor eax, eax            
000000000000000d 33d2             xor edx, edx            
000000000000000f 8a03             mov al, [ebx]           
0000000000000011 8a11             mov dl, [ecx]           
0000000000000013 32c2             xor al, dl              
0000000000000015 8803             mov [ebx], al           
0000000000000017 3c00             cmp al, 0x0             
0000000000000019 742b             jz 0x46                 
000000000000001b 83c101           add ecx, 0x1            
000000000000001e 83c301           add ebx, 0x1            
0000000000000021 ebec             jmp 0xf                 
0000000000000023 33ff             xor edi, edi            
0000000000000025 bff3f9311c       mov edi, 0x1c31f9f3     
000000000000002a b744             mov bh, 0x44            
000000000000002c a5               movsd                   
000000000000002d a4               movsb                   
000000000000002e 67f9             a16 stc                 
0000000000000030 751c             jnz 0x4e                
0000000000000032 a5               movsd                   
0000000000000033 e775             out 0x75, eax           
0000000000000035 126101           adc ah, [ecx+0x1]       
0000000000000038 04e7             add al, 0xe7            
000000000000003a a4               movsb                   
000000000000003b 62ec             invalid                 
000000000000003d a7               cmpsd                   
000000000000003e 648fc2           pop edx                 
0000000000000041 0000             add [eax], al           
0000000000000043 191c3a           sbb [edx+edi], ebx      
0000000000000046 cc               int3

At this point, I got a bit lost (at the time) because I didn’t understand assembly well enough (or at all), so, continuing with the logic presented in the linked blogpost, I considered just working with the bytes directly.

All we really need to do it take the bytes starting at 0x5 and 0x23 and xor them. I figured I’ll need the decimal value of these addresses; 0x5 is just 5, but 0x23 = 16*2 + 3 = 35. We can of course get this via printf

printf "%d\n" 0x23

## 35

or less simply with the built-in calculator tool bc, going from (input) base 16 to (output) base 10

echo "obase=10;ibase=16; 23" | bc

## 35

I placed the bytes in sequence (removing spaces) with

str=$(cat dsd.hex | sed 's/ //g')
echo $str

## E8000000005B8BCB83C31E33C033D28A038A1132C288033C00742B83C10183C301EBEC33FFBFF3F9311CB744A5A467F9751CA5E77512610104E7A462ECA7648FC20000191C3ACC

Since I have 2 characters per hex, 0x5 starts at character 10, and 0x23 starts at character 70, so we define our strings as

astr=${str:70} # 0x23 to end
echo $astr

## 33FFBFF3F9311CB744A5A467F9751CA5E77512610104E7A462ECA7648FC20000191C3ACC

and

$ bstr=${str:10} # 0x5 to end
$ echo $bstr

## 5B8BCB83C31E33C033D28A038A1132C288033C00742B83C10183C301EBEC33FFBFF3F9311CB744A5A467F9751CA5E77512610104E7A462ECA7648FC20000191C3ACC

There is overlap here, which we will have to deal with when we get to it. For now, we want to xor these. Let’s cut these down to 60 characters (where they start to overlap)

trimastr=${astr:0:60}
echo $trimastr

## 33FFBFF3F9311CB744A5A467F9751CA5E77512610104E7A462ECA7648FC2

trimbstr=${bstr:0:60}
echo $trimbstr

## 5B8BCB83C31E33C033D28A038A1132C288033C00742B83C10183C301EBEC

The command xor (^) chokes on this many digits (in fact, any more than about 8) so I’ve written a script to xor the characters one at a time:

for i in {0..59} ; do echo $(( 0x${astr:$i:1} ^ 0x${bstr:$i:1} )) | awk '{printf "%X",$0}' ; done > xor.dat

# xor.dat:
687474703A2F2F7777772E6473642E676F762E61752F6465636F6465642E

What about that bit that overlapped? we only xored the first 60 characters, but the length of the string is

echo ${#astr}

## 72

so we need those first 12 (72-12) characters (6 hex) of overlap

olap1=${astr:60:12}
echo $olap1

## 0000191C3ACC

The assembly code would have overwritten the overlapped part by the time it reached there, so we need to xor with the xor’d part, i.e. the first 12 characters of xor.dat

olap2=$(cat xor.dat | cut -c -12)
echo $olap2

## 687474703A2F

Now, finally, do the last xor (^)

for i in {0..11} ; do echo $(( 0x${olap1:$i:1} ^ 0x${olap2:$i:1} )) | awk '{printf "%X",$0}' ; done > xorolap.dat

# xorolap.dat:
68746D6C00E3

so this needs to go at the end of our xor.dat

cat xor.dat xorolap.dat > xorfinal.dat

# xorfinal.dat:
687474703A2F2F7777772E6473642E676F762E61752F6465636F6465642E68746D6C00E3

The final 00 and after is useless, so let’s drop it. Finally, we just need to convert this back to text using xxd -r

cat xorfinal.dat | sed 's/[0]\{2\}.*//' | xxd -r -p > dsd.sol

Phew. I’m not going to reveal the solution just yet, because this isn’t the end of the story (but I did get the right answer).

So, that’s a commandline solution to (at least this part) of the puzzle. But now I know R!

Learning more about assembly from the book ‘Code’, it occurred to me that the operations - which could be implemented with something as simple as telegraph relays (or crabs) - were just operations on data. Given an input, produce an output (sort of). A MOV operation just moved some value stored at some address to another address (or to/from a register). This felt like it could be simple enough to encode in some R functions. Perhaps not some “pure” R functions, because I want the side-effect of altering a global memory bank, but surely I could do simple things like ADD.

I looked around to see if someone else had done this before. As is usually the case with odd requests, Mike (a.k.a. @coolbutuseless) has done something similar in the form of r64 which I didn’t appreciate was a sufficiently distinct flavour of assembly (I never had a Commodore64, we had an Amstrad CPC664 on which I really only played games). After a quick PR to bring that repo up to date with other changes by the author (a migration of one dependency) I realised this wasn’t what I needed, but did learn a lot from how it was structured.

Okay, on to building something myself. I knew I’d need some memory and some registers. The registers seemed easy - they wouldn’t hold a lot and I could address them by name, e.g. eax. An environment seemed natural, both because of the named list structure, and because I knew it would be mutable. That seemed like a benefit for this use-case - having a global set of registers I could move data in and out of without making copies of the thing or passing it around everywhere.

Next I’d need memory. I figured a vector of hex value made sense, but I wanted to be able to refer to the first one as 0x00. Now, the names of a vector need to be a character vector - you can’t use the actual hex values

memory <- c(0x00 = 0x19, 0x01 = 0x1a, 0x02 = 0x1b)

## Error: <text>:1:18: unexpected '='
## 1: memory <- c(0x00 =
##                      ^

so we need to use character strings

memory <- c("0x00" = 0x19, "0x01" = 0x1a, "0x02" = 0x1b)
memory

## 0x00 0x01 0x02 
##   25   26   27

More importantly, we’ll need to ensure we only refer to these by the character strings because [ first tries a coercion to integer, which, side-note, is why this works

(1:10)[2.3] # since as.integer(2.3) == 2

## [1] 2

(1:10)[4.7] # since as.integer(4.7) == 4

## [1] 4

The risk is that we use a hex value to extract an element, in which case we might accidentally try to get the first value with

memory[0x00]

## named numeric(0)

Instead, we want

memory["0x00"]

## 0x00 
##   25

In order to make sure we always do this, we need a sanitize() function which always returns the string.

We can convert a value to hexmode with as.hexmode, but that’s a lot of typing, so I added an alias as

hex <- as.hexmode

For processing assembly instructions, we might see something like

mov eax, 0x5

which should move the value 0x5 into the register eax… so we’ll need a way to distinguish direct addresses from registers. Worse still, we might refer to the address stored in a register, as [eax]. A reg_or_val() function would identify anything which points to an address (containing a [), any of the named registers, or a value, and would return the address (or where that points).

With all of those pieces, the only thing left is to actually be able to run code.

Assembly runs sequentially through the instructions, unless we encounter some flow control opcodes (e.g. JMP - jump to address - I’ll keep calling them opcodes but mnemonics is a more correct term). The basic process would then be to read in the instruction, identify the opcode and the arguments, and execute, modifying the memory and registers in-place. Once that’s done, we move to the next instruction.

With flow control we might need to identify a different address to go to next, and that might depend on the status of the registers, for example JNZ 0x00 jumps to address 0x00 if the zero flag is not set. So, we can execute the current instruction but then apply any flag-based logic to see if we need to go to a new address, and go wherever we should go go next. This is implemented as runasm()

That takes care of running the code, but what are we running? Oh, operations. Right. Well, we need some of those. Going through the opcodes I need for the puzzle, I’ll need the following:

call
pop
mov
add
xor
cmp
jz
jmp
halt

CALL just pushes a value onto the stack (register esp), POP retrieves it and stores it at an address, MOV as we said moves a value from place to place, ADD adds two values, XOR does what it suggests, and so on. These don’t seem tricky to implement, for example

mov <- function(x, y) {
  # copy y into x
  res <- hex(reg_or_val(y))
  if (x %in% names(registers)) {
    assign(x, res, envir = registers)
  } else {
    mem[sanitize(x)] <<- sanitize(res)
  }
  return(invisible(NULL))
}

The wrinkle will be that particular instructions also update registers, for example an ADD stores whether the result was 0x00 in the zero-flag register

function(x, y) {
  # add y to x and save in x
  res <- hex(reg_or_val(x)) + hex(reg_or_val(y))
  if (x %in% names(registers)) {
    assign(x, res, envir = registers)
  } else {
    mem[sanitize(x)] <<- sanitize(res)
  }
  assign("zf", hex(as.integer(res == 0x00)), envir = registers)
  return(invisible(x))
}

A JMP (or other jump) will check this register and jump (or not) accordingly.

With these pieces in place, an R package was a natural home for the code, so I can now present the {rx86} package: https://github.com/jonocarroll/rx86

Let’s use it to solve the puzzle!!!

Starting with the puzzle string

dsd <- "6AAAAABbi8uDwx4zwDPSigOK
ETLCiAM8AHQrg8EBg8MB6+wz
/7/z+TEct0SlpGf5dRyl53US
YQEE56Ri7Kdkj8IAABkcOsw="

we decode it (this time in R)

(b64 <- base64enc::base64decode(dsd))

##  [1] e8 00 00 00 00 5b 8b cb 83 c3 1e 33 c0 33 d2 8a 03 8a 11 32 c2 88 03 3c 00
## [26] 74 2b 83 c1 01 83 c3 01 eb ec 33 ff bf f3 f9 31 1c b7 44 a5 a4 67 f9 75 1c
## [51] a5 e7 75 12 61 01 04 e7 a4 62 ec a7 64 8f c2 00 00 19 1c 3a cc

This will be the only non-R part: we still need to disassemble the bytecode into assembly, but we can do that from R with a system() call to udcli

(disas <- system("udcli -x", input = paste(b64, collapse = " "), intern = TRUE))

##  [1] "0000000000000000 e800000000       call 0x5                "
##  [2] "0000000000000005 5b               pop ebx                 "
##  [3] "0000000000000006 8bcb             mov ecx, ebx            "
##  [4] "0000000000000008 83c31e           add ebx, 0x1e           "
##  [5] "000000000000000b 33c0             xor eax, eax            "
##  [6] "000000000000000d 33d2             xor edx, edx            "
##  [7] "000000000000000f 8a03             mov al, [ebx]           "
##  [8] "0000000000000011 8a11             mov dl, [ecx]           "
##  [9] "0000000000000013 32c2             xor al, dl              "
## [10] "0000000000000015 8803             mov [ebx], al           "
## [11] "0000000000000017 3c00             cmp al, 0x0             "
## [12] "0000000000000019 742b             jz 0x46                 "
## [13] "000000000000001b 83c101           add ecx, 0x1            "
## [14] "000000000000001e 83c301           add ebx, 0x1            "
## [15] "0000000000000021 ebec             jmp 0xf                 "
## [16] "0000000000000023 33ff             xor edi, edi            "
## [17] "0000000000000025 bff3f9311c       mov edi, 0x1c31f9f3     "
## [18] "000000000000002a b744             mov bh, 0x44            "
## [19] "000000000000002c a5               movsd                   "
## [20] "000000000000002d a4               movsb                   "
## [21] "000000000000002e 67f9             a16 stc                 "
## [22] "0000000000000030 751c             jnz 0x4e                "
## [23] "0000000000000032 a5               movsd                   "
## [24] "0000000000000033 e775             out 0x75, eax           "
## [25] "0000000000000035 126101           adc ah, [ecx+0x1]       "
## [26] "0000000000000038 04e7             add al, 0xe7            "
## [27] "000000000000003a a4               movsb                   "
## [28] "000000000000003b 62ec             invalid                 "
## [29] "000000000000003d a7               cmpsd                   "
## [30] "000000000000003e 648fc2           pop edx                 "
## [31] "0000000000000041 0000             add [eax], al           "
## [32] "0000000000000043 191c3a           sbb [edx+edi], ebx      "
## [33] "0000000000000046 cc               int3                    "

We then read this back into R as a data.frame

asm <- suppressWarnings(
  readr::read_fwf(paste(disas, collapse = "\n"), 
                  col_types = "ccc",
                  col_positions = readr::fwf_widths(c(16, 16, 21)))
)
colnames(asm) <- c("addr", "bytecode", "instr")
# trim the leading 0s from addr since this is all we're using
asm$addr <- substr(asm$addr, nchar(asm$addr)-1, nchar(asm$addr))
asm

## # A tibble: 33 x 3
##    addr  bytecode   instr        
##    <chr> <chr>      <chr>        
##  1 00    e800000000 call 0x5     
##  2 05    5b         pop ebx      
##  3 06    8bcb       mov ecx, ebx 
##  4 08    83c31e     add ebx, 0x1e
##  5 0b    33c0       xor eax, eax 
##  6 0d    33d2       xor edx, edx 
##  7 0f    8a03       mov al, [ebx]
##  8 11    8a11       mov dl, [ecx]
##  9 13    32c2       xor al, dl   
## 10 15    8803       mov [ebx], al
## # … with 23 more rows

The last instruction, int3, is an interrupt, but let’s generalise it to a halt because we’ll be done

asm[33, "instr"] <- "halt"

We can run this with {rx86}… we need a memory array and some registers

mem <- create_mem()
registers <- create_reg()

Then we can run the code

runasm(asm)

As we saw earlier, the ‘code’ part of the asm is stored in 0x00 to 0x21 with the remaining addresses being used for temporary storage, from 0x23. The operations encoded perform an XOR between the values stored at 0x05 through to 0x21 with those starting at 0x23, storing the results starting at 0x23. Extracting the memory from this offset onward (up to where it zeroes) results in

(mem_offset <- mem[which(names(mem) == "0x23"):length(mem)])

##   0x23   0x24   0x25   0x26   0x27   0x28   0x29   0x2a   0x2b   0x2c   0x2d 
## "0x68" "0x74" "0x74" "0x70" "0x3a" "0x2f" "0x2f" "0x77" "0x77" "0x77" "0x2e" 
##   0x2e   0x2f   0x30   0x31   0x32   0x33   0x34   0x35   0x36   0x37   0x38 
## "0x64" "0x73" "0x64" "0x2e" "0x67" "0x6f" "0x76" "0x2e" "0x61" "0x75" "0x2f" 
##   0x39   0x3a   0x3b   0x3c   0x3d   0x3e   0x3f   0x40   0x41   0x42   0x43 
## "0x64" "0x65" "0x63" "0x6f" "0x64" "0x65" "0x64" "0x2e" "0x68" "0x74" "0x6d" 
##   0x44   0x45   0x46   0x47   0x48   0x49   0x4a   0x4b   0x4c   0x4d   0x4e 
## "0x6c" "0x00" "0xcc" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" 
##   0x4f   0x50   0x51   0x52   0x53   0x54   0x55   0x56   0x57   0x58   0x59 
## "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" 
##   0x5a   0x5b   0x5c   0x5d   0x5e   0x5f   0x60   0x61   0x62   0x63   0x64 
## "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" 
##   0x65   0x66   0x67   0x68   0x69   0x6a   0x6b   0x6c   0x6d   0x6e   0x6f 
## "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" 
##   0x70   0x71   0x72   0x73   0x74   0x75   0x76   0x77   0x78   0x79   0x7a 
## "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" "0x00" 
##   0x7b   0x7c   0x7d   0x7e   0x7f 
## "0x00" "0x00" "0x00" "0x00" "0x00"

And then lastly, we need to convert this sequence of hex values into characters. I’ve added a helper which achieves this, dropping everything after the first null-terminating byte (0x00) then

hex2string(mem_offset)

## [1] "http://www.dsd.gov.au/decoded.html"

TADA!

This example is stored along with the package as a vignette, so

vignette("dsd_ruxcon_challenge", package = "rx86")

The link in this solution just redirects to the ASD frontpage since the puzzle is now over 10 years old, but when it was active it led to a page with some binary

0100 0011 0100 1111 0100 1110 0100 0111 0101 0010 0100 0001 0101 0100 0101
0101 0100 1100 0100 0001 0101 0100 0100 1001 0100 1111 0100 1110 0101 0011

Originally, I solved this part at the command line by storing this code in a file named decoded and running a similar bc conversion to before, but this time from binary (ibase=2) to hex (obase=16), storing the result in decoded.hex

for bin in $(cat decoded) ; do echo "obase=16;ibase=2; $bin" | bc >> decoded.hex ; done

From there, removing the line breaks and spaces, and passing through xxd in reverse (similar to hexdump but reverse works on my machine)

cat decoded.hex | tr '\n' ' ' | sed 's/ //g' | xxd -r -p

Again, I’ll hold off showing the answer, but it was correct.

It would be satisfying to also do this in R, so I added another helper in {rx86} that does this conversion - it’s not terribly complex, but involves splitting a string into pairs of strings (split at some point) and a conversion

split_pairs <- function(x, split = "") {
  sst <- strsplit(x, split)[[1]]
  out <- paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
  paste0(out, collapse = ",")
}

bin2ascii <- function(bin) {
  nolb <- gsub("\n", " ", bin)
  split <- strsplit(split_pairs(paste(nolb, collapse = " "), split = " "), ",")[[1]]
  ints <- strtoi(split, base = 2)
  intToUtf8(ints)
}

It appears to do the job

binary <- "0100 0011 0100 1111 0100 1110 0100 0111 0101 0010 0100 0001 0101 0100 0101
0101 0100 1100 0100 0001 0101 0100 0100 1001 0100 1111 0100 1110 0101 0011"

bin2ascii(binary)

## [1] "CONGRATULATIONS"

There, an (almost) entirely R solution to the puzzle, and all it took was writing my own x86 assembly parser.

I did want to see if I’d made my parser too specific and it only worked with this one example around which I’d designed it, so I wanted to add another example. This journey started with the book ‘Code’, so it felt fitting to use an example from there. In the book, an example of multiplying two 8-bit numbers is used, which involved an ADC (add with carry) operation to handle overflow. This seemed like a good candidate, and I started coding it, but soon realised that the exact routine relied on an add al, 0xff which has the effect of adding -1 in 8-bit, but on my machine

as.integer(0xff)

## [1] 255

and

as.hexmode(-1)

## [1] "ffffffff"

which isn’t compatible. I could instead code a SUB opcode and sub al, 0x01 (which I did) but at this point I decided to abandon the 8-bit idea and simplify down to just doing essential part of the program which multiplies 127 and 28 through repeated additions (via a loop). The asm for this is also stored in the package, and executing it is as simple as

mult_asm <- suppressWarnings(
  readr::read_fwf(system.file("asm", "mult.asm", package = "rx86"), 
                  col_types = "ccc",
                  col_positions = readr::fwf_widths(c(3, 6, 20)))
)
colnames(mult_asm) <- c("addr", "bytecode", "instr")
print(mult_asm)

## # A tibble: 14 x 3
##    addr  bytecode instr         
##    <chr> <chr>    <chr>         
##  1 00    101005   mov al, [0x22]
##  2 03    201001   add al, [0x20]
##  3 06    111005   mov [0x22], al
##  4 09    101004   mov al, [0x21]
##  5 0c    221000   sub al, 0x01  
##  6 0f    111004   mov 0x21, al  
##  7 12    101003   jnz 0x00      
##  8 15    20001e   halt          
##  9 18    111003   invalid       
## 10 1b    330000   invalid       
## 11 1e    ff00     invalid       
## 12 20    a7       data 167      
## 13 21    1c       data 28       
## 14 22    00       result

Again, we need a new memory array and some registers

mem <- create_mem(len = 64)
registers <- create_reg()

Then run the code

runasm(mult_asm)

The final result can be extracted but it is still a hex value

mem[sanitize(0x22)]

##     0x22 
## "0x1244"

Converting it to an integer gives the expected result

as.integer(mem[sanitize(0x22)])

## [1] 4676

167*28

## [1] 4676

This, too, is stored as a vignette in the package, and can be found with

vignette("mult_code_petzold", package = "rx86")

The package is far from perfect, and only supports what I needed it to, but I’ve learned a lot about assembly and got to build something I’ve always wanted to. Plus I’ve finally written up my process for this puzzle that has been sitting on a disused laptop for a decade.

That’s not quite the end, though - I really wanted to test out what I’ve learned so far, and what good is a new programming ability without a “Hello, world!” example?

Almost all of the examples I found floating around use ‘modern’ asm (without the bytecode) and allow such luxuries as “storing a string” and “system calls” - none of that here, thank you. Instead, I added a new opcode mnemonic int 0x80 which sort of does what it should - it writes to screen the value (converted to character) of whatever is in the register eax. That’s helpful, but I still need the assembly that will use that. This is where I feel I’ve actually hand-programmed something myself. This is a piece of code that could literally have been punched into a card

Punch card

The whole thing works, of course

hello_asm <- suppressWarnings(
  readr::read_fwf(system.file("asm", "helloworld.asm", package = "rx86"), 
                  col_types = "ccc",
                  col_positions = readr::fwf_widths(c(3, 3, 20)))
)
colnames(hello_asm) <- c("addr", "bytecode", "instr")
print(hello_asm)

## # A tibble: 23 x 3
##    addr  bytecode instr        
##    <chr> <chr>    <chr>        
##  1 00    10       mov ecx, 0x0e
##  2 01    10       mov al, 0x08 
##  3 02    10       mov eax, [al]
##  4 03    cc       int 0x80     
##  5 04    28       sub ecx, 0x01
##  6 05    70       jz 0x17      
##  7 06    05       add al, 0x01 
##  8 07    e9       jmp 0x02     
##  9 08    48       data         
## 10 09    65       data         
## # … with 13 more rows

mem <- create_mem()
registers <- create_reg()

runasm(hello_asm)

## Hello, world!

and I find that honestly, ridiculously pleasing.

This, too, is included in the package as

vignette("helloworld", package = "rx86")

I’m satisfied that {rx86} works, at least in some sense.

I’ve learned a lot along the way, and who knows, maybe I’ll add some more opcodes to the package. If you have some suggestions, please let me know!

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       Pop!_OS 20.10               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2021-12-23                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  base64enc     0.1-3   2015-07-28 [1] CRAN (R 4.0.3)
##  blogdown      1.7     2021-12-19 [1] CRAN (R 4.0.3)
##  bookdown      0.24    2021-09-02 [1] CRAN (R 4.0.3)
##  bslib         0.3.1   2021-10-06 [1] CRAN (R 4.0.3)
##  callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.3)
##  cli           3.1.0   2021-10-27 [1] CRAN (R 4.0.3)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.3)
##  desc          1.4.0   2021-09-28 [1] CRAN (R 4.0.3)
##  devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.3)
##  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
##  dplyr         1.0.2   2020-08-18 [1] CRAN (R 4.0.3)
##  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.3)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.3)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.3)
##  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.0.3)
##  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.3)
##  generics      0.1.0   2020-10-31 [1] CRAN (R 4.0.3)
##  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.3)
##  hms           0.5.3   2020-01-08 [1] CRAN (R 4.0.3)
##  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.0.3)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.0.3)
##  jsonlite      1.7.2   2020-12-09 [1] CRAN (R 4.0.3)
##  knitr         1.37    2021-12-16 [1] CRAN (R 4.0.3)
##  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.0.3)
##  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.3)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.3)
##  pillar        1.4.7   2020-11-20 [1] CRAN (R 4.0.3)
##  pkgbuild      1.2.0   2020-12-15 [1] CRAN (R 4.0.3)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.3)
##  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.3)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.3)
##  processx      3.4.5   2020-11-30 [1] CRAN (R 4.0.3)
##  ps            1.5.0   2020-12-05 [1] CRAN (R 4.0.3)
##  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.3)
##  R6            2.5.0   2020-10-28 [1] CRAN (R 4.0.3)
##  readr         1.4.0   2020-10-05 [1] CRAN (R 4.0.3)
##  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.3)
##  rlang         0.4.10  2020-12-30 [1] CRAN (R 4.0.3)
##  rmarkdown     2.11    2021-09-14 [1] CRAN (R 4.0.3)
##  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.3)
##  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.3)
##  rx86        * 0.1.0   2021-12-22 [1] local         
##  sass          0.4.0   2021-05-12 [1] CRAN (R 4.0.3)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.3)
##  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.3)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.3)
##  testthat      3.0.1   2020-12-17 [1] CRAN (R 4.0.3)
##  tibble        3.0.4   2020-10-12 [1] CRAN (R 4.0.3)
##  tidyr         1.1.2   2020-08-27 [1] CRAN (R 4.0.3)
##  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.3)
##  usethis       2.1.5   2021-12-09 [1] CRAN (R 4.0.3)
##  utf8          1.1.4   2018-05-24 [1] CRAN (R 4.0.3)
##  vctrs         0.3.6   2020-12-17 [1] CRAN (R 4.0.3)
##  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.3)
##  xfun          0.29    2021-12-14 [1] CRAN (R 4.0.3)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.3)
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.0
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Improving a Visualization

website@jcarroll.com.au (Jonathan Carroll) — Fri, 02 Jul 2021 00:00:00 +0000

I saw this post on Reddit’s r/dataisbeautiful showing this plot of streaming services market share, comparing 2020 to 2021

US Streaming Services Market Share, 2020 vs 2021

and thought it looked like a good candidate for trying out some plot improvement techniques.

Yes, that was a reasonably long while ago, this post has taken quite some time to put together. Life.

I’ve played with adding images to plot axes several times (also here, here, here) so that part shouldn’t pose too much of a challenge. First, I’ll try to rebuild the original. The original was built in powerpoint but I’ll be reproducing it with R (surprise, surprise).

The data itself appears to be captured from something like this page (paywalled) but the precise values aren’t important; I’ll just take them directly from the original plot manually

streaming <- tibble::tribble(
  ~service, ~`2020`, ~`2021`,
  "netflix",    29, 20,
  "prime",      21, 16,
  "hulu",       16, 13,
  "disney",     12, 11,
  "apple",       4,  5,
  "peacock",     0,  5,
  "hbo",         3, 12,
  "paramount",   2,  3,
  "other",      13, 15,
)

I can build a simple barplot from this data with {ggplot2}. First we’ll need it in long format, with the services ordered as they are in the original. From that I can build a basic bar plot with dodged bars. There’s a few fiddly bits to work out, which I’ll try to go through with code comments.

library(ggplot2)

## pivot to long format with the 
## year and share as their own columns
streaming_long <- tidyr::pivot_longer(streaming, 
                                      cols = -service, 
                                      names_to = "year", 
                                      values_to = "share")

## plot the years side-by-side in the original order
p <- ggplot(streaming_long) + 
  geom_col(aes(factor(service, levels = streaming$service), 
               share, fill = year), position = position_dodge(width = 0.9)) + 
  ## add a hidden set of points to make the legend circles easily
  geom_point(aes(x = service, y = -10, color = year, fill = year), size = 4) + 
  ## add the percentages just above each bar
  geom_text(aes(service, share + 1, label = paste0(share, "%"), group = year),
            position = position_dodge(width = 0.9), size = 3) +
  ## use similar colours to the original
  scale_fill_manual(values = c(`2020` = "red3", `2021` = "black")) +
  scale_color_manual(values = c(`2020` = "red3", `2021` = "black")) + 
  ## hide the fill legend and make the color legend horizontal
  guides(fill = "none", color = guide_legend(direction = "horizontal")) +
  scale_y_continuous(labels = scales::percent_format(scale = 1), 
                     limits = c(0, 35)) +
  labs(title = "US Streaming Market Share", 
       subtitle = "2020 vs 2021", 
       caption = "Source: Ampere Analytics via The Wrap
       
       Other Streatming Services include ESPN+, Showtime,
       Sling TV, Youtube TV, and Starz",
       x = "", y = "") +
  theme_minimal() + 
  theme(axis.text = element_text(size = 10),
        plot.title = element_text(size = 28, hjust= 0.5), 
        plot.subtitle = element_text(size = 28, hjust = 0.5),
        plot.caption = element_text(size = 7, color = "grey60"),
        plot.background = element_rect(fill = "#f4f7fc", size = 0),
        legend.title = element_blank(),
        legend.text= element_text(size = 12),
        panel.grid = element_blank(),
        ## move the color legend to an inset 
        legend.position = c(0.85, 0.8)) 
p

## Warning: Removed 18 rows containing missing values (geom_point).

Not bad. Let’s get some of the other elements looking right. I used a font identifying site to pick a similar font which seems to be Josefin Slab SemiBold.

library(extrafont)
fontfamily <- "Josefin Slab SemiBold"

p <- p + theme(plot.title = element_text(family = fontfamily),
               plot.subtitle = element_text(family = fontfamily))
p

## Warning: Removed 18 rows containing missing values (geom_point).

That’s fairly close. For the logos I’ll use the versions on Wikipedia

wiki <- "https://upload.wikimedia.org/wikipedia/commons/thumb/"
logos <- tibble::tribble(
  ~service, ~logo,
  "netflix", paste0(wiki, "0/08/Netflix_2015_logo.svg/340px-Netflix_2015_logo.svg.png"),
  "prime", paste0(wiki, "1/11/Amazon_Prime_Video_logo.svg/450px-Amazon_Prime_Video_logo.svg.png"),
  "hulu", paste0(wiki, "e/e4/Hulu_Logo.svg/440px-Hulu_Logo.svg.png"),
  "disney", paste0(wiki, "3/3e/Disney%2B_logo.svg/320px-Disney%2B_logo.svg.png"),
  "apple",  paste0(wiki, "2/28/Apple_TV_Plus_Logo.svg/500px-Apple_TV_Plus_Logo.svg.png"),
  "peacock", paste0(wiki, "d/d3/NBCUniversal_Peacock_Logo.svg/440px-NBCUniversal_Peacock_Logo.svg.png"),
  "hbo", paste0(wiki, "d/de/HBO_logo.svg/440px-HBO_logo.svg.png"),
  "paramount", paste0(wiki, "a/a5/Paramount_Plus.svg/440px-Paramount_Plus.svg.png"),
  "other", "other.png"
) %>% 
  mutate(path = file.path("images", paste(service, tools::file_ext(logo), sep = ".")))
labels <- setNames(paste0("<img src='", logos$path, "' width='35' />"), logos$service)
labels[["other"]] <- "other<br />streaming<br />services"

and save a local copy for faster loading/just in case

for (r in 1:8) {
  download.file(logos$logo[r], logos$path[r])
}

I can leverage {ggtext} to set these as the axis labels

p <- p + 
  scale_x_discrete(labels = labels) + 
  theme(axis.text.x = ggtext::element_markdown())
p

## Warning: Removed 18 rows containing missing values (geom_point).

That’s not too bad - for the sake of scrolling, here’s the original again

US Streaming Services Market Share, 2020 vs 2021

Now to try to improve it.

My first thought on seeing this plot with the legend was that {ggtext} makes this a lot easier to read by using the title as the legend. For the sake of the ‘removed rows containing missing values’ warning I’ll also drop the points layer

p$layers[[2]] <- NULL ## drop the geom_points layer
p <- p + 
  labs(subtitle = "<span style = 'color: red3;'>2020</span> vs 2021") + 
  theme(plot.subtitle = ggtext::element_markdown(), legend.position = "none")
p

To help make the black bars more distinct from the text, we could make these a different colour

p <- p + 
  scale_fill_manual(values = c(`2020` = "red3", `2021` = "darkcyan")) + 
  labs(subtitle = paste0("<span style = 'color: red3;'>2020</span> ",
                         "vs <span style = 'color: darkcyan;'>2021</span>")) + 
  theme(plot.subtitle = ggtext::element_markdown(), legend.position = "none")

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

Now to see if there’s a better way to represent this data. These are fractions of the total share… would a pie chart be a candidate?

streaming_pie <- left_join(streaming_long, logos, by = "service")

p_pie <- ggplot(streaming_pie, aes(1, share, fill = service, image = path)) + 
  geom_col() + 
  coord_polar("y") +
  ggimage::geom_image(size = 0.12, position = position_stack(vjust = 0.5)) +
  facet_wrap(~year, strip.position = "bottom") +
  guides(fill = "none", color = guide_legend(direction = "horizontal")) +
  labs(title = "US Streaming Market Share", subtitle = "2020 vs 2021", x = "", y = "") +
  theme_void() + 
  theme(text = element_text(size = 32),
        plot.title = element_text(family = fontfamily, size = 26, hjust = 0.5), 
        plot.subtitle = element_text(family = fontfamily, size = 26, hjust = 0.5),
        plot.background = element_rect(fill = "#f4f7fc", size = 0),
        legend.title = element_blank())
p_pie

No, that’s harder for making comparisons. Forget that idea.

This should possibly be a horizontal bar plot so that the labels read nicely

p + 
  coord_flip() + 
  scale_x_discrete(labels = labels, limits = rev(streaming$service)) + 
  theme(axis.text.y = ggtext::element_markdown())

## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

The ‘insight’ this is trying to display is how each service’s share has grown or shrunk. It wasn’t obvious to me that the original was ordered by that - decreases shown first, then increases (in some order, I’m still not sure what). What might make for a better plot to center it on the 2020 share and show the growth (positive or negative)

streaming_delta <- dplyr::mutate(streaming, growth = `2021` - `2020`)

p <- ggplot(streaming_delta) + 
  geom_col(aes(factor(service, levels = streaming$service), growth, fill = growth > 0)) + 
  labs(title = "US Streaming Market Share Growth", x = "", y = "") +
  theme_minimal() + 
  theme(text = element_text(family = fontfamily),
        plot.title = element_text(size = 26), 
        plot.subtitle = element_text(size = 26)) + 
  scale_x_discrete(labels = labels) + 
  coord_flip() +
  theme(axis.text.y = ggtext::element_markdown()) +
  scale_fill_manual(values = c(`TRUE` = "darkcyan", `FALSE` = "red3")) + 
  labs(subtitle = paste0("<span style = 'color: darkcyan;'>Growth</span> ",
                         "vs <span style = 'color: red3;'>Loss</span> (% of 2020)")) + 
  theme(plot.subtitle = ggtext::element_markdown(),
        plot.background = element_rect(fill = "#f4f7fc", size = 0))
p

Each comparison in the original required two bars, and the separation between the services didn’t make that easy to read. In this version, the absolute scale is lost (services starting high and decreasing aren’t so distinct from those starting low and gaining slightly) so what about using a dumbbell plot to show the separation between the two values while retaining their overall scales, rather than relying on comparing two bars?

library(ggalt, include.only = "geom_dumbbell")

p <- ggplot(streaming) + 
  ggalt::geom_dumbbell(aes(factor(service, levels = rev(streaming$service)), 
                           x = `2020`, xend = `2021`), 
                       colour_x = "red3", size_x = 6,
                       colour_xend = "darkcyan", size_xend = 6,
                       size = 1, dot_guide = TRUE, 
                       dot_guide_size = 0.5, 
                       dot_guide_colour = "grey25") + 
  scale_x_continuous(labels = scales::percent_format(scale = 1), 
                   limits = c(0, 35)) +
  scale_y_discrete(labels = labels) + 
  labs(title = "US Streaming Market Share", x = "", y = "") +
  labs(subtitle = paste0("<span style = 'color: red3;'>2020</span> ",
                         "vs <span style = 'color: darkcyan;'>2021</span>")) + 
  theme_minimal() +
  theme(plot.title = element_text(family = fontfamily, size = 26), 
        plot.subtitle = element_text(family = fontfamily, size = 26)) +
  theme(axis.text.y = ggtext::element_markdown(), 
        plot.subtitle = ggtext::element_markdown(), 
        plot.background = element_rect(fill = "#f4f7fc", size = 0),
        legend.position = "none") 
p

## Warning: Use of `streaming$service` is discouraged. Use `service` instead.

One last option to bring this data to life would be to animate between the two states…

library(gganimate)

p <- ggplot(streaming_long) + 
  geom_point(aes(share, 
                 factor(service, levels = streaming$service)), 
             size = 5, color = "darkcyan") + 
  scale_y_discrete(labels = labels) +
  labs(title = "US Streaming Market Share", x = "", y = "") +
  labs(subtitle = "{closest_state}") + 
  scale_x_continuous(labels = scales::percent_format(scale = 1), 
                     limits = c(0, 35)) +
  theme_minimal() +
  theme(text = element_text(family = fontfamily), 
        plot.title = element_text(size = 26), 
        plot.subtitle = element_text(size = 26),
        legend.position = "none") +
  theme(axis.text.y = ggtext::element_markdown(),
        plot.background = element_rect(fill = "#f4f7fc", size = 0))
  
anim <- p + 
  transition_states(year, transition_length = 4) +
  ease_aes() +
  enter_fade() +
  exit_fade()
animate(anim, renderer = gifski_renderer())
anim_save(filename = "images/streaming.gif")

I think these last few communicate more than the original. It’s always amazing how easy it is to bring this data into R and start playing with different presentations. Sure, perfecting the plots can take a little while, but the data is there to be played with.

One visualization which might have been interesting would be if we had the data on the transitions between services. Then an alluvial plot would be very informative. Simulating that…

streaming_flow <- tibble::tribble(
    ~`2020`, ~Freq, ~`2021`,
  "netflix",    12, "netflix",
  "netflix",     4, "prime",
  "netflix",     6, "disney",
  "netflix",     3, "apple",
  "netflix",     4, "hbo",
  "prime",      16, "prime",
  "prime",       5, "netflix",
  "prime",       2, "hulu",
  "prime",       1, "disney",
  "prime",       3, "peacock",
  "prime",       4, "hbo",
  "hulu",       13, "hulu",
  "hulu",        1, "netflix",
  "hulu",        2, "prime",
  "hulu",        1, "disney",
  "disney",     11, "disney",
  "disney",      1, "netflix",
  "disney",      3, "prime",
  "disney",      1, "hulu",
  "apple",       4, "apple",
  "apple",       2, "netflix",
  "apple",       1, "prime",
  "peacock",     2, "peacock",
  "peacock",     1, "netflix",
  "hbo",         3, "hbo",
  "hbo",         1, "prime"
)

library(ggalluvial)

lodes <- to_lodes_form(streaming_flow, axes = c("2020", "2021"))
lodes <- left_join(lodes, logos,  by = c("stratum" = "service"))
lodes <- arrange(lodes, stratum)

ggplot(lodes, aes(x = x, stratum = stratum, alluvium = alluvium, 
                  y = Freq, fill = stratum, label = stratum, image = path)) + 
  geom_flow() +
  geom_stratum(alpha = 0.6) + 
  geom_text(stat = "stratum", size = 4) +
  scale_x_discrete(limits = c("2020", "2021"), 
                   expand = c(0.1, 0.1)) +
  labs(title = " US Streaming Market Share Flow", 
       subtitle = " 2020 to 2021") +
  theme_void() +
  theme(axis.text.x = element_text(size = 20, vjust = 3),
        plot.title = element_text(family = fontfamily, size = 26), 
        plot.subtitle = element_text(family = fontfamily, size = 26),
        plot.background = element_rect(fill = "#f4f7fc", size = 0),
        legend.position = "none")

I spent way too long trying to get the logos into this one, but there seems to be a fundamental issue with combining {ggalluvial} with {ggimage} which I coulnd’t resolve.

Do you have a better way to present this data? Let me know in the comments or on Twitter.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.3 (2020-10-10)
##  os       Pop!_OS 20.10               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2021-07-02                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version  date       lib source        
##  ash            1.0-15   2015-09-01 [1] CRAN (R 4.0.3)
##  assertthat     0.2.1    2019-03-21 [1] CRAN (R 4.0.3)
##  BiocManager    1.30.12  2021-03-28 [1] CRAN (R 4.0.3)
##  blogdown       1.2      2021-03-04 [1] CRAN (R 4.0.3)
##  bookdown       0.21     2020-10-13 [1] CRAN (R 4.0.3)
##  callr          3.5.1    2020-10-13 [1] CRAN (R 4.0.3)
##  cli            2.2.0    2020-11-20 [1] CRAN (R 4.0.3)
##  colorspace     2.0-0    2020-11-11 [1] CRAN (R 4.0.3)
##  crayon         1.3.4    2017-09-16 [1] CRAN (R 4.0.3)
##  desc           1.2.0    2018-05-01 [1] CRAN (R 4.0.3)
##  devtools       2.3.2    2020-09-18 [1] CRAN (R 4.0.3)
##  digest         0.6.27   2020-10-24 [1] CRAN (R 4.0.3)
##  dplyr        * 1.0.2    2020-08-18 [1] CRAN (R 4.0.3)
##  ellipsis       0.3.1    2020-05-15 [1] CRAN (R 4.0.3)
##  evaluate       0.14     2019-05-28 [1] CRAN (R 4.0.3)
##  extrafont    * 0.17     2014-12-08 [1] CRAN (R 4.0.3)
##  extrafontdb    1.0      2012-06-11 [1] CRAN (R 4.0.3)
##  fansi          0.4.1    2020-01-08 [1] CRAN (R 4.0.3)
##  farver         2.0.3    2020-01-16 [1] CRAN (R 4.0.3)
##  fs             1.5.0    2020-07-31 [1] CRAN (R 4.0.3)
##  generics       0.1.0    2020-10-31 [1] CRAN (R 4.0.3)
##  ggalluvial   * 0.12.3   2020-12-05 [1] CRAN (R 4.0.3)
##  ggalt        * 0.4.0    2017-02-15 [1] CRAN (R 4.0.3)
##  gganimate    * 1.0.7    2020-10-15 [1] CRAN (R 4.0.3)
##  ggimage      * 0.2.8    2020-04-02 [1] CRAN (R 4.0.3)
##  ggplot2      * 3.3.3    2020-12-30 [1] CRAN (R 4.0.3)
##  ggplotify      0.0.7    2021-05-11 [1] CRAN (R 4.0.3)
##  ggtext         0.1.1    2020-12-17 [1] CRAN (R 4.0.3)
##  gifski         1.4.3-1  2021-05-02 [1] CRAN (R 4.0.3)
##  glue           1.4.2    2020-08-27 [1] CRAN (R 4.0.3)
##  gridGraphics   0.5-1    2020-12-13 [1] CRAN (R 4.0.3)
##  gridtext       0.1.4    2020-12-10 [1] CRAN (R 4.0.3)
##  gtable         0.3.0    2019-03-25 [1] CRAN (R 4.0.3)
##  hms            0.5.3    2020-01-08 [1] CRAN (R 4.0.3)
##  htmltools      0.5.0    2020-06-16 [1] CRAN (R 4.0.3)
##  jsonlite       1.7.2    2020-12-09 [1] CRAN (R 4.0.3)
##  KernSmooth     2.23-18  2020-10-29 [4] CRAN (R 4.0.3)
##  knitr          1.30     2020-09-22 [1] CRAN (R 4.0.3)
##  labeling       0.4.2    2020-10-20 [1] CRAN (R 4.0.3)
##  lifecycle      0.2.0    2020-03-06 [1] CRAN (R 4.0.3)
##  magick         2.6.0    2021-01-13 [1] CRAN (R 4.0.3)
##  magrittr       2.0.1    2020-11-17 [1] CRAN (R 4.0.3)
##  maps           3.3.0    2018-04-03 [1] CRAN (R 4.0.3)
##  markdown       1.1      2019-08-07 [1] CRAN (R 4.0.3)
##  MASS           7.3-53   2020-09-09 [4] CRAN (R 4.0.2)
##  memoise        1.1.0    2017-04-21 [1] CRAN (R 4.0.3)
##  munsell        0.5.0    2018-06-12 [1] CRAN (R 4.0.3)
##  pillar         1.4.7    2020-11-20 [1] CRAN (R 4.0.3)
##  pkgbuild       1.2.0    2020-12-15 [1] CRAN (R 4.0.3)
##  pkgconfig      2.0.3    2019-09-22 [1] CRAN (R 4.0.3)
##  pkgload        1.1.0    2020-05-29 [1] CRAN (R 4.0.3)
##  png            0.1-7    2013-12-03 [1] CRAN (R 4.0.3)
##  prettyunits    1.1.1    2020-01-24 [1] CRAN (R 4.0.3)
##  processx       3.4.5    2020-11-30 [1] CRAN (R 4.0.3)
##  progress       1.2.2    2019-05-16 [1] CRAN (R 4.0.3)
##  proj4          1.0-10.1 2021-01-26 [1] CRAN (R 4.0.3)
##  ps             1.5.0    2020-12-05 [1] CRAN (R 4.0.3)
##  purrr          0.3.4    2020-04-17 [1] CRAN (R 4.0.3)
##  R6             2.5.0    2020-10-28 [1] CRAN (R 4.0.3)
##  RColorBrewer   1.1-2    2014-12-07 [1] CRAN (R 4.0.3)
##  Rcpp           1.0.5    2020-07-06 [1] CRAN (R 4.0.3)
##  remotes        2.2.0    2020-07-21 [1] CRAN (R 4.0.3)
##  rlang          0.4.10   2020-12-30 [1] CRAN (R 4.0.3)
##  rmarkdown      2.6      2020-12-14 [1] CRAN (R 4.0.3)
##  rprojroot      2.0.2    2020-11-15 [1] CRAN (R 4.0.3)
##  Rttf2pt1       1.3.8    2020-01-10 [1] CRAN (R 4.0.3)
##  rvcheck        0.1.8    2020-03-01 [1] CRAN (R 4.0.3)
##  scales         1.1.1    2020-05-11 [1] CRAN (R 4.0.3)
##  sessioninfo    1.1.1    2018-11-05 [1] CRAN (R 4.0.3)
##  stringi        1.5.3    2020-09-09 [1] CRAN (R 4.0.3)
##  stringr        1.4.0    2019-02-10 [1] CRAN (R 4.0.3)
##  testthat       3.0.1    2020-12-17 [1] CRAN (R 4.0.3)
##  tibble         3.0.4    2020-10-12 [1] CRAN (R 4.0.3)
##  tidyr          1.1.2    2020-08-27 [1] CRAN (R 4.0.3)
##  tidyselect     1.1.0    2020-05-11 [1] CRAN (R 4.0.3)
##  tweenr         1.0.2    2021-03-23 [1] CRAN (R 4.0.3)
##  usethis        2.0.0    2020-12-10 [1] CRAN (R 4.0.3)
##  vctrs          0.3.6    2020-12-17 [1] CRAN (R 4.0.3)
##  withr          2.3.0    2020-09-22 [1] CRAN (R 4.0.3)
##  xfun           0.22     2021-03-11 [1] CRAN (R 4.0.3)
##  xml2           1.3.2    2020-04-23 [1] CRAN (R 4.0.3)
##  yaml           2.2.1    2020-02-01 [1] CRAN (R 4.0.3)
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.0
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

isEven without modulo

website@jcarroll.com.au (Jonathan Carroll) — Mon, 09 Mar 2020 00:00:00 +0000

You may have seen the memes going around about fun ways to program the straightforward function isEven() which returns TRUE if the input is even, and FALSE otherwise. I had a play with this and it turned into enough for a blog post, and a nice walk through some features of R.

The ‘traditional’ way to check if an integer is even is to check if it is divisible by 2. This can be achieved with the modulo operator %% which gives the remainder after dividing by another number. For example, 5 modulo 2 or 5 %% 2 gives 1 because 2 goes into 5 twice with 1 leftover. If a number x is even, it is an exact multiple of 2, and so x %% 2 == 0.

5 %% 2

## [1] 1

6 %% 2

## [1] 0

A function which tests values of x for this property could be written as

## 1
isEven <- function(x) {
    ## traditional modulo check
    x %% 2 == 0
}

The == operation checks that the left side is equal to the right side (but not necessarily identical, e.g. the classes can be different) and returns either TRUE or FALSE (or NA, but that’s not an issue for the cases we’re looking at here). I’ve also relied on the fact that the result of the last statement in a function body is used as the return value if no explicit return() is used.

Confirming that this works is as easy as trying some values. It’s always good to check that your function produces results you expect. It’s also a good idea to try some odd values to ensure you don’t hit edge-cases.

isEven(0)

## [1] TRUE

isEven(3)

## [1] FALSE

isEven(4)

## [1] TRUE

isEven(-1)

## [1] FALSE

isEven(-6)

## [1] TRUE

In the process of playing with this function I refined how I tested my code. I started with a set of confirming evaluations like above. Then I wanted to confirm that they actually gave the results I expect, so I refined it to

test_isEven <- function() {
    all(
        isEven(0) == TRUE,
        isEven(3) == FALSE,
        isEven(4) == TRUE,
        isEven(-1) == FALSE,
        isEven(-6) == TRUE
    )
}

Now I just had one function to call which confirmed that all these tests gave the expected results. Once I had explained this layout to myself with the word ‘expected’, I realised what I actually wanted was a test suite, and testthat is a great candidate for that. Refactoring the above into a series of expectations might look like

library(testthat)
test_isEven <- function() {
    test_that("isEven peforms as expected", {
        expect_true(isEven(0))
        expect_false(isEven(3))
        expect_true(isEven(4))
        expect_false(isEven(-1))
        expect_true(isEven(-6))
    })
}

Now I can test any implementation of isEven() with just one function call, and if one of the expectations fails I’ll know which it is. Running this with the above isEven() produces no output, so the tests succeeded

test_isEven()

The ‘no output’ might be concerning, so we can also run a negative control to make sure it breaks when things are broken. Let’s break the isEven() by reversing the test

## 1a
isEven <- function(x) {
    ## (broken) traditional modulo check
    x %% 2 == 1
}
test_isEven()

## Error: Test failed: 'isEven peforms as expected'
## * <text>:4: isEven(0) isn't true.
## * <text>:5: isEven(3) isn't false.
## * <text>:6: isEven(4) isn't true.
## * <text>:7: isEven(-1) isn't false.
## * <text>:8: isEven(-6) isn't true.

So, we can trust that if we make a mistake or don’t implement this properly, we’ll know. Typically a function you write would have a lot more safety checking, such as ensuring that we actually passed a value, and that it’s an integer, but for the sake of this post I’m going to assume that these are both guaranteed to be true.

This version of isEven() is simple and it works, but that’s not what the internet wants - a common challenge is to make a version of isEven() which doesn’t use modulo. Now we need to think a little more, but we can at least check any implementation with our tests.

I came up with a few, both from borrowing from other solutions and on my own. Let’s see…

If the last digit is any of 0, 2, 4, 6, or 8, then it’s an even number

## 2
isEven <- function(x) {
    ## ends with an even digit
    grepl("[02468]$", x)
}
test_isEven()

With that same idea, if the least significant bit (binary) is unset then it’s even

## 3
isEven <- function(x) {
    ## least significant bit is unset
    x == 0 || !bitwAnd(x, 1)
}
test_isEven()

Continuing down the bitwise path, if we can shift left and right and get back to the original number, then it’s even

## 4
isEven <- function(x) {
    ## bitwise shift right then left
    !(x-(bitwShiftL(bitwShiftR(x, 1), 1)))
}
test_isEven()

If we alternate FALSE and TRUE counting from 0 to x then we get our answer

## 5
isEven <- function(x) {
    ## alternate TRUE/FALSE
    y <- FALSE
    for (i in 0:x) {
        y <- !y
    }
    return(y)
}
test_isEven()

We could do the same thing with recursion

## 6
isEven <- function(x) {
    ## recursion, n-1 is odd if n is even
    if (x == 0) return(TRUE)
    !isEven(abs(x) - 1)
}
test_isEven()

Not quite using modulo, integer division by 2, doubled, should return the original value

## 7
isEven <- function(x) {
    ## integer division, doubled
    2*(x %/% 2) == x
}
test_isEven()

Similarly, the result of regular division cast to integer, doubled, should return the original value

## 7a
isEven <- function(x) {
    ## normal division, doubled
    2*as.integer(x/2) == x
}
test_isEven()

If we start from a number and count towards 0 by twos then we will hit 0 if the number is even

## 8
isEven <- function(x) {
    ## moving by 2s towards 0 ends at 0
    y <- x
    repeat({
        if (y == 0) return(TRUE)
        if (sign(x) != sign(y)) return(FALSE)
        y <- y - sign(x)*2
    })
}
test_isEven()

We can write that a bit simpler if we only use the absolute value of x

## 8a (abs version)
isEven <- function(x) {
    ## moving by 2s towards 0 ends at 0
    y <- abs(x)
    repeat({
        if (y == 0) return(TRUE)
        if (y < 0) return(FALSE)
        y <- y - 2
    })
}
test_isEven()

Exploiting mathematical properties, we know that -1 to any even power returns 1

## 9
isEven <- function(x) {
    ## -1 to an even power is 1
    (-1)**x == 1
}
test_isEven()

The relationship \[\cos(2x) = -\cos(x)\] can also be exploited

## 10
isEven <- function(x) {
    ## cos(2x) == -cos(x)
    cos(x*pi) == -cos(pi)
}
test_isEven()

Now for some more R-specific solutions… R rounds towards even integers, and we can exploit that

## 11
isEven <- function(x) {
    ## R rounds even real numbers down
    round(x + 0.5) == x
}
test_isEven()

We can create a vector of ‘every other integer’ and check whether a value is in there

## 12
isEven <- function(x) {
    ## is x in set of 'every other integer'?
    abs(x) %in% (0:abs(x))[c(TRUE, FALSE)]
}
test_isEven()

Creating a vector of TRUE and FALSE we can extract the element corresponding to a value

## 13
isEven <- function(x) {
    ## even/odd sequence
    if (x == 0) return(TRUE)
    rep(c(FALSE, TRUE), (abs(x)/2) + 1)[abs(x)]
}
test_isEven()

Then, starting to get really absurd, we could solve the equation \[2n = x\] which will have an integer n if x is even

## 14
isEven <- function(x) {
    ## integer solution to 2n = x?
    n <- solve(2, x)
    as.integer(n) == n
}
test_isEven()

And, lastly, for the truly absurd, we can use the fact that “zero” and “eight” are the only single digits written as English words with an “e”. This requires a couple of extra packages, but can be done.

## 15
isEven <- function(x) {
    ## zero and eight are the only odd 
    ## last digit as words with an e
    last <- english::words(as.integer(stringr::str_sub(x, -1, -1)))
    last == "zero" || last == "eight" || !grepl("e", last)
}
test_isEven()

This isn’t an exhaustive list, but it seemed like a good place to stop looking. If you can think of more then add them to this thread on Twitter.

I hope this demonstrates the usefulness of writing functions and testing them with testthat. Plus, if the %% operator ever breaks, you have plenty of alternatives.

devtools::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.6.2 (2019-12-12)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2020-03-09                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date       lib source        
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.2)
##  backports     1.1.5   2019-10-02 [1] CRAN (R 3.6.2)
##  blogdown      0.18    2020-03-04 [1] CRAN (R 3.6.2)
##  bookdown      0.17    2020-01-11 [1] CRAN (R 3.6.2)
##  callr         3.4.2   2020-02-12 [1] CRAN (R 3.6.2)
##  cli           2.0.1   2020-01-08 [1] CRAN (R 3.6.2)
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.6.2)
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.6.2)
##  devtools      2.2.2   2020-02-17 [1] CRAN (R 3.6.2)
##  digest        0.6.24  2020-02-12 [1] CRAN (R 3.6.2)
##  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.6.2)
##  english       1.2-5   2020-01-26 [1] CRAN (R 3.6.2)
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.6.2)
##  fansi         0.4.1   2020-01-08 [1] CRAN (R 3.6.2)
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.6.2)
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.6.2)
##  htmltools     0.4.0   2019-10-04 [1] CRAN (R 3.6.2)
##  knitr         1.28    2020-02-06 [1] CRAN (R 3.6.2)
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.6.2)
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.6.2)
##  pkgbuild      1.0.6   2019-10-09 [1] CRAN (R 3.6.2)
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.6.2)
##  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 3.6.2)
##  processx      3.4.2   2020-02-09 [1] CRAN (R 3.6.2)
##  ps            1.3.2   2020-02-13 [1] CRAN (R 3.6.2)
##  R6            2.4.1   2019-11-12 [1] CRAN (R 3.6.2)
##  Rcpp          1.0.3   2019-11-08 [1] CRAN (R 3.6.2)
##  remotes       2.1.1   2020-02-15 [1] CRAN (R 3.6.2)
##  rlang         0.4.5   2020-03-01 [1] CRAN (R 3.6.2)
##  rmarkdown     2.1     2020-01-20 [1] CRAN (R 3.6.2)
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.6.2)
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.6.2)
##  stringi       1.4.5   2020-01-11 [1] CRAN (R 3.6.2)
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.6.2)
##  testthat    * 2.3.1   2019-12-01 [1] CRAN (R 3.6.2)
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.6.2)
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.6.2)
##  xfun          0.12    2020-01-13 [1] CRAN (R 3.6.2)
##  yaml          2.2.1   2020-02-01 [1] CRAN (R 3.6.2)
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.6
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

r-bugs :: object_size

website@jcarroll.com.au (Jonathan Carroll) — Sat, 12 Oct 2019 00:00:00 +0000

As soon as the R-Foundation posted that they’re inviting cleanup of old bugs, I knew it would be an opportunity to learn more about the way R works on the inside.

I started looking through the list of open bugs for somewhere I could help out. I barely know anything about the actual C internals of the language (I’m hoping to learn) so I figured it would be best to start with some parts of the code I’m familiar with.

I had an open bug report for the internals of glm which I extended with a reproducible example. I had another look through the open bug reports for “glm” in case there was another report of this that I had overlooked (not that I can find) and found another which seemed fairly straightforward - some out of date documentation.

Bug 16522

That seems approachable. I tested that the documentation was still in that state (it was) and that the example did what it said (it did). Lastly, I read through the source of that function to double-check that the return value would indeed be more general. In fact, the method return value could be either the name of the fitting function as a character string, e.g.

glm.fit2 <- glm.fit
glm(1:4 ~ rnorm(4), method = "glm.fit2")$method

## [1] "glm.fit2"

or the actual function definition, if it was provided that way

head(glm(1:4 ~ rnorm(4), method = glm.fit)$method)

##                                                                          
## 1 function (x, y, weights = rep(1, nobs), start = NULL, etastart = NULL, 
## 2     mustart = NULL, offset = rep(0, nobs), family = gaussian(),        
## 3     control = list(), intercept = TRUE, singular.ok = TRUE)            
## 4 {                                                                      
## 5     control <- do.call("glm.control", control)                         
## 6     x <- as.matrix(x)

(truncated for simplicity).

This made for a small, (-2/+3)-line patch which was accepted and is now part of the source of R.

Sidenote: I made this patch using git, but I should really be doing this via SVN. This thread by Michael Chirico is what I’ll be following from here on.

Now, on to the next bug.

I had a browse through the sections and found this one from 2013

Bug 15389

That seems innocent enough, right? My experience with object.size is when I look at how much memory a given object is taking up (incorrectly, I now understand after reading Advanced R). What I always liked about this function was that it has a format method so I could easily convert the standard output

object.size(mtcars)

## 7208 bytes

into a different unit very easily

format(object.size(mtcars), "KB")

## [1] "7 Kb"

Of course, in order to do this (to know to convert from bytes to Kb) the object needs to be classed appropriately. That’s the case, here

class(object.size(mtcars))

## [1] "object_size"

That isn’t the case for the size element of file.info, though

example_file <- file.path(.Library, "base", "R", "base.rdb")
file.info(example_file)[, c("size", "isdir", "mode")]

##                                      size isdir mode
## /usr/lib/R/library/base/R/base.rdb 973156 FALSE  644

which is just a number

class(file.info(example_file)$size)

## [1] "numeric"

The suggestion is to make this of class object_size, which would enable the nice formatting of the unit (even though a ‘file’ is not an ‘object’). Seems fair, let’s have a look at what needs to happen to make that work - hopefully it’s as simple as adding a class to the size element. I use RStudio, and I have a copy of the r-source in a project, so I can simply CTRL+SHIFT+F to search all files in this project for “file.info”. Sure enough, it’s in /src/library/base/R/files.R

file.info <- function(..., extra_cols = TRUE)
{
    res <- .Internal(file.info(fn <- c(...), extra_cols))
    res$mtime <- .POSIXct(res$mtime)
    res$ctime <- .POSIXct(res$ctime)
    res$atime <- .POSIXct(res$atime)
    class(res) <- "data.frame"
    attr(res, "row.names") <- fn # not row.names<- as that does a length check
    res
}

Not much to it, but some surprises for sure. What immediately jumps out to me is that the fact that the result is a data.frame is achieved through the very hacky “slap a class on it” method rather than as.data.frame(). The call to .Internal means the actual work is done at the C level, so there may not be much hope changing the class of the size there.

The simplest thing would appear to be to convert res$size to class object_size as soon as it’s created. There’s sometimes a converting function, e.g. as.object_size, but I don’t see one here. Looking at the internals of the object.size function

object.size

## function (x) 
## structure(.Call(C_objectSize, x), class = "object_size")
## <bytecode: 0x564bd0b58e30>
## <environment: namespace:utils>

suggests it’s safe enough to slap the class on an object, so let’s try that. I’ll rename this function for now so we can see if it’s working

file.info2 <- function(..., extra_cols = TRUE)
{
    res <- .Internal(file.info(fn <- c(...), extra_cols))
    class(res$size) <- "object_size"
    res$mtime <- .POSIXct(res$mtime)
    res$ctime <- .POSIXct(res$ctime)
    res$atime <- .POSIXct(res$atime)
    class(res) <- "data.frame"
    attr(res, "row.names") <- fn # not row.names<- as that does a length check
    res
}

file.info2(example_file)

## Error in round(x/base^power, digits = digits): invalid second argument of length 0

Well, that didn’t work. Huh. Did we do something wrong?

file.info2(example_file)$size

## 973156 bytes

No, that actually works. Did it actually fail to return an object?

fi <- file.info2(example_file)

Huh, that works, too. Can we see what’s inside?

str(fi)

## 'data.frame':    1 obs. of  10 variables:
##  $ size  : 'object_size' num 973156
##  $ isdir : logi FALSE
##  $ mode  : 'octmode' int 644
##  $ mtime : POSIXct, format: "2019-01-15 06:33:43"
##  $ ctime : POSIXct, format: "2019-08-10 20:44:16"
##  $ atime : POSIXct, format: "2019-08-10 20:44:13"
##  $ uid   : int 0
##  $ gid   : int 0
##  $ uname : chr "root"
##  $ grname: chr "root"

Well, that looks to be what we wanted - the size element has the right class. If we print this, however

print(fi)

## Error in round(x/base^power, digits = digits): invalid second argument of length 0

So, the issue seems to be in the print method somewhere. using traceback() we can see that the error comes from this line in utils:::format.object_size which seems to be having trouble finding the digits value

paste(round(x/base^power, digits = digits), unit)

The formal arguments for utils:::format.object_size has a default of digits = 1L so why is it not being found? Tracing back further we see that the format.object_size method was called from format.data.frame which loops through the columns of the data.frame and formats them according to each class there

format.data.frame <- function(x, ..., justify = "none")
{
    nc <- length(x)
    <...>
    rval <- vector("list", nc)
    for(i in seq_len(nc))
       rval[[i]] <- format(x[[i]], ..., justify = justify)
    <...>

but there’s no mention of digits here. Tracing even further back, this is called from print.data.frame and passes its digits argument. That function has the signature

print.data.frame <-
    function(x, ..., digits = NULL, quote = FALSE, right = TRUE,
         row.names = TRUE, max = NULL)

and now we can see where the problem comes from - we try to print our data.frame which has an object_size column, but that print method sets digits = NULL which is passed to format.data.frame which is passed to format.object_size which has no way to deal with digits = NULL. Ugh.

Can we test that? Sure, let’s create a simple data.frame, make a column be object_size, and try to print it

df <- data.frame(x = 1, y = 2048)
print(df)

##   x    y
## 1 1 2048

class(df$y) <- "object_size"
print(df)

## Error in round(x/base^power, digits = digits): invalid second argument of length 0

This is the behaviour if we auto-print an object by just referencing it by name

df

## Error in round(x/base^power, digits = digits): invalid second argument of length 0

but what if we do provide the digits argument to an actual call?

print(df, digits = 1L)

##   x          y
## 1 1 2048 bytes

SOLVED! But what to do about it? I see three options:

OPTION 1: Don’t default to digits = NULL in print.data.frame. I’m not sure what the consequence of this would be across all calls to this function, but I can’t imagine there’s much use for that default. A better default would seem to be

getOption("digits")

## [1] 7

Looking at ?options we can see how this is intended to be used

digits: controls the number of significant digits to print when printing numeric values. It is a suggestion only. Valid values are 1…22 with default 7. See the note in print.default about values greater than 15.

OPTION 2: Make round deal with NULL values. I worry that is already handled in that an error message appropriate to that function is generated

round(3.14159, digits = NULL)

## Error in round(3.14159, digits = NULL): invalid second argument of length 0

OPTION 3: Make format.object_size capable of dealing with digits = NULL. The fact that it has no way of dealing with this value seems like an oversight, since it must be a valid value in order to pass it to round. Again, the option of using a more sensible default comes to mind, but there already is a default in place here (digits = 1L) it’s just being overridden.

Instead, we would need some sort of handling in this situation, such as

digits <- ifelse(is.null(digits), 1L, digits)

## or 

digits <- ifelse(is.null(digits), getOption("digits"), digits)

I think this last option is the most satisfactory (though perhaps the first should also be addressed) so let’s see if that’s sufficient. Unfortunately, since the issue is deeply nested within another function’s namespace, we can’t just write a new format.object_size and call print.data.frame and expect it to work (without rebuilding R itself – learning how to do that is on my TODO list). What we can do, though, is to write a format.object_size3, class our new column as object_size3, and test that. With this new function written, we get

format.object_size3 <- function(x, units = "b", standard = "auto", digits = 1L, ...)
{
  known_bases <- c(legacy = 1024, IEC = 1024, SI = 1000)
  known_units <- list(
    SI     = c("B", "kB",  "MB",  "GB", "TB", "PB",  "EB",  "ZB",  "YB"),
    IEC    = c("B", "KiB", "MiB", "GiB","TiB","PiB", "EiB", "ZiB", "YiB"),
    legacy = c("b", "Kb",  "Mb",  "Gb", "Tb", "Pb"),
    LEGACY = c("B", "KB",  "MB",  "GB", "TB", "PB") # <- only for "KB"
  )
  
  units <- match.arg(units,
                     c("auto", unique(unlist(known_units), use.names = FALSE)))
  standard <- match.arg(standard, c("auto", names(known_bases)))
  digits <- ifelse(is.null(digits), 1L, digits) # added
  
  if (standard == "auto") { ## infer 'standard' from 'units':
    standard <- "legacy" # default; may become "SI"
    if (units != "auto") {
      if (endsWith(units, "iB"))
        standard <- "IEC"
      else if (endsWith(units, "b"))
        standard <- "legacy"   ## keep when "SI" is the default
      else if (units == "kB")
        ## SPECIAL: Drop when "SI" becomes the default
        stop("For SI units, specify 'standard = \"SI\"'")
    }
  }
  base      <- known_bases[[standard]]
  units_map <- known_units[[standard]]
  
  if (units == "auto") {
    power <- if (x <= 0) 0L else min(as.integer(log(x, base = base)),
                                     length(units_map) - 1L)
  } else {
    power <- match(toupper(units), toupper(units_map)) - 1L
    if (is.na(power))
      stop(gettextf("Unit \"%s\" is not part of standard \"%s\"",
                    sQuote(units), sQuote(standard)), domain = NA)
  }
  unit <- units_map[power + 1L]
  ## SPECIAL: Use suffix 'bytes' instead of 'b' for 'legacy' (or always) ?
  if (power == 0 && standard == "legacy") unit <- "bytes"
  
  paste(round(x / base^power, digits=digits), unit)
}

file.info3 <- function(..., extra_cols = TRUE)
{
    res <- .Internal(file.info(fn <- c(...), extra_cols))
    class(res$size) <- "object_size3"
    res$mtime <- .POSIXct(res$mtime)
    res$ctime <- .POSIXct(res$ctime)
    res$atime <- .POSIXct(res$atime)
    class(res) <- "data.frame"
    attr(res, "row.names") <- fn # not row.names<- as that does a length check
    res
}

file.info3(example_file)[, c("size", "isdir", "mode")]

##                                            size isdir mode
## /usr/lib/R/library/base/R/base.rdb 973156 bytes FALSE  644

It works!

As of R 3.2.0 there are also some helper extractors of these elements (mode, mtime, size) so our improvement extends to file.size. We can test this if we update the helper to use our version

file.size3 <- function(...) file.info3(..., extra_cols = FALSE)$size
file.size3(example_file)

## [1] 973156
## attr(,"class")
## [1] "object_size3"

We can format that specifically

format(file.size3(example_file), "KB")

## [1] "950.3 Kb"

but it doesn’t print nicely on its own here. This is because we’ve artificially changed the class to object_size3 so we’re no longer plugging in to all the methods for object_size. I could go through and redefine all of those, but for testing, it’s easier to just reset everything to object_size, define file.size to use my custom version, and test that, and that works.

I’ve seen others create a package containing the modified code so that all of the namespacing works out, but in this case it would be a large chunk of the print, subset, and format methods.

Are we done? What about another look at that class(res) <- "data.frame" business?

Would it make more sense for that to use res <- as.data.frame(res)? Let’s try

file.info4 <- function(..., extra_cols = TRUE)
{
    res <- .Internal(file.info(fn <- c(...), extra_cols))
    class(res$size) <- "object_size3"
    res$mtime <- .POSIXct(res$mtime)
    res$ctime <- .POSIXct(res$ctime)
    res$atime <- .POSIXct(res$atime)
    res <- as.data.frame(res)
    attr(res, "row.names") <- fn # not row.names<- as that does a length check
    res
}

file.info4(example_file)

## Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"object_size3"' to a data.frame

Fair enough, there’s probably no as.data.frame.object_size3 method (the 3 is for our convenience, but there’s no as.data.frame.object_size either). It’s simple enough to add

as.data.frame.object_size3 <- as.data.frame.vector

file.info4(example_file)

## Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"octmode"' to a data.frame

The “mode” element has the file permissions as octmode. I feel we’re getting a bit off track if we need to add a lot of other as.data.frame methods, but here we go

as.data.frame.octmode <- as.data.frame.vector

file.info4(example_file)

##                                            size isdir mode
## /usr/lib/R/library/base/R/base.rdb 973156 bytes FALSE  644
##                                                  mtime               ctime
## /usr/lib/R/library/base/R/base.rdb 2019-01-15 06:33:43 2019-08-10 20:44:16
##                                                  atime uid gid uname
## /usr/lib/R/library/base/R/base.rdb 2019-08-10 20:44:13   0   0  root
##                                    grname
## /usr/lib/R/library/base/R/base.rdb   root

Right, that’s working. I’ll raise the question of whether or not it’s worth the effort to support the as.data.frame conversion or whether the forcing of the class is better - I’m honestly not sure which.

I wrote a bug report and corresponding patch that adds the new test for NULL line to the definition of format.object_size and submitted that.

I then wrote another patch to the bug report this started with which implements this (now minor) change to the size element of file.info. It may be the case that neither is welcome in the core source, and that’s fine. The bug can be closed as WONTFIX.

I’ve learned a lot about how the components of these work, and either way a bug should get closed. I’m off to find the next one.

Addendum: I wanted to test this out properly, so I rebuilt my modified version of R (with these two patches in place) from source using docker. Assuming you have docker set up, this seems to do the trick

## pull the r-base docker image
## this has most of the requirements to build R
docker pull r-base

## run the image with a command-line
## with your local svn repository mounted
## your path to your svn directory will vary
docker run -ti -v /home/USER/svn/:/svn/ r-base /bin/bash 

## update whatever is necessary to build R from the svn source
apt update
apt-get install libcurl4-openssl-dev 
apt-get install texinfo ## needed to build manuals
apt-get install texlive-latex-base ## needed to build vignettes
apt-get install texlive-latex-extra ## sty files
apt-get install subversion ## to work with svn

## ensure svn is up to date
cd /svn/r-devel/
svn update

## build R and check that it works
## we're just using the command line, so X11 is not needed
## we're just building the source, so we don't need e.g. MASS
## we don't need to be using java, so don't include that
./configure --with-x=no --without-recommended-packages --disable-byte-compiled-packages --disable-java
## make in parallel
make -j4
make check
make check-all
make install

bin/R
## R Under development (unstable) (2019-10-10 r77275) -- "Unsuffered Consequences"

Woohoo!

Note that this is ephemeral - the changes won’t persist after you stop the image. To prevent that, we can save the image for re-use (from outside the image)

## your image ID will be different
docker commit -m "build env" 8f42f23123dd r-build-env

With that built, we can try out our patched functions (not run locally here)

example_file <- file.path(.Library, "base", "R", "base.rdb")
file.info(example_file)[, c("size", "isdir", "mode")]
#>                                              size isdir mode
#> /svn/r-devel/library/base/R/base.rdb 357435 bytes FALSE  644

file.size(example_file)
#> 357435 bytes

format(file.size(example_file), "KB")
#> [1] "349.1 Kb"

Success!

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-10-12                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.2.1   2019-09-24 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.25    2019-09-18 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

{ggtext} for images as x-axis labels

website@jcarroll.com.au (Jonathan Carroll) — Tue, 13 Aug 2019 00:00:00 +0000

I’ve written a few times about using an image as an x-axis label, and the solutions have been slowly improving. This one blows all of them out of the water.

Claus Wilke (@ClausWilke) now has a {ggtext} package which can very neatly add images as x-axis labels!

This makes the solution as simple as

library(ggplot2)
library(ggtext)
library(rvest)

## GDP per capita, top 11 countries
url      <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita"
html     <- xml2::read_html(url)
gdppc    <- html_table(html_nodes(html, "table")[3])[[1]][1:11,]

## clean up; remove non-ASCII and perform type conversions
gdppc$Country <- gsub("Â ", "", gdppc$Country)
gdppc$Country <- iconv(gdppc$Country, "latin1", "ASCII", sub="")
gdppc$Country[9] <- "United States of America"
gdppc$Rank    <- iconv(gdppc$Rank,    "latin1", "ASCII", sub="")
gdppc$`US$`   <- as.integer(sub(",", "", gdppc$`US$`))

## switching to a different source of flag images
labels <- setNames(
  paste0("<img src='http://www.senojflags.com/images/country-flag-icons/", 
         gsub(" ", "-", gdppc$Country), "-Flag.png'  width='30' /><br>", 
         sapply(strwrap(gdppc$Country, width = 10, simplify = FALSE), 
                function(x) paste(x, collapse = "<br>"))),
  gdppc$Country
)

## create a dummy dataset
npoints <- length(gdppc$Country)
y       <- gdppc$`US$`
x       <- gdppc$Country
dat     <- data.frame(x=factor(x, levels=gdppc$Country), y=y)

## NB: #85bb65 is the color of money in the USA apparently.
gg <- ggplot(dat, aes(x=x, y=y/1e3L, group=1)) 
gg <- gg + geom_bar(col="black", fill="#85bb65", stat="identity")
gg <- gg + scale_x_discrete(name = NULL, labels = labels)
gg <- gg + theme_minimal()
gg <- gg + scale_fill_discrete(guide=FALSE)
gg <- gg + theme(plot.background = element_rect(fill="grey90"))
gg <- gg + labs(title="GDP per capita", 
                subtitle="Top 11 countries", 
                x="", y="$US/1000", 
                caption=paste0("Source: ",url))
## ggtext::element_markdown
gg <- gg + theme(axis.text.x  = element_markdown(color = "black", size = 7), 
                 axis.text.y  = element_text(size=14),
                 axis.title.x = element_blank())
gg

Figure 1: {ggtext} makes this easy!

This is an even nicer solution than I’ve been using, not only because it’s shorter to use, but also more flexible - markdown can be processed (using element_markdown) wherever element_text is used, such as axis text, legends, titles, etc…

I think this finally closes this chapter, but now it’s time to make some really cool graphs with images.

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version   date       lib
##  assertthat    0.2.1     2019-03-21 [1]
##  backports     1.1.4     2019-04-10 [1]
##  bitops        1.0-6     2013-08-17 [1]
##  blogdown      0.14.1    2019-08-11 [1]
##  bookdown      0.12      2019-07-11 [1]
##  callr         3.3.1     2019-07-18 [1]
##  cli           1.1.0     2019-03-19 [1]
##  colorspace    1.4-1     2019-03-18 [1]
##  crayon        1.3.4     2017-09-16 [1]
##  curl          4.0       2019-07-22 [1]
##  desc          1.2.0     2018-05-01 [1]
##  devtools      2.1.0     2019-07-06 [1]
##  digest        0.6.20    2019-07-04 [1]
##  dplyr         0.8.3     2019-07-04 [1]
##  evaluate      0.14      2019-05-28 [1]
##  fs            1.3.1     2019-05-06 [1]
##  ggplot2     * 3.2.1     2019-08-10 [1]
##  ggtext      * 0.1.0     2019-08-13 [1]
##  glue          1.3.1     2019-03-12 [1]
##  gridtext      0.1.0     2019-08-13 [1]
##  gtable        0.3.0     2019-03-25 [1]
##  highr         0.8       2019-03-20 [1]
##  htmltools     0.3.6     2017-04-28 [1]
##  httr          1.4.1     2019-08-05 [1]
##  knitr         1.24      2019-08-08 [1]
##  labeling      0.3       2014-08-23 [1]
##  lazyeval      0.2.2     2019-03-15 [1]
##  magrittr      1.5       2014-11-22 [1]
##  markdown      1.1       2019-08-07 [1]
##  memoise       1.1.0     2017-04-21 [1]
##  munsell       0.5.0     2018-06-12 [1]
##  pillar        1.4.2     2019-06-29 [1]
##  pkgbuild      1.0.4     2019-08-05 [1]
##  pkgconfig     2.0.2     2018-08-16 [1]
##  pkgload       1.0.2     2018-10-29 [1]
##  png           0.1-7     2013-12-03 [1]
##  prettyunits   1.0.2     2015-07-13 [1]
##  processx      3.4.1     2019-07-18 [1]
##  ps            1.3.0     2018-12-21 [1]
##  purrr         0.3.2     2019-03-15 [1]
##  R6            2.4.0     2019-02-14 [1]
##  Rcpp          1.0.2     2019-07-25 [1]
##  RCurl         1.95-4.12 2019-03-04 [1]
##  remotes       2.1.0     2019-06-24 [1]
##  rlang         0.4.0     2019-06-25 [1]
##  rmarkdown     1.14      2019-07-12 [1]
##  rprojroot     1.3-2     2018-01-03 [1]
##  rvest       * 0.3.4     2019-05-15 [1]
##  scales        1.0.0     2018-08-09 [1]
##  selectr       0.4-1     2018-04-06 [1]
##  sessioninfo   1.1.1     2018-11-05 [1]
##  stringi       1.4.3     2019-03-12 [1]
##  stringr       1.4.0     2019-02-10 [1]
##  testthat      2.2.1     2019-07-25 [1]
##  tibble        2.1.3     2019-06-06 [1]
##  tidyselect    0.2.5     2018-10-11 [1]
##  usethis       1.5.1     2019-07-04 [1]
##  withr         2.1.2     2018-03-15 [1]
##  xfun          0.8       2019-06-25 [1]
##  xml2        * 1.2.2     2019-08-09 [1]
##  yaml          2.2.0     2018-07-25 [1]
##  source                              
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  Github (rstudio/blogdown@be4e91c)   
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  Github (clauswilke/ggtext@5c7cfa9)  
##  CRAN (R 3.5.2)                      
##  Github (clauswilke/gridtext@21b7198)
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.2)                      
##  CRAN (R 3.5.1)                      
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

forcats::fct_match

website@jcarroll.com.au (Jonathan Carroll) — Fri, 22 Feb 2019 23:57:25 +0000

This journey started almost exactly a year ago, but it’s finally been sufficiently worked through and merged! Yay, I’ve officially contributed to the tidyverse (minor as it may be).

I’m at least as useful as Zoidberg

It began with a tweet, recalling a surprise I encountered that day during some routine data processing

Source of today's mild heart-attack: I have categories W, X_Y, and Z in some data. Intending to keep only the second two:

data %>% filter(g %in% c(“X Y”, “Z”)

Did you spot that I used a space instead of an underscore? I sure as heck didn't, and filtered excessively to just Z.
— Jonathan Carroll (@carroll_jono) March 6, 2018

For those of you not so comfortable with pipes and dplyr, I was trying to subset a data.frame ‘data’ (with a column g having values "W", "X_Y" and "Z") to only those rows for which the column g had the value "X_Y" or "Z" (not the actual values, of course, but that’s the idea). Without dplyr this might simply be

data[data$g %in% c("X Y", "Z"), ]

To make that more concrete, let’s actually show it in action

data <- data.frame(a = 1:5, g = c("X_Y", "W", "Z", "Z", "W"))
data

##   a   g
## 1 1 X_Y
## 2 2   W
## 3 3   Z
## 4 4   Z
## 5 5   W

data %>% 
   filter(g %in% c("X Y", "Z"))

##   a g
## 1 3 Z
## 2 4 Z

filter isn’t at fault here – the same issue would arise with [ – I have mis-specified the values I wish to match, so I am returned only the matching values. %in% is also performing its job - it returns a logical vector; the result of comparing the values in the column g to the vector c("X Y", "Z"). Both of these functions are behaving as they should, but the logic of what I was trying to achieve (subset to only these values) was lost.

Now, in some instances, that is exactly the behaviour you want – subset this vector to any of these values… where those values may not be present in the vector to begin with

data %>% 
   filter(values %in% all_known_values)

The problem, for me, is that there isn’t a way to say “all of these should be there”. The lack of matching happens silently. If you make a typo, you don’t get that level, and you aren’t told that it’s been skipped

simpsons_characters %>% 
   filter(first_name %in% c("Homer", "Marge", "Bert", "Lisa", "Maggie")

Technically this is a double-post because I also want to sidenote this with something I am amazed I have not known about yet (I was approximately today years old when I learned about this)… I’ve used regexmatching for a while, and have been surprised at how well I’ve been able to make it work occasionally. I’m familiar with counting patterns ((A){2} to match two occurrences of A) and ranges of counts ((A){2,4} Sto match between two and four occurrences of A) but I was not aware that you can specify a number of mistakes that can be included to still make a match…;

grep("Bart", c("Bart", "Bort", "Brat"), value = TRUE)

## [1] "Bart"

grep("(Bart){~1}", c("Bart", "Bort", "Brat"), value = TRUE)

## [1] "Bart" "Bort"

(“Are you matching to me?”… “No, my regex also matches to ‘Bort’”)

Use (pattern){~n}to allow up to nsubstitutions in the pattern matching. Refer here and here.

Back to the original problem – filterand %in%are doing their jobs, but we aren’t getting the result we want because we made a typo, and we aren’t told that we’ve done so.

Enter a new PR to forcats (originally to dplyr, but forcats does make more sense) which implements fct_match(f, lvls). This checks that all of the values in lvls are actually present in f before returning the logical vector of which entries they correspond to. With this, the pattern becomes (after loading the development version of forcats from github)

data %>% 
   filter(fct_match(g, c("X Y", "Z")))

## Error: Levels not present in factor: "X Y"

Yay! We’re notified that we’ve made an error. "X Y" isn’t actually in our column g. If we don’t make the error, we get the result we actually wanted in the first place. We can now use this successfully

data %>% 
   filter(fct_match(g, c("X_Y", "Z")))

##   a   g
## 1 1 X_Y
## 2 3   Z
## 3 4   Z

It took a while for the PR to be addressed (the tidyverse crew have plenty of backlog, no doubt) but after some minor requested changes and a very neat cleanup by Hadley himself, it’s been merged.

My original version had a few bells and whistles that the current implementation has put aside. The first was inverting the matching with fct_exclude to make it easier to negate the matching without having to create a new anonymous function, i.e. ~!fct_match(.x). I find this particularly useful since a pipe expects a call/named function, not a lambda/anonymous function, which is actually quite painful to construct

data %>%
   pull(g) %>%
   (function(x) !fct_match(x, c("X_Y", "Z")))

## [1] FALSE  TRUE FALSE FALSE  TRUE

whereas if we defined

fct_exclude <- function(f, lvls, ...) !fct_match(f, lvls, ...)

we can use

data %>%
   pull(g) %>%
   fct_exclude(c("X_Y", "Z"))

## [1] FALSE  TRUE FALSE FALSE  TRUE

The other was specifying whether or not to include missing levels when considering if lvls is a valid value in f since unique(f) and levels(f) can return different answers.

The cleanup really made me think about how much ‘fluff’ some of my code can have. Sure, it’s nice to encapsulate some logic in a small additional function, but sometimes you can actually replace all of that with a one-liner and not need all that. If you’re ever in the mood to see how compact internal code can really be, check out the source of forcats.

Hopefully this pattern of filter(fct_match(f, lvls)) is useful to others. It’s certainly going to save me overlooking some typos.

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  dplyr       * 0.8.3   2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  forcats     * 0.4.0   2019-02-17 [1] CRAN (R 3.5.1)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.5.2)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.1)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  purrr         0.3.2   2019-03-15 [1] CRAN (R 3.5.2)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  tibble        2.1.3   2019-06-06 [1] CRAN (R 3.5.2)                   
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Even more images as x-axis labels

website@jcarroll.com.au (Jonathan Carroll) — Tue, 16 Oct 2018 23:18:32 +0000

This is the last update to this strange saga… I hope.

Image labels… Photo: http://www.premierpaper.com/shop/custom-labels/

Easily two of the most popular posts on my blog are this one and this one describing a couple of ways in which I managed to hack together using an image as a category label in a ggplot.

There are likely many people who believe one should never do such a thing, but given the popularity, it seems a lot of people aren’t listening to that. Good on you.

via GIPHY

One of these posts was recently shared again by the amazing #rstats amplifier Mara Averick (if you’re not following her on Twitter, you’re missing out) and [@baptiste_auguie](https://twitter.com/baptiste_auguie") (the saviour of the previous implementation) mentioned that he had written a ‘hack’ to get chemical symbols as a categorical axis label using tikzDevice. That package leverages $\LaTeX$ (of which I am very familiar, having written my PhD thesis entirely in $\LaTeX$ many moons ago) to treat all of the text in an image as potential $\LaTeX$ commands and produce a working source code which generates the required plot.

The example code is straightforward enough

options(tikzLatexPackages = c(getOption('tikzLatexPackages'),
                              "\\usepackage{acide-amine}\n")) 

d = data.frame(x=1:10, y=1:10, f=factor(sample(letters[1:2], 10, repl=TRUE))) 

p <- qplot(x,y,data=d) + theme_bw() + 
  theme(plot.margin = unit(c(1, 1, 5, 1), "lines"), 
       axis.text.x = element_text(size = 12 * 
        0.8, lineheight = 0.9, vjust = 10)) + 
  scale_x_continuous(breaks = c(2, 8), labels=c("\\phe{15}", "\\leu{15}")) 

tikz("annotation.tex", standAlone=T, width=4, height=4) 
print(p)

and produces this

This got me curious, though – if it can process arbitrary $\LaTeX$, could it process a \\includegraphics call?

Efficient! If it's arbitrary LaTeX, could the labels just be \includegraphics calls?
— Jonathan Carroll (@carroll_jono) October 11, 2018

Yes, as it turns out.

via GIPHY

A quick test showed that it was indeed possible, which only leaves re-implementing the previous posts’ images using this method.

I’ve done so, and the code isn’t particularly shorter than the other method.

Producing nearly the same end result.

tikzDevice result

There are a few differences compared to the previous version(s):

I had a request for rotating the additional text, which I actually [also updated recently[(https://gist.github.com/jonocarroll/2f9490f1f5e7c82ef8b791a4b91fc9ca#file-images_as_xaxis_labels_updated-r), and it seemed to fit better, so I rotated the labels within the $\LaTeX$ command.
Since all of the text has been rendered via $\LaTeX$, the fonts are a bit different.
The rankings have since changed, so I’ve added an 11th to keep Australia in the list.

The $\LaTeX$ component of this also meant that a few changes were necessary in the other labels, such as the dollar sign in the y-axis label, and the underscores throughout (these are considered special characters in $\LaTeX$). Lastly, the result of running the tikz command is that a .tex ($\LaTeX$ source code) file is produced. This isn’t quite the plot image file we want. It does however have the commands to generate one. The last steps in the above gist are to process this .tex file with $\LaTeX$. Here I used the tools::texi2dvi function, but one could also use a system command to their $\LaTeX$ installation.

That still only produced a PDF. The last step was to use the magick package to convert this into an image.

Overall, this is a nice proof of concept, but I don’t think it’s a particularly tidy way of achieving the goal of image axis labels. It does however lay the groundwork for anyone else who decides this might be a useful route to take. Plus I learned a bit more about how tikzDevice works and got to revisit my $\LaTeX$ knowledge.

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  dplyr         0.8.3   2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  filehash      2.4-2   2019-04-17 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  ggplot2     * 3.2.1   2019-08-10 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  lazyeval      0.2.2   2019-03-15 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.5.1)                   
##  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.5.2)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.1)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  purrr         0.3.2   2019-03-15 [1] CRAN (R 3.5.2)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  scales        1.0.0   2018-08-09 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  tibble        2.1.3   2019-06-06 [1] CRAN (R 3.5.2)                   
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)                   
##  tikzDevice  * 0.12.3  2019-08-07 [1] CRAN (R 3.5.2)                   
##  tinytex     * 0.15    2019-08-07 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Adding strings in R

website@jcarroll.com.au (Jonathan Carroll) — Sat, 06 Oct 2018 00:09:15 +0000

This started out as a “hey, I wonder…” sort of thing, but as usual, they tend to end up as interesting voyages into the deepest depths of code, so I thought I’d write it up and share. Shoutout to [@coolbutuseless](https://twitter.com/coolbutuseless) for proving that a little curiosity can go a long way and inspiring me to keep digging into interesting topics.

This is what you get if you “glue” “strings”. Photo: https://craftwhack.com/cool-craft-string-easter-eggs/

This post came across my feed last week, referring to the roperators package on CRAN. In that post, the author introduces an infix operator from that package which ‘adds’ (concatenates/pastes) strings

"using infix (%) operators " %+% "R can do simple string addition"

## [1] "using infix (%) operators R can do simple string addition"

This might be familiar if you use python

"python " + "adds " + "strings"

or javascript

"javascript " + "also adds " + "strings"
## javascript also adds strings

or perhaps even go

package main

import "fmt"

func main() {
  fmt.Println("go " + "even adds " + "strings")
}

or Julia

"julia can " * "add strings"

but this is not something natively available in R

"this doesn't" + "work"

## Error in "this doesn't" + "work": non-numeric argument to binary operator

Could we make it work, though? That got me wondering. My first guess was to just create a new + function which does allow for this. The normal addition operator is

`+`

## function (e1, e2)  .Primitive("+")

so a first attempt might be

`+` <- function(e1, e2) {
  if (is.character(e1) | is.character(e2)) {
    paste0(e1, e2)
  } else {
    base::`+`(e1, e2)
  }
}

This checks to see if the left or right side of the operator is a character-classed object, and if either is, it pastes the two together. Otherwise it just uses the ‘regular’ addition operator between the two arguments. This works for simple cases, e.g.

"a" + "b"

## [1] "ab"

"a" + 2

## [1] "a2"

2 + 2

## [1] 4

2 + "a"

## [1] "2a"

But we hit an important snag if we try to add to character-represented numbers

"200" + "200"

## [1] "200200"

That’s probably going to be an issue if we read in unformatted data (e.g. from a CSV) as characters and try to treat it like numbers. Normally this would throw the above error about not being numeric, but now we get a silent weird number-character. That’s no good.

An extension to this checks whether or not we have the number-as-a-character situation and falls back to the correct interpretation in that case

`+` <- function(e1, e2) {
  ## unary
  if (missing(e2)) return(e1)
  if (!is.na(suppressWarnings(as.numeric(e1))) & !is.na(suppressWarnings(as.numeric(e2)))) {
    ## both arguments numeric-like but characters
    return(base::`+`(as.numeric(e1), as.numeric(e2)))
  } else if ((is.character(e1) & is.na(suppressWarnings(as.numeric(e1)))) | 
             (is.character(e2) & is.na(suppressWarnings(as.numeric(e2))))) {
    ## at least one true character 
    return(paste0(e1, e2))
  } else {
    ## both numeric
    return(base::`+`(e1, e2))
  }
}

"a" + "b"

## [1] "ab"

"a" + 2

## [1] "a2"

2 + 2

## [1] 4

2 + "a"

## [1] "2a"

"2" + "2"

## [1] 4

2 + "edgy" + 4 + "me"

## [1] "2edgy4me"

So, that’s one option for string addition in R. Is it the right one? The idea of actually dispatching on a character class is inviting. Can we just add a +.character method (since there doesn’t seem to already be one)? Normally when we have S3 dispatch we need a generic function, which calls UseMethod("class"), but we don’t have that in this case. + is an internal generic, which is probably the first sign that we’re going to have trouble. If we try to define the method

`+` <- base::`+`
`+.character` <- function(e1, e2) {
  paste0(e1, e2)
}
"a" + "b"

## Error in "a" + "b": non-numeric argument to binary operator

It seems to fail. What went wrong? Is dispatch not working?

via GIPHY

We want to dispatch on “character” – is that what we have?

class("a")

## [1] "character"

What if we explicitly create an object with that class?

structure("a", class = "character") + 2

## [1] "a2"

2 + structure("a", class = "character")

## [1] "2a"

What if we try to dispatch on some new class?

`+.foo` <- function(e1, e2) {
  paste0(e1, e2)
}

structure("a", class = "foo") + 2

## [1] "a2"

but no dice for just a regular atomic character object. Time to revisit the help pages.

In R, addition is limited to particular classes of objects, defined by the Ops group (there are also Math, Summary, and Complex groups). The methods for the Ops group members describe which classes can be involved in operations involving any of the Ops group members:

"+", "-", "*", "/", "^", "%%", "%/%"
"&", "|", "!"
"==", "!=", "<", "<=", ">=", ">"

These methods are:

eval(.S3methods("Ops"), envir = baseenv())

##  [1] Ops.data.frame      Ops.Date            Ops.difftime       
##  [4] Ops.factor          Ops.numeric_version Ops.ordered        
##  [7] Ops.POSIXt          Ops.quosure*        Ops.raster*        
## [10] Ops.roman*          Ops.ts*             Ops.unit*          
## see '?methods' for accessing help and source code

What’s missing from this list, in order for us to be able to just use “string” + “string” is a character method. What’s perhaps even more surprising is that there is a roman method! Whaaaat?

as.roman("1") + as.roman("5")

## [1] VI

as.roman("2000") + as.roman("18")

## [1] MMXVIII

Since the operations need to be defined for all the members of the Ops group, we would also need to define what to do with, say, * between strings. When one side is a string and the other is a number, a reasonable approach might be that which was taken in the original post (using a new infix %s*%)

"a" %s*% 3

##     a 
## "aaa"

There is, of course, a function to do this already

strrep("a", 3)

## [1] "aaa"

but I could see creating "a" * 3 as a shortcut to this. Note: this exists in python

"a" * 3

## 'aaa'

I don’t know what one would expect "a" * "b" to produce.

The problem with where this is heading is that we aren’t allowed to create the method for an atomic class, as Joris Meys and Brodie Gaslam point out on Twitter

Yes, you're right. Below is what I remembered, which suggested that if it were not sealed, it could be defined, but that isn't true b/c do_arith only dispatches on objects (as you point out), although in theory it could dispatch on atomics, but probably doesn't for speed. pic.twitter.com/UXk6Tdm3lW
— BrodieG (@BrodieGaslam) October 4, 2018

setMethod("+", c("character", "character"), function(e1, e2) paste0(e1, e2))

## Error in setMethod("+", c("character", "character"), function(e1, e2) paste0(e1, : the method for function '+' and signature e1="character", e2="character" is sealed and cannot be re-defined

so no luck there. Brodie also links to a Stack Overflow discussion on this very topic where it is pointed out by Martin Mächler that this has been discussed on r-develq – that makes for some interesting historical weigh-ins on why this isn’t a thing in R. Incidentally, the small-world effect comes into play regarding that Stack Overflow post as one of the three answers happens to be a former work colleague of mine.

So, in the end, it seems the best we can do is the rather long-winded overwrite of + which checks if the arguments really are characters. I don’t mind this, and would probably use it if it was in base R or a package. The biggest issue that people seem to have with this is that it ‘looks like’ addition, but it’s not commutative. If that word is new to you, it just means that x + y should give the same answer as y + x. For numbers, the regular + satisfies this:

2 + 3

## [1] 5

3 + 2

## [1] 5

but when we try to do this with strings… not so much

"a" + "b"

## [1] "ab"

"b" + "a"

## [1] "ba"

This doesn’t particularly bother me, because I’m okay with this not actually being ‘mathematical addition’. The fun turn this then took was the suggestion from Joris Meys that Julia’s non-associative operators is a strength of the language. There, the way that you group values matters

a + b + c is parsed as +(a, b, c) not +(+(a, b), c).

I’ll eventually get around to learning more Julia, but this is already hurting my brain.

That distinction may be of interest, however, to Miles McBain, whose concern was more about repeated applications of + being a bottleneck

I hate + for string concatenation. “a” + “b” + “c” is paste(“a”, paste(“b”,“c”)). So you end up copying the data in “b” and “c” twice due to the data being immutable. That can really add up fast with more +'s if you are careless. Like I was in my first programming job.
— Miles McBain (@MilesMcBain) October 4, 2018

In that case, parsing as +("a", "b", "c") is exactly what would be desired.

So, what’s the conclusion of all of this? I’ve learned (and re-learned) a heap more about how the Ops group works, I’ve played a lot with dispatch, and I’ve thought deeply about edge-cases for adding strings. I’ve also been exposed to a bit more Julia. All in all, a worthwhile dive into something potentially silly, but a lot of fun. If you have some thoughts on the matter, leave a comment here or reply on Twitter – I’d love to hear about another angle to this story.

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  dplyr       * 0.8.3   2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  forcats     * 0.4.0   2019-02-17 [1] CRAN (R 3.5.1)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  jsonlite      1.6     2018-12-07 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  lattice       0.20-38 2018-11-04 [1] CRAN (R 3.5.1)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  Matrix        1.2-17  2019-03-22 [1] CRAN (R 3.5.2)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.5.2)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.1)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  purrr         0.3.2   2019-03-15 [1] CRAN (R 3.5.2)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  reticulate    1.13    2019-07-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  roperators  * 1.1.0   2018-09-28 [1] CRAN (R 3.5.1)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  tibble        2.1.3   2019-06-06 [1] CRAN (R 3.5.2)                   
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Constricted development with reticulate

website@jcarroll.com.au (Jonathan Carroll) — Wed, 04 Apr 2018 23:38:05 +0000

I’ve been using the reticulate package occasionally for a while now, so I was surprised to see that it had only just been officially released.

reticulate: R interface to Python https://t.co/qVWmwoMQAP. Comprehensive set of interoperability tools including R Markdown Python engine #rstats #pydata pic.twitter.com/SuWM6Y3Pk0
— RStudio (@rstudio) March 26, 2018

It’s a brilliant piece of work, allowing python and R to coexist in the same workflow.

Another opportunity came up today to use it so I thought it might be nice to do a very quick blog post to show just how easy it is to take external python code and have it callable directly from R. In this case, [@coolbutuseless](https://twitter.com/coolbutuseless) posed a challenge on Twitter to write a fast ‘needle in a haystack’ search of a small vector inside a larger one. I looked over the existing candidates and figured some sort of Sieve of Eratosthenes-esque algorithm might have a chance (though the name eluded me entirely at the time).

My proposal was to search for the first digit using which(), and use this reduced vector of possible-matches in additional tests on the remaining parts of the ‘needle’. [@coolbutuseless](https://twitter.com/coolbutuseless) refactored my attempt allowing for arbitrary length needles and found it to do quite well against the current offerings. What he still wanted though was a Boyer–Moore string search algorithm implementation. This is apparently what GNU grep uses, so it’s probably pretty okay.

That algorithm is pretty clever about how it goes about the search, starting in a similar way to what I did (the sieve approach was apparently the leading string match method prior to Boyer-Moore). It’s much more complicated though, so I wasn’t about to write one of those myself in R. Nowadays, people think of C/C++ when there’s functionality they want to grab from elsewhere. There’s a C implementation on the Wikipedia site, so that seems like a nice place to start. I saved the text to a new boyermoore.c file and ran

R CMD SHLIB boyermoore.c

from a terminal to compile it into boyermore.so. This could then be loaded into R with dyn.load("boyermore.so") and in theory called with .C("boyer_moore", <something>, <something>). I tried a couple of <something>s (which weren’t pointers) and promptly crashed RStudio.

The python implementation is also listed on Wikipedia, so I figured that’s another route to try. I saved the text to a new boyermoor.py file (also embedded below) and started about loading the functions from R. This is actually much simpler than for C:

library(reticulate)
bm <- py_run_file("boyermoor.py")

This executes the python file and creates a new named list with each exported python function as an element. How easy is that!?! Calling the function would be as easy as

bm$string_search(needle, haystack)

Not quite that easy of course… The implementation assumes that both the ‘needle’ and the ‘haystack’ are text, not numbers. To solve this, I converted my numbers (in the range 0 to 12) to letters using the built-in LETTERS vhat it worked as expected, a benchmark test showed that it was nowhere near as fast as my R approach. I can’t say this is due to the algorithm itself, which should be fairly fast, but probably has more to do with the fact that I’m using two different languages.

The entire call from R looks pretty neat and tidy

despite a lot of python code in the background

I’d certainly recommend having reticulate in your arsenal next time you need to attack a problem using python from within R. There’s a whole heap of useful ways to interact between R and python with this including importing python modules and calling python scripts, etc…

As a side-note: keep an eye on the ergo project to connect the go language in just as easily.

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  jsonlite      1.6     2018-12-07 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  lattice       0.20-38 2018-11-04 [1] CRAN (R 3.5.1)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  Matrix        1.2-17  2019-03-22 [1] CRAN (R 3.5.2)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  reticulate  * 1.13    2019-07-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

JC and the Vignettes

website@jcarroll.com.au (Jonathan Carroll) — Tue, 06 Mar 2018 15:55:31 +0000

If that’s not a great 1960’s band name then I don’t know what is (hint: I don’t know what is).

At the start of the year I set out my ‘goals for 2018’ just like many of us do; an overly ambitious list of things we’d like to do to better ourselves. My list includes improving my French to better interact with a colleague I cohabitate and work with while onsite (my 48 day streak on Duolingo was interrupted by travel… c’est un bon début); reading more books (two a month, so far so good); writing more blog posts (one a month, this one included); interacting more with the R community; and using a bullet journal (all of these are currently tracked in said bullet journal).

My original plan for the increased interaction was to pick an R package a month. I’d pick a package which didn’t already have a vignette, learn the package, write a vignette, submit it as a PR, and blog about the experience. This seemed straightforward enough. There’s a long-standing feeling that too many R packages lack vignettes (note: https://juliasilge.com/blog/mining-cran-description/ – an analysis I intend to reproduce/update). I looked through my backlog of interesting packages I meant to look at more closely and checked to see if they already had vignettes… all of them did (womp womp).

For those not familiar, vignettes in R packages are long-form documentation. Not just a listing of each function, but a good solid walkthrough of background, a use-case, examples, motivations, pitfalls, comparisons, performance metrics, and so on. Function documentation rarely provides sufficient detail like this, so vignettes are a convenient way to include some longer discussions about your package. The problem is that people either neglect to, forget to, or aren’t aware that they can (and should!) write vignettes for their packages.

Rather than admit defeat and throw another resolution on the ever-growing pile of failures (I’m looking your way, dusty calligraphy set) I decided to take a different approach. I sent out an offer on Twitter: suggest a package which needs a vignette. It seemed to be popular enough

The old argument was that #rstats packages generally lack vignettes. Someone scraped CRAN and found ‘most’ (lots of?) packages don't have them.

With my book now with the copyeditors, I finally have some time to get/give back to this awesome community…
— Jonathan Carroll (@carroll_jono) February 7, 2018

So, my offer: point me towards a new(ish) package on GitHub that (a) does something cool, and (b) doesn't have a vignette. I'll learn the package inside-out, write a vignette, submit it as a PR, and blog about it. Your package, someone else's which you use, I don't mind…
— Jonathan Carroll (@carroll_jono) February 7, 2018

Thus began the ‘Volunteer Vignettes’ program. I got to work on the first one almost immediately, and doing so has already uncovered bugs, inconsistencies, and insights to the author (I do plan to start a conversation with original authors before beginning any actual work). I’ll be writing each one up once I’m ‘done’ with it, sharing the insights discovered along the way, plus some new ideas about how vignettes might evolve.

If you’re new to vignettes, at this point you may be asking “How does one go about making one? What tools are required? How do I include one in my package?”, and I’m glad you asked. Over the next few months I’ll be blogging about vignettes; how they’re currently used, how they might be more useful, and how we might be able to get people to use them more. I’m also scheduled to present the eventual conclusion of this project at userR 2018, so I’d better get it done!

For now, stay tuned!

P.S. for those interested in a very old-school jam:

Thou shalt not compare numeric values (except when it works)

website@jcarroll.com.au (Jonathan Carroll) — Mon, 04 Sep 2017 15:02:07 +0000

This was just going to be a few Tweets but it ended up being a bit of a rollercoaster of learning for me, and I haven’t blogged in far too long, so I’m writing it up quickly as a ‘hey look at that’ example for newcomers.

I’ve been working on the ‘merging data’ part of my book and, as I do when I’m writing this stuff, I had a play around with some examples to see if there was anything funky going on if a reader was to try something slightly different. I’ve been using dplyr for the examples after being thoroughly convinced on Twitter to do so. It’s going well. Mostly.

## if you haven't already done so, load dplyr
library(dplyr)

My example involved joining together two tibbles containing text values. Nothing too surprising. I wondered though; do numbers behave the way I expect? Now, a big rule in programming is ‘thou shalt not compare numbers’, and it holds especially true when numbers aren’t exactly integers. This is because representing non-integers is hard, and what you see on the screen isn’t always what the computer sees internally.

Thou shalt not compare numbers

If I had a tibble where the column I would use to join had integers

dataA <- tribble(
    ~X, ~Y,
    0L, 100L,
    1L, 101L,
    2L, 102L,
    3L, 103L
)
dataA

## # A tibble: 4 x 2
##       X     Y
##   <int> <int>
## 1     0   100
## 2     1   101
## 3     2   102
## 4     3   103

and another tibble with numeric in that column

dataB <- tribble(
    ~X, ~Z,
    0, 1000L,
    1, 1001L,
    2, 1002L,
    3, 1003L
)
dataB

## # A tibble: 4 x 2
##       X     Z
##   <dbl> <int>
## 1     0  1000
## 2     1  1001
## 3     2  1002
## 4     3  1003

would they still join?

full_join(dataA, dataB)

## Joining, by = "X"

## # A tibble: 4 x 3
##       X     Y     Z
##   <dbl> <int> <int>
## 1     0   100  1000
## 2     1   101  1001
## 3     2   102  1002
## 4     3   103  1003

Okay, sure. R treats these as close enough to join. I mean, maybe it shouldn’t, but we’ll work with what we have. R doesn’t always think these are equal

identical(0L, 0)

## [1] FALSE

identical(2L, 2)

## [1] FALSE

though sometimes it does

0L == 0

## [1] TRUE

2L == 2

## [1] TRUE

(== coerces types before comparing). Well, what if one of these just ‘looks like’ the other value (can be coerced to the same?)

dataC <- tribble(
    ~X, ~Z,
    "0", 100L,
    "1", 101L,
    "2", 102L,
    "3", 103L
)
dataC

## # A tibble: 4 x 2
##   X         Z
##   <chr> <int>
## 1 0       100
## 2 1       101
## 3 2       102
## 4 3       103

full_join(dataA, dataC)

## Joining, by = "X"

## Error: Can't join on 'X' x 'X' because of incompatible types (character / integer)

That’s probably wise. Of course, R is perfectly happy with things like

"2":"5"

## [1] 2 3 4 5

and == thinks that’s fine

"0" == 0L

## [1] TRUE

"2" == 2L

## [1] TRUE

but who am I to argue?

Anyway, how far apart can those integers and numerics be before they aren’t able to be joined? What if we shift the ‘numeric in name only’ values away from the integers just a teensy bit? .Machine$double.eps is the built-in value for ‘the tiniest number you can produce’. On this system it’s 2.22044610^{-16}.

dataBeps <- tribble(
    ~X, ~Z,
    0 + .Machine$double.eps, 1000L,
    1 + .Machine$double.eps, 1001L,
    2 + .Machine$double.eps, 1002L,
    3 + .Machine$double.eps, 1003L
)
dataBeps

## # A tibble: 4 x 2
##          X     Z
##      <dbl> <int>
## 1 2.22e-16  1000
## 2 1.00e+ 0  1001
## 3 2.00e+ 0  1002
## 4 3.00e+ 0  1003

Well, that’s… weirder. The values offset from 2 and 3 joined fine, but the 0 and 1 each got multiple copies since R thinks they’re different. What if we offset a little further?

dataB2eps <- tribble(
    ~X, ~Z,
    0 + 2*.Machine$double.eps, 1000L,
    1 + 2*.Machine$double.eps, 1001L,
    2 + 2*.Machine$double.eps, 1002L,
    3 + 2*.Machine$double.eps, 1003L
)
dataB2eps

## # A tibble: 4 x 2
##          X     Z
##      <dbl> <int>
## 1 4.44e-16  1000
## 2 1.00e+ 0  1001
## 3 2.00e+ 0  1002
## 4 3.00e+ 0  1003

That’s what I’d expect. So, what’s going on? Why does R think those numbers are the same? Let’s check with a minimal example: For each of the values 0:4, let’s compare that integer with the same offset by .Machine$double.eps

library(purrr) ## for the 'thou shalt not for-loop' crowd
map_lgl(0:4, ~ as.integer(.x) == as.integer(.x) + .Machine$double.eps)

## [1] FALSE FALSE  TRUE  TRUE  TRUE

And there we have it. Some sort of relative difference tolerance maybe? In any case, the general rule to live by is to never compare floats. Add this to the list of reasons why.

For what it’s worth, I’m sure this is hardly a surprising detail to the dplyr team. They’ve dealt with things like this for a long time and I’m sure it was much worse before those changes.

Update: As noted in the comments, R does have a way to check if things are ‘nearly equal’ (within some specified tolerance) via all.equal()

purrr::map_lgl(0:4, ~all.equal(.x, .x + .Machine$double.eps))

## [1] TRUE TRUE TRUE TRUE TRUE

However, this does require the user to either specify the exact tolerance under which they consider two numbers ‘equal’, or to use the default (which, judging by the source of all.equal.numeric() is sqrt(.Machine$double.eps) or around 1.490116110^{-8} on this system). This means that numbers can be ‘quite’ different (depending on what’s an important difference) and still considered equal

purrr::map_lgl(0:4, ~ all.equal(.x, .x + 1e-8))

## [1] TRUE TRUE TRUE TRUE TRUE

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  dplyr       * 0.8.3   2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fansi         0.4.0   2018-10-05 [1] CRAN (R 3.5.1)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.5.2)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.1)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  purrr       * 0.3.2   2019-03-15 [1] CRAN (R 3.5.2)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  tibble      * 2.1.3   2019-06-06 [1] CRAN (R 3.5.2)                   
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  utf8          1.1.4   2018-05-24 [1] CRAN (R 3.5.1)                   
##  vctrs         0.2.0   2019-07-05 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
##  zeallot       0.1.0   2018-01-28 [1] CRAN (R 3.5.2)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Data Munging With R Preview - Storing Values (Assigning)

website@jcarroll.com.au (Jonathan Carroll) — Mon, 26 Jun 2017 23:10:03 +0000

[Update] The title of this book has since been changed to Beyond Spreadsheets with R.

Since about October last year, I’ve been writing an introduction to R book. It’s been quite the experience. I’ve finally started making time to document some of the interesting things I’ve learned (about R, about writing, about how to bring those two together) along the way.

The book is aimed at proper beginners; people with absolutely no formal coding experience. This tends to mean people coming from Excel who need to do more than a spreadsheet can/should.

I'm writing an R book for real beginners (ppl with 0 code XP) via @ManningBooks! What tripped you up most when you first learned R? Pls RT!
— Jonathan Carroll (@carroll_jono) September 27, 2016

Most of the books I’ve looked at which claim to teach programming begin with some strong assumptions about the reader already knowing how to program, and teach the specific syntax of some language. That’s no good if this is your first language, so I’m working towards teaching the concepts, the language, and the syntax (warts and all).

The book is currently available under the Manning Early Access Program (MEAP) which means if you buy it you get the draft of the first three chapters right now. If you find something you still don’t understand, or you don’t like how I’ve written some/all of it, then jump onto the forum and let me know. I make more edits and write more chapters, and you get updated versions. Lather, rinse, repeat until the final version and you get a polished book which (if I’m any good) contains what you want it to.

I’m genuinely interested in getting this right; I want to help people learn R. I contribute a bit of time on Stack Overflow answering people’s questions, and it’s very common to see questions that shouldn’t need asking. I don’t blame the user for not knowing something (a different answer for not searching, perhaps), but I can help make the resource they need.

To show that I really want people to contribute, here’s a discount code to sweeten the deal: use mlcarroll for 50% off here.

Chapter 1 is a free download, so please check that out too! At the moment the MEAP covers the first three chapters, but the following four aren’t too far behind.

I’ll document some of the behind-the-scenes process shortly, but for now here’s an excerpt from chapter 2:

2.2. Storing Values (Assigning)

In order to do something with our data, we will need to tell R what to call it, so that we can refer to it in our code. In programming in general, we typically have variables (things that may vary) and values (our data). We’ve already seen that different data values can have different types, but we haven’t told R to store any of them yet. Next, we’ll create some variables to store our data values.

2.2.1. Data (Variables)

If we have the values 4 and 8 and we want to do something with them, we can use the values literally (say, add them together as 4 + 8). You may be familiar with this if you frequently use Excel; data values are stored in cells (groups of which you can opt to name) and you tell the program which values you wish to combine in some calculation by selecting the cells with the mouse or keyboard. Alternatively, you can opt to refer to cells by their grid reference (e.g. A1). Similarly to this second method, we can store values in variables (things that may vary) and abstract away the values. In R, assigning of values to variables takes the following form

variable <- value

The assignment operator <- can be thought of as storing the value/thing on the right hand side into the name/thing on the left hand side. For example, try typing x <- 4 into the R **Console** then press Enter:

Figure 2. 1. The variable x has been assigned the value 4. You could just as easily use the equals sign to achieve this; x = 4 but I recommend you use <- for this for reasons that will become clear later. You’ll notice that the **Environment** tab of the **Workspace** pane now lists x under **Values** and shows the number 4 next to it, as shown in Fig 2. 2.

Figure 2. 2. The variable x has been assigned the value 4.

What happened behind the scenes was that when we pressed Enter, R took the entire expression that we entered (x <- 4) and evaluated it. Since we told R to assign the value 4 to the variable x, R converted the value 4 to binary and placed that in the computer’s memory. R then gives us a reference to that place in the computer’s memory and labels it x. A diagram of this process is shown in Fig 2. 3. Nothing else appeared in the **Console** because the action of assigning a value doesn’t return anything (we’ll cover this more in our section on functions).

Figure 2. 3. Assigning a value to a variable. The value entered is converted to binary, then stored in memory, the reference to which is labelled by the variable.

This is overly simplified, of course. Technically speaking, in R, names have objects rather than the other way around. This means that R can be quite memory efficient since it doesn’t create a copy of anything it doesn’t need to.

Caution: On “hidden” variables

Variables which begin with a period (e.g. .length) are considered hidden and do not appear in the **Environment** tab of the **Workspace**. They otherwise behave exactly as any other variable; they can be printed and manipulated. An example of one of these is the .Last.value variable, which exists from the moment you load up R (with the value TRUE) - this contains the output value of the last statement executed (handy if you forgot to assign it to something). There are very few reasons you’ll want to use this feature (dot-prefixed hidden variables) on purpose at the moment, so for now, avoid creating variable names with this pattern. The exception to the hidden nature of these is again the .Last.value variable which you can request to be visible in the **Environment** tab via Tools ▸ Global Options… ▸ General ▸ Show .Last.value in environment listing.

We can retrieve the value assigned to the variable x by asking R to print the value of x

print(x = x)

## [1] 4

for which we have a useful shortcut - if your entire expression is just a variable, R will assume you mean to print() it, so

## [1] 4

works just the same.

Now, about that [1]: it’s important to know that in R, there’s no such thing as a single value; every value is actually a vector of values (we’ll cover these properly in the next chapter, but think of these as collections of values of the same type).^[1]

Whenever R prints a value it allows for the case where the value contains more than one number. To make this easier on the eye, it labels the first value appearing on the line by it’s position in the collection. For collections (vectors) with just a single value, this might appear strange, but this makes more sense once our variables contain more values

# Print the column names of the mtcars dataset
names(x = mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

We can assign another variable to another value

y <- 8

There are few restrictions for what we can name our data values, but R will complain if you try to break them. Variables should start with a letter, not a number. Trying to create the variable 2b will generate an error. Variables can also start with a dot (.) as long as it’s not immediately followed by a number, although you may wish to avoid doing so. The rest of the variable name can consist of letters (upper and lower case) and numbers, but not punctuation (except . or _) or other symbols (except the dot, though again, preferably not).

There are also certain reserved words that you can’t name variables as. Some are reserved for built-in functions or keywords

if, else, repeat, while, function, for, in, next, and break.

Others are reserved for particular values

TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, and NA_character_.

We’ll come back to what each of these means, but for now you just need to know that you can’t create a variable with one of those names.

Caution: On overwriting names

What you can do however, which you may wish to take care with, is overwrite the in-built names of variables and functions. By default, the value pi is available (π = 3.141593).

If you were translating an equation into code, and wanted to enter the value p_i you might accidentally call it pi and in doing so change the default value, causing all sorts of trouble when you next go to use it or call a function you’ve written which expects it to still be the default.

The default value can still be accessed by specifying the package in which it is defined, separated by two colons (::). In the case of pi, this is the base package.

# Re-defining `pi` to be equal to exactly `3`
pi <- 3L
# The default, correct value is still available.
base::pi

## [1] 3.141593

This is also an issue for functions, with the same solution; specify the package in which it is defined to use that definition. We’ll return to this in a section on ‘scope’.

We’ll cover how to do things to our variables in more detail in the next section, but for now let’s see what happens if we add our variables x and y in the same way as we did for our regular numbers

x + y

## [1] 12

which is what we got when we added these numbers explicitly. Note that since our expression produces just a number (no assignment), the value is printed. We’ll cover how to add and subtract values in more depth in our section on basic mathematics.

R has no problems with overwriting these values, and it doesn’t mind what data you overwrite these with.^[2]

y <- 'banana'
y

## [1] "banana"

R is case-sensitive, which means that it treats a and A as distinct names. You can have a variable named myVariable and another named MYvariable and another named myVARIABLE and R will hold the value assigned to each independently.

On variable names:

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton
Principal Curmudgeon Netscape Communications Corporation

I said earlier that R won’t keep track of your units so it’s a good idea to name your variables in a way that makes logical sense, is meaningful, and will help you remember what it represents. Variables x and y are fine for playing around with values, but aren’t particularly meaningful if your data represents speeds, where you may want to use something like speed_kmph for speeds in kilometers per hour. Underscores (_) are allowed in variable names, but whether or not you use them is up to you. Some programmers prefer to name variables in this way (sometimes referred to as ‘snake_case’), others prefer ‘CamelCase’. The use of periods (dots, .) to separate words is discouraged for reasons beyond the scope of this book.^[3]

Important: Naming things

Be careful when naming your variables. Make them meaningful and concise. In six months from now, will you remember what data_17 corresponds to? Tomorrow, will you remember that newdata was updated twice?

2.2.2. Unchanging Data

If you’re familiar with working with data in a spreadsheet program (such as Excel), you may expect your variables to behave in a way that they won’t. Automatic recalculation is a very useful feature of spreadsheet programs, but it’s not how R behaves.

If we assign our two variables, then add them, we can save that result to another variable

a <- 4
b <- 8
sum_of_a_and_b <- a + b

This has the value we expect

print(x = sum_of_a_and_b)

## [1] 12

Now, if we change one of these values

b <- 2

this has no impact on the value of the variable we created to hold the sum earlier

print(x = sum_of_a_and_b)

## [1] 12

Once the sum was calculated, and that value stored in a variable, the connection to the original values was lost. This makes things reliable because you know for sure what value a variable will have at any point in your calculation by following the steps that lead to it, whereas a spreadsheet depends much more on its current overall state.

2.2.3. Assigmnent Operators (<- vs =)

If you’ve read some R code already, you’ve possibly seen that both <- and = are used to assign values to objects, and this tends to cause some confusion. Technically, R will accept either when assigning variables, so in that respect it comes down to a matter of style (I still highly recommend assigning with <-). The big difference comes when using functions that take arguments - there you should only use = to specify what the value of the argument. For example, when we inspected the mtcars data, we could specify a string with which to indent the output

str(object = mtcars, indent.str = '>>  ')

## 'data.frame':    32 obs. of  11 variables:
## >>  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## >>  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
## >>  $ disp: num  160 160 108 258 360 ...
## >>  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
## >>  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## >>  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
## >>  $ qsec: num  16.5 17 18.6 19.4 17 ...
## >>  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
## >>  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
## >>  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
## >>  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

If we had used <- instead of = for either argument, then R would treat that as creating a new variable object or indent.str with value mtcars or '>> ' respectively, which isn’t what we want.

Examples:

score <- 4.8
score

## [1] 4.8

str(object = score)

##  num 4.8

fruit <- 'banana'
fruit

## [1] "banana"

str(object = fruit)

##  chr "banana"

Note that we didn’t need to tell R that one of these was a number and one was a string, it figured that out itself. It’s good practice (and easier to read) to make your <- line up vertically when defining several variables:

first_name <- 'John'
last_name  <- 'Smith'
top_points <- 23

but only if this can be achieved without adding too many spaces (exactly how many is too many is up to you).

Caution: Watch this space!

An extra space can make a big difference to the syntax. Compare:

a <- 3

with

a < - 3

## [1] FALSE

In the first case we assigned the value 3 to the variable a (which returns nothing). In the second case, with a wayward space, we compared a to the value -3 which returns FALSE (I’ll explain why that works at all, later).

Now that we know how to provide some data to R, what if we want to explicitly tell R that our data should be of a specific type, or we want to convert our data to a different type? That’s an article for another day.

If you’re interested in seeing more, and hopefully providing feedback on what you do/don’t like about it, then use the discount code mlcarroll here for 50% off and get reading!

1. In technical terms, R has no scalar types.

2. This is where the distinction of weakly typed becomes important - in a strongly typed language you would not be able to arbitrarily change the type of a variable.

3. This syntax is already used within R to denote functions acting on a specific class, such as print.Date().

Images as x-axis labels (updated)

website@jcarroll.com.au (Jonathan Carroll) — Fri, 03 Jun 2016 08:18:26 +0000

They say “if you want to find an answer on the internet, just present a wrong one as fact. Then wait.”

It didn’t take long, actually. Despite my searches while trying to get images into x-axis labels it seems I overlooked a working, significantly less hacky implementation. My Google-fu had in fact let me down.

Baptiste Auguié ([@tpab](https://twitter.com/tpab) / [@baptiste](https://github.com/baptiste)) had this working a while ago (seemingly before the ggplot2 update that broke other methods), and in a definitively less hacky way. I’ve added a new gist (if you’re reading this on R-bloggers, the gist isn’t embedded, so either follow the link or view on my site) which implements it on the same graph as earlier, and I like this significantly more.

This method gets around the element_text() validation and updates the grobs in a way that’s above my pay grade/understanding of ggplot2 internals, and is a much more consistent way to go about it. This also:

places the factor labels on the graph along with the picture, covering some concerns about people not knowing which maps are for which country,
leaves room for the caption to go back in, which I wanted,
automatically scales the grob better,
doesn’t involve creating an external grob and thus turning off clipping; using axis.text.x is exactly what I was hoping for.

Updated version using @baptiste’s method; much better.

My version worked (sort of) but only because it used options that were bad practice (not doubting that for a moment). I’d like to see this method make it into ggplot2 properly; Baptiste had an open GitHub issue involving it a while ago but it has since been closed, presumably without the feature being incorporated.

I started the previous post by saying how awesome open-source software is (e.g. R). You know what else is awesome? The #rstats community. Thank you to every one of you.

Images as x-axis labels

website@jcarroll.com.au (Jonathan Carroll) — Thu, 02 Jun 2016 22:42:31 +0000

Open-source software is awesome. If I found that a piece of closed-source software was missing a feature that I wanted, well, bad luck. I probably couldn’t even tell if was actually missing or if I just didn’t know about it. When the source is available, maintained, and documented however, things get fun. We can identify, and perhaps fill gaps.

I’ve thought for a couple of projects which had bar-graphs that it would be neat to have the categories labelled by an icon or a picture. Say, the logo for a company or an illustrative example. Sure, you could fire up GIMP/Inkscape and manually insert them over the top of the text labels (each and every time you re-produce the graph… no thanks) but that’s not how I operate.

There are probably very few cases for which this is technically a good idea (trying to be a featured author on JunkCharts might very well be one of those reasons). Nonetheless, there are at least a couple of requests for this floating around on stackoverflow; here and here for example. I struggled to find any satisfactory solutions that were in current working order (though perhaps my Google-fu has failed me).

The second link there has a working example, but the big update to ggplot2 breaks that pretty strongly; opts was deprecated and now element_text() has a gatekeeper validation routine that prevents any such messing around. The first link however takes a different route. I couldn’t get that one to work either, but in any case the answer is a year out of date (updates in ggplot2 can easily have broken the gTree relations), not particularly flexible, and relies on saving intermittent image files for PostScriptTrace to read back in which I’m not a fan of (and couldn’t get to work anyway).

I decided that I perhaps had enough ammunition to hack something together myself (emphasis on hack), and sure enough it seems to have worked (for a limited definition of “worked” with no attached or implied guarantees whatsoever).

GDP per capita with flags for x-axis labels. This was harder to make than it seemed, but I’ve since added a little more flexibility to it.

The way to go about making your own is as follows;

Stop and carefully re-evaluate the choices that you’ve made to bring you to this decision. Are you sure? Okay…
Save the images (in the correct factor order) into a list (e.g. pics).
Build your bar graph with categorical x-axis as per normal, using theme() to remove the labels. Save as an object (e.g. g).
Source the function from this gist (at your own risk… copy and paste if you prefer):

devtools::source_gist("1d1bdb00a7b3910d62bf3eec8a77b4a7")

Call (or pipe your ggplot object to) the function:

g %>% add_images_as_xlabels(pics)

## or

add_images_as_xlabels(g, pics)

Your image will be re-drawn with your pictures labelling the categories.

Here’s an example of the code used to generate the GDP per capita image, featuring some fairly brief (for what it does) rvest scraping (to reiterate; I don’t want to have to do any of this by hand, so let’s code it up!).

At least a few caveats surround what I did manage to get working, including but not limited to:

I’m not sure how to put the x-axis title back in at the right position without padding it with a lot of linebreaks ("\n\n\n\nX-AXIS TITLE").
I’m not sure how to move the caption line from labs() (assuming you’re using the development version of ggplot2 on GitHub with @hrbrmstr’s excellent annotation additions) so it potentially gets drawn over.
The spacing below the graph is currently arbitrarily set to a few lines more than necessary, but it’s a compromise in having an arbitrary number of images loaded at their correct sizes.
Similarly, I’ve just expanded the plot range of the original graph by a seemingly okay amount which has worked for the few examples I’ve tried.
Using a graph like this places the onus of domain knowledge onto the reader; if you don’t know what those flags refer to then this graph is less useful than one with the countries labelled with words. Prettier though.

I’ve no doubt that there must be a better way to do this, but it’s beyond my understanding of how ggproto works, and I can’t seem to bypass element_text’s requirements with what I do know. If you would like to help develop this into something more robust then I’m most interested. Given that it’s a single function I wasn’t going to create a package just for this, but I’m willing to help incorporate it into someone’s existing package. Hit the comments or ping me on Twitter ([@carroll_jono](https://twitter.com/carroll_jono))!

From a (set.)seed grows a mighty dataset

website@jcarroll.com.au (Jonathan Carroll) — Mon, 30 May 2016 21:35:26 +0000

Can you predict the output from this code?

printStr <- function(str) paste(str, collapse="")

set.seed(12173423); x <- sample(LETTERS, 5, replace=TRUE)
set.seed(7723132);y <- sample(LETTERS, 5, replace=TRUE)

paste(printStr(x), printStr(y))

Okay, the first bit is straightforward; it’s a function that puts two string together into one. The next two lines appear to provide a random integer to the set.seed function then sample the pool of LETTERS 5 times with replacement. The last line uses the function from the first line to combine those samples of letters together into one string. Easy enough. Looks like it will produce a random string. Give it a try… go on, the seeds should make it reproducible.

[1] "HELLO WORLD"

Whoa! What are the odds of that!?! Of all the possible letters we could have sampled, we get that!

Okay, yes, it’s rigged. Pretty neat choice of values for set.seed there, right? I came across the Java variant of this via StackOverflow’s ‘Hot Network Questions’ sidebar (a rabbit-hole equal in depth to a Wikipedia wiki-hole). The seeds just happen to be ones that when sampled 5 times with replacement produce the right values to extract those letters in order. That seems simple enough until you want to find them.

Update (2019): With R 3.6 the random number generator (RNG) has been updated to avoid a particular bug, the result of which is that this entire process will be invalid for that R version. This will still work as advertised in versions prior to 3.6, but the same seed will produce different strings in 3.6 and above.

The possible combinations of 5 letters, chosen with replacement, from the pool of 26 is $26^5$ which is a lot, but not insanely many. I work with multi-million row datasets frequently enough. So, we could just run a loop over all integers (set.seed rounds to nearest integer anyway; refer to ?set.seed), set the seed to that value, and save the sampled letters. The first combination will be

set.seed(1L)
sample(LETTERS, 5, replace=TRUE)

## [1] "G" "J" "O" "X" "F"

So, we write a loop and iterate over the seed, saving the outputs. But wait, you may wonder, what’s to guarantee that we don’t get the same sample twice? Nothing. It’s a random sample starting from a different seed every time; there’s no control over the results after the fact. A quick check confirms this; here’s a duplicate of the first record appearing at seed 3415066L

set.seed(1L);sample1 <- sample(LETTERS, 5, replace=TRUE)
set.seed(3415066L); sample2 <- sample(LETTERS, 5, replace=TRUE)
identical(sample1, sample2)

## [1] TRUE

So, set.seed(1L) produces the same 5 letter sample as set.seed(3415066L). There’s definitely duplicates of other combinations between those two too. Okay, so we’re not going to be limited to $26^5$. How many though? Who knows? What’s the distribution of duplication? Without knowing how many we need to try for, we can take the upper limit and go for that; on my machine I get

.Machine$integer.max

## [1] 2147483647

which is certainly a bigger number, but not out of the realm of possibility.

To make life easier, we can split the problem up. It’s “embarrassingly parallel” (each iteration is completely independent of the others) so it’s perfect for parallelisation. If you haven’t read Drew Schmidt a.k.a wrathematics’ semi-NSFW guide to Parallelism, R, and OpenMP then stop reading this and go read that.

You’re back, great. Let’s assume for now that you too have access to a big, fast computer and want to parallelise the loop over all positive integers. If you’re lucky, it’s as easy as

library(parallel)
cl <-
  makeCluster(7) ## 8-core machine, leave one out to remain stable
clusterApply(cl, seq(1, (.Machine$integer.max - 1), 1e7), function(x) {
  wordvec <- data.frame(word = character(1e7L), seed = integer(1e7L))
  for (iloop in 1:(1e7)) {
    iseed <- x + iloop - 1
    if (abs(iseed) < .Machine$integer.max) {
      set.seed(iseed)
      wordvec[iloop, "word"] <-
        paste0(LETTERS[sample(26, 5, replace = TRUE)], collapse = "")
      wordvec[iloop, "word"] <- iseed
    }
  }
  write.csv(wordvec, file = paste0("seeded_words_", as.character(x), ".csv"))
}
})

but life’s not that easy. This is slow as a week of Mondays. For starters, updating the data.frame this many times will probably exhaust your RAM. This was run on a machine with 32GB available, and it got full, fast. Writing out large .csv files is slow, and given that each of them has ten million rows, the 215 files aren’t particularly small; there are a lot of duplicate entries.

We can make this better with a few adjustments;

library(parallel)
cl <-
  makeCluster(7) ## 8-core machine, leave one out to remain stable
clusterApply(cl, seq(1, (.Machine$integer.max - 1), 1e7), function(x) {
  library(data.table)
  wordvec <- data.table(word = character(1e7L), seed = integer(1e7L))
  for (iloop in 1:(1e7)) {
    iseed <- x + iloop - 1
    if (abs(iseed) < .Machine$integer.max) {
      set.seed(iseed)
      set(
        wordvec,
        i = iloop,
        j = "word",
        value = paste0(LETTERS[sample(26, 5, replace = TRUE)], collapse = "")
      )
      set(wordvec,
          i = iloop,
          j = "seed",
          value = iseed)
    }
  }
  unique_wordvec <- unique(wordvec, by = "word")
  save(unique_wordvec,
       file = paste0("seeded_words_", as.character(x), ".rds"))
}
})

Using data.table means that the set() operations are all in-memory and this alone speeds up the loops. Removing duplicates using unique (now dispatched for data.table) and saving as a compressed binary .rds file makes this a little less bulky. All in all, this can be completed in a few hours on a decent enough machine. I did try using feather for the saving of files and my early tests using smaller subsets showed it to be amazingly fast. Unfortunately there are still some bugs to be ironed out of that package for large files/lots of rows, and my 215 files ended up small, but unreadable.

Given that there’s only $26^5 = 11881376$ combinations that we’re looking for, depending on how often duplicates come up, we probably don’t need all the results. I’ll save you the trouble and let you know that the loop only needs to go up to at most, 113449118. Reading all of the files back in and merging them again requires some careful considerations. R isn’t too keen on creating objects larger than 2GB, so we can’t really just merge 113449118 lines of data. Taking it step-wise, I managed to get it to work

library(data.table)
library(dplyr)
load("seeded_words_1.rds") ## load the first file
bigdf <-
  unique_wordvec## objects were saved as unique_wordvec so save ## to a new name to avoid overwriting
rm(unique_wordvec)## then drop the saved version

allfiles <- list.files(pattern = "01.rds")

## files were saved as 'seeded_words_X.rds' where x was steps of 1e7
## sorting alphabetically gives the wrong order
for (file in allfiles[order(as.numeric(sub("\\.rds", "", sub(
  "[a-z_]*", "", allfiles
))))]) {
  cat(paste0("** Processing file ", file, "\n")) ## show notifications on the screen
  load(file)
  bigdf <-
    unique(data.table(bind_rows(bigdf, unique_wordvec)), by = "word") ## drop duplicates as we go
  rm(unique_wordvec)
  if (nrow(bigdf) &
      amp
      gt
      = 26 ^ 5) {
    cat(paste0("** Processing OUTPUT file.\n"))
    save(bigdf, file = "all_seeded_words.rds")
    stop()
  }
}

This results in a 75MB .rds of all unique combinations of 5 letters sampled with replacement, and the seed that generates them. Not particularly share-able or convenient. We’re mainly interested in actual words. We can filter this list down to English words if we can think of some way to do that (with the associated assumptions and limitations that may bring). Here’s one R option:

library(ScrabbleScore)
words <- bigdf[is.twl06.word(bigdf$word), ]

This filters the generated 5-letter words against a Scrabble Official Tournament and Club Word List which is as close as I can be bothered getting to ‘English’ words. What’s left is a list of 8938 5-letter words with their associated generating seeds. Sure enough, filtering the twl06 wordlist down to the 5-letter words gives exactly that many; we’ve generated all the 5-letter words in that data set. Cool. What were we hoping to do with it? Oh, right.

print(words["HELLO"])
#> 1: HELLO 12173423
print(words["WORLD"])
#> 1: WORLD 7723132

There we go, the seeds used in the original question for this post. If we wanted, we could write other words or phrases in this way.

set.seed(5360994); x <- sample(LETTERS, 5, replace=TRUE)
set.seed(21732771); y <- sample(LETTERS, 5, replace=TRUE)

paste(printStr(x), printStr(y))

## [1] "STATS RULES"

We might be interested in what the distribution of unique, English words looks like. Here you go;

library(ggplot2)
gg <- ggplot(words, aes(x=seq_along(seed), y=seed)) gg <- gg + geom_point(alpha=0.6, col="steelblue1", pch=20, size=2) gg <- gg + theme_bw()
gg <- gg + labs(title="Seed that generates unique, English words", subtitle="Filtered as valid Scrabble TW06 words",
caption="https://en.wikipedia.org/wiki/Official_Tournament_and_Club_Word_List",
x="Index",
y="Seed")
gg

I’ve converted that using the excellent plotly::ggplotly() function so you can mouseover each point to see the corresponding word.

Fairly uniform looking in that plot. How about the density?

library(ggplot2)
gv <- ggplot(words, aes(x=factor(0), y=seed)) gv <- gv + geom_violin(fill="steelblue1") gv <- gv + theme_bw()
gv <- gv + theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
gv <- gv + labs(title="Violin plot of seed that generates unique, English words", subtitle="Filtered as valid Scrabble TW06 words",
caption="https://en.wikipedia.org/wiki/Official_Tournament_and_Club_Word_List",
y="Seed")
gv

which looks pretty nicely vanishing as more duplicates are produced.

Finally, what about the distribution by starting letter?

Unsurprisingly; not many words starting with “X” (13) and lots starting with “S” (1084). The last word produced (the one with the largest unique seed before we run out of unique words) is “HUTCH” at 113449118.

Can we do anything else with this? The first thing that comes to mind is using this to encode a message. This method is reminiscent of a hash function; it takes some data and via a 1-way mechanism, produces an encoded message. Of course, the 1-way nature of this takes a word and encodes it to an integer that can’t be easily predicted, so hopefully your message is all integers. Many reasons make this a bad idea to actually use for this purpose. The first being that in the world of digital security, if you’re thinking of rolling your own, you’re setting yourself up for trouble. Much smarter people than you or I have spent a lot of time getting digital security right, and it still isn’t.

As for actual technical issues, the obvious one is that it can be brute-forced (as we just showed) easily. I produced the list of all 5-letter combinations produced from all possible integers in a few hours. Modern GPU processing can perform many millions of these calculations per second. Another technical fault of this would be that collisions are all too easy, as demonstrated by our duplicates. A good encoding should only generate the hashed value from the input, not any other input. MD5 has this flaw. If you were to try and use this encoding to validate a message (say, the integer represents a checksum of the message contents, encoded as a difficult to predict word) then it would be far too easy for a malicious entity to produce the same word from their own message padded out with junk data.

So, not very useful for encryption/hashing (not that it should be). I don’t really have a useful application for this apart from the riddle at the start of this post, but it’s been an interesting journey through optimisation, parallelisation, filtering, and file input/output. I’d say that has made it worthwhile enough.

The data file of valid Scrabble words can be downloaded here if you’d like it. I’ll gladly provide the full 5-letter list on request.

I’m not a data-security expert so any and all of my advice there is liable to be rubbish. Do you know a better way to generate this data, or an aspect I haven’t considered? Hit the comments and let me know.

Title image: CC-BY U.S. Department of Agriculture

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  colorspace    1.4-1   2019-03-18 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  dplyr         0.8.3   2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  ggplot2     * 3.2.1   2019-08-10 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  gtable        0.3.0   2019-03-25 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  lazyeval      0.2.2   2019-03-15 [1] CRAN (R 3.5.2)                   
##  magrittr    * 1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 3.5.1)                   
##  pillar        1.4.2   2019-06-29 [1] CRAN (R 3.5.2)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgconfig     2.0.2   2018-08-16 [1] CRAN (R 3.5.1)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  purrr         0.3.2   2019-03-15 [1] CRAN (R 3.5.2)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  scales        1.0.0   2018-08-09 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  tibble        2.1.3   2019-06-06 [1] CRAN (R 3.5.2)                   
##  tidyselect    0.2.5   2018-10-11 [1] CRAN (R 3.5.1)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Bad Neighbours (no, not the movie)

website@jcarroll.com.au (Jonathan Carroll) — Sat, 30 Apr 2016 01:06:29 +0000

Another day, another compulsion to see if I can do any better than someone’s solution.

This one also comes from the FiveThiryEight Puzzler challenge courtesy of Xi’an

The original challenge this time was

The misanthropes are coming. Suppose there is a row of some number, N, of houses in a new, initially empty development. Misanthropes are moving into the development one at a time and selecting a house at random from those that have nobody in them and nobody living next door. They keep on coming until no acceptable houses remain. At most, one out of two houses will be occupied; at least one out of three houses will be. But what’s the expected fraction of occupied houses as the development gets larger, that is, as N goes to infinity?

which seems straightforward enough. Xi’an has a nice writeup of the analytical solution (which looks pretty well thought out) but that’s not what caught my attention. The (probably not intentionally provocative) statements

A riddle from The Riddler where brute-force simulation does not pay:
Hence this limits the realisation of simulation to, say, N=10⁴

however, are like a red flag to a bull for me. The code provided for Xi’an’s solution isn’t optimised, and doesn’t take advantage of some potential speed-ups. 10,000 iterations seems like it should be quick. There’s also a typo in the microbenchmark code there; time should be times otherwise it’s passed as a lambda function evaluating time=10. Anyway, improvements to the code can be made.

I took a slightly different approach; I assigned a vector of ‘houses’ being either occupied or available, identified as such by a boolean (TRUE/FALSE). For the purposes of this question, available means that there is a) no occupant; b) no occupant on either side. The function I ended up with was

misanthropist <- function(N) {
  
  occupied   <- rep(FALSE, N)
  acceptable <- rep(TRUE, N)
  
  while(any(acceptable)) {
    possible <- .Internal(which(acceptable))
    occupied[movedin <- possible[.Internal(sample(length(possible), 1, FALSE, NULL))]] <- TRUE
    acceptable[c(movedin-1, movedin, movedin+1)] <- FALSE
  }
  
  return(mean(occupied))
}

library(compiler)
misanthropist_c <- cmpfun(misanthropist)

There are a heap of tricks employed here to speed evaluation up, and a few that aren’t because it turns out they didn’t perform better.

the acceptable vector is populated independently of the occupied vector; acceptable = ! occupied seemed like it was a contender but ended up being slower.
any(acceptable) works faster than sum(acceptable)>0 in the while loop, presumably because of short-circuiting (we only need to know that one is TRUE, at which point we don’t need to keep testing).
I’ve used .Internal calls where possible (which, sample); this removes a tiny bit of overhead.
the switching of acceptable to FALSE for the newly occupied house and those on either side can be done in a single step via a c() subsetting. Originally I had coded around the potential issues of trying to set acceptable[0] or acceptable[N+1] when their neighbours moved in, but as it turns out, R is happy to silently assign beyond the bounds of that vector and move on, so no more checks needed.
the proportion of occupied houses is easily calculated at the end given that as.integer(TRUE)==1 and as.integer(FALSE)==0, so the mean of the boolean vector is the proportion of TRUE values.
finally, I’ve byte-compiled the function with compiler::cmpfun. Built-in functions in base are already byte-compiled, and this helps just a little bit more.

Back to the original question; how many iterations can we do? First, let’s compare what we’ve got so far with a reasonable number of iterations

> microbenchmark(frqns(1000L), misanthropist_c(1000L), times=3, unit="ms")
Unit: milliseconds
                   expr         min          lq        mean      median          uq         max neval
           frqns(1000L) 3600.981381 3601.460353 3655.512618 3601.939325 3682.778237 3763.617149     3
 misanthropist_c(1000L)    3.447858    3.470277    3.512251    3.492697    3.544448    3.596199     3

Uh, yep. That’s, just a little bit faster. Smidgen. 3.5ms/1000 iterations. What about a few more iterations on my optimised function? How about that 10,000 limit?

> microbenchmark(misanthropist_c(10000L), times=3, unit="ms")
Unit: milliseconds
                    expr      min       lq     mean   median       uq      max neval
 misanthropist_c(10000L) 194.0501 194.8379 198.7545 195.6258 201.1066 206.5875     3

Maybe we shouldn’t get too cocky. 10 times as many iterations takes 56 times longer.

> microbenchmark(misanthropist_c(100000L), times=3, unit="ms")
Unit: milliseconds
                     expr      min       lq    mean  median       uq      max neval
 misanthropist_c(100000L) 18260.85 18355.87 18422.4 18450.9 18503.18 18555.47     3

That brings us up to 184ms/1000 iterations. 10 times as many iterations again takes 92 times longer. It’s definitely slowing down.

On a log-log plot of time against iterations with a slope of 2, it’s clearer that the problem scales as $\mathcal{O}(n^2)$. That suggests that we should be able to complete the 1,000,000 iteration evaluation in about 20 minutes. 2,000,000 iterations in around 1 hour 20 minutes. 3,000,000 in 3 hours. Where am I going with this you ask? Xi’an requested help from stackexchange (a great move which paid off well) to get the analytical solution to the problem. If you check the timestamps, you’ll notice

# asked    Apr 25 at 14:04
# answered Apr 25 at 16:25

So, let’s say that stackexchange was offline when you were impatiently working on a solution, you coded perfectly and knew how to optimise your functions. How close to the right answer can you get in this amount of time (2.5 hours). We can probably do at most 2,000,000 iterations. Does that reach a close-enough solution? Rather than making my code run for that long, let’s see if Xi’an’s recursive equation gets the same answer (obviously faster).

xian <- function(N) {
  a=b=1
  for (n in 3:N){ C=(1+2*a+(n-1)*b)/n;a=b;b=C}
  return(C/N)
}

xian1e5 <- xian(1e5L)
mine1e5 <- misanthropist_c(1e5L)
format(2*100*(xian1e5 - mine1e5)/(xian1e5 + mine1e5), digits=3)

## [1] "-0.0335"

so off by 0.07% at that stage, presumably getting closer with more iterations. Let’s use the recursive equation for this next bit then, knowing that in the above scenario we would be using the full iterative approach. The recursive expression itself can also be optimised. I did re-write it in C (xian_c) to see if that helped, but compiler:cmpfun (as xian_c2) does just as good a job (as one might expect)

> microbenchmark(xian(1e7), xian_c(1e7), xian_c2(1e7), times=5, unit=&quot;ms&quot;)
Unit: milliseconds
           expr        min         lq       mean     median         uq        max neval
 xian(1e+07)    16881.8007 16920.1492 16935.2306 16931.9014 16963.9973 16978.3044     5
 xian_c(1e+07)    114.8676   115.0287   115.0771   115.0948   115.1507   115.2438     5
 xian_c2(1e+07)   114.8645   114.9419   116.2547   114.9835   117.2379   119.2454     5

so clearly some improvements can be made. This one scales much better with iterations, to the point that I can just max it out

.Machine$integer.max
# 2147483647
format(xian_c(.Machine$integer.max), digits=20)
                  # 0.43233235833753796973
0.5*(1-exp(-2))   # 0.43233235838169364884

So now we have an upper limit on precision. We’ll be able to get at best within 0.0000000102% of the exact answer.

If I repeatedly use the xian_c function with different numbers of iterations, we can see how well we should expect to do

Is 2e-5% close enough for a couple of hours work?

And there we have it. If we’d been stuck with the non-recursive method and needed to get as close to the right answer as possible in a comparable time to obtaining the analytic solution and coding/running it, we could get pretty darn close. I’d say the brute-force method lives to see another day! … provided you do a bit of optimising and don’t mind worrying about the Halting problem.

Did I miss an important optimisation? Know a better approach? Hit the comments and let me know!

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Solving Inequality (the math kind)

website@jcarroll.com.au (Jonathan Carroll) — Wed, 27 Apr 2016 22:47:55 +0000

This neat approach showed up recently as an answer to a FiveThirtyEight puzzle and of course I couldn’t help but throw it at dplyr as soon as I could. Turns out that’s not a terrible idea. The question posed is

optimise

200a + 100b + 50c + 25d

under the constraints

400a + 400b + 150c + 50d ≤ 1000, b ≤ a, a ≤ 1, c ≤ 8, d ≤ 4,
and (a,b,c,d) all non-negative integers.

Leaving aside any interpretations of wording of the original question (let’s just start with trying to solve this system of inequalities) the solution provided used 4 nested loops, which can definitely be avoided.

My approach was to create all possible combinations of the 4 variables (within the given constraints), filter out the ones that don’t meet the constraint criteria, then sort by the evaluating expression to find which one does best.

I’m not suggesting that this is by any means always the best approach, but when the phase-space of possible solutions is so low (especially combinations of small integers) then this is pretty tidy (technically a single dplyr chain).

Alternatively, one could set this up as an equation and use a linear solver. In that case, we want to optimise $\max(\|A x\|)$ subject to the constraints $G x \ge h$ where $A$ and $x$ represent the coefficients and variables to be optimised, $h$ the constraint vector, and $G$ a matrix of coefficients for the constraints. For the system we’re looking at, that matrix inequality looks like this

\[ \left[\begin{array}{cccc}400 & 400 & 150 & 50 \\1 & 0 & 0 & 0 \\0 & 1 & 0 & 0 \\0 & 0 & 1 & 0 \\0 & 0 & 0 & 1\end{array}\right] \left[\begin{array}{c}a \\ b \\ c \\ d\end{array}\right] \le \left[\begin{array}{c}1000 \\ 1 \\ 1 \\ 8 \\ 4\end{array}\right]\ . \]

Of course, the constraint that $b \le a$ needs to be checked after the fact.

Programming this is fairly straightforward, even with the constraint that these are integer solutions. limSolve::linp is made for exactly these types of problems.

which results in the same answer as our manual brute-force search.

One last thing to try is to plot the solution space and see how it looks. Sounds like a good opportunity to try out plotly.

Since this is technically a 5D plot (4 variables and a value) it’s a little difficult to visualise. I’ve reduced the dimensionality by treating each unique combination of $a \le 1$ and $b \le a$ (i.e. $00,~01,~10,~11$) as a group and using colour to distinguish those. The plot below should show up as a 3D object, so click, drag, and scroll it to have a closer look. Clicking on a group will remove/add it so you can get a clearer view, and hovering over a point should bring up the values of the axes and evaluation.

Going back to the expression that’s being optimised it’s pretty clear why it’s broken down into 4 planes when grouped this way (substitute different values of $a$ and $b$ to see).

Do you have another way to solve this? Drop a line or a link in the comments.

#auunconf slack users' timezone locations

website@jcarroll.com.au (Jonathan Carroll) — Thu, 14 Apr 2016 23:04:15 +0000

I had never used slack before, but had read a heap of tech articles extolling its virtues. Apparently this is what our current Prime Minister advocates within Cabinet. The upcoming #auunconf organising team set up a channel and invited the participants, so I checked it out. Slack is pretty awesome as far as a unified workspace/messaging protocol can go. What makes it even more awesome, is that someone (@hrbrmstr, no surprise) has made an R package that talks to it.

After installing/loading the slackr package, obtaining an API key (the usual drill; create an app, request key, save it somewhere and pray you don’t lose it or share it) and saving it in ~/.slackr (so I don’t have to remember to delete it from shared code) it was as simple as calling slackr_users() to get a data.frame of the users and their relevant data. Neat!

The only geographical information in there was the timezone, so I figured I would merge that with a shapefile of such and plot it. Here’s the code I ended up creating

Once I had plotted the map I wished the projection was more Pacific-centered, and looked into making that happen. It appears to be trickier than I wanted to bother with for such a small project, so I ended up abandoning it. I did find a stackoverflow answer that seemed to have all the right ingredients (again, @hrbrmstr at work) but I couldn’t get it to plot in any sort of reasonable time.

Map of #auunconf slack users

The unique users so far claim to come from:

Australia/Brisbane
Australia/Canberra
Asia/Ulaanbaatar
America/Indiana/Indianapolis
Australia/Adelaide
Europe/Amsterdam
Pacific/Auckland

so quite the diverse crowd.

Once all was done and plotted, uploading the image to the slack team was as easy as dev_slackr("#general") which sends the current graphic to the #general channel of the slack team that slackr was configured for. Sure enough, it worked!

It works!

I’m not entirely sure what I’ll use this for, but it was certainly a fun exercise to get working. Perhaps I can generalise it enough to submit a pull-request to make it available in slackr?

Simpler isn't always faster

website@jcarroll.com.au (Jonathan Carroll) — Thu, 14 Apr 2016 21:52:45 +0000

My name is Jonathan, and I have a coding obsession.

I’ll admit it, the Hadleyverse has ruined me. I can no longer read a blog post or stackoverflow question in base R and leave it be. There are improvements to make, and I’m somewhat sure that I know what they are. Most of them involve dplyr. Many involve data.table. Some involve purrr.

This one came up on R-bloggers today (which leads back to MilanoR) and seemed like a good opportunity. The problem raised was; given a list of data.frames, can you create a list of the variables sorted into those data.frames? i.e. can you turn this

df_list_in <- list (
        df_1 = data.frame(x = 1:5, y = 5:1),
        df_2 = data.frame(x = 6:10, y = 10:6),
        df_3 = data.frame(x = 11:15, y = 15:11)
    )

into this

df_list_out <- list (
        df_x = data.frame(x_1 = 1:5, x_2 = 6:10, x_3 = 11:15),
        df_y = data.frame(y_1 = 5:1, y_2 = 10:6, y_3 = 15:11)
)

That looks like a problem I came across recently. Let’s see…

I managed to replace that function – which, while fast, is a little obtuse and difficult to read – with essentially a one-liner

df_list_in %>% purrr::transpose() %>% lapply(as.data.frame)

## $x
##   df_1 df_2 df_3
## 1    1    6   11
## 2    2    7   12
## 3    3    8   13
## 4    4    9   14
## 5    5   10   15
## 
## $y
##   df_1 df_2 df_3
## 1    5   10   15
## 2    4    9   14
## 3    3    8   13
## 4    2    7   12
## 5    1    6   11

You may now proceed to argue over which is easier/simpler/more accessible/requires less knowledge of additional packages/etc… If you ask me, it’s damn-near perfect as long as you can place a cursor on transpose in RStudio and hit F1 which will bring up the purrr::transpose help menu and explain exactly what is going on. Anyway, how does it compare? Here’s Michy’s graph (formatting updated and my function added)

and then, just for fun (and because I wanted an excuse to try it out) here’s a yarrr::pirateplot of the same data

My one-line function (without the magrittr syntactical sugar) does slightly better than the arrange_col function (on average), but has a lot less up-front code and is more readable (to me at least). The performance of any of these three doesn’t seem like it would have trouble scaling for any practical use-case.

Scaling up the problem to a list of 100 data.frames each with 1000 observations of 50 variables, the same result pans out as shown in the above microbenchmark and pirateplot below

On the giant example (100 data.frames of 1000 observations of 50 variables) the difference is 20ms vs 380ms. Honestly, I don’t know what I’d do with the additional 360ms, but chances are I’d just waste them. I’ll take the efficient code on this one.

Can you do even better than the one-liner? Spot a potential issue? Have I made a mistake? Got comments? You know what to do.

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr    * 1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  purrr         0.3.2   2019-03-15 [1] CRAN (R 3.5.2)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

52vis Week 2 Challenge -- Australian Version

website@jcarroll.com.au (Jonathan Carroll) — Tue, 12 Apr 2016 21:18:19 +0000

I mapped out the USA homelessness rate in my last post as a challenge and noted at the end that it would be interesting to do the same for Australia. That was the first comment I received in person, too. “Let’s do it!” I said. What I found may shock you (click-bait title; check).

Most of the code carried over. Of course, lacking hrbrmstr’s neat albersusa equivalent I had to obtain and process the shapefile myself. Thankfully, the ABS have me covered. Here’s the whole script;

For starters, I compared the Australian statistics on the same scale as the USA (median 1.63‰, capped at 3x that value) and was shocked

AUS homeless population, US scale

Yep, it appears we’re worse than the USA for homelessness. That sucks. What if we put it back on our own scale, how do our states do relatively? Well, for starters, the median goes up to 3.5‰ (I’ve again capped at 3x that value) but a lot of that seems to be coming from NT. Looking at the data itself, our lowest value is indeed higher than the USA median, so we’ve nothing to be proud of. That said, some states are doing better than our own median. TAS looks to be nicely below, while SA seems to be sitting around the median.

AUS homeless population, AUS scale

If we drill down to the data itself, we can see what the actual figures look like. I’ve had a go at hrbrmstr’s geom_lollipop (from the dev version of ggalt) and it works nicely, as expected. I’ve left NT off this first graph so that the others stand a fighting chance at the scale.

And here’s what happens if you include the Northern Territory

Ouch. It looks odd, but it’s correct. The number of people in the NT in 2011 was around 231,331 but the census-estimated homeless population was 15,479, which means that 6.7% of the population (i.e. 67‰) were homeless. What? Have I made a mistake? No, it’s just horrifyingly true.

Aw, man. I came here for data analysis, not feels. Clearly this is a national shame, and something needs to be done about it.

52vis Week 2 Challenge

website@jcarroll.com.au (Jonathan Carroll) — Sun, 10 Apr 2016 22:01:17 +0000

From Bob Rudis’ blog comes a weekly data/coding challenge. I didn’t quite get the time to tackle last week’s but I thought this one offered up a pretty good opportunity.

Half the challenge is of course data processing/tidying, which is a big part of data science anyway (“75% of data science is getting the data in the right format, the other 50% is doing something with it” in case you haven’t heard the old joke). Needless to say, I’m using R for this one.

In case folks are wondering why I’m doing this, it’s pretty simple. We need a society that has high data literacy and we need folks who are capable of making awesome, truthful data visualizations. The only way to do that is by working with data over, and over, and over, and over again.
Directed projects with some reward are one of the best Pavlovian ways to accomplish that :-)

The data this week is from the U.S. Department of Housing and Urban Development and involves homeless statistics, which I guess is pretty confronting given that I’ve done all this work from the comfort of my warm bed. Back to the topic at hand though. Bob Rudis provided some sample code and a nice facet_wrapped lollipop graph. I’ve also gone the ggplot2 route but I’ve done mine as a choropleth with Bob Rudis’ neat extensions for USA projections. It’s an almost too-obvious choice, so I spruced it up with the gganimate package. The script is here:

The map shows states with the median homeless population per thousand state population as white, with more than that coloured red, less than the median coloured blue (no, it’s not a political map). Each frame shows a different year of data. I think it does an okay job of displaying the changes in this statistic over a few years.

USA Homeless population, scaled by state population, and capped at 3x the national median. White fill represents median values. Grey states didn’t have data for that year.

I’m loving the annotations extensions to ggplot2; they really make these graphs a lot more professional looking. As for interpreting this map, well, that’s perhaps a little trickier. It seems to look like things got a bit worse overall in the earlier years of this data set, but since then they’ve been getting better. The west coast still has a large homeless population, and the central states seem to be a lot better. What’s not obvious from this, and that’s a general failing of non-size-proportional maps, is that some of the smallest states have some of the biggest per mille homelessness rates; D.C. tops out the scale in every year at between 9.3‰ (2007) and 11.7‰ (2014), followed by Oregon and Hawaii who see more than 3x the national median for more than a couple of years.

I’m now somewhat curious to see what the Australian version looks like. Perhaps that’s a topic for the upcoming ROpenSci #auunconf.

As always, comments and suggestions welcome. The full repo of files is available here, for which I’ll be adding a pull-request back into the original repo.

Bring on the ROpenSci #auunconf 2016!

website@jcarroll.com.au (Jonathan Carroll) — Fri, 01 Apr 2016 06:25:01 +0000

I’ll be heading to the 2016 ROpenSci un-conference (hackathon) in Brisbane later this month to smash out a heap of open-science R code. Ideas are already flowing quite nicely, and I’m confident that any ideas we don’t end up officially working on will get their chance in the very near future.

One thing I noticed from the organisers was that coffee won’t be provided in an official sense. As a physicist at heart, that’s strange (scary) yet understandable (physics conferences go through an astounding amount of coffee; we once had a full-time barista on deck). There are supposedly plenty of nearby places to get a good coffee, but where? Time for some R code!

This ended up being a little easier than I first thought thanks to someone already identifying the right Google Places API endpoint and providing an example function. I re-wrote the function to be a bit more general and to suit my needs a little better. After that it’s just a matter of extracting and plotting locations, adding a 2d density, and prettifying the output.

I’m loving hrbrmstr’s annotations additions to ggplot2; I think they really bring R graphics into a professional appearance. I have a feeling that my locations when not at the hackathon itself will correlate well with this density map as I try to find the best local coffee.

Stay tuned for updates on the projects we end up developing. I have a good feeling that they’re going to be somewhat awesome.

Suggestions on the above code most welcome. Also, if you happen to know of a great coffee house near there that isn’t listed, hit the comments section!

Image marginal histograms

website@jcarroll.com.au (Jonathan Carroll) — Sat, 12 Mar 2016 00:37:37 +0000

Another day, another interesting challenge.

I follow Bob Rudis’ (a.k.a. hrbrmstr’s) blog, typically via R-bloggers, and this post caught my eye. Partly because I thought I knew of an existing way to do this. As usual, actually getting that to work took a little longer than I might have hoped, but I think the end result is pretty neat.

His post describes the process of writing an R function to take an image file, for example this one

and producing a histogram along the sides of the number of pixels on a given row/column. This is what he created (a different image to the example, I believe)

Something funny is going on with the right-hand histogram; it doesn’t line up with the image.

Here’s my approach.

It leverages the png package to extract the channels into a matrix, converts those to x,y,z data.frames, takes the median value, plots that with ggplot2, then leverages ggExtra::ggMarginal to add the marginal histograms. Note that the ggExtra package has some bugs (it hasn’t been maintained in a while) in relation to more recent (possibly the dev branch) of ggplot2. I got it working on at least one of my machines. This is my result

I’ve had several uses for these types of marginal plots lately, so hopefully I can sort out the issues I’ve been getting in combination with ggplot2.

Is it crowded in here?

website@jcarroll.com.au (Jonathan Carroll) — Wed, 09 Mar 2016 22:31:13 +0000

This was a neat graphic that someone made. It shows the population at a given latitude or longitude as a bar chart, overlayed on a map of the world itself. It shows where people live; the bigger the bar, the more people living at that latitude/longitude.

“I can do that.” I said. In R of course. So here it is;

I love that such a small amount of code can produce something so interesting. Click the images below to view them in all their full-size glory.

How is this useful? Well… okay, it’s not. It’s pretty. That’s what it is. An a neat exercise in data manipulation and plotting.

UPDATE: As per a comment on my reddit thread, I’ve updated this to include a logarithmic colour-scale for population. The populations follow a nice logit curve if you arrange them in order:

Here’s the updated graphics:

Jackpot!

website@jcarroll.com.au (Jonathan Carroll) — Wed, 13 Jan 2016 15:18:22 +0000

The powerball lottery in the USA has jackpotted to a first prize of $1.3 billion, which is just a silly amount of money.

The cost of an entry (if you happen to be in the USA) is just $2, which is very much a ‘take a gamble’ sort of amount. If you’re an Aussie (except from SA like me) you can still have a go, but it will cost you considerably more ($10.50) and you’ll still have to pay the relevant US taxes if you win.

The following scenario has been raised a few times around the intertubes; if it costs $2 per ticket, the chances of winning (1/number of combinations of the drawn numbers) is 1/292,201,338, and the prize is over a billion dollars – why not buy one of every ticket and guarantee a win?

First, let’s look at the game. There are 69 white balls from which 5 will be drawn. There is also a pool of 26 powerballs from which 1 will be drawn. You need 5/5 + 1 to win the jackpot.

The odds of getting that right, if you recall your combinatorics, is one in

\[ \displaystyle{{69 \choose 5} \cdot {26 \choose 1} = 292,201,338}\ . \]

Doubling that makes buying all the tickets a mere $600 million or so. I’ll get my wallet.

This would be easy money if it weren’t for three important facts; first, the cash prize is actually $930 million if you take it right away, so we’re already out of pocket quite a bit. Second, you may need to split the jackpot with one or more people, meaning a significantly lower return, possibly less than you invested. Lastly, you also need to pay tax on the income, which is around 40% on that. Maybe it’s not such a good deal.

If you have one of every ticket however, you win every prize. How much does that get you? Back to combinatorics. To figure out how many combinations there are of each division we need to calculate the number of ways to get the number of correct and incorrect balls comparing our ticket to the draw, and multiply by the value of that prize.

So, for the next best prize (a mere $1 million) we need to have all 5 of the white balls but not the powerball on our ticket. There are 5 possibilities of white ball, and we need all 5 of them. We need to match one of the 25 non-winning powerballs too, so the number of matching combinations is

\[ \displaystyle{{5 \choose 5} \cdot {25 \choose 1} = 25}\ . \]

So, there are 25 ways in which we could do this (get all 5 of the white ball numbers on our ticket, but not the powerball). That means that if we have one of each ticket, 25 of them will be worth a million dollars each.

Continuing this logic the total winnings would be

\[ \begin{array}{lcl} {\rm WINNINGS} &=& 930,000,000\times{5 \choose 5} \cdot {1 \choose 1} \\ &+& 1,000,000\times{5 \choose 5} \cdot {25 \choose 1} \\ &+& 50,000\times{5 \choose 4} \cdot {64 \choose 1} \cdot {1 \choose 1} \\ &+& 100\times{5 \choose 4} \cdot {64 \choose 1} \cdot {25 \choose 1} \\ &+& 100\times{5 \choose 3} \cdot {64 \choose 2} \cdot {1 \choose 1} \\ &+& 7\times{5 \choose 3} \cdot {64 \choose 2} \cdot {25 \choose 1} \\ &+& 7\times{5 \choose 2} \cdot {64 \choose 3} \cdot {1 \choose 1} \\ &+& 4\times{5 \choose 1} \cdot {64 \choose 4} \cdot {1 \choose 1} \\ &+& 4\times{5 \choose 0} \cdot {64 \choose 5} \cdot {1 \choose 1}\end{array}\ . \]

or programmed as

winnings <- 930e6*1 +                                     ## cash prize for jackpot, 1 winner
            1e6*choose(5,5)*choose(25,1)                + ## match 5 out of 5 white, don't match powerball
            5e4*choose(5,4)*choose(69-5,1)*choose(1,1)  + ## match 4 out of 5 white, match powerball
            1e2*choose(5,4)*choose(69-5,1)*choose(25,1) + ## match 4 out of 5 white, don't match powerball
            1e2*choose(5,3)*choose(69-5,2)*choose(1,1)  + ## match 3 out of 5 white, match powerball
            7*choose(5,3)*choose(69-5,2)*choose(25,1)   + ## match 3 out of 5 white, don't match powerball
            7*choose(5,2)*choose(69-5,3)*choose(1,1)    + ## match 2 out of 5 white, match powerball
            4*choose(5,1)*choose(69-5,4)*choose(1,1)    + ## match 1 out of 5 white, match powerball
            4*choose(5,0)*choose(69-5,5)*choose(1,1)      ## match 0 out of 5 white, match powerball
prettyNum(winnings, big.mark=",")

## [1] "1,023,466,048"

So, winning all prizes all by yourself (everyone else who might have won the jackpot lost their tickets) nets you a little over a billion pre-tax dollars on its own. Not bad, but still pretty risky since you’re betting on not sharing.

The big question will be how big does the lottery need to get before this starts to look like a plausible option? The cost of tickets and total number of combinations are constants, so there must be some jackpot prize for which it’s a good bet to buy all the tickets, given that the chances of sharing don’t go up considerably (if you trust the FiveThirtyEight analysis of historical entries);

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

SimplyStats Thanksgiving Puzzle

website@jcarroll.com.au (Jonathan Carroll) — Thu, 26 Nov 2015 09:39:33 +0000

I owe a lot to Jeff Leek and Roger Peng for their great Coursera courses, in which I learned to program in R.

They (along with Rafa Irizarry) run the Simply Statistics blog, which I highly reccomend. They posted a Thanksgiving puzzle in which a data.frame needs to be converted from one form to another, spelling out ‘thanksgiving’.

http://simplystatistics.org/2015/11/25/a-thanksgiving-dplyr-rubiks-cube-puzzle-for-you/

The puzzle: convert this

into this

My solution, which uses Rubik’s Cube rotations of rows and columns (and dplyr of course):

Suggestions on how I could have done this differently (or automated solutions) most welcome!

What are the odds?

website@jcarroll.com.au (Jonathan Carroll) — Tue, 10 Mar 2015 08:00:20 +0000

After posting this photo of our lottery ticket to Facebook

I thought more and more about random-event probabilities.

I know my way around numbers just fine, so I know that the odds of winning Division 1 in the South Australian Saturday Night X-Lotto is 1 in 8,145,060 (yes, that’s one of the semi-useless numbers I have memorised).

That’s fairly improbable at face value, sure, but it’s me, so I’m going deeper. Calculating odds of events with limited outcomes can be as easy as multiplying out individual probabilities. For example, the odds of 8 women in a mum’s group all having the opposite-sex for their second child is just the product of 1 option from 2 choices, 8 times (multiplied);

\[ \left(\frac{1}{2}\right)^8 = \frac{1}{256}\ . \]

With things like drawing multiple numbered balls from a pool there are complications; it doesn’t matter what order you take them out in, and you need to account for all the possible combinations. The odds are nonetheless fairly easy to calculate if you know/remember your combinatorics; If, from the initial pool of \[n\] numbers, we need to choose \[r\] without replacement, then the notation for this is “n choose r” and has the formula

\[ \displaystyle{{n \choose r} = \frac{n!}{r!(n-r)!}}\ , \]

where the exclamation mark denotes factorial ($p! = p \times (p-1) \times (p-2) \times \ldots \times 1$).

We are choosing 6 numbers from a pool of 45 without replacement, so the odds of any one exact choice coming up is

\[ \displaystyle{{45 \choose 6} = \frac{45!}{6!\ 39!}}\ . \]

We can make use of a little more math to simplify that;

\[ \frac{(a+b)!}{b!} = (a+b) \times (a+b-1) \times (a+b-2) \times \ldots \times (b+1)\ , \]

so we are left with

\[ \frac{45!}{6!\ 39!} = \frac{(6+39)!}{6!\ 39!} = \frac{45 \times 44 \times 43 \times 42 \times 41 \times 40}{6 \times 5 \times 4 \times 3 \times 2 \times 1} = \frac{5864443200}{720} = 8,145,060\ . \]

How about getting (any) 4 numbers? That’s choosing 4 from a pool of 45;

\[ \displaystyle{{45 \choose 4} = 148,995}\ . \]

Games like Powerball have much worse odds; In that game Division 1 is won by selecting 6 numbers from a pool of 40 without replacement, but then also selecting the Powerball from a new pool of 20;

\[ \displaystyle{{40 \choose 6} \times {20 \choose 1} = 76,767,600}\ . \]

The point I was making with the photo was that it seemed somewhat cruel that the two missing numbers from our otherwise winning combination were both off by exactly 10, and thus only two little 1’s stood between us and a decent $2 million. I know perfectly well that any two numbers qualified for those positions, but having the actual digits right there was just frustrating.

I know there are plenty of people who don’t play the lottery because the odds are so bad. The problem I find with that thinking is two-fold; firstly, “you gotta be in it to win it.” A cliché, sure, but valid. If you don’t have a ticket, your chances of winning are precisely zero. Secondly, while the odds of predicting the next random drawing of 6 numbers from a pool of 45 is 1 in 8,145,060, and thus the individual chances of winning are ‘low’, sure enough someone wins more or less every week. A scientific hero of mine, Richard Feynman puts it perfectly;

You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won’t believe what happened. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance I would see that particular one tonight? Amazing!

The point is that every game (line on a ticket) has exactly the same chances of winning as any other. Sure, individually that’s ‘low’ odds, but it’s no lower for me than it is for you. Someone is quite likely going to be life-changingly wealthier each week, just as someone’s car with some licence plate was going to be in the parking lot. It’s easy to make the quote as above after the fact and state the odds of seeing that one, but that’s applying an ex-post analysis to that particular event, confusing the matter somewhat.

There is of course some advantage in having more games available (buying more tickets), though it’s no guarantee. You could buy 1,845,060 tickets and still not win; those calculated odds correspond to an infinite sample (you want an infinite sample simulator? Here you go). If I buy a 20-game ticket then my chances of winning drop by a factor of 20 to 20/8,145,060, or 1 in 407,253. That’s getting pretty reasonable; we’re under the ‘1 in a million’ line now.

Another way to think about this is that even 1 in 8 million isn’t a terribly low probability in the grand scheme of things. Sure, things with that sort of probability aren’t going to be showing up in your day-to-day life a lot, but there are interesting examples. There’s apparently a 1 in 11 million chance that you’ll die in an plane accident, which is hopefully reassuring. There’s this family who rented a car and travelled around the U.S.A., eventually parking next to another car with a consecutive number plate. The fact that it’s the same model of car isn’t terribly surprising; the plates were probably given in sequence to a dealer. The fact that the identifiable vehicles have been brought back together is the surprising part.

They claim the other was rented from a completely different state to where theirs was. I’ve seen a licence plate consecutive to ours on the road, but here in Adelaide that’s probably not nearly so unlikely. I actually have a similar story; one Easter we went up the coast to Port Broughton, which was, to be honest, quite boring, so we got in the car and drove around to various towns, eventually stopping in for a snorkel at Moonta Bay. We pulled into the car park and I recognised the person getting out of their car in the bay across the road from ours. It was a visiting German post-doc from my research group who had decided to drive up from Adelaide on a whim, and who just happened to arrive at the same place as us at the same time and park a few meters from us. I digress.

What then, were the odds of getting those two ‘off by 10’ numbers on my game line? With 4 of the 6 numbers already correct, the pool was reduced by 4 numbers, and I needed to choose another 2, so the individual chances of any two remaining numbers being drawn were

\[ \displaystyle{{41 \choose 2} = 820}\ . \]

That applies whether we want the chances of specifically 13 and 16 being chosen or any other pair of numbers.

In large physics experiments such as the LHC, statistics play a large role in determining when an event (e.g. observing a Nobel-prize winning particle) is ‘likely’ and when it isn’t. Part of that includes taking into consideration the fact that lots of observations are made, making the distinction between rare events and random fluctuations a fine line, and so a ‘look elsewhere effect’ is included in the calculations. The analogy here is that I would have been just as upset with 23 and 26, or 33 and 36, so we should include those in our calculations of ‘how likely’. This changes our calculations from the above, to choosing any of 6 numbers from the remaining 41, then any of 5 from the remaining 40;

\[ \frac{6}{41} \times \frac{5}{40} = \frac{30}{1640} = \frac{3}{164} \sim 1\ {\rm in}\ 55\ , \]

so I’d say not all that unlikely to happen.

This is why patterns show up so frequently from apparently random events; we focus on specific odds of unlikely events but fail to take in to account the other equally unlikely combinations. Bumping into a friend from a decade ago might seem unlikely, but you need to multiply those odds by the number of people who would have elicited the same response from you, had you bumped into them. In my earlier anecdote about bumping into a fellow researcher, I would have been just as surprised to see pretty much anyone I knew, so the probability of this ‘event’ is suddenly significantly higher, though possibly still ‘low’. With the car example, this probably happens a lot and you don’t notice it. I certainly don’t read the licence plate of every vehicle I park near. Also, how close would the plates need to be to surprise you? Off by a digit? Just the last digit? A letter? The ‘look elsewhere’ effect really raises this one significantly.

Anyway, after all that, am I still going to buy a ticket sometime this month? Probably.

Project Euler Q5 :: Smallest multiple

website@jcarroll.com.au (Jonathan Carroll) — Thu, 08 Jan 2015 23:25:58 +0000

Explanation. Standard caveat: don’t look here if you are trying to do these yourself.

2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder. What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?

I’m getting the feeling that brute-force is going to be quite the useful tool for these questions. Thankfully R can churn through numbers really fast.

So, we’re after a number divisible by 1, 2, 3, ..., 10. Let’s vectorise that and check the stated answer

all(2520 %% 1:10 == 0)

## [1] TRUE

Easy enough. The solution value must be divisible by 20, so we can just test multiples of 20 for the above property

i <- 20
y <- FALSE
while(!y) {
  i <- i + 20
  y <- all(i %% 1:20 == 0)
}
i

## [1] 232792560

### CORRECT

Wrapping a system.time() call around that assures us that this is still done in under a minute, as per the guidelines

   user  system elapsed 
 26.150   0.000  26.192

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Project Euler Q4 :: Largest palindrome product

website@jcarroll.com.au (Jonathan Carroll) — Thu, 08 Jan 2015 22:24:09 +0000

Explanation. Standard caveat: don’t look here if you are trying to do these yourself.

A palindromic number reads the same both ways. The largest palindrome made from the product of two 2-digit numbers is 9009 = 91 × 99. Find the largest palindrome made from the product of two 3-digit numbers.

This seems like another brute-force question. There’s not that many numbers to test.

## check the worked solution
91*99

## [1] 9009

I’m not aware of an is.palindrome function, but it’s easy enough to code.

is.palindrome <- function(x) {
   ## convert to character and explode
   x <- unlist(strsplit(as.character(x), ""))
   ## check if the vector is palindromic
   return(identical(x, rev(x)))
}
is.palindrome(9009)

## [1] TRUE

is.palindrome(9001)

## [1] FALSE

Let’s try it out for the two digit example and make sure we’re on the right track. Multiply all two digit numbers together and test them for palindrome-ness, then find the largest of those.

twodigits <- 10:99
prods <- expand.grid(twodigits, twodigits)
prods$prod <- prods[ ,1]*prods[ ,2]
prods.palindromes <- prods$prod[sapply(prods$prod, is.palindrome)]
max(prods.palindromes)

## [1] 9009

Great! What about three digits?

threedigits <- 100:999
prods <- expand.grid(threedigits, threedigits)
prods$prod <- prods[ ,1]*prods[ ,2]
prods.palindromes <- prods$prod[sapply(prods$prod, is.palindrome)]
largest <- max(prods.palindromes)
largest

## [1] 906609

### CORRECT

Takes a little longer, and generates a nice little 10MB, 810,000 element vector along the way.

format(object.size(prods), units="Mb")

## [1] "9.4 Mb"

The two three digit numbers?

prods[prods$prod==largest, ]

##        Var1 Var2   prod
## 732594  993  913 906609
## 804514  913  993 906609

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Project Euler Q3 :: Largest prime factor

website@jcarroll.com.au (Jonathan Carroll) — Sat, 03 Jan 2015 00:52:07 +0000

Explanation. Standard caveat: don’t look here if you are trying to do these yourself.

The prime factors of 13195 are 5, 7, 13 and 29.
What is the largest prime factor of the number 600851475143?

It seems so simple at first glance, until of course you look at how big that last number is. I started off by making sure I understood the issue.

## check the worked solution
5*7*13*29

## [1] 13195

so far so good. At this point I realised that there isn’t an inbuilt is.prime so I stole one from this site.

is.prime <- function(num) {
 if (num == 2L) {
 TRUE
 } else if (any(num %% 2L:(num-1L) == 0L)) {
 FALSE
 } else {
 TRUE
 }
}

Testing the example works pretty well…

## let's loop up to n and list the prime factors
prime.factors <- function(n) {
 primes <- c()
 for(i in 1:n) {
 ## take advantage of lazy logical evaluation
 ## and short-cut to only the factors
 if(n %% i == 0 & is.prime(i)) primes <- c(primes, i)
 }
 return(primes)
}
prime.factors(13195)

## [1]  5  7 13 29

but I hit a snag when I tried to do the same for the problem value.

w <- as.integer(600851475143)

## Warning: NAs introduced by coercion to integer range

prime.factors(600851475143) ## Error: cannot allocate vector of size 4476.7 Gb

## Error in prime.factors(600851475143): long vectors not supported yet: eval.c:6387

Sure enough, that’s bigger than the machine precision integer allows

as.numeric("600851475143") > .Machine$integer.max

## [1] TRUE

so, I abandoned the pre-filled list of values and went again with the brute force. For the sake of speeding it up, I delayed testing for primes until later, as I can do that over the generated list with an apply and only bothered testing the values below sqrt(n) and n/f where f is the largest found prime so far.

## lists are too big. Find the primes by brute force
## using floating point representations
z <- as.numeric("600851475143")
i <- 2
factors <- 1
## loop through values of i that are
## less than sqrt(z) and
## less than z/the largest found factor
while(i < sqrt(z) & i < z/max(factors)) {
 ## skip the prime test for now
 if(z %% i == 0) factors <- c(factors, i)
 i <- i + 1
}
factors

## [1]      1     71    839   1471   6857  59569 104441 486847

factors.prime <- sapply(factors, is.prime)
primes <- factors[factors.prime] 
z == prod(primes)

## [1] TRUE

max(primes)

## [1] 6857

### CORRECT

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Project Euler Q2 :: Even Fibonacci numbers

website@jcarroll.com.au (Jonathan Carroll) — Fri, 02 Jan 2015 22:56:50 +0000

Explanation. Standard caveat: don’t look here if you are trying to do these yourself.

Each new term in the Fibonacci sequence is generated by adding the previous two terms. By starting with 1 and 2, the first 10 terms will be:

1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …
By considering the terms in the Fibonacci sequence whose values do not exceed four million, find the sum of the even-valued terms.

Getting a little trickier already. I initially attempted this one by a recursive brute-force search of each step in the sequence;

## recursively define the Fibonacci sequence
fibonacci <- function(x) {
   y <- ifelse(x > 1, fibonacci(x-1)+fibonacci(x-2), x)
   return(y)
}

## values in fibonacci < 4e6
z <- 4e6L
## check i=20,30,40
fibonacci(20)

## [1] 6765

fibonacci(30) # takes over 8 seconds

## [1] 832040

# fibonacci(40) # takes too long

But of course these solutions are supposed to be calculable in about a minute so I’ve taken a wrong turn.

It quickly became obvious that I was needlessly re-calculating each step to add another, which is silly, as this explicitly needs all of them each time. I decided to store the sequence as a data.frame to keep the iteration number alongside it. Note that I use the more correct definition of the sequence which starts with 1, 1, 2. That’s not going to be an issue here, as we’re summing the even values anyway.

## this is silly, why recompute every time?
fibonacci.seq <- data.frame(n=integer(), f=integer())
fibonacci.seq[1,] <- c(1,1)
fibonacci.seq[2,] <- c(1,1)
w <- 2
f <- fibonacci.seq$f[w]
while(f < z) {
 w <- w + 1
 fibonacci.seq[w,] <- data.frame(n=w, f=fibonacci.seq$f[w-1]+fibonacci.seq$f[w-2])
 f <- fibonacci.seq$f[w]
}
head(fibonacci.seq)

##   n f
## 1 1 1
## 2 1 1
## 3 3 2
## 4 4 3
## 5 5 5
## 6 6 8

tail(fibonacci.seq)

##     n       f
## 29 29  514229
## 30 30  832040
## 31 31 1346269
## 32 32 2178309
## 33 33 3524578
## 34 34 5702887

## sum of even values
sum(fibonacci.seq[fibonacci.seq$f %% 2 == 0, "f"])

## [1] 4613732

### CORRECT

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Project Euler Q1 :: Multiples of 3 and 5

website@jcarroll.com.au (Jonathan Carroll) — Fri, 02 Jan 2015 22:45:10 +0000

Explanation. Standard caveat: don’t look here if you are trying to do these yourself.

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 or 5 below 1000.

This one’s pretty straight forward, really, as one might hope being the first question. R’s built-in subsetting mechanism handles the extraction fairly nicely. I perhaps would have liked a way to do this without first defining x; though I suppose it could just be repeated in the last line.

## check the worked solution
sum(c(3,5,6,9))

## [1] 23

## values < 1000
x <- 1:999

## sum of x % 3 or x % 5
sum(x[x %% 3 == 0 | x %% 5 == 0])

## [1] 233168

### CORRECT

devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-08-13                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.1.0   2019-07-06 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.24    2019-08-08 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library

Project Euler

website@jcarroll.com.au (Jonathan Carroll) — Fri, 02 Jan 2015 22:27:10 +0000

As a means of honing my R programming skills, I’ve decided to tackle the Project Euler questions exclusively using my new favourite programming language. Besides, it seems that R can do just about everything; surely it can handle some programming games. If I get the time, I’ll add updates in other languages as I build my knowledge of them.

R is not going to be the best language for some of the questions (I see some very Python-suitable questions) but they should all be do-able in R, and I’d bet R beats some of the others in at least a few questions.

I’ll try to post a solution or two a week at least, along side my other projects. There’s about 500 questions available, presumably increasing in difficulty, so that’s quite a depth of material to cover.

If you know of a good improvement that can be made, by all means drop a line in the comments.

It should be noted however, as per the Project Euler site;

I learned so much solving problem XXX so is it okay to publish my solution elsewhere?
It appears that you have answered your own question. There is nothing quite like that “Aha!” moment when you finally beat a problem which you have been working on for some time. It is often through the best of intentions in wishing to share our insights so that others can enjoy that moment too. Sadly, however, that will not be the case for your readers. Real learning is an active process and seeing how it is done is a long way from experiencing that epiphany of discovery. Please do not deny others what you have so richly valued yourself.

so if you’re planning on tackling these yourself, you’re better off steering clear of these posts for now and comparing solutions later. That’s what I’m doing at least.