r-bugs :: object_size

As soon as the R-Foundation posted that they’re inviting cleanup of old bugs, I knew it would be an opportunity to learn more about the way R works on the inside.

I started looking through the list of open bugs for somewhere I could help out. I barely know anything about the actual C internals of the language (I’m hoping to learn) so I figured it would be best to start with some parts of the code I’m familiar with.

I had an open bug report for the internals of glm which I extended with a reproducible example. I had another look through the open bug reports for “glm” in case there was another report of this that I had overlooked (not that I can find) and found another which seemed fairly straightforward - some out of date documentation.

Bug 16522

Bug 16522

That seems approachable. I tested that the documentation was still in that state (it was) and that the example did what it said (it did). Lastly, I read through the source of that function to double-check that the return value would indeed be more general. In fact, the method return value could be either the name of the fitting function as a character string, e.g.

glm.fit2 <- glm.fit
glm(1:4 ~ rnorm(4), method = "glm.fit2")$method
## [1] "glm.fit2"

or the actual function definition, if it was provided that way

head(glm(1:4 ~ rnorm(4), method = glm.fit)$method)
##                                                                          
## 1 function (x, y, weights = rep(1, nobs), start = NULL, etastart = NULL, 
## 2     mustart = NULL, offset = rep(0, nobs), family = gaussian(),        
## 3     control = list(), intercept = TRUE, singular.ok = TRUE)            
## 4 {                                                                      
## 5     control <- do.call("glm.control", control)                         
## 6     x <- as.matrix(x)

(truncated for simplicity).

This made for a small, (-2/+3)-line patch which was accepted and is now part of the source of R.

Sidenote: I made this patch using git, but I should really be doing this via SVN. This thread by Michael Chirico is what I’ll be following from here on.

Now, on to the next bug.

I had a browse through the sections and found this one from 2013

Bug 15389

Bug 15389

That seems innocent enough, right? My experience with object.size is when I look at how much memory a given object is taking up (incorrectly, I now understand after reading Advanced R). What I always liked about this function was that it has a format method so I could easily convert the standard output

object.size(mtcars)
## 7208 bytes

into a different unit very easily

format(object.size(mtcars), "KB")
## [1] "7 Kb"

Of course, in order to do this (to know to convert from bytes to Kb) the object needs to be classed appropriately. That’s the case, here

class(object.size(mtcars))
## [1] "object_size"

That isn’t the case for the size element of file.info, though

example_file <- file.path(.Library, "base", "R", "base.rdb")
file.info(example_file)[, c("size", "isdir", "mode")]
##                                      size isdir mode
## /usr/lib/R/library/base/R/base.rdb 973156 FALSE  644

which is just a number

class(file.info(example_file)$size)
## [1] "numeric"

The suggestion is to make this of class object_size, which would enable the nice formatting of the unit (even though a ‘file’ is not an ‘object’). Seems fair, let’s have a look at what needs to happen to make that work - hopefully it’s as simple as adding a class to the size element. I use RStudio, and I have a copy of the r-source in a project, so I can simply CTRL+SHIFT+F to search all files in this project for “file.info”. Sure enough, it’s in /src/library/base/R/files.R

file.info <- function(..., extra_cols = TRUE)
{
    res <- .Internal(file.info(fn <- c(...), extra_cols))
    res$mtime <- .POSIXct(res$mtime)
    res$ctime <- .POSIXct(res$ctime)
    res$atime <- .POSIXct(res$atime)
    class(res) <- "data.frame"
    attr(res, "row.names") <- fn # not row.names<- as that does a length check
    res
}

Not much to it, but some surprises for sure. What immediately jumps out to me is that the fact that the result is a data.frame is achieved through the very hacky “slap a class on it” method rather than as.data.frame(). The call to .Internal means the actual work is done at the C level, so there may not be much hope changing the class of the size there.

The simplest thing would appear to be to convert res$size to class object_size as soon as it’s created. There’s sometimes a converting function, e.g. as.object_size, but I don’t see one here. Looking at the internals of the object.size function

object.size
## function (x) 
## structure(.Call(C_objectSize, x), class = "object_size")
## <bytecode: 0x564bd0b58e30>
## <environment: namespace:utils>

suggests it’s safe enough to slap the class on an object, so let’s try that. I’ll rename this function for now so we can see if it’s working

file.info2 <- function(..., extra_cols = TRUE)
{
    res <- .Internal(file.info(fn <- c(...), extra_cols))
    class(res$size) <- "object_size"
    res$mtime <- .POSIXct(res$mtime)
    res$ctime <- .POSIXct(res$ctime)
    res$atime <- .POSIXct(res$atime)
    class(res) <- "data.frame"
    attr(res, "row.names") <- fn # not row.names<- as that does a length check
    res
}

file.info2(example_file)
## Error in round(x/base^power, digits = digits): invalid second argument of length 0

Well, that didn’t work. Huh. Did we do something wrong?

file.info2(example_file)$size
## 973156 bytes

No, that actually works. Did it actually fail to return an object?

fi <- file.info2(example_file)

Huh, that works, too. Can we see what’s inside?

str(fi)
## 'data.frame':    1 obs. of  10 variables:
##  $ size  : 'object_size' num 973156
##  $ isdir : logi FALSE
##  $ mode  : 'octmode' int 644
##  $ mtime : POSIXct, format: "2019-01-15 06:33:43"
##  $ ctime : POSIXct, format: "2019-08-10 20:44:16"
##  $ atime : POSIXct, format: "2019-08-10 20:44:13"
##  $ uid   : int 0
##  $ gid   : int 0
##  $ uname : chr "root"
##  $ grname: chr "root"

Well, that looks to be what we wanted - the size element has the right class. If we print this, however

print(fi)
## Error in round(x/base^power, digits = digits): invalid second argument of length 0

So, the issue seems to be in the print method somewhere. using traceback() we can see that the error comes from this line in utils:::format.object_size which seems to be having trouble finding the digits value

paste(round(x/base^power, digits = digits), unit)

The formal arguments for utils:::format.object_size has a default of digits = 1L so why is it not being found? Tracing back further we see that the format.object_size method was called from format.data.frame which loops through the columns of the data.frame and formats them according to each class there

format.data.frame <- function(x, ..., justify = "none")
{
    nc <- length(x)
    <...>
    rval <- vector("list", nc)
    for(i in seq_len(nc))
       rval[[i]] <- format(x[[i]], ..., justify = justify)
    <...>

but there’s no mention of digits here. Tracing even further back, this is called from print.data.frame and passes its digits argument. That function has the signature

print.data.frame <-
    function(x, ..., digits = NULL, quote = FALSE, right = TRUE,
         row.names = TRUE, max = NULL)

and now we can see where the problem comes from - we try to print our data.frame which has an object_size column, but that print method sets digits = NULL which is passed to format.data.frame which is passed to format.object_size which has no way to deal with digits = NULL. Ugh.

Can we test that? Sure, let’s create a simple data.frame, make a column be object_size, and try to print it

df <- data.frame(x = 1, y = 2048)
print(df)
##   x    y
## 1 1 2048
class(df$y) <- "object_size"
print(df)
## Error in round(x/base^power, digits = digits): invalid second argument of length 0

This is the behaviour if we auto-print an object by just referencing it by name

df
## Error in round(x/base^power, digits = digits): invalid second argument of length 0

but what if we do provide the digits argument to an actual call?

print(df, digits = 1L)
##   x          y
## 1 1 2048 bytes

SOLVED! But what to do about it? I see three options:

OPTION 1: Don’t default to digits = NULL in print.data.frame. I’m not sure what the consequence of this would be across all calls to this function, but I can’t imagine there’s much use for that default. A better default would seem to be

getOption("digits")
## [1] 7

Looking at ?options we can see how this is intended to be used

digits: controls the number of significant digits to print when printing numeric values. It is a suggestion only. Valid values are 1…22 with default 7. See the note in print.default about values greater than 15.

OPTION 2: Make round deal with NULL values. I worry that is already handled in that an error message appropriate to that function is generated

round(3.14159, digits = NULL)
## Error in round(3.14159, digits = NULL): invalid second argument of length 0

OPTION 3: Make format.object_size capable of dealing with digits = NULL. The fact that it has no way of dealing with this value seems like an oversight, since it must be a valid value in order to pass it to round. Again, the option of using a more sensible default comes to mind, but there already is a default in place here (digits = 1L) it’s just being overridden.

Instead, we would need some sort of handling in this situation, such as

digits <- ifelse(is.null(digits), 1L, digits)

## or 

digits <- ifelse(is.null(digits), getOption("digits"), digits)

I think this last option is the most satisfactory (though perhaps the first should also be addressed) so let’s see if that’s sufficient. Unfortunately, since the issue is deeply nested within another function’s namespace, we can’t just write a new format.object_size and call print.data.frame and expect it to work (without rebuilding R itself – learning how to do that is on my TODO list). What we can do, though, is to write a format.object_size3, class our new column as object_size3, and test that. With this new function written, we get

format.object_size3 <- function(x, units = "b", standard = "auto", digits = 1L, ...)
{
  known_bases <- c(legacy = 1024, IEC = 1024, SI = 1000)
  known_units <- list(
    SI     = c("B", "kB",  "MB",  "GB", "TB", "PB",  "EB",  "ZB",  "YB"),
    IEC    = c("B", "KiB", "MiB", "GiB","TiB","PiB", "EiB", "ZiB", "YiB"),
    legacy = c("b", "Kb",  "Mb",  "Gb", "Tb", "Pb"),
    LEGACY = c("B", "KB",  "MB",  "GB", "TB", "PB") # <- only for "KB"
  )
  
  units <- match.arg(units,
                     c("auto", unique(unlist(known_units), use.names = FALSE)))
  standard <- match.arg(standard, c("auto", names(known_bases)))
  digits <- ifelse(is.null(digits), 1L, digits) # added
  
  if (standard == "auto") { ## infer 'standard' from 'units':
    standard <- "legacy" # default; may become "SI"
    if (units != "auto") {
      if (endsWith(units, "iB"))
        standard <- "IEC"
      else if (endsWith(units, "b"))
        standard <- "legacy"   ## keep when "SI" is the default
      else if (units == "kB")
        ## SPECIAL: Drop when "SI" becomes the default
        stop("For SI units, specify 'standard = \"SI\"'")
    }
  }
  base      <- known_bases[[standard]]
  units_map <- known_units[[standard]]
  
  if (units == "auto") {
    power <- if (x <= 0) 0L else min(as.integer(log(x, base = base)),
                                     length(units_map) - 1L)
  } else {
    power <- match(toupper(units), toupper(units_map)) - 1L
    if (is.na(power))
      stop(gettextf("Unit \"%s\" is not part of standard \"%s\"",
                    sQuote(units), sQuote(standard)), domain = NA)
  }
  unit <- units_map[power + 1L]
  ## SPECIAL: Use suffix 'bytes' instead of 'b' for 'legacy' (or always) ?
  if (power == 0 && standard == "legacy") unit <- "bytes"
  
  paste(round(x / base^power, digits=digits), unit)
}

file.info3 <- function(..., extra_cols = TRUE)
{
    res <- .Internal(file.info(fn <- c(...), extra_cols))
    class(res$size) <- "object_size3"
    res$mtime <- .POSIXct(res$mtime)
    res$ctime <- .POSIXct(res$ctime)
    res$atime <- .POSIXct(res$atime)
    class(res) <- "data.frame"
    attr(res, "row.names") <- fn # not row.names<- as that does a length check
    res
}

file.info3(example_file)[, c("size", "isdir", "mode")]
##                                            size isdir mode
## /usr/lib/R/library/base/R/base.rdb 973156 bytes FALSE  644

It works!

As of R 3.2.0 there are also some helper extractors of these elements (mode, mtime, size) so our improvement extends to file.size. We can test this if we update the helper to use our version

file.size3 <- function(...) file.info3(..., extra_cols = FALSE)$size
file.size3(example_file)
## [1] 973156
## attr(,"class")
## [1] "object_size3"

We can format that specifically

format(file.size3(example_file), "KB")
## [1] "950.3 Kb"

but it doesn’t print nicely on its own here. This is because we’ve artificially changed the class to object_size3 so we’re no longer plugging in to all the methods for object_size. I could go through and redefine all of those, but for testing, it’s easier to just reset everything to object_size, define file.size to use my custom version, and test that, and that works.

I’ve seen others create a package containing the modified code so that all of the namespacing works out, but in this case it would be a large chunk of the print, subset, and format methods.

Are we done? What about another look at that class(res) <- "data.frame" business?

Would it make more sense for that to use res <- as.data.frame(res)? Let’s try

file.info4 <- function(..., extra_cols = TRUE)
{
    res <- .Internal(file.info(fn <- c(...), extra_cols))
    class(res$size) <- "object_size3"
    res$mtime <- .POSIXct(res$mtime)
    res$ctime <- .POSIXct(res$ctime)
    res$atime <- .POSIXct(res$atime)
    res <- as.data.frame(res)
    attr(res, "row.names") <- fn # not row.names<- as that does a length check
    res
}

file.info4(example_file)
## Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"object_size3"' to a data.frame

Fair enough, there’s probably no as.data.frame.object_size3 method (the 3 is for our convenience, but there’s no as.data.frame.object_size either). It’s simple enough to add

as.data.frame.object_size3 <- as.data.frame.vector

file.info4(example_file)
## Error in as.data.frame.default(x[[i]], optional = TRUE): cannot coerce class '"octmode"' to a data.frame

The “mode” element has the file permissions as octmode. I feel we’re getting a bit off track if we need to add a lot of other as.data.frame methods, but here we go

as.data.frame.octmode <- as.data.frame.vector

file.info4(example_file)
##                                            size isdir mode
## /usr/lib/R/library/base/R/base.rdb 973156 bytes FALSE  644
##                                                  mtime               ctime
## /usr/lib/R/library/base/R/base.rdb 2019-01-15 06:33:43 2019-08-10 20:44:16
##                                                  atime uid gid uname
## /usr/lib/R/library/base/R/base.rdb 2019-08-10 20:44:13   0   0  root
##                                    grname
## /usr/lib/R/library/base/R/base.rdb   root

Right, that’s working. I’ll raise the question of whether or not it’s worth the effort to support the as.data.frame conversion or whether the forcing of the class is better - I’m honestly not sure which.

I wrote a bug report and corresponding patch that adds the new test for NULL line to the definition of format.object_size and submitted that.

I then wrote another patch to the bug report this started with which implements this (now minor) change to the size element of file.info. It may be the case that neither is welcome in the core source, and that’s fine. The bug can be closed as WONTFIX.

I’ve learned a lot about how the components of these work, and either way a bug should get closed. I’m off to find the next one.


Addendum: I wanted to test this out properly, so I rebuilt my modified version of R (with these two patches in place) from source using docker. Assuming you have docker set up, this seems to do the trick

## pull the r-base docker image
## this has most of the requirements to build R
docker pull r-base

## run the image with a command-line
## with your local svn repository mounted
## your path to your svn directory will vary
docker run -ti -v /home/USER/svn/:/svn/ r-base /bin/bash 

## update whatever is necessary to build R from the svn source
apt update
apt-get install libcurl4-openssl-dev 
apt-get install texinfo ## needed to build manuals
apt-get install texlive-latex-base ## needed to build vignettes
apt-get install texlive-latex-extra ## sty files
apt-get install subversion ## to work with svn

## ensure svn is up to date
cd /svn/r-devel/
svn update

## build R and check that it works
## we're just using the command line, so X11 is not needed
## we're just building the source, so we don't need e.g. MASS
## we don't need to be using java, so don't include that
./configure --with-x=no --without-recommended-packages --disable-byte-compiled-packages --disable-java
## make in parallel
make -j4
make check
make check-all
make install

bin/R
## R Under development (unstable) (2019-10-10 r77275) -- "Unsuffered Consequences"

Woohoo!

Note that this is ephemeral - the changes won’t persist after you stop the image. To prevent that, we can save the image for re-use (from outside the image)

## your image ID will be different
docker commit -m "build env" 8f42f23123dd r-build-env

With that built, we can try out our patched functions (not run locally here)

example_file <- file.path(.Library, "base", "R", "base.rdb")
file.info(example_file)[, c("size", "isdir", "mode")]
#>                                              size isdir mode
#> /svn/r-devel/library/base/R/base.rdb 357435 bytes FALSE  644

file.size(example_file)
#> 357435 bytes

format(file.size(example_file), "KB")
#> [1] "349.1 Kb"

Success!


devtools::session_info()

## ─ Session info ──────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.5.2 (2018-12-20)
##  os       Pop!_OS 19.04               
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language en_AU:en                    
##  collate  en_AU.UTF-8                 
##  ctype    en_AU.UTF-8                 
##  tz       Australia/Adelaide          
##  date     2019-10-12                  
## 
## ─ Packages ──────────────────────────────────────────────────────────────
##  package     * version date       lib source                           
##  assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.2)                   
##  backports     1.1.4   2019-04-10 [1] CRAN (R 3.5.2)                   
##  blogdown      0.14.1  2019-08-11 [1] Github (rstudio/blogdown@be4e91c)
##  bookdown      0.12    2019-07-11 [1] CRAN (R 3.5.2)                   
##  callr         3.3.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.2)                   
##  crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.1)                   
##  desc          1.2.0   2018-05-01 [1] CRAN (R 3.5.1)                   
##  devtools      2.2.1   2019-09-24 [1] CRAN (R 3.5.2)                   
##  digest        0.6.20  2019-07-04 [1] CRAN (R 3.5.2)                   
##  ellipsis      0.3.0   2019-09-20 [1] CRAN (R 3.5.2)                   
##  evaluate      0.14    2019-05-28 [1] CRAN (R 3.5.2)                   
##  fs            1.3.1   2019-05-06 [1] CRAN (R 3.5.2)                   
##  glue          1.3.1   2019-03-12 [1] CRAN (R 3.5.2)                   
##  htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.1)                   
##  knitr         1.25    2019-09-18 [1] CRAN (R 3.5.2)                   
##  magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)                   
##  memoise       1.1.0   2017-04-21 [1] CRAN (R 3.5.1)                   
##  pkgbuild      1.0.4   2019-08-05 [1] CRAN (R 3.5.2)                   
##  pkgload       1.0.2   2018-10-29 [1] CRAN (R 3.5.1)                   
##  prettyunits   1.0.2   2015-07-13 [1] CRAN (R 3.5.1)                   
##  processx      3.4.1   2019-07-18 [1] CRAN (R 3.5.2)                   
##  ps            1.3.0   2018-12-21 [1] CRAN (R 3.5.1)                   
##  R6            2.4.0   2019-02-14 [1] CRAN (R 3.5.1)                   
##  Rcpp          1.0.2   2019-07-25 [1] CRAN (R 3.5.2)                   
##  remotes       2.1.0   2019-06-24 [1] CRAN (R 3.5.2)                   
##  rlang         0.4.0   2019-06-25 [1] CRAN (R 3.5.2)                   
##  rmarkdown     1.14    2019-07-12 [1] CRAN (R 3.5.2)                   
##  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 3.5.1)                   
##  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.1)                   
##  stringi       1.4.3   2019-03-12 [1] CRAN (R 3.5.2)                   
##  stringr       1.4.0   2019-02-10 [1] CRAN (R 3.5.1)                   
##  testthat      2.2.1   2019-07-25 [1] CRAN (R 3.5.2)                   
##  usethis       1.5.1   2019-07-04 [1] CRAN (R 3.5.2)                   
##  withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.1)                   
##  xfun          0.8     2019-06-25 [1] CRAN (R 3.5.2)                   
##  yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.1)                   
## 
## [1] /home/jono/R/x86_64-pc-linux-gnu-library/3.5
## [2] /usr/local/lib/R/site-library
## [3] /usr/lib/R/site-library
## [4] /usr/lib/R/library


r-bugs 

See also