---
title: "Controlling Splitting Behavior"
author: "Gabriel Becker"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Controlling Splitting Behavior}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---


# Controlling Facet Levels

## Provided Functions

By default, `split_*_by(varname, ...)` generates a facet for each
*level* the variable `varname` takes in the data - including
unobserved ones in the `factor` case. This behavior can be customized
in various ways.

The most straightforward way to customize which facets are generated
by a split is with one of the split functions or split function
families provided by `rtables`.

These predefined split functions and function factories implement
commonly desired customization patterns of splitting behavior (i.e.,
faceting behavior). They include:

- `remove_split_levels` - remove specified levels from the data for
  facet generation.
- `keep_split_levels` - keep only specified levels in the data for
  facet generation (removing all others).
- `drop_split_levels` - drop levels that are unobserved *within the
  data being split*, i.e., associated with the parent facet.
- `reorder_split_levels` - reorder the levels (and thus the generated
  facets) to the specified order.
- `trim_levels_in_group` - drop unobserved levels *of another
  variable* independently within the data associated with each facet
  generated by the current split.
- `add_overall_level`, `add_combo_levels` - add additional "virtual"
  levels which combine two or more levels of the variable being
  split. See the following section.
- `trim_levels_to_map` - trim the levels of multiple variables to a
  pre-specified set of value combinations. See the following section.


The first four of these are fairly self-describing and for brevity, we
refer our readers to `?split_funcs` for details including working
examples.

## Controlling Combinations of Levels Across Multiple Variables

Often with nested splitting involving multiple variables, the
values of the variables in question are *logically nested*; meaning
that certain values of the inner variable are only coherent in
combination with a specific value or values of the outer variable.

As an example, suppose we have a variable `vehicle_class`, which can
take the values `"automobile"`, and `"boat"`, and a
variable `vehicle_type`, which can take the values `"car"`,
`"truck"`, `"suv"`,`"sailboat"`, and `"cruiseliner"`. The
combination (`"automobile"`, `"cruiseliner"`) does not make sense and
will never occur in any (correctly cleaned) data set; nor does the
combination (`"boat"`, `"truck"`).

We will showcase strategies to deal with this in the next sections
using the following artificial data:

```{r}
set.seed(0)
levs_type <- c("car", "truck", "suv", "sailboat", "cruiseliner")

vclass <- sample(c("auto", "boat"), 1000, replace = TRUE)
auto_inds <- which(vclass == "auto")
vtype <- rep(NA_character_, 1000)
vtype[auto_inds] <- sample(
  c("car", "truck"), ## suv missing on purpose
  length(auto_inds),
  replace = TRUE
)
vtype[-auto_inds] <- sample(
  c("sailboat", "cruiseliner"),
  1000 - length(auto_inds),
  replace = TRUE
)

vehic_data <- data.frame(
  vehicle_class = factor(vclass),
  vehicle_type = factor(vtype, levels = levs_type),
  color = sample(
    c("white", "black", "red"), 1000,
    prob = c(1, 2, 1),
    replace = TRUE
  ),
  cost = ifelse(
    vclass == "boat",
    rnorm(1000, 100000, sd = 5000),
    rnorm(1000, 40000, sd = 5000)
  )
)
head(vehic_data)
```

### `trim_levels_in_group`

The `trim_levels_in_group` split function factory creates split
functions which deal with this issue empirically; any combination which *is
observed in the data being tabulated* will appear as nested facets
within the table, while those that do not, will not.

If we use default level-based faceting, we get several logically
incoherent cells within our table:

```{r examples, message=FALSE}
library(rtables)

lyt <- basic_table() %>%
  split_cols_by("color") %>%
  split_rows_by("vehicle_class") %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt, vehic_data)
```

This is obviously not the table we want, as the majority of its space
is taken up by meaningless combinations. If we use
`trim_levels_in_group` to trim the levels of `vehicle_type`
separately within each level of `vehicle_class`, we get a table which
only has meaningful combinations:

```{r}
lyt2 <- basic_table() %>%
  split_cols_by("color") %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_in_group("vehicle_type")) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt2, vehic_data)
```

Note, however, that it does not contain *all* meaningful combinations,
only those that were actually observed in our data; which *happens* to
not include the perfectly valid `"auto"`, `"suv"` combination.

To restrict level combinations to those which are valid *regardless of
whether the combination was observed*, we must use
`trim_levels_to_map()` instead.


### `trim_levels_to_map`

`trim_levels_to_map` is similar to `trim_levels_in_group` in that its
purpose is to avoid combinatorial explosion when nesting splitting
with logically nested variables. Unlike its sibling function, however,
with `trim_levels_to_map` we define the exact set of allowed
combinations *a priori*, and that exact set of combinations is
produced in the resulting table, regardless of whether they are
observed or not.

```{r, message = FALSE}
library(tibble)
map <- tribble(
  ~vehicle_class, ~vehicle_type,
  "auto",         "truck",
  "auto",         "suv",
  "auto",         "car",
  "boat",         "sailboat",
  "boat",         "cruiseliner"
)

lyt3 <- basic_table() %>%
  split_cols_by("color") %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt3, vehic_data)
```

Now we see that the `"auto"`, `"suv"` combination is again present,
even though it is populated with `NA`s (because there is no data in
that category), but the logically invalid combinations are still absent.

## Combining Levels

Another very common manipulation of faceting in a table context is
the introduction of combination levels that are not explicitly
modeled in the data. Most often, this involves the addition of an
"overall" category, but in both principle and practice it can involve
any arbitrary combination of levels.

`rtables` explicitly supports this via the `add_overall_level` (for
the all case) and `add_combo_levels` split function factories.

### `add_overall_level`

`add_overall_level` accepts `valname` which is the name of the new
level, as well as `label`, and `first` (whether it should come first,
if `TRUE`, or last, if `FALSE`, in the ordering).

Building further on our arbitrary vehicles table, we can use this to
create an "all colors" category:

```{r}
lyt4 <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("color", split_fun = add_overall_level("allcolors", label = "All Colors")) %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt4, vehic_data)
```

With the column counts turned on, we can see that the "All Colors"
column encompasses the full 1000 (completely fake) vehicles in our
data set.

To add more arbitrary combinations, we use `add_combo_levels`.

### `add_combo_levels`

`add_combo_levels` allows us to add one or more arbitrary combination
levels to the faceting structure of our table.

We do this by defining a *combination data.frame* which describes the
levels we want to add. A combination `data.frame` has the 
following columns and one row for each combination to add:

- `valname` - string indicating the name of the value, which will appear
  in paths.
- `label` - a string indicating the label which should be displayed when
  rendering.
- `levelcombo` - character vector of the individual levels to be
  combined in this combination level.
- `exargs` - a list (usually `list()`) of extra arguments which should be
  passed to analysis and content functions when tabulated within this
  column or row.

Suppose we wanted combinations levels for all non-white colors, and
for white and black colors. We do this like so:

```{r}
combodf <- tribble(
  ~valname, ~label, ~levelcombo, ~exargs,
  "non-white", "Non-White", c("black", "red"), list(),
  "blackwhite", "Black or White", c("black", "white"), list()
)


lyt5 <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("color", split_fun = add_combo_levels(combodf)) %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt5, vehic_data)
```

# Fully Customizing Split (Facet) Behavior

Beyond the ability to select common splitting customizations from the
split functions and split function factories `rtables` provides, we
can also fully customize every aspect of splitting behavior by
creating our own split functions. While it is possible to do so by
hand, the primary way we do this is via the `make_split_fun()`
function, which accepts functions implementing different component
behaviors and combines them into a split function which can be used in
a layout.

Splitting, or faceting as it is done in `rtables`, can be thought of
as the combination of 3 steps:

1. preprocessing - transformation of the incoming data which will be faceted
  - e.g., dropping unused factor levels, etc.
2. splitting - mapping the incoming data to a set of 1 or more subsets
   representing individual facets.
3. postprocessing - operations on the facets - e.g., combining them,
   removing them, etc.
   
The `make_split_fun()` function allows us to specify custom behaviors
for each of these steps independently when defining custom splitting
behavior via the `pre`, `core_split`, and `post` arguments, which
dictate the above steps, respectively.

The `pre` argument accepts zero or more pre-processing functions, which
must accept: `df`, `spl`, `vals`, `labels`, and can optionally accept
`.spl_context`.  They then manipulate `df` (the incoming data for the
split) and return a modified data.frame. This modified data.frame
*must* contain all columns present in the incoming data.frame, but can
add columns if necessary. Although, we note that these new columns _cannot
be used in the layout as split or analysis variables_, because they
will not be present when validity checking is done.

The pre-processing component is useful for things such as manipulating
factor levels, e.g., to trim unobserved ones or to reorder levels
based on observed counts, etc.

For a more detailed discussion on what custom split functions do, and an 
example of a custom split function not implemented via `make_split_fun()`, 
see `?custom_split_funs`.

## An Example Custom Split Function

Here we will implement an arbitrary, custom split function where we
specify both pre- and post-processing instructions. It is unusual for
users to need to override the core splitting logic - and, in fact, is
only supported in row space currently - so we leave this off of our
example here but will provide another narrow example of that usage
below.

### An Illustrative Example of A Custom Split Function

First, we define two aspects of 'pre-processing step' behavior:

1. A function which reverses the order of the levels of a variable
   (while retaining which level is associated with which observation),
   and
2. A function factory which creates a function that removes a level
   and the data associated with it.


```{r}
## reverse order of levels

rev_lev <- function(df, spl, vals, labels, ...) {
  ## in the split_rows_by() and split_cols_by() cases,
  ## spl_variable() gives us the variable
  var <- spl_variable(spl)
  vec <- df[[var]]
  levs <- if (is.character(vec)) unique(vec) else levels(vec)
  df[[var]] <- factor(vec, levels = rev(levs))
  df
}

rem_lev_facet <- function(torem) {
  function(df, spl, vals, labels, ...) {
    var <- spl_variable(spl)
    vec <- df[[var]]
    bad <- vec == torem
    df <- df[!bad, ]
    levs <- if (is.character(vec)) unique(vec) else levels(vec)
    df[[var]] <- factor(as.character(vec[!bad]), levels = setdiff(levs, torem))
    df
  }
}
```

Finally we implement our post-processing function. Here we will
reorder the facets based on the amount of data each of them
represents.

```{r}
sort_them_facets <- function(splret, spl, fulldf, ...) {
  ord <- order(sapply(splret$datasplit, nrow))
  make_split_result(
    splret$values[ord],
    splret$datasplit[ord],
    splret$labels[ord]
  )
}
```


Finally, we construct our custom split function and use it to create
our table:


```{r}
silly_splfun1 <- make_split_fun(
  pre = list(
    rev_lev,
    rem_lev_facet("white")
  ),
  post = list(sort_them_facets)
)

lyt6 <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("color", split_fun = silly_splfun1) %>%
  split_rows_by("vehicle_class", split_fun = trim_levels_to_map(map)) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt6, vehic_data)
```


### Overriding the Core Split Function

**Currently, overriding core split behavior is only supported in
functions used for row splits.**


Next, we write a custom core-splitting function which divides the
observations into 4 groups: the first 100, observations 101-500,
observations 501-900, and the last hundred. We could claim this was to
test for structural bias in the first and last observations, but
really its to simply illustrate overriding the core splitting
machinery and has no meaningful statistical purpose.

```{r}
silly_core_split <- function(spl, df, vals, labels, .spl_context) {
  make_split_result(
    c("first", "lowmid", "highmid", "last"),
    datasplit = list(
      df[1:100, ],
      df[101:500, ],
      df[501:900, ],
      df[901:1000, ]
    ),
    labels = c(
      "first 100",
      "obs 101-500",
      "obs 501-900",
      "last 100"
    )
  )
}
```

We can use this to construct a splitting function. This can be combined
with pre- and post-processing functions, as each of the stages is
performed independently, but in this case, we won't, because our core
splitting behavior is such that pre- or post-processing do not make much
sense.


```{r}
even_sillier_splfun <- make_split_fun(core_split = silly_core_split)

lyt7 <- basic_table(show_colcounts = TRUE) %>%
  split_cols_by("color") %>%
  split_rows_by("vehicle_class", split_fun = even_sillier_splfun) %>%
  split_rows_by("vehicle_type") %>%
  analyze("cost")

build_table(lyt7, vehic_data)
```

### Design of Pre- and Post-Processing Functions For Use in `make_split_fun`


Pre-processing and post-processing functions in the custom-splitting
context are best thought of as (and implemented as) independent, atomic
building blocks for the desired overall behavior. This allows them to
be reused in a flexible mix-and-match way.

`rtables` provides several behavior components implemented as
either functions or function factories:

- Pre-processing "behavior blocks"
  - `drop_facet_levels` - drop unobserved levels in the variable being
    split
  
- Post-processing "behavior blocks"
  - `trim_levels_in_facets` - provides `trim_levels_in_group` behavior
  - `add_overall_facet` - add a combination facet for the full data
  - `add_combo_facet` - add a single combination facet (can be used
    more than once in a single `make_split_fun` call)