--- title: "Join Keys" author: "NEST CoreDev" output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Join Keys} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Overview The `teal.data` package provides a way to define primary keys for a dataset and establish relationships with other datasets. Each dataset _joining keys_ can be characterized by: - Columns constituting the primary key; - Foreign/merge keys, analogous to `SQL` foreign keys. Typically, an application developer specifies these keys manually. However, for datasets following the ADaM standard, `teal.data` can automatically assign keys using the `default_cdisc_join_keys` object. Refer to the section ["Joining Keys with ADaM Datasets"](#join-keys-in-adam-datasets) for details on using this object to select specific datasets. ##### Uses of `join_keys` class in _teal_ applications The primary function of the `join_keys` class in `teal` applications is to facilitate the seamless [merging of datasets](https://insightsengineering.github.io/teal.transform/latest-tag/articles/data-merge.html) using `teal.transform`. Additionally, it plays a role in the data filtering using the _[Filter Panel](https://insightsengineering.github.io/teal/latest-tag/articles/filter-panel.html)_ in a `teal` application. The filters applied to a (parent) dataset are also applied to their children. ## Anatomy of `join_keys` The `join_keys` object contains information about the foreign/primary keys of multiple datasets. Each key is represented by a pair of datasets (by name reference) and a named character vector that encodes the column name mapping between the two datasets. In addition, a foreign key may also contain a _parent-child_ attribute that is used in the "Filter Panel" as we mentioned above. A new join keys can be created as an empty object, or by defining an initial set of primary and foreign keys. That initial object can be extended by adding/modifying/removing keys and by establishing parent-child relationships between datasets. ##### `join_keys(...)`: Join Keys Constructor / Getter / Setter Convenient function that is used both as the constructor and as the getter for _join_keys_ objects. As the _Getter_ it is used to retrieve the _joining keys_ that are contained in other objects, such as a `teal_data` object. As the _Constructor_ it is used to specify a collection of multiple individual keys (via `join_key` function described below). ##### `join_key(dataset_1, dataset_2, key, parent)`: Single Join Key Constructor Specifies a primary key or a relationship between two datasets. - `dataset_1`, `dataset_2`: names of the datasets (if `dataset_2` is the same as `dataset_1` or is omitted, it creates a primary key); - `key`: named `character` vector of column name mapping between datasets (unnamed vector assumes column names between datasets are the same); - `directed` (optional): flag that indicates whether there is a _parent-child_ directed relationship between `dataset_2` to `dataset_1` (latter as a parent). Note that join keys are assumed to be symmetric, i.e., `join_key("ds1", "ds2", c("ds1_col" = "ds2_col"))` establishes a relationship from "x" to "y" and vice versa. By default, the new joining key will set the `dataset_1` as the parent. ##### Example & Output ```{r, include=FALSE} # nolint start: commented_code_linter. ``` ```{r, results="hide", message=FALSE, tidy=FALSE} library(teal.data) jk <- join_keys( join_key("ds1", keys = "col_1"), # ds1: [col_1] join_key("ds2", keys = c("col_1", "col_2")), # ds2: [col_1, col_2] join_key("ds3", keys = c("col_1", "col_3")), # ds3: [col_1, col_3] join_key("ds1", "ds2", keys = "col_1"), # ds1 <-- ds2 join_key("ds1", "ds3", keys = "col_1"), # ds1 <-- ds3 join_key("ds4", "ds5", keys = c("col_4" = "col_5"), directed = FALSE) # ds4 <--> ds5 ) # The parent-child relationships are created automatically (unless 'parent' parameter is "none") jk ``` ```{r, include=FALSE} # nolint end: commented_code_linter. ``` | Output of `print(jk)` | Output annotation | | ---------------------------------- |:----------------------------------------:| | `## A join_keys object containing foreign keys between 3 datasets:` | **Title** | | `## ds1: [col_1]` | **Primary keys**
_\_: [_\_] | | `## <-- ds2: [col_1]` | **Foreign keys**
_(arrow `<--` denotes `ds1` is the parent of `ds2`)_ | | `## <-- ds3: [col_1]` | | | `## ds2: [col_1, col_2]` | | | `## --> ds1: [col_1]` | arrow `-->` denotes `ds2` is a child of `ds1` | | `## --* (implicit via parent with): ds3` | **Implicit relationship between `ds2` & `ds3`**
_(given that they share common keys with same parent)_ | | `## ds3: [col_1, col_3]` | | | `## --> ds1: [col_1]` | | | `## --* (implicit via parent with): ds2` | | | `## ds4: [no primary keys]` | | | `## <-> ds5: [col_5]` | **Foreign keys**
_(arrow `<->` denotes no parent definition between datasets)_ | | `## ds5: [no primary keys]` | | | `## <-> ds4: [col_4] ` | | ## Accessing and Modifying keys The _subset_ operator with 2 indices (`x[i, j]`) is used to retrieve the primary/foreign keys. Both indices must be a string denoting the dataset name. ```{r} # Using the jk object defined in "Anatomy of Join Keys" jk # Getting primary key of "ds1" jk["ds1", "ds1"] # Getting foreign keys between "ds4" and "ds5" jk["ds4", "ds5"] ``` Note that there is a symmetry in the keys between `ds4` and `ds5` relationship: ```{r} jk["ds5", "ds4"] jk["ds5", "ds4"] ``` When only 1 argument is used this operator will return a `join_keys` object that is filtered accordingly. ```{r} # Using the jk object defined in "Anatomy of Join Keys" jk # Getting primary key of "ds1" jk["ds1", "ds1"] # Getting keys of "ds1" and "ds2" jk[c("ds1", "ds2")] ``` Modifying or adding a key uses the same notation with the assignment operator `<-`. A symmetric relationship will be created automatically, where the parent (by default) will be the dataset defined in the first index. Assigning `NULL` value will delete the relationship. ```{r} # Adding a new ds5 <-- ds1 key jk["ds1", "ds5"] <- "a_column" # Removing an existing key jk["ds4", "ds5"] <- NULL ``` ## Merging Join Keys The combining of multiple _joining keys_ is achieved using the `c()` generic function, which generates a symmetric and valid `join_keys` object. When encountering identical relationship pairs, this operation retains the last occurrence of the pair within the specified argument order. For added convenience, the function also accommodates `join_key_set` objects created through the `join_key` function. These objects can be provided as the initial argument or in any other position as needed. ```{r} jk1 <- join_keys(join_key("ds1", "ds1", "col_1")) jk2 <- join_keys(join_key("ds2", "ds2", "col_1"), join_key("ds1", "ds2", "col_1")) # Merging c(jk1, jk2) # Keeping last occurence c(jk1, jk2, join_keys(join_key("ds2", "ds2", "col_2"), join_key("ds1", "ds2", c("col_1" = "col_2")))) # Merges join_key and join_key_set objects (from join_key function) c(jk1, join_key("ds3", "ds3", "col_3")) ``` ## Join Keys Relationships There are 2 types of relationships encoded with _joining keys_ that are described in the following sections. The _primary_ and _foreign_ keys are created explicitly using the constructor for individual keys (`join_key`). Additionally, the `join_keys` object infers implicit relationships when two datasets share foreign keys to a parent dataset, but not between themselves. These implicit relationships are available just like another foreign key and can be used to merge datasets, despite not being defined by the user. ### Primary Key with `teal_data` When using the `teal_data` function, the simplest method to define the join keys is to use the `join_keys` argument. We can specify the column(s) of the dataset that (together) uniquely identify rows in the dataset. ```{r, include=FALSE} # nolint start: commented_code_linter. ``` ```{r, message=FALSE} library(teal.data) td_pk <- within( teal_data(), ds1 <- transform(iris, id = seq_len(nrow(iris))) ) datanames(td_pk) <- "ds1" join_keys(td_pk) <- join_keys(join_key("ds1", keys = "id")) join_keys(td_pk) ``` We can extend the previous example and define primary keys for multiple datasets: ```{r, message=FALSE} td_pk <- within( td_pk, { ds2 <- data.frame(W = 10:1, V = 5:14, M = rep(1:5, 2)) ds3 <- data.frame(V = 5:14, N = 4) } ) datanames(td_pk) <- c(datanames(td_pk), "ds2", "ds3") join_keys(td_pk)["ds2", "ds2"] <- c("V", "W") join_keys(td_pk)["ds3", "ds3"] <- c("V", "W") join_keys(td_pk) ``` ```{r, include=FALSE} # nolint end: commented_code_linter. ``` ### Foreign Keys with `teal_data` When passing multiple datasets to the `teal_data` function, dataset relationships are set using `join_keys` and `join_key` functions, which then can be used to merge datasets together within `teal` apps. For users familiar with `SQL` database schema, these relationships are symmetric and not as strict as `SQL` foreign key relationships as `teal` does not validate whether the values defined as foreign key columns are present in the table. For example: ```{r} library(teal.data) td_fk <- within( teal_data(), { ds1 <- data.frame(X = 1:10, Y = 21:30, Z = 1:10) ds2 <- data.frame(W = 10:1, V = 5:14, M = rep(1:5, 2)) ds3 <- data.frame(V = 5:14, N = 4) } ) datanames(td_fk) <- c("ds1", "ds2", "ds3") join_keys(td_fk) <- join_keys( # Primary keys join_key("ds1", keys = c("X")), join_key("ds2", keys = c("V", "W")), join_key("ds3", keys = c("V")), # Foreign keys join_key("ds1", "ds2", c("X" = "W")), join_key("ds2", "ds3", c("V" = "V")) ) join_keys(td_fk) ``` ### Implicit Relationships Two datasets that share common _foreign_ keys to the same _parent_ dataset have an implicit relationship between them that is modeled and accessible in _joining keys_. This is a special relationship that is inferred from existing foreign keys. It does not need to be explicitly defined but it can be accessed and overwritten just as any other foreign key. These implicit relationships can be used to merge 2 datasets together, just as if they were defined manually. ```{r} library(teal.data) td <- within( teal_data(), { ds1 <- data.frame(X = 1:10, Y = 21:30, Z = 1:10) ds2 <- data.frame(W = 10:1, V = 5:14, M = rep(1:5, 2)) ds3 <- data.frame(V = 5:14, N = 4) ds4 <- data.frame(V = 5:14, R = rnorm(10)) } ) datanames(td) <- c("ds1", "ds2", "ds3", "ds4") join_keys(td) <- join_keys( # Primary keys join_key("ds1", keys = c("X")), join_key("ds2", keys = c("V", "W")), join_key("ds3", keys = c("V")), join_key("ds4", keys = c("V")), # Foreign keys join_key("ds1", "ds2", c("X" = "W")), join_key("ds2", "ds3", c("V" = "V")), join_key("ds1", "ds4", c("X" = "B")) ) join_keys(td) join_keys(td)["ds2", "ds4"] ``` Note that the definition above contains no `join_key` for `"ds2" <-> "ds4"` ## _Join Keys_ in ADaM Datasets `teal.data` provides a set of default join keys (primary and foreign) for datasets named according to the ADaM standard. They are stored in `default_cdisc_join_keys`. ```{r} names(default_cdisc_join_keys) |> sort() ``` When not all default keys are required users can select a smaller set of datasets by subsetting (for example: `default_cdisc_join_keys[c("ADSL", "ADTTE")]`). Please note, that for every element that is selected, its parent will also be returned (if it has one) as well as any pair-wise relationships in the resulting selection. ```{r} default_cdisc_join_keys default_cdisc_join_keys["ADSL"] default_cdisc_join_keys["ADTTE"] default_cdisc_join_keys[c("ADSL", "ADTTE", "ADRS")] ```