1 Intro to Data Manipulation

All of you have most likely used every single element of this module, but now we will reassess these operators and behaviors to get a deeper understanding - thus allowing us to wield them more effectively.

1.1 Subsetting

The three subsetting (or subscripting) operators available in R are [, [[, and $. There are also some functions such as subset. Each has different behaviors and caveats attached that are important when deciding which to use for the intended task.

Subscript	Effect
Positive Numeric Vector	selects items in those indices
Negative Numeric Vector	selects all but those indices
Character Vector	selects items with those names
Logical Vector	selects all TRUE (and NA) items
Missing	selects all missing

You can easily see how each of these works given a simple vector

x <- c(1, 5, NA, 8, 10)
names(x) <- c("a", "b", "c", "d", "e")
x[1]
#> a 
#> 1
x[-1]
#>  b  c  d  e 
#>  5 NA  8 10
x[c(1:3)]
#>  a  b  c 
#>  1  5 NA
x[-c(1:3)]
#>  d  e 
#>  8 10
x[is.na(x)]
#>  c 
#> NA
x[!is.na(x)]
#>  a  b  d  e 
#>  1  5  8 10
x["b"]
#> b 
#> 5
x[c("b", "c")]
#>  b  c 
#>  5 NA

# Logical subsetting returns values that you give as true
x[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
#>  a  c 
#>  1 NA
# but don't forget about the recycling rules!
x[c(TRUE, FALSE)]
#>  a  c  e 
#>  1 NA 10

# if you call a specific index more than once it will be returned more than once
x[c(2, 2)]
#> b b 
#> 5 5

By default, [ will simplify the results to the lowest possible dimensionality. That is, it will reduce any higher dimensionality object to a list or vector. This is because if you select a subset, R will coerce the result to the appropriate dimensionality. We will give an example of this momentarily. To stop this behavior you can use the drop = FALSE option.

For higher dimensionality objects, rows and columns are subset individually and can be combined in a single call

Theoph[c(1:10), c("Time", "conc")]
#>     Time  conc
#> 1   0.00  0.74
#> 2   0.25  2.84
#> 3   0.57  6.57
#> 4   1.12 10.50
#> 5   2.02  9.66
#> 6   3.82  8.58
#> 7   5.10  8.36
#> 8   7.03  7.47
#> 9   9.05  6.89
#> 10 12.12  5.94

[[ and $ allow us to take out components of the list.

Likewise, given data frames are lists of column, [[ can be used to extract a column from data frames.

[[ is similar to [, however, it only returns a single value. $ is shorthand for [[ but is only available for character subsetting

There are two additional important distinctions between $ and [[

$ can not be used for column names stored as a variable:

id <- "Subject"
Theoph$id
Theoph[[id]]

$ allows for partial matching, though this is not advised given code completion engines in this day and age

names(Theoph)
# $ allows for partial matching
head(Theoph$Sub)
head(Theoph$"Subj")
head(Theoph$Subject)

# '[[' does not
head(Theoph[["Subj"]])

Using $ can lead to unintended consequences if you’re looking to grab a column of a certain name that isn’t there but a partial match is - especially if this is nested in a function where it isn’t clear immediately

[ and [[ are both useful for different tasks. In a general sense you use them to accomplish the following:

	Simplifying	Preserving
Vector	`x[[1]]`	`x[1]`
List	`x[[1]]`	`x[1]`
Factor	`x[1:4, drop = T]`	`x[1:4]`
Array	`x[1, ]`, `x[, 1]`	`x[1, , drop = F]`, `x[, 1, drop = F]`
Data frame	`x[, 1]`, `x[[1]]`	`x[, 1, drop = F]`, `x[1]`

There are benefits for each - simplying is often beneficial when you are looking for a result. Preserving is often beneficial in the programming context when you want to keep the results and structure the same.

An easy way to think about it:

“If list x is a train carrying objects, then x[[5]] is the object in car 5; x[4:6] is a train of cars 4-6.” — [@RLangTip](http://twitter.com/#!/RLangTip/status/118339256388304896)

One thing to note: S3 and S4 objects can override the standard behavior of [ and [[ so they behave differently for different types of objects. This can be useful for controlling simplification vs preservation behavior.

1.1.1 Try it yourself

create a vector, list, and dataframe.
Extract elements using [, [[, and $
What are the type and attributes that remain for the extracted piece of information
Quickly brainstorm a couple situations that these could be important to remember for more complex tasks

1.2 Logical Subsetting

One of the most common ways to subset rows is to use logical subsetting.

Let’s take a look

Theoph[Theoph$Subject ==1,]
#>    Subject   Wt Dose  Time  conc
#> 1        1 79.6 4.02  0.00  0.74
#> 2        1 79.6 4.02  0.25  2.84
#> 3        1 79.6 4.02  0.57  6.57
#> 4        1 79.6 4.02  1.12 10.50
#> 5        1 79.6 4.02  2.02  9.66
#> 6        1 79.6 4.02  3.82  8.58
#> 7        1 79.6 4.02  5.10  8.36
#> 8        1 79.6 4.02  7.03  7.47
#> 9        1 79.6 4.02  9.05  6.89
#> 10       1 79.6 4.02 12.12  5.94
#> 11       1 79.6 4.02 24.37  3.28

We just subset the Theoph dataframe to only give us back the rows in which Subject equals 1. How does R go about doing this? Logical subsetting!

Notice what we get when we simply ask for Theoph$Subject == 1

Theoph$Subject ==1
#>   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

It doesn’t give us back the values for the rows where subject equals one, rather, it gives us back a vector of TRUE or FALSE values.

So, in reality, we are using the logical subsetting rules to extract the rows of the dataframe that come back TRUE from our logical query.

We can do this ‘by hand’ to show whats going on

subj1 <- ifelse(Theoph$Subject == 1, TRUE, FALSE)
subj1
#>   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#>  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#>  [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Theoph[subj1,]
#>    Subject   Wt Dose  Time  conc
#> 1        1 79.6 4.02  0.00  0.74
#> 2        1 79.6 4.02  0.25  2.84
#> 3        1 79.6 4.02  0.57  6.57
#> 4        1 79.6 4.02  1.12 10.50
#> 5        1 79.6 4.02  2.02  9.66
#> 6        1 79.6 4.02  3.82  8.58
#> 7        1 79.6 4.02  5.10  8.36
#> 8        1 79.6 4.02  7.03  7.47
#> 9        1 79.6 4.02  9.05  6.89
#> 10       1 79.6 4.02 12.12  5.94
#> 11       1 79.6 4.02 24.37  3.28

Logical subsetting is at the core of many of R’s operations. Any time you’re matching, or checking with is.* you are using logical subsetting to test the condition you’re looking for then returning the TRUE results

1.3 Common Subsetting Situations and Some Useful Functions

Now that you’ve gotten your feet wet with the basics of subsetting, lets check-in with some of the commonly used operators that give us some enhanced subsetting functionality.

Note: These all take advantage of logical subsetting:

Take a moment to prod through to documentation for the other operations. Note for things operations like %in% or & to query help you need to add a single quote around it like so ?'%in%'

This is a good chance for us to take a deeper look @ both how these functions work and how to read the documentation

pause to look @ %in%, is.na, and which documentation

%in%
is.na
!
duplicated
unique
&
|
any
all
which

It can be useful to make yourself a brief ‘cheat sheet’ of some of the common operations you use to reference when you’re thinking about what you are trying to do what you want your output to be.

For example:

%in% - compares values from the input vector with a table vector and returns T/F. * input/output are coerced to vectors and then type-matched before comparison. Factors, lists converted to character vectors! * Never returns NA, making it useful forif` conditions
- Can be slow for long lists and best avoided for complex cases

There is an interesting nugget of information in the documentation - that the input is coerced to a vector then type-matched.

So what is going on in these two situations?

ISM <- c(TRUE, FALSE, TRUE, FALSE)
ISMnums <- c(1, 0, 1, 0)
SUBJ <- c(1, 2, 3, 4)
ISM %in% SUBJ
#> [1]  TRUE FALSE  TRUE FALSE
which(ISM %in% SUBJ)
#> [1] 1 3
ISMnums %in% SUBJ
#> [1]  TRUE FALSE  TRUE FALSE
which(ISMnums %in% SUBJ)
#> [1] 1 3

ISM <- c(TRUE, FALSE, TRUE, FALSE)
SUBJ <- factor(c(1, 2, 3, 4))
ISM %in% SUBJ
#> [1] FALSE FALSE FALSE FALSE
which(ISM %in% SUBJ)
#> integer(0)
ISMnums %in% SUBJ
#> [1]  TRUE FALSE  TRUE FALSE
which(ISMnums %in% SUBJ)
#> [1] 1 3

hint: look @ ID numbers

The moral of the story, is make sure you know what is going on under the hood - sometimes you can get the ‘right’ answer for the wrong reasons.

1.3.1 Pop quiz question

What’s wrong with this subsetting command? dosingdf <- df[unique(df$ID),]

1.4 Data Manipulation

Now that we can slice and dice our data how we want - let’s examine how we can actually manipulate the data

Our goal for this section is to be able to: - rename columns systematically - reorder and rearrange rows and columns to our needs - create new columns

1.5 renaming columns

Quick renaming of columns can be easily accomplished using the dplyr rename command (originally from plyr) with the structure:

dataframe <- rename(dataframe, c("oldcolname1" = "newcolname1", ...))

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
names(Theoph)
#> [1] "Subject" "Wt"      "Dose"    "Time"    "conc"
Theoph <- rename(Theoph, ID = Subject)
names(Theoph)
#> [1] "ID"   "Wt"   "Dose" "Time" "conc"

This can also be done directly without PKPDmisc, however is a bit more verbose

names(Theoph)[names(Theoph)=="Subject"] <- "ID"

Pause for a second - given what we’ve learned about subsetting - what is going on based on the way we’ve constructed the renaming.

Answer: names(Theoph) creates a vector of names - the names(Theoph) == "Subject" logically subsets the vector to identify which index matches the query. <- "ID" is to assign a new value to that index location(s).

Column names can be directly accessed using the colnames function (or for dataframes or lists simply names), and you can rename them all directly by giving it a vector of names.

colnames(Theoph) <- c("hello", "there")
colnames(Theoph)
#> [1] "hello" "there" NA      NA      NA
rm(Theoph)

As you can see, this can be dangerous due to not giving the proper length vector (remember R’s recycling rule!), likewise, if the order of columns changes unexpectedly, your vector could rename columns incorrectly.

There are a couple things directly accessing all the colnames can be useful for though.

For example, capitalization of columns can often be inconsistent and frustrating. This can be quickly fixed by converting all columns to uppercase or lowercase using toupper() and tolower()

names(Theoph)
#> [1] "Subject" "Wt"      "Dose"    "Time"    "conc"
names(Theoph) <- toupper(names(Theoph))
names(Theoph)
#> [1] "SUBJECT" "WT"      "DOSE"    "TIME"    "CONC"
rm(Theoph)

1.6 reordering rows/columns

When reordering columns in a dataframe you can actually think of it as creating a new dataframe in which the columns get created in the way you order them.

newTheoph <- Theoph[c("Subject", "Time", "conc", "Dose", "Wt")]
head(Theoph)
#>   Subject   Wt Dose Time  conc
#> 1       1 79.6 4.02 0.00  0.74
#> 2       1 79.6 4.02 0.25  2.84
#> 3       1 79.6 4.02 0.57  6.57
#> 4       1 79.6 4.02 1.12 10.50
#> 5       1 79.6 4.02 2.02  9.66
#> 6       1 79.6 4.02 3.82  8.58
head(newTheoph)
#>   Subject Time  conc Dose   Wt
#> 1       1 0.00  0.74 4.02 79.6
#> 2       1 0.25  2.84 4.02 79.6
#> 3       1 0.57  6.57 4.02 79.6
#> 4       1 1.12 10.50 4.02 79.6
#> 5       1 2.02  9.66 4.02 79.6
#> 6       1 3.82  8.58 4.02 79.6

Most of the time since you don’t want to create a new data frame every time you reorder, you can simply overwrite the old data frame. Theoph <- Theoph[c("Subject", "Time", "conc", "Dose", "Wt")]

Just like all other subsetting we’ve gone over, we can also organize by index

Theoph <- Theoph[c(1, 4, 5, 3, 2)]

I suggest against it unless you have good reason. (ie you know your code will always be a specific structure) Even then, while faster to type than named indices, it makes legibility more difficult, and modification down the line more tedious trying to keep track.

1.7 Keys

Similar to columns, rows also have names. As you slice and dice and reorder a dataset it can get pretty ugly, so if the need arises, row names can be rest by rownames(df) <- NULL

Rows can always be referred by their name. They are also structurally distinguishable by their content.

A key is the set of columns that can uniquely distinguish every row. Different datasets can have keys of varying complexity (number of columns)

A basic dosing dataset key may be as simple as ID, however for a cross-over clinical trial a dataset may be keyed on ID, time, and cohort.

The general relationship between key columns and other columns is:

Key columns represent unique objects (persons, groups, sites, etc) and the other columns should characterize these objects

For example, a person might be the key column for different concentration, time, wt, etc measurements, thus if you are ‘extracting’ information you’d likely want it based on the key column, so for example by max concentration by individual)

R does not explicitly recognize keys - it is up to you to keep track. Keys become increasingly important when delving into advanced data manipulation. Using dplyr group_by makes working with keys much more straightforward than in base R.