1 Intro to Data Manipulation
All of you have most likely used every single element of this module, but now we will reassess these operators and behaviors to get a deeper understanding - thus allowing us to wield them more effectively.
1.1 Subsetting
The three subsetting (or subscripting) operators available in R are [
, [[
, and $
. There are also some functions such as subset
. Each has different behaviors and caveats attached that are important when deciding which to use for the intended task.
Subscript | Effect |
---|---|
Positive Numeric Vector | selects items in those indices |
Negative Numeric Vector | selects all but those indices |
Character Vector | selects items with those names |
Logical Vector | selects all TRUE (and NA) items |
Missing | selects all missing |
You can easily see how each of these works given a simple vector
x <- c(1, 5, NA, 8, 10)
names(x) <- c("a", "b", "c", "d", "e")
x[1]
#> a
#> 1
x[-1]
#> b c d e
#> 5 NA 8 10
x[c(1:3)]
#> a b c
#> 1 5 NA
x[-c(1:3)]
#> d e
#> 8 10
x[is.na(x)]
#> c
#> NA
x[!is.na(x)]
#> a b d e
#> 1 5 8 10
x["b"]
#> b
#> 5
x[c("b", "c")]
#> b c
#> 5 NA
# Logical subsetting returns values that you give as true
x[c(TRUE, FALSE, TRUE, FALSE, FALSE)]
#> a c
#> 1 NA
# but don't forget about the recycling rules!
x[c(TRUE, FALSE)]
#> a c e
#> 1 NA 10
# if you call a specific index more than once it will be returned more than once
x[c(2, 2)]
#> b b
#> 5 5
By default, [
will simplify the results to the lowest possible dimensionality. That is, it will reduce any higher dimensionality object to a list or vector. This is because if you select a subset, R will coerce the result to the appropriate dimensionality. We will give an example of this momentarily. To stop this behavior you can use the drop = FALSE
option.
For higher dimensionality objects, rows and columns are subset individually and can be combined in a single call
Theoph[c(1:10), c("Time", "conc")]
#> Time conc
#> 1 0.00 0.74
#> 2 0.25 2.84
#> 3 0.57 6.57
#> 4 1.12 10.50
#> 5 2.02 9.66
#> 6 3.82 8.58
#> 7 5.10 8.36
#> 8 7.03 7.47
#> 9 9.05 6.89
#> 10 12.12 5.94
[[
and $
allow us to take out components of the list.
Likewise, given data frames are lists of column, [[
can be used to extract a column from data frames.
[[
is similar to [
, however, it only returns a single value. $
is shorthand for [[
but is only available for character subsetting
There are two additional important distinctions between $
and [[
$
can not be used for column names stored as a variable:
id <- "Subject"
Theoph$id
Theoph[[id]]
$
allows for partial matching, though this is not advised given code completion engines in this day and age
names(Theoph)
# $ allows for partial matching
head(Theoph$Sub)
head(Theoph$"Subj")
head(Theoph$Subject)
# '[[' does not
head(Theoph[["Subj"]])
Using $
can lead to unintended consequences if you’re looking to grab a column of a certain name that isn’t there but a partial match is - especially if this is nested in a function where it isn’t clear immediately
[
and [[
are both useful for different tasks. In a general sense you use them to accomplish the following:
Simplifying | Preserving | |
---|---|---|
Vector | x[[1]] |
x[1] |
List | x[[1]] |
x[1] |
Factor | x[1:4, drop = T] |
x[1:4] |
Array | x[1, ] , x[, 1] |
x[1, , drop = F] , x[, 1, drop = F] |
Data frame | x[, 1] , x[[1]] |
x[, 1, drop = F] , x[1] |
There are benefits for each - simplying is often beneficial when you are looking for a result. Preserving is often beneficial in the programming context when you want to keep the results and structure the same.
An easy way to think about it:
“If list
x
is a train carrying objects, thenx[[5]]
is the object in car 5;x[4:6]
is a train of cars 4-6.” — [@RLangTip](http://twitter.com/#!/RLangTip/status/118339256388304896)
One thing to note: S3 and S4 objects can override the standard behavior of [
and [[
so they behave differently for different types of objects. This can be useful for controlling simplification vs preservation behavior.
1.1.1 Try it yourself
- create a vector, list, and dataframe.
- Extract elements using [, [[, and $
- What are the type and attributes that remain for the extracted piece of information
- Quickly brainstorm a couple situations that these could be important to remember for more complex tasks
1.2 Logical Subsetting
One of the most common ways to subset rows is to use logical subsetting.
Let’s take a look
Theoph[Theoph$Subject ==1,]
#> Subject Wt Dose Time conc
#> 1 1 79.6 4.02 0.00 0.74
#> 2 1 79.6 4.02 0.25 2.84
#> 3 1 79.6 4.02 0.57 6.57
#> 4 1 79.6 4.02 1.12 10.50
#> 5 1 79.6 4.02 2.02 9.66
#> 6 1 79.6 4.02 3.82 8.58
#> 7 1 79.6 4.02 5.10 8.36
#> 8 1 79.6 4.02 7.03 7.47
#> 9 1 79.6 4.02 9.05 6.89
#> 10 1 79.6 4.02 12.12 5.94
#> 11 1 79.6 4.02 24.37 3.28
We just subset the Theoph dataframe to only give us back the rows in which Subject equals 1. How does R go about doing this? Logical subsetting!
Notice what we get when we simply ask for Theoph$Subject == 1
Theoph$Subject ==1
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
It doesn’t give us back the values for the rows where subject equals one, rather, it gives us back a vector of TRUE
or FALSE
values.
So, in reality, we are using the logical subsetting rules to extract the rows of the dataframe that come back TRUE
from our logical query.
We can do this ‘by hand’ to show whats going on
subj1 <- ifelse(Theoph$Subject == 1, TRUE, FALSE)
subj1
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Theoph[subj1,]
#> Subject Wt Dose Time conc
#> 1 1 79.6 4.02 0.00 0.74
#> 2 1 79.6 4.02 0.25 2.84
#> 3 1 79.6 4.02 0.57 6.57
#> 4 1 79.6 4.02 1.12 10.50
#> 5 1 79.6 4.02 2.02 9.66
#> 6 1 79.6 4.02 3.82 8.58
#> 7 1 79.6 4.02 5.10 8.36
#> 8 1 79.6 4.02 7.03 7.47
#> 9 1 79.6 4.02 9.05 6.89
#> 10 1 79.6 4.02 12.12 5.94
#> 11 1 79.6 4.02 24.37 3.28
Logical subsetting is at the core of many of R’s operations. Any time you’re matching, or checking with is.*
you are using logical subsetting to test the condition you’re looking for then returning the TRUE
results
1.3 Common Subsetting Situations and Some Useful Functions
Now that you’ve gotten your feet wet with the basics of subsetting, lets check-in with some of the commonly used operators that give us some enhanced subsetting functionality.
Note: These all take advantage of logical subsetting:
Take a moment to prod through to documentation for the other operations. Note for things operations like %in%
or &
to query help you need to add a single quote around it like so ?'%in%'
This is a good chance for us to take a deeper look @ both how these functions work and how to read the documentation
pause to look @ %in%, is.na, and which documentation
%in%
is.na
!
duplicated
unique
&
|
any
all
which
It can be useful to make yourself a brief ‘cheat sheet’ of some of the common operations you use to reference when you’re thinking about what you are trying to do what you want your output to be.
For example:
%in% - compares values from the input vector with a table vector and returns T/F. * input/output are coerced to vectors and then type-matched before comparison. Factors, lists converted to character vectors! * Never returns NA, making it useful for
if` conditions- Can be slow for long lists and best avoided for complex cases
There is an interesting nugget of information in the documentation - that the input is coerced to a vector then type-matched.
So what is going on in these two situations?
ISM <- c(TRUE, FALSE, TRUE, FALSE)
ISMnums <- c(1, 0, 1, 0)
SUBJ <- c(1, 2, 3, 4)
ISM %in% SUBJ
#> [1] TRUE FALSE TRUE FALSE
which(ISM %in% SUBJ)
#> [1] 1 3
ISMnums %in% SUBJ
#> [1] TRUE FALSE TRUE FALSE
which(ISMnums %in% SUBJ)
#> [1] 1 3
ISM <- c(TRUE, FALSE, TRUE, FALSE)
SUBJ <- factor(c(1, 2, 3, 4))
ISM %in% SUBJ
#> [1] FALSE FALSE FALSE FALSE
which(ISM %in% SUBJ)
#> integer(0)
ISMnums %in% SUBJ
#> [1] TRUE FALSE TRUE FALSE
which(ISMnums %in% SUBJ)
#> [1] 1 3
hint: look @ ID numbers
The moral of the story, is make sure you know what is going on under the hood - sometimes you can get the ‘right’ answer for the wrong reasons.
1.3.1 Pop quiz question
What’s wrong with this subsetting command? dosingdf <- df[unique(df$ID),]
1.4 Data Manipulation
Now that we can slice and dice our data how we want - let’s examine how we can actually manipulate the data
Our goal for this section is to be able to: - rename columns systematically - reorder and rearrange rows and columns to our needs - create new columns
1.5 renaming columns
Quick renaming of columns can be easily accomplished using the dplyr rename
command (originally from plyr) with the structure:
dataframe <- rename(dataframe, c("oldcolname1" = "newcolname1", ...))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
names(Theoph)
#> [1] "Subject" "Wt" "Dose" "Time" "conc"
Theoph <- rename(Theoph, ID = Subject)
names(Theoph)
#> [1] "ID" "Wt" "Dose" "Time" "conc"
This can also be done directly without PKPDmisc, however is a bit more verbose
names(Theoph)[names(Theoph)=="Subject"] <- "ID"
Pause for a second - given what we’ve learned about subsetting - what is going on based on the way we’ve constructed the renaming.
Answer: names(Theoph)
creates a vector of names - the names(Theoph) == "Subject"
logically subsets the vector to identify which index matches the query. <- "ID"
is to assign a new value to that index location(s).
Column names can be directly accessed using the colnames
function (or for dataframes or lists simply names
), and you can rename them all directly by giving it a vector of names.
colnames(Theoph) <- c("hello", "there")
colnames(Theoph)
#> [1] "hello" "there" NA NA NA
rm(Theoph)
As you can see, this can be dangerous due to not giving the proper length vector (remember R’s recycling rule!), likewise, if the order of columns changes unexpectedly, your vector could rename columns incorrectly.
There are a couple things directly accessing all the colnames can be useful for though.
For example, capitalization of columns can often be inconsistent and frustrating. This can be quickly fixed by converting all columns to uppercase or lowercase using toupper()
and tolower()
names(Theoph)
#> [1] "Subject" "Wt" "Dose" "Time" "conc"
names(Theoph) <- toupper(names(Theoph))
names(Theoph)
#> [1] "SUBJECT" "WT" "DOSE" "TIME" "CONC"
rm(Theoph)
1.6 reordering rows/columns
When reordering columns in a dataframe you can actually think of it as creating a new dataframe in which the columns get created in the way you order them.
newTheoph <- Theoph[c("Subject", "Time", "conc", "Dose", "Wt")]
head(Theoph)
#> Subject Wt Dose Time conc
#> 1 1 79.6 4.02 0.00 0.74
#> 2 1 79.6 4.02 0.25 2.84
#> 3 1 79.6 4.02 0.57 6.57
#> 4 1 79.6 4.02 1.12 10.50
#> 5 1 79.6 4.02 2.02 9.66
#> 6 1 79.6 4.02 3.82 8.58
head(newTheoph)
#> Subject Time conc Dose Wt
#> 1 1 0.00 0.74 4.02 79.6
#> 2 1 0.25 2.84 4.02 79.6
#> 3 1 0.57 6.57 4.02 79.6
#> 4 1 1.12 10.50 4.02 79.6
#> 5 1 2.02 9.66 4.02 79.6
#> 6 1 3.82 8.58 4.02 79.6
Most of the time since you don’t want to create a new data frame every time you reorder, you can simply overwrite the old data frame. Theoph <- Theoph[c("Subject", "Time", "conc", "Dose", "Wt")]
Just like all other subsetting we’ve gone over, we can also organize by index
Theoph <- Theoph[c(1, 4, 5, 3, 2)]
I suggest against it unless you have good reason. (ie you know your code will always be a specific structure) Even then, while faster to type than named indices, it makes legibility more difficult, and modification down the line more tedious trying to keep track.
1.7 Keys
Similar to columns, rows also have names. As you slice and dice and reorder a dataset it can get pretty ugly, so if the need arises, row names can be rest by rownames(df) <- NULL
Rows can always be referred by their name. They are also structurally distinguishable by their content.
A key is the set of columns that can uniquely distinguish every row. Different datasets can have keys of varying complexity (number of columns)
A basic dosing dataset key may be as simple as ID, however for a cross-over clinical trial a dataset may be keyed on ID, time, and cohort.
The general relationship between key columns and other columns is:
Key columns represent unique objects (persons, groups, sites, etc) and the other columns should characterize these objects
For example, a person might be the key column for different concentration, time, wt, etc measurements, thus if you are ‘extracting’ information you’d likely want it based on the key column, so for example by max concentration by individual)
R does not explicitly recognize keys - it is up to you to keep track. Keys become increasingly important when delving into advanced data manipulation. Using dplyr group_by
makes working with keys much more straightforward than in base R.