Using R Tidyverse

The new (and hot) way of analysing data with R is the tidyverse https://www.tidyverse.org/ which includes the online book ‘R for Data Science’. Please check it out!

Instead of dataframes we use Tibbles now. First import the data using the File menu in Rstudio - and make sure FACTORS is off and tables and columns show. This is the old way (remember) of plotting the individuals_attributes data frame

ind_attr=read.csv("data-202002/Individual_attributes.csv")
plot(ind_attr$ELISA ~ ind_attr$Time_to_diagnosis)

we want to turn ind_attr into a tibble with

library(tidyverse)
## -- Attaching packages ------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ---------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
tb = as_tibble(ind_attr)
tb
## # A tibble: 96 x 8
##    SampleID Phenotype Time_to_diagnos~   Age Gender  ELISA Location
##    <fct>    <fct>                <dbl> <int> <fct>   <dbl> <fct>   
##  1 16K0021  Slow gro~               21    21 M      6.47e3 Kilifi ~
##  2 16K0053  Slow gro~               21    33 F      5.75e3 Kilifi ~
##  3 16K0109  Slow gro~               21    36 F      4.02e3 Kilifi ~
##  4 16K0135  Slow gro~               21    29 M      3.35e3 Kilifi ~
##  5 17K0003  Slow gro~               21    40 M      3.36e4 Kilifi ~
##  6 17K0009  Slow gro~               21    33 F      1.22e5 Kilifi ~
##  7 17K0026  Slow gro~               21    31 M      2.26e4 Kilifi ~
##  8 17K0033  Slow gro~               21    32 F      1.55e4 Kilifi ~
##  9 17K0047  Slow gro~               21    22 M      2.87e4 Kilifi ~
## 10 17K0055  Slow gro~               21    32 M      3.51e3 Kilifi ~
## # ... with 86 more rows, and 1 more variable: New.phenotype <fct>

now we try the new way plotting using ggplot and tibble

ggplot(data = tb) + geom_point(mapping = aes(y=ELISA, x = Time_to_diagnosis))
## Warning: Removed 11 rows containing missing values (geom_point).

which shows that all high ELISA values are for all late diagnosis only. ELISA uses a solid-phase enzyme immunoassay (EIA) to detect the presence of a ligand (commonly a protein) in a liquid sample using antibodies directed against the protein to be measured.

Correlation

Let’s try a simple correlation. This site has some interesting ideas which we may visit later. Let’s correlate using the pipes from dplyr:

library(dplyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
cs = cbind(tb$Time_to_diagnosis,tb$ELISA,tb$Age)
rcorr(cs)
##      [,1] [,2] [,3]
## [1,] 1.00 0.20 0.08
## [2,] 0.20 1.00 0.19
## [3,] 0.08 0.19 1.00
## 
## n
##      [,1] [,2] [,3]
## [1,]   85   85   71
## [2,]   85   85   71
## [3,]   71   71   71
## 
## P
##      [,1]   [,2]   [,3]  
## [1,]        0.0616 0.5302
## [2,] 0.0616        0.1081
## [3,] 0.5302 0.1081

From this it is clear that correlations between time to diagnosis, age and ELISA are low.

Schizont

The schizont column compares to a log transformed ELISA(?). Let’s check that

data <- read_csv("data-202003/final_chmi_covariates.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   subjectid = col_character(),
##   location = col_character(),
##   gender = col_character(),
##   thal_genotype = col_character(),
##   sickle = col_character(),
##   g6pd_genotype = col_character(),
##   phenotypes_all = col_character()
## )
## See spec(...) for full column specifications.

Now it shoud be a native tibble. Let’s plot:

ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR))

which does look similar to

ggplot(data = tb) + geom_point(mapping = aes(y=log(ELISA), x = Time_to_diagnosis))
## Warning: Removed 11 rows containing missing values (geom_point).

Let’s do some colouring. There are three locations.

ggplot(data = data) + geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, color=location))

You can tell the Schizont load is higher for Kilifi South though there does not appear to be a clear relationship between time to dianosis (other than that all high values are beyond 22 days). Let’s try some model:

ggplot(data = data) + 
  geom_point(mapping = aes(y=schizont, x = time_to_diagnosis_last_PCR, colour=location)) +
  geom_smooth(mapping = aes(time_to_diagnosis_last_PCR,schizont), se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Now there is a trend line towards numbers of days. Very nice. Let’s see if we can split by location

ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=location)) + geom_point() + geom_smooth(method=lm,formula = y ~ splines::bs(x,3),se=FALSE)

The anti-schizont antibody is measured before infection. Diagnosis is based on PCR. Based on this figure the schizont antibody is typically lower in Kilifi North - which points at less malaria activity/infections. When schizont antibody is higher, time to diagnosis is later. This is true for all areas.

Gender shows no real effect

ggplot(data=data, aes(time_to_diagnosis_last_PCR,schizont,colour=gender)) + geom_point() + geom_smooth(method=lm,formula = y ~y,se=FALSE)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on the
## right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
## model.matrix: no columns are assigned
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on the
## right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 1 in
## model.matrix: no columns are assigned