version 0.03.01 “2017-09-20 07:44:23 PDT”

1 Introduction

Personal predictive models use an individual’s covariates to assign him/her a probability of developing a specific disease or outcome within a specified future time period [0, t_star] and before a specified competing risk. Predictive models are often evaluated by comparing their assigned risks to outcome occurence among participants in a longitudinal cohort study. Performance of such models can be evaluated using two criteria: calibration and concordance. Model calibration (also called goodness-of-fit) measures how well the model-assigned risks agree with subsequent observed outcomes. rmap provides a grouped goodness-of-fit test and grouped and individualized attribute diagrams. Concordance (also called the area under the ROC curve) measures how well a model separates or discriminates positive and negative outcomes.rmap provides estimation and confidence intervals for the concordance as well as ROC plots. As of version 0.03.00, rmap handles three designs for obtaining samples of participants: (1) random sampling, (2) two-stage sampling, and (3) weighted designs.

Below I show some R code for reading in some simulated data, calling some rmap functions, and displaying rmap in action: estimating concordance, producing ROC plots, testing for grouped goodness-of-fit, and producing grouped and individualized attribute diagrams.

library(rmap)
xxx = read.csv(file = "random_sample_example.csv", stringsAsFactors = FALSE)
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
K = 4
N_bootstraps = 100 
set.seed(1)
rmap_answers = rmap_random_sample(e, t, r, t_star, K, N_bootstraps)
individual = individual_random_sample(e, t, r, t_star, N_bootstraps)
the_message = paste("rmap in action", sep = "\n")
grid.arrange(textGrob(the_message), 
             pretty_risk_plot(rmap_answers), 
             pretty_roc_plot(rmap_answers), 
             pretty_individual_risk_plot(individual),
             ncol = 2)

In the examples in this webpage, I use N_bootstraps = 100 so you can run through them relatively quickly. When you perform your own analyses, I suggest starting with N_bootstraps = 100 for organizing your analyses, and then changing to higher numbers to obtain your final results.

The html page that you are looking at was created in RStudio. You may download the source RStudio file rmap-tutorial.Rmd, open it in RStudio and follow along. All RStudio code chunks except those inside A repeat of the previous example, but with different variable names and An example that compares two risk models are “self-contained. This means that except in these two sections, you can try out any code chunk by stepping through any one. Just start at the beginning of the code chunk and send each line to R. You will not need any definitions from a previous code chunk. To evaluate code inside the two exception sections, start in the beginning of the first code chunk each section. In general, the code in each section is self-contained. To install RStudio use the link provided in the next section.

The following sections describe how to install rmap, rmap inputs, how to interpret rmap outputs, and some examples.

TOP

2 Installing rmap

If you have not yet installed R, download the latest version (at least 3.4.0) for your operating system at:

http://www.r-project.org

One of the required steps will be to install devtools. Before that, if your operating system is Windows, you will need to install Rtools, and if your operating system is Mac OS, you will need to install Xcode and command-line tools:

https://cran.r-project.org/bin/windows/Rtools/

https://developer.apple.com/xcode/

http://osxdaily.com/2014/02/12/install-command-line-tools-mac-os-x/

Run the R application. To install the rmap package, enter the following lines of code to the R prompt:

install.packages("devtools")
library(devtools)
install_github("gailg/rmap")
if("rmap" %in% rownames(installed.packages())){
  print("rmap installed successfully--you are good to go!")
} else {
  print("something went wrong--ask for help")
}

If your installation was successful, you should see the message

[1] "rmap installed successfully--you are good to go!"

To install RStudio, follow the instructions here: https://www.rstudio.com/products/rstudio/download/

TOP

3 Data preparation

You need a data set, containing, say N individuals, and for the n-th person, you have the following data:

3.1 e

The event for the individual. Record e = 1 if the outcome of interest was observed to occur before being censored and before the competing event. Record e = 2 if the competing risk was observed to occur before the outcome and before being censored. And record e = 0 if the individual was censored before outcome and competing risk. Acceptable values are 0, 1, 2.

3.2 t

The time until the event e. Mathematically speaking, let t1 be time until outcome, t2 time until competing risk, t0 time until censoring . Then t is the minimum of t1, t2, t0; and e above, records which of the three kinds of events occurred first. Acceptable values for t are positive real numbers (that can exceed t_star the right endpoint of the time period of interest).

3.3 r

The risk assigned by the personal predictive model. This is the probability of the outcome of interest occuring before the competing risk and before t_star. Acceptable values are real numbers between 0 and 1, including 0 and 1.

3.4 category (maybe)

If your data were obtained by two-stage sampling, category is the two-stage category indicating which two-stage category the person fell into. (See also N_first_stage in Other data you may need.) If your data are a cohort sample whose relevant covariates do not match those of your target population, and you would like to weight your cohort sample based on covariate information from a target population, category records which covariate category the (cohort) individual fell into. (See also N_target or target_category in Other data you may need.) If your data were obtained by a random sample, you will not need category.

4 Reading in data

If your data is an excel file with columns e, t, r, and possibly category, save it as a CSV (comma separated value) file to your working directory. For practice you can download my CSV file random_sample_example.csv, save it to your working directory, and try out the following commands. If you get the same results I get below, you are in business.

xxx = read.csv(file = "random_sample_example.csv", stringsAsFactors = FALSE)
head(xxx)
##   e    t      r
## 1 2 5.40 0.5055
## 2 2 0.30 0.6004
## 3 0 2.43 0.4809
## 4 0 0.74 0.7456
## 5 1 2.88 0.6170
## 6 0 0.66 0.4827

TOP

5 Other data you may need

If your data were obtained using a two-stage sample design, you will need N_first_stage. If your data were obtained using a weighted design, you will need either N_target or target_category. Otherwise, you have a random sample and you can skip this section.

5.1 N_first_stage

If you obtained your data using a two-stage sample, you obtained a random sample of your population (the first stage sample), recording information that was relatively easy to obtain and that places individuals into two or more categories. And then you oversampled or undersampled these categories to obtain the N individuals whose values of e, t, r, and category you have recorded. N_first_stage counts the number of individuals from the first stage sample that fell in each category. If your categories were A and B, and in the first stage 123 fell in category A and 456 fell in category B, then you would use for N_first_stage the named vector c(A = 123, B = 456). Each name in N_first_stage must have at least one representative in category.

5.2 N_target or target_category

If covariates of the participants of the cohort study differ from those of the population for which your predictive model is targeted, you can weight the cohort participants using an additional sample, a target sample, whose covariate categories match the target population. Classify individuals in both cohort sample and target sample into covariate categories, labeled with capital letters beginning with "A", "B", etc. Let category be the vector of length N with n-th element containing the covariate category of the n-th person. Let target_category be a vector containing one element for each person in the target sample recording his/her covariate category.

If instead of providing rmap with the vector target_category, you wish to provide rmap with counts of the numbers of individuals in the target population that fell in each covariate category, let N_target be the named vector counting the number of individuals from the target sample that fell in each category. If your categories were "A" and "B", and in the first stage 123 fell in category A and 456 fell in category B, then you would use N_target to be the named vector c(A = 123, B = 456). Each name in N_target must have at least one representative in category.

TOP

6 rmap inputs

The example in the introduction calls the functions rmap_random_sample and individual_random_sample. These functions are “wrapper” functions in which I prespecified some of the options in the workhorse functions rmap and rmap_individual. If you have a random sample and are happy with the options I have chosen, all you need to do is make sure you have correctly specified the inputs for the wrapper functions (e, t, r, t_star, K, N_bootstraps, a slightly smaller set of inputs than for the workhorse functions), and then you can follow my example in Introduction. This section describes these inputs as well as other inputs when you need more than what the wrapper functions offer.

The wrapper functions rmap_random_sample and individual_random_sample assume your design is a random sample, and the confidence level of confidence intervals is to be 0.95. Additionally, rmap_random_sample assumes that you wish risk groups is to be defined using K quantiles, and the assigned risks of each risk group is to be summarized using their mean.

The following details the inputs to rmap.

6.1 e, t, r

These were described in Data preparation. If your cohort contains N individuals, then each of e, t, and r is a vector of length N, and the n-th element pertains to the n-th individual.

6.2 t_star

This positive number is the right end point of the time period of interest. The personal risk model that you are asking rmap to evaluate expresses the risk, or the probability that the outcome occurs before t_star and before the prespecified competing risk.

6.3 design

Use this input to specify your study design and give any necessary details pertinient to your study design.

If your design was a random sample, use design = "random_sample".

If your design was a two-stage sample, use design = list(category = category, N_first_stage = N_first_stage) where N_first_stage is described in Other data you may need.

If your design was biased producing a cohort sample whose relevant categories do not proportionately match those of your target population, and you have a target sample with categories target_category which records the covariate category of each person in the target sample, or N_target which counts the number of individuals in the target sample falling in each category (see Other data you may need), use design = list(category = category, target_category = target_category) or design = list(category = category, N_target = N_target).

6.4 risk_group

Use this input to specify how you wish the assigned risk is to be broken down into risk groups.

If you would like rmap to break down your sample into K approximately equal-sized groups, for example K = 4 so the risks are broken down into quartiles, you can specify risk_group = list(K = 4). (Throughout this web page, and even when I am not using qunatiles to define risk groups, I denote K to be the number of risk groups.)

If you would like to specify cut points to cut up the assigned risks, say all individuals having assigned risk r less than 0.30 in one risk group, greater than or equal 0.30 and less than 0.70 in the second risk group, and greater than or equal 0.70 in the third, you can specify risk_group = list(cutoffs = (0, 0.30, 0.70, 1)).

If you have broken down your risk groups manually and have a vector k, the n-th element taking on an integer in 1, 2, ..., K, indicating which risk group to assign to the n-th person, you can specify risk_group = list(k = k).

6.5 r_summary

r_summary allows you to choose how you would like to summarize the assigned risks in each risk group. You can specify r_summary = "mean" or r_summary = "median" with obvious result Or if your risk groups were specified using risk_group = list(cutoffs = cutoffs), with say cutoffs = c(0, 0.30, 0.70, 1), you can specify r_summary = "midpoint" to use the midpoint of each interval defined by your cutoffs, in this example, 0.15, 0.50, 0.85 would be the summary values of the three risk groups.

6.6 N_bootstraps

This nonnegative integer specifies the number of bootstraps. If N_bootstraps = 0, no bootstraps will be performed. For random samples and two-stage samples, grouped goodness-of-fit tests and concordance estimates can be gotten without bootstraps, but for inferences on concordance as well as inferences on all quantities when the design is a weighted sample, you will need bootstrapping.

The examples used in this web page use N_bootstraps = 100 so that the examples can be stepped through relatively quickly. You should use much larger numbers, for example N_bootstraps = 400 or N_bootstraps = 1000.

6.7 confidence_level

A positive number less than 1. If you do not specify confidence_level, rmap will use 0.95.

TOP

7 rmap_individual inputs

The other workhorse function rmap_individual does not perform formal tests or inference, but provides a summary plot, an individualized attribute diagram which plots an estimate of risk at each observed assigned risk using an epsilon kernel neighborhood.

rmap_individual requires e, t, r, t_star, design, N_bootstraps, and confidence_level, which have been discussed above, together with one additional argument epsilon.

7.1 epsilon

This is a small positive number specifying the kernel neighborhoods used to calculate observed risk at each distinct assigned risk. Asymptotic theory suggests good behavior for epsilon = N^(-1/3).

TOP

8 Explaining rmap output

rmap output is broken down into two objects: numerical_summaries and plots.

library(rmap)
xxx = read.csv("random_sample_example.csv")
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
design = "random_sample"
risk_group = list(K = 4)
r_summary = "mean"
N_bootstraps = 100
set.seed(1)
rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary, N_bootstraps)
names(rmap_answers)
## [1] "numerical_summaries" "plots"
rmap_answers$numerical_summaries
## $concordance
##   estimate      lower      upper 
## 0.61790756 0.53944772 0.68182796 
## 
## $gof_asymptotic
##  statistic    p_value 
## 0.77240948 0.54293795 
## 
## $gof_bootstrap
##  statistic    p_value 
## 0.67450509 0.60955997 
## 
## $grouped_estimates
##   gamma_hat          r     pi_hat
## 1      0.25 0.44360667 0.40537867
## 2      0.25 0.54622000 0.57387391
## 3      0.25 0.61601333 0.52766265
## 4      0.25 0.71231067 0.64958005
## 
## $grouped_asymptotic_sds
##            sd      lower      upper in_ci
## 1 0.070702944 0.27727820 0.54780272   yes
## 2 0.072182000 0.43024256 0.70603630   yes
## 3 0.066140594 0.39906304 0.65269306   yes
## 4 0.067403305 0.50920130 0.76809369   yes
## 
## $grouped_bootstrap_sds
##            sd      lower      upper in_ci
## 1 0.072765207 0.28445228 0.57311430   yes
## 2 0.073477427 0.45170787 0.73024667   yes
## 3 0.071437034 0.39580420 0.65979942   yes
## 4 0.072397093 0.52166454 0.78910443   yes
names(rmap_answers$plots)
## [1] "df_for_roc_plot"  "df_for_risk_plot" "roc_plot"        
## [4] "risk_plot"
head(rmap_answers$plots$df_for_roc_plot)
##     one_minus_specificity  sensitivity
## 289          0.0000000000 0.0000000000
## 286          0.0000000000 0.0066321352
## 284          0.0092496924 0.0066321352
## 283          0.0188664264 0.0066321352
## 282          0.0278984790 0.0066321352
## 280          0.0278984790 0.0153300546
rmap_answers$plots$roc_plot

rmap_answers$plots$risk_plot

8.1 numerical_summaries consist of the following

  1. concordance contains the concordance estimate as well as the lower and upper bounds of a 95 percent bootstrap confidence interval (or a confidence interval with a different confidence-level if you specified it with the input confidence_level)

  2. gof_asymptotic reports the goodness of fit statistic, with definition \(\chi^2_K = \sum_{k = 1}^K \text{gamma_hat}_k \Big( \text{pi_hat}_k - \text{r}_k \Big)^2 \Big/ \text{sd}_k^2\), as well as its p_value for testing if the grouped assigned risks fit the observed cohort data. The variances (\(\text{sd}_k^2\)) in the denominator of the statistic are gotten by asymptotic theory. This object is NULL if theoretical variances are not available (i.e. when the design is weighted).

  3. gof_bootstrap is similar to gof_asymptotic but variances are gotten by bootstrapping. This object is NULL if bootstrap variances are not available (i.e when N_bootstraps = 0).

  4. grouped_estimates is a data.frame containing gamma_hat which estimates the proportion of individuals in each risk group, r which summarizes the assigned risk in each risk group (using the summary specified by r_summary). and pi_hat which estimates the probability given risk group, of getting the outcome before t_star and before the competing risk.

  5. grouped_asymptotic_sds is a data.frame containing sd the square root of the variance of pi_hat for each risk group, the lower and upper bounds of the 95 percent confidence interval for pi, and in_ci indicating whether or not the confidence interval contains r. The estimated variance and confidence intervals are calculated using asymptotic theory. This is available only for random samples and two-stage samples.

  6. grouped_bootstrap_sds is similar to grouped_asymptotic_sds, but the variance and confidence intervals are calculated using the bootstrap. This is available only if you specify N_bootstraps to be a nonzero number.

8.2 plots contain

  1. df_for_roc_plot is a data.frame that you can use to construct custom ROC plots.

  2. roc_plot contains a ROC plot (with one_minus_specificity on the x-axis and sensitivity on the y-axis).

  3. risk_plot contains a grouped attribute diagram with grouped_estimates$r plotted on the x-axis and grouped_estimates$pi_hat together with grouped_asymptotic_sds$lower and grouped_asymptotic_sds$upper, if they are available or if they are not available grouped_bootstrap_sds$lower and grouped_bootstrap_sds$upper, plotted on the y-axis.

8.3 rmap_individual output

library(rmap)
xxx = read.csv("random_sample_example.csv")
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
design = "random_sample"
epsilon = length(e)^(-1/3)
N_bootstraps = 100
set.seed(1)
individual = rmap_individual(e, t, r, t_star, design, epsilon, N_bootstraps)
names(individual)
## [1] "df_for_risk_plot" "risk_plot"
head(individual$df_for_risk_plot)
##   assigned_risk observed_risk     lower     upper
## 1         26.75     36.518547 19.382884 60.200836
## 2         31.16     36.639417 19.451383 61.349186
## 3         32.26     34.837770 20.312142 61.866644
## 4         32.30     37.376353 19.517306 62.296359
## 5         32.99     37.376353 19.517306 62.296359
## 6         35.16     39.047342 21.435216 62.296359
individual$risk_plot

rmap_individual output contains two objects: df_for_risk_plot and risk_plot.

  1. df_for_risk_plot is a data.frame that you can use to construct custom individualized attribute diagrams.

  2. risk_plot contains an individualized attribute diagram with unique values of the input r plotted on the x-axis and epsilon kernel nearest neighbor estiamte of outcome probability on the y-axis.

TOP

9 A random sample example using the wrapper function rmap_random_sample and individual_random_sample

If you have a random sample, and you are happy using assigned risk quantiles to define your risk groups, the mean to summarize the risks in each risk group, and confidence level 0.95, you can use the wrapper functions rmap_random_sample and individual_random_sample as illustrated in Introduction

TOP

10 A weighted cohort example using the wrapper function rmap_weighted_sample

The wrapper function rmap_weighted_sample can perform an rmap analysis in which the covariates of the participants in the cohort sample do not match those of the target population, and you have a sample from the target population with covariate categories that you can compare with covariate categories of the cohort sample. This wrapper function assumes you want your risk groups to be defined by cutoffs, you want to summarize risks in each risk group using means, and the confidence level of all confidence intervals to be 0.95.

You can follow along in this example by first downloading the two data sets, weighted_example_cohort_sample.csv and weighted_example_target_sample.csv, and making sure they are in your working directory.

library(rmap)
xxx = read.csv("weighted_example_cohort_sample.csv", 
               stringsAsFactors = FALSE)
head(xxx)
##   e    t      r category
## 1 1 2.86 0.8517        C
## 2 2 0.96 0.5130        C
## 3 1 2.03 0.5482        C
## 4 1 0.54 0.3027        A
## 5 1 1.94 0.2500        A
## 6 1 2.63 0.7006        D
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
target_sample = read.csv("weighted_example_target_sample.csv", 
                         stringsAsFactors = FALSE)
head(target_sample)
##     x_1   x_2 category
## 1 -3.17  0.15        A
## 2 -1.39  0.85        D
## 3 -3.50 -0.61        A
## 4 -2.44 -0.19        C
## 5 -1.13  0.84        D
## 6 -2.98 -0.03        A
target_category = target_sample$category
category = xxx$category
cutoffs = c(0, 0.20, 1)
N_bootstraps = 100
set.seed(2)
rmap_answers = rmap_weighted_sample(
  e, t, r, category, target_category, t_star, cutoffs, N_bootstraps)
individual = individual_weighted_sample(e, t, r, category, target_category, t_star, N_bootstraps)
grid.arrange(pretty_roc_plot(rmap_answers), 
             pretty_risk_plot(rmap_answers), 
             ncol = 2)

Each of the functions rmap_random_sample, individual_random_sample, rmap_weighted_sample_fn, individual_weighted_sample is a wrapper function. You can see what is inside the definition of any R function just by sending to R the name of the function, without trailing parentheses or inputs. By examining how I constructed these wrapper functions, you can see how you can create your own customized calls to rmap.

library(rmap)
rmap_weighted_sample
## function (
##   e, t, r, category, target_category, t_star, cutoffs, N_bootstraps){
##   design = list(target_category = target_category, category = category)
##   risk_group = list(cutoffs = cutoffs)
##   r_summary = "mean"
##   confidence_level = 0.95
##   rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary, 
##                       N_bootstraps, confidence_level)
##   rmap_answers
## }
## <environment: namespace:rmap>

11 A weighted cohort example using rmap and rmap_individual

If I didn’t have rmap_weighted_sample_fn, this is how I would get the same results. Notice how I put the the vector cutoffs into a named list risk_group and the two vectors target_category and category into the named list design.

library(rmap)
xxx = read.csv("weighted_example_cohort_sample.csv", 
               stringsAsFactors = FALSE)
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
target_sample = read.csv("weighted_example_target_sample.csv", 
                         stringsAsFactors = FALSE)
target_category = target_sample$category
category = xxx$category
set.seed(1)
design = list(category = category, target_category = target_category)
cutoffs = c(0, 0.20, 1)
risk_group = list(cutoffs = cutoffs)
r_summary = "mean"
N_bootstraps = 100
confidence_level = 0.95
rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary, 
                    N_bootstraps, confidence_level)
epsilon = length(e)^(-1/3)
individual = rmap_individual(e, t, r, t_star, design, epsilon,
                             N_bootstraps, confidence_level)
the_message = paste("doing it without calling rmap_weighted_sample", sep = "\n")
grid.arrange(textGrob(the_message), 
             pretty_risk_plot(rmap_answers), 
             pretty_roc_plot(rmap_answers), 
             pretty_individual_risk_plot(individual),
             ncol = 2)

TOP

12 A repeat of the previous example, but with different variable names

You may use different variable names for all inputs of R functions. I’ve repeated the code from the previous section but renamed all input. Notice that although I’ve given them different names, the inputs are in the same order.

library(rmap)
xxx = read.csv("weighted_example_cohort_sample.csv", 
               stringsAsFactors = FALSE)
my_e = xxx$e
my_t = xxx$t
my_r = xxx$r
my_t_star = 10
target_sample = read.csv("weighted_example_target_sample.csv", 
                         stringsAsFactors = FALSE)
my_target_category = target_sample$category
my_category = xxx$category
set.seed(1)
my_design = list(category = my_category, target_category = my_target_category)
my_cutoffs = c(0, 0.20, 1)
my_risk_group = list(cutoffs = my_cutoffs)
my_r_summary = "mean"
my_N_bootstraps = 100
my_confidence_level = 0.95
rmap_answers = rmap(my_e, my_t, my_r, my_t_star, my_design, my_risk_group, my_r_summary, 
                    my_N_bootstraps, my_confidence_level)
my_epsilon = length(my_e)^(-1/3)
individual = rmap_individual(my_e, my_t, my_r, my_t_star, my_design, my_epsilon,
                             my_N_bootstraps, my_confidence_level)
the_message = paste("doing it without calling rmap_weighted_sample", sep = "\n")
grid.arrange(textGrob(the_message), 
             pretty_risk_plot(rmap_answers), 
             pretty_roc_plot(rmap_answers), 
             pretty_individual_risk_plot(individual),
             ncol = 2)

The inputs my_design and my_risk_group are lists that contain additional objects. I can examine what is inside my_design

my_design
## $category
##   [1] "C" "C" "C" "A" "A" "D" "C" "C" "C" "A" "A" "A" "C" "C" "A" "C" "A"
##  [18] "C" "C" "B" "A" "C" "A" "C" "A" "A" "A" "C" "A" "A" "A" "A" "A" "C"
##  [35] "C" "C" "C" "C" "C" "A" "C" "C" "C" "C" "A" "C" "C" "C" "C" "A" "A"
##  [52] "A" "C" "C" "C" "A" "C" "C" "A" "A" "C" "C" "C" "A" "C" "A" "C" "A"
##  [69] "C" "C" "C" "A" "A" "A" "A" "C" "C" "A" "C" "C" "A" "C" "A" "A" "A"
##  [86] "C" "A" "A" "A" "C" "A" "C" "C" "C" "C" "A" "A" "C" "C" "C" "A" "C"
## [103] "C" "A" "A" "A" "A" "A" "A" "C" "C" "A" "A" "C" "A" "C" "D" "C" "C"
## [120] "C" "C" "C" "A" "A" "C" "A" "A" "C" "A" "A" "A" "A" "C" "A" "C" "C"
## [137] "C" "C" "C" "C" "A" "A" "A" "C" "A" "C" "A" "A" "C" "A" "C" "C" "C"
## [154] "A" "C" "A" "A" "A" "A" "C" "C" "C" "A" "D" "C" "C" "C" "C" "A" "C"
## [171] "D" "C" "A" "A" "C" "A" "A" "C" "A" "C" "C" "A" "A" "A" "A" "A" "D"
## [188] "C" "A" "C" "C" "C" "C" "C" "A" "A" "C" "C" "C" "A" "C" "B" "A" "C"
## [205] "D" "A" "C" "C" "C" "A" "A" "A" "A" "A" "A" "C" "D" "C" "A" "A" "A"
## [222] "A" "C" "C"
## 
## $target_category
##   [1] "A" "D" "A" "C" "D" "A" "A" "B" "A" "C" "D" "A" "A" "B" "A" "B" "A"
##  [18] "B" "D" "A" "C" "D" "D" "C" "D" "A" "D" "C" "B" "B" "C" "C" "D" "D"
##  [35] "D" "D" "B" "B" "B" "B" "C" "A" "B" "D" "B" "D" "B" "A" "B" "A" "D"
##  [52] "C" "C" "A" "A" "A" "C" "B" "A" "D" "A" "C" "B" "C" "B" "B" "B" "A"
##  [69] "B" "A" "B" "A" "C" "C" "C" "B" "D" "D" "D" "B" "D" "A" "C" "B" "C"
##  [86] "A" "C" "A" "B" "B" "C" "A" "C" "C" "D" "D" "A" "B" "A" "A" "A" "D"
## [103] "D" "D" "A" "C" "B" "B" "A" "D" "B" "C" "D" "C" "B" "D" "B" "C" "C"
## [120] "B" "C" "C" "B" "A" "D" "C" "B" "C" "A" "B" "D" "B" "B" "B" "C" "D"
## [137] "C" "B" "D" "D" "A" "B" "C" "D" "B" "B" "C" "A" "B" "B" "C" "D" "B"
## [154] "D" "A" "D" "B" "C" "A" "D" "B" "C" "C" "C" "B" "B" "B" "B" "A" "A"
## [171] "B" "D" "A" "A" "A" "C" "C" "B" "C" "A" "C" "A" "D" "C" "B" "B" "B"
## [188] "B" "D" "B" "C" "D" "B" "D" "A" "D" "C" "C" "A" "C" "B" "B" "D" "A"
## [205] "D" "C" "D" "D" "B" "C" "C" "D" "C" "A" "C" "A" "B" "D" "D" "C" "C"
## [222] "B" "D" "B" "B" "B" "A" "B" "A" "B" "A" "C" "B" "B" "B" "D" "D" "B"
## [239] "B" "C" "C" "B" "B" "D" "B" "B" "D" "B" "A" "D" "A" "A" "C" "A" "B"
## [256] "A" "A" "B" "D" "B" "B" "C" "C" "A" "D" "D" "C" "A" "B" "A" "B" "B"
## [273] "A" "D" "D" "D" "A" "C" "D" "A" "C" "B" "A" "C" "A" "B" "C" "D" "D"
## [290] "A" "B" "D" "C" "C" "C" "B" "D" "D" "D" "D" "C" "D" "D" "C" "A" "A"
## [307] "C" "B" "B" "C" "B" "D" "C" "A" "D" "C" "D" "A" "B" "D" "A" "C" "D"
## [324] "D" "A" "A" "C" "B" "B" "A" "B" "B" "A" "B" "D" "A" "B" "D" "C" "D"
## [341] "B" "A" "A" "C" "B" "A" "A" "B" "C" "D" "B" "C" "D" "A" "B" "D" "D"
## [358] "A" "D" "C" "B" "C" "D" "C" "B" "C" "B" "A" "C" "A" "C" "D" "B" "D"
## [375] "A" "D" "B" "C" "D" "B" "D" "C" "C" "C" "D" "D" "D" "A" "B" "B" "B"
## [392] "D" "A" "C" "C" "A" "A" "D" "A" "B"

I see that my_design is a list that contains two objects, named category and target_category. These objects get their names and what is inside from their definitions inside the list my_design = list(category = my_category, target_category = my_target_category). The first object is named category, and what is inside category is my_category

my_category
##   [1] "C" "C" "C" "A" "A" "D" "C" "C" "C" "A" "A" "A" "C" "C" "A" "C" "A"
##  [18] "C" "C" "B" "A" "C" "A" "C" "A" "A" "A" "C" "A" "A" "A" "A" "A" "C"
##  [35] "C" "C" "C" "C" "C" "A" "C" "C" "C" "C" "A" "C" "C" "C" "C" "A" "A"
##  [52] "A" "C" "C" "C" "A" "C" "C" "A" "A" "C" "C" "C" "A" "C" "A" "C" "A"
##  [69] "C" "C" "C" "A" "A" "A" "A" "C" "C" "A" "C" "C" "A" "C" "A" "A" "A"
##  [86] "C" "A" "A" "A" "C" "A" "C" "C" "C" "C" "A" "A" "C" "C" "C" "A" "C"
## [103] "C" "A" "A" "A" "A" "A" "A" "C" "C" "A" "A" "C" "A" "C" "D" "C" "C"
## [120] "C" "C" "C" "A" "A" "C" "A" "A" "C" "A" "A" "A" "A" "C" "A" "C" "C"
## [137] "C" "C" "C" "C" "A" "A" "A" "C" "A" "C" "A" "A" "C" "A" "C" "C" "C"
## [154] "A" "C" "A" "A" "A" "A" "C" "C" "C" "A" "D" "C" "C" "C" "C" "A" "C"
## [171] "D" "C" "A" "A" "C" "A" "A" "C" "A" "C" "C" "A" "A" "A" "A" "A" "D"
## [188] "C" "A" "C" "C" "C" "C" "C" "A" "A" "C" "C" "C" "A" "C" "B" "A" "C"
## [205] "D" "A" "C" "C" "C" "A" "A" "A" "A" "A" "A" "C" "D" "C" "A" "A" "A"
## [222] "A" "C" "C"

When you hand my_design to rmap, the names of the objects target and target_category signal to rmap that you want to do a weighted analysis and and the objects category contains the covariate categories of the cohort sample and target_category contains the covariate categories of the target sample.

Similarly I can examine my_risk_group

my_risk_group
## $cutoffs
## [1] 0.0 0.2 1.0

I see that my_risk_group is a list that contains one object named cutoffs. This object gets its name and what is inside it from its definition in the list my_risk_group = list(cutoffs = my_cutoffs). The object inside is named cutoffs, and what is inside cutoffs is the same as my_cutoffs. When you hand my_risk_group to rmap, the name of the object cutoffs tells rmap that the contents should be interpreted as cut points.

my_cutoffs
## [1] 0.0 0.2 1.0

TOP

13 A random-sample example using rmap and rmap_individual

random_sample_example.csv

library(rmap)
xxx = read.csv("random_sample_example.csv")
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
design = "random_sample"
risk_group = list(K = 4)
r_summary = "mean"
N_bootstraps = 100
set.seed(1)
rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary, N_bootstraps)
epsilon = length(e)^(-1/3)
individual = rmap_individual(e, t, r, t_star, design, epsilon, N_bootstraps)
the_message = paste("rmap on random sample", sep = "\n")
grid.arrange(textGrob(the_message), 
             pretty_risk_plot(rmap_answers), 
             pretty_roc_plot(rmap_answers), 
             pretty_individual_risk_plot(individual),
             ncol = 2)

14 a two-stage sample example using rmap and rmap_individual

two_stage_sample_example.csv

library(rmap)
xxx = read.csv("two_stage_sample_example.csv", stringsAsFactors = FALSE)
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
N_first_stage = c(A = 132, B = 168)
category = xxx$category
design = list(category = category, N_first_stage = N_first_stage)
risk_group = list(K = 3)
r_summary = "mean"
N_bootstraps = 100
set.seed(3)
rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary, N_bootstraps)
epsilon = length(e)^(-1/3)
individual = rmap_individual(e, t, r, t_star, design, epsilon, N_bootstraps)
the_message = paste("rmap on two-stage-sample")
grid.arrange(textGrob(the_message), 
             pretty_risk_plot(rmap_answers), 
             pretty_roc_plot(rmap_answers), 
             pretty_individual_risk_plot(individual),
             ncol = 2)

15 An example that compares two risk models

two_model_comparison_example.csv

15.1 Read data and run rmap and rmap_individual on each model

library(rmap)
xxx = read.csv("two_model_comparison_example.csv")
head(xxx)
##   e    t     r1     r2
## 1 1 1.11 0.8002 0.8205
## 2 2 1.79 0.4391 0.3795
## 3 1 1.69 0.4975 0.5169
## 4 1 3.99 0.5308 0.4506
## 5 2 1.45 0.4652 0.4128
## 6 2 2.15 0.4679 0.4262
e = xxx$e
t = xxx$t
r1 = xxx$r1
r2 = xxx$r2
t_star = 10
design = "random_sample"
risk_group = list(K = 4)
r_summary = "mean"
N_bootstraps = 100
rmap_1 = rmap(e, t, r1, t_star, design, risk_group, r_summary, N_bootstraps)
rmap_2 = rmap(e, t, r2, t_star, design, risk_group, r_summary, N_bootstraps)
epsilon = nrow(xxx)^(-1/3)
risk_group = list(epsilon = epsilon)
individual_1 = rmap_individual(e, t, r1, t_star, design, risk_group, N_bootstraps)
individual_2 = rmap_individual(e, t, r2, t_star, design, risk_group, N_bootstraps)
rmap_1$numerical_summaries
## $concordance
##   estimate      lower      upper 
## 0.63761916 0.60365961 0.67776946 
## 
## $gof_asymptotic
##  statistic    p_value 
## 1.01610754 0.39735476 
## 
## $gof_bootstrap
##  statistic    p_value 
## 1.04105560 0.38423595 
## 
## $grouped_estimates
##   gamma_hat         r     pi_hat
## 1      0.25 0.4355136 0.45236268
## 2      0.25 0.5416832 0.49696358
## 3      0.25 0.6143128 0.55264638
## 4      0.25 0.7124136 0.72332186
## 
## $grouped_asymptotic_sds
##            sd      lower      upper in_ci
## 1 0.037580869 0.38025595 0.52652607   yes
## 2 0.039119764 0.42095456 0.57311321   yes
## 3 0.039448222 0.47467859 0.62810837   yes
## 4 0.032455091 0.65546335 0.78225513   yes
## 
## $grouped_bootstrap_sds
##            sd      lower      upper in_ci
## 1 0.035550750 0.38539789 0.51276520   yes
## 2 0.039732544 0.41835472 0.57789113   yes
## 3 0.038579876 0.47637754 0.61901071   yes
## 4 0.031768492 0.65771376 0.79417079   yes
names(rmap_1$plots)
## [1] "df_for_roc_plot"  "df_for_risk_plot" "roc_plot"        
## [4] "risk_plot"
names(individual_1)
## [1] "df_for_risk_plot" "risk_plot"

15.2 Create a ROC plot comparing the two models

roc_1 = rmap_1$plots$df_for_roc_plot
roc_2 = rmap_2$plots$df_for_roc_plot
head(roc_1)
##     one_minus_specificity  sensitivity
## 873          0.0000000000 0.0000000000
## 872          0.0000000000 0.0018047327
## 871          0.0041063929 0.0018047327
## 870          0.0041063929 0.0039918629
## 869          0.0041063929 0.0058486117
## 868          0.0041063929 0.0079045932
roc_1$model = rep("Model 1 is super duper", nrow(roc_1))
roc_2$model = rep("Model 2", nrow(roc_2))
df = rbind(roc_1, roc_2)
ggplot(df, aes(x = one_minus_specificity, y = sensitivity, color = model)) +
  geom_step() + 
  geom_abline(slope = 1, intercept = 0, color = "gray", linetype = 2) +
  scale_color_manual(values = c("blue", "red")) +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2, 0.9)) +
  ggtitle("ROC plots for assigned risk Models 1 and 2")

15.3 Create an attribute diagram comparing the two models

risk_1 = rmap_1$plots$df_for_risk_plot
risk_2 = rmap_2$plots$df_for_risk_plot
head(risk_1)
##          r    pi_hat     lower     upper
## 1 43.55136 45.236268 38.025595 52.652607
## 2 54.16832 49.696358 42.095456 57.311321
## 3 61.43128 55.264638 47.467859 62.810837
## 4 71.24136 72.332186 65.546335 78.225513
risk_1$model = rep("Model 1 is super duper", nrow(risk_1))
risk_2$model = rep("Model 2", nrow(risk_2))
df = rbind(risk_1, risk_2)
df
##          r    pi_hat     lower     upper                  model
## 1 43.55136 45.236268 38.025595 52.652607 Model 1 is super duper
## 2 54.16832 49.696358 42.095456 57.311321 Model 1 is super duper
## 3 61.43128 55.264638 47.467859 62.810837 Model 1 is super duper
## 4 71.24136 72.332186 65.546335 78.225513 Model 1 is super duper
## 5 38.09284 49.026215 41.619358 56.476075                Model 2
## 6 53.14648 51.646121 44.302436 58.919391                Model 2
## 7 63.16632 55.070842 47.412032 62.496544                Model 2
## 8 75.87008 67.363156 59.806163 74.114299                Model 2
ggplot(df, aes(x = r, y = pi_hat, ymin = lower, ymax = upper, color = model)) +
  geom_point() +
  geom_line() +
  geom_abline(slope = 1, intercept = 0, color = "gray", linetype = 2) +
    scale_color_manual(values = c("blue", "red")) +
  geom_errorbar(aes(ymin = df$lower, ymax = df$upper), width = 2) +
  xlim(0, 100) + ylim(0, 100) +
  xlab("Assigned Risk (%)") + ylab("Observed Risk (%)") +
  theme(legend.title = element_blank()) +
  theme(legend.position = c(0.2, 0.9)) +
  ggtitle("Attribute diagram comparing assigned risk Models 1 and 2")

15.4 Attribute and individualized attribute diagrams comparing the two models

p1 = rmap_1$plots$risk_plot + ggtitle("Zee Model One")
p2 = rmap_2$plots$risk_plot + ggtitle("Yep this is Model Two")
p3 = individual_1$risk_plot
p4 = individual_2$risk_plot
grobs = lapply(list(p1, p2, p3, p4), `+`, 
               theme(axis.title = element_blank(),
                     plot.title = element_text(hjust = 0.5)))
space = .5
grid.arrange(grobs[[1]] + theme( axis.text.x = element_blank(),
                                 plot.margin = unit(c(space,0,0,space), "cm")),
             grobs[[2]] + theme( axis.text = element_blank(),
                                 plot.margin = unit(c(space,space,0,0), "cm")),
             grobs[[3]] + theme( plot.margin = unit(c(0,0,space,space), "cm")), 
             grobs[[4]] + theme( axis.text.y = element_blank(),
                                 plot.margin = unit(c(0,space,space,0), "cm")), 
             left = "Observed Risks (%)",
             bottom = "Assigned Risks (%)",
             top = "Attribute and Individualized Attribute Diagrams for Models 1 and 2",
             ncol = 2)

TOP

16 Example datasets you can download

random_sample_example.csv

two_stage_sample_example.csv

weighted_example_cohort_sample.csv

weighted_example_target_sample.csv

two_model_comparison_example.csv

TOP

17 Documentation and references

“Evaluating disease prediction models using a cohort whose covariate distribution differs from that of the target population” click here

“Multi-stage sampling in genetic epidemiology”, Alice Whittemore and Jerry Halpern, January 30, 1997, Statistics in Medicine. doc-WHSMMR2016.pdf

rmap-formulas-v02.pdf

rmap-simulatedData-v01.pdf

rmap-walkthrough-v01.pdf

concordance-formulas.pdf