rmap
inputsrmap_individual
inputsrmap
outputrmap_random_sample
and individual_random_sample
rmap_weighted_sample
rmap
and rmap_individual
rmap
and rmap_individual
rmap
and rmap_individual
version 0.03.01 “2017-09-20 07:44:23 PDT”
Personal predictive models use an individual’s covariates to assign him/her a probability of developing a specific disease or outcome within a specified future time period [0, t_star]
and before a specified competing risk. Predictive models are often evaluated by comparing their assigned risks to outcome occurence among participants in a longitudinal cohort study. Performance of such models can be evaluated using two criteria: calibration and concordance. Model calibration (also called goodness-of-fit) measures how well the model-assigned risks agree with subsequent observed outcomes. rmap
provides a grouped goodness-of-fit test and grouped and individualized attribute diagrams. Concordance (also called the area under the ROC curve) measures how well a model separates or discriminates positive and negative outcomes.rmap
provides estimation and confidence intervals for the concordance as well as ROC plots. As of version 0.03.00, rmap
handles three designs for obtaining samples of participants: (1) random sampling, (2) two-stage sampling, and (3) weighted designs.
Below I show some R code for reading in some simulated data, calling some rmap
functions, and displaying rmap
in action: estimating concordance, producing ROC plots, testing for grouped goodness-of-fit, and producing grouped and individualized attribute diagrams.
library(rmap)
xxx = read.csv(file = "random_sample_example.csv", stringsAsFactors = FALSE)
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
K = 4
N_bootstraps = 100
set.seed(1)
rmap_answers = rmap_random_sample(e, t, r, t_star, K, N_bootstraps)
individual = individual_random_sample(e, t, r, t_star, N_bootstraps)
the_message = paste("rmap in action", sep = "\n")
grid.arrange(textGrob(the_message),
pretty_risk_plot(rmap_answers),
pretty_roc_plot(rmap_answers),
pretty_individual_risk_plot(individual),
ncol = 2)
In the examples in this webpage, I use N_bootstraps = 100
so you can run through them relatively quickly. When you perform your own analyses, I suggest starting with N_bootstraps = 100
for organizing your analyses, and then changing to higher numbers to obtain your final results.
The html page that you are looking at was created in RStudio. You may download the source RStudio file rmap-tutorial.Rmd, open it in RStudio and follow along. All RStudio code chunks except those inside A repeat of the previous example, but with different variable names and An example that compares two risk models are “self-contained. This means that except in these two sections, you can try out any code chunk by stepping through any one. Just start at the beginning of the code chunk and send each line to R. You will not need any definitions from a previous code chunk. To evaluate code inside the two exception sections, start in the beginning of the first code chunk each section. In general, the code in each section is self-contained. To install RStudio use the link provided in the next section.
The following sections describe how to install rmap
, rmap
inputs, how to interpret rmap
outputs, and some examples.
If you have not yet installed R, download the latest version (at least 3.4.0) for your operating system at:
One of the required steps will be to install devtools
. Before that, if your operating system is Windows, you will need to install Rtools, and if your operating system is Mac OS, you will need to install Xcode and command-line tools:
https://cran.r-project.org/bin/windows/Rtools/
https://developer.apple.com/xcode/
http://osxdaily.com/2014/02/12/install-command-line-tools-mac-os-x/
Run the R application. To install the rmap package, enter the following lines of code to the R prompt:
install.packages("devtools")
library(devtools)
install_github("gailg/rmap")
if("rmap" %in% rownames(installed.packages())){
print("rmap installed successfully--you are good to go!")
} else {
print("something went wrong--ask for help")
}
If your installation was successful, you should see the message
[1] "rmap installed successfully--you are good to go!"
To install RStudio, follow the instructions here: https://www.rstudio.com/products/rstudio/download/
You need a data set, containing, say N
individuals, and for the n
-th person, you have the following data:
e
The event for the individual. Record e = 1
if the outcome of interest was observed to occur before being censored and before the competing event. Record e = 2
if the competing risk was observed to occur before the outcome and before being censored. And record e = 0
if the individual was censored before outcome and competing risk. Acceptable values are 0, 1, 2
.
t
The time until the event e
. Mathematically speaking, let t1
be time until outcome, t2
time until competing risk, t0
time until censoring . Then t
is the minimum of t1
, t2
, t0
; and e
above, records which of the three kinds of events occurred first. Acceptable values for t
are positive real numbers (that can exceed t_star
the right endpoint of the time period of interest).
r
The risk assigned by the personal predictive model. This is the probability of the outcome of interest occuring before the competing risk and before t_star
. Acceptable values are real numbers between 0
and 1
, including 0
and 1
.
category
(maybe)If your data were obtained by two-stage sampling, category
is the two-stage category indicating which two-stage category the person fell into. (See also N_first_stage
in Other data you may need.) If your data are a cohort sample whose relevant covariates do not match those of your target population, and you would like to weight your cohort sample based on covariate information from a target population, category
records which covariate category the (cohort) individual fell into. (See also N_target
or target_category
in Other data you may need.) If your data were obtained by a random sample, you will not need category
.
If your data is an excel file with columns e
, t
, r
, and possibly category
, save it as a CSV (comma separated value) file to your working directory. For practice you can download my CSV file random_sample_example.csv, save it to your working directory, and try out the following commands. If you get the same results I get below, you are in business.
xxx = read.csv(file = "random_sample_example.csv", stringsAsFactors = FALSE)
head(xxx)
## e t r
## 1 2 5.40 0.5055
## 2 2 0.30 0.6004
## 3 0 2.43 0.4809
## 4 0 0.74 0.7456
## 5 1 2.88 0.6170
## 6 0 0.66 0.4827
If your data were obtained using a two-stage sample design, you will need N_first_stage
. If your data were obtained using a weighted design, you will need either N_target
or target_category
. Otherwise, you have a random sample and you can skip this section.
N_first_stage
If you obtained your data using a two-stage sample, you obtained a random sample of your population (the first stage sample), recording information that was relatively easy to obtain and that places individuals into two or more categories. And then you oversampled or undersampled these categories to obtain the N
individuals whose values of e
, t
, r
, and category
you have recorded. N_first_stage
counts the number of individuals from the first stage sample that fell in each category. If your categories were A and B, and in the first stage 123
fell in category A and 456
fell in category B, then you would use for N_first_stage
the named vector c(A = 123, B = 456)
. Each name in N_first_stage
must have at least one representative in category
.
N_target
or target_category
If covariates of the participants of the cohort study differ from those of the population for which your predictive model is targeted, you can weight the cohort participants using an additional sample, a target sample, whose covariate categories match the target population. Classify individuals in both cohort sample and target sample into covariate categories, labeled with capital letters beginning with "A"
, "B"
, etc. Let category
be the vector of length N
with n
-th element containing the covariate category of the n
-th person. Let target_category
be a vector containing one element for each person in the target sample recording his/her covariate category.
If instead of providing rmap
with the vector target_category
, you wish to provide rmap
with counts of the numbers of individuals in the target population that fell in each covariate category, let N_target
be the named vector counting the number of individuals from the target sample that fell in each category. If your categories were "A"
and "B"
, and in the first stage 123
fell in category A and 456
fell in category B, then you would use N_target
to be the named vector c(A = 123, B = 456)
. Each name in N_target
must have at least one representative in category
.
rmap
inputsThe example in the introduction calls the functions rmap_random_sample
and individual_random_sample
. These functions are “wrapper” functions in which I prespecified some of the options in the workhorse functions rmap
and rmap_individual
. If you have a random sample and are happy with the options I have chosen, all you need to do is make sure you have correctly specified the inputs for the wrapper functions (e
, t
, r
, t_star
, K
, N_bootstraps
, a slightly smaller set of inputs than for the workhorse functions), and then you can follow my example in Introduction. This section describes these inputs as well as other inputs when you need more than what the wrapper functions offer.
The wrapper functions rmap_random_sample
and individual_random_sample
assume your design is a random sample, and the confidence level of confidence intervals is to be 0.95
. Additionally, rmap_random_sample
assumes that you wish risk groups is to be defined using K
quantiles, and the assigned risks of each risk group is to be summarized using their mean.
The following details the inputs to rmap
.
e
, t
, r
These were described in Data preparation. If your cohort contains N
individuals, then each of e
, t
, and r
is a vector of length N
, and the n
-th element pertains to the n
-th individual.
t_star
This positive number is the right end point of the time period of interest. The personal risk model that you are asking rmap
to evaluate expresses the risk, or the probability that the outcome occurs before t_star
and before the prespecified competing risk.
design
Use this input to specify your study design and give any necessary details pertinient to your study design.
If your design was a random sample, use design = "random_sample"
.
If your design was a two-stage sample, use design = list(category = category, N_first_stage = N_first_stage)
where N_first_stage
is described in Other data you may need.
If your design was biased producing a cohort sample whose relevant categories do not proportionately match those of your target population, and you have a target sample with categories target_category
which records the covariate category of each person in the target sample, or N_target
which counts the number of individuals in the target sample falling in each category (see Other data you may need), use design = list(category = category, target_category = target_category)
or design = list(category = category, N_target = N_target)
.
risk_group
Use this input to specify how you wish the assigned risk is to be broken down into risk groups.
If you would like rmap
to break down your sample into K
approximately equal-sized groups, for example K = 4
so the risks are broken down into quartiles, you can specify risk_group = list(K = 4)
. (Throughout this web page, and even when I am not using qunatiles to define risk groups, I denote K
to be the number of risk groups.)
If you would like to specify cut points to cut up the assigned risks, say all individuals having assigned risk r
less than 0.30
in one risk group, greater than or equal 0.30
and less than 0.70
in the second risk group, and greater than or equal 0.70
in the third, you can specify risk_group = list(cutoffs = (0, 0.30, 0.70, 1))
.
If you have broken down your risk groups manually and have a vector k
, the n
-th element taking on an integer in 1, 2, ..., K
, indicating which risk group to assign to the n
-th person, you can specify risk_group = list(k = k)
.
r_summary
r_summary
allows you to choose how you would like to summarize the assigned risks in each risk group. You can specify r_summary = "mean"
or r_summary = "median"
with obvious result Or if your risk groups were specified using risk_group = list(cutoffs = cutoffs)
, with say cutoffs = c(0, 0.30, 0.70, 1)
, you can specify r_summary = "midpoint"
to use the midpoint of each interval defined by your cutoffs
, in this example, 0.15, 0.50, 0.85
would be the summary values of the three risk groups.
N_bootstraps
This nonnegative integer specifies the number of bootstraps. If N_bootstraps = 0
, no bootstraps will be performed. For random samples and two-stage samples, grouped goodness-of-fit tests and concordance estimates can be gotten without bootstraps, but for inferences on concordance as well as inferences on all quantities when the design is a weighted sample, you will need bootstrapping.
The examples used in this web page use N_bootstraps = 100
so that the examples can be stepped through relatively quickly. You should use much larger numbers, for example N_bootstraps = 400
or N_bootstraps = 1000
.
confidence_level
A positive number less than 1
. If you do not specify confidence_level
, rmap
will use 0.95
.
rmap_individual
inputsThe other workhorse function rmap_individual
does not perform formal tests or inference, but provides a summary plot, an individualized attribute diagram which plots an estimate of risk at each observed assigned risk using an epsilon kernel neighborhood.
rmap_individual
requires e
, t
, r
, t_star
, design
, N_bootstraps
, and confidence_level
, which have been discussed above, together with one additional argument epsilon
.
epsilon
This is a small positive number specifying the kernel neighborhoods used to calculate observed risk at each distinct assigned risk. Asymptotic theory suggests good behavior for epsilon = N^(-1/3)
.
rmap
outputrmap
output is broken down into two objects: numerical_summaries
and plots
.
library(rmap)
xxx = read.csv("random_sample_example.csv")
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
design = "random_sample"
risk_group = list(K = 4)
r_summary = "mean"
N_bootstraps = 100
set.seed(1)
rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary, N_bootstraps)
names(rmap_answers)
## [1] "numerical_summaries" "plots"
rmap_answers$numerical_summaries
## $concordance
## estimate lower upper
## 0.61790756 0.53944772 0.68182796
##
## $gof_asymptotic
## statistic p_value
## 0.77240948 0.54293795
##
## $gof_bootstrap
## statistic p_value
## 0.67450509 0.60955997
##
## $grouped_estimates
## gamma_hat r pi_hat
## 1 0.25 0.44360667 0.40537867
## 2 0.25 0.54622000 0.57387391
## 3 0.25 0.61601333 0.52766265
## 4 0.25 0.71231067 0.64958005
##
## $grouped_asymptotic_sds
## sd lower upper in_ci
## 1 0.070702944 0.27727820 0.54780272 yes
## 2 0.072182000 0.43024256 0.70603630 yes
## 3 0.066140594 0.39906304 0.65269306 yes
## 4 0.067403305 0.50920130 0.76809369 yes
##
## $grouped_bootstrap_sds
## sd lower upper in_ci
## 1 0.072765207 0.28445228 0.57311430 yes
## 2 0.073477427 0.45170787 0.73024667 yes
## 3 0.071437034 0.39580420 0.65979942 yes
## 4 0.072397093 0.52166454 0.78910443 yes
names(rmap_answers$plots)
## [1] "df_for_roc_plot" "df_for_risk_plot" "roc_plot"
## [4] "risk_plot"
head(rmap_answers$plots$df_for_roc_plot)
## one_minus_specificity sensitivity
## 289 0.0000000000 0.0000000000
## 286 0.0000000000 0.0066321352
## 284 0.0092496924 0.0066321352
## 283 0.0188664264 0.0066321352
## 282 0.0278984790 0.0066321352
## 280 0.0278984790 0.0153300546
rmap_answers$plots$roc_plot
rmap_answers$plots$risk_plot
numerical_summaries
consist of the followingconcordance
contains the concordance estimate
as well as the lower
and upper
bounds of a 95 percent bootstrap confidence interval (or a confidence interval with a different confidence-level if you specified it with the input confidence_level
)
gof_asymptotic
reports the goodness of fit statistic
, with definition \(\chi^2_K = \sum_{k = 1}^K \text{gamma_hat}_k \Big( \text{pi_hat}_k - \text{r}_k \Big)^2 \Big/ \text{sd}_k^2\), as well as its p_value
for testing if the grouped assigned risks fit the observed cohort data. The variances (\(\text{sd}_k^2\)) in the denominator of the statistic
are gotten by asymptotic theory. This object is NULL
if theoretical variances are not available (i.e. when the design is weighted).
gof_bootstrap
is similar to gof_asymptotic
but variances are gotten by bootstrapping. This object is NULL
if bootstrap variances are not available (i.e when N_bootstraps = 0
).
grouped_estimates
is a data.frame containing gamma_hat
which estimates the proportion of individuals in each risk group, r
which summarizes the assigned risk in each risk group (using the summary specified by r_summary
). and pi_hat
which estimates the probability given risk group, of getting the outcome before t_star
and before the competing risk.
grouped_asymptotic_sds
is a data.frame containing sd
the square root of the variance of pi_hat
for each risk group, the lower
and upper
bounds of the 95 percent confidence interval for pi
, and in_ci
indicating whether or not the confidence interval contains r
. The estimated variance and confidence intervals are calculated using asymptotic theory. This is available only for random samples and two-stage samples.
grouped_bootstrap_sds
is similar to grouped_asymptotic_sds
, but the variance and confidence intervals are calculated using the bootstrap. This is available only if you specify N_bootstraps
to be a nonzero number.
plots
containdf_for_roc_plot
is a data.frame that you can use to construct custom ROC plots.
roc_plot
contains a ROC plot (with one_minus_specificity
on the x-axis and sensitivity
on the y-axis).
risk_plot
contains a grouped attribute diagram with grouped_estimates$r
plotted on the x-axis and grouped_estimates$pi_hat
together with grouped_asymptotic_sds$lower
and grouped_asymptotic_sds$upper
, if they are available or if they are not available grouped_bootstrap_sds$lower
and grouped_bootstrap_sds$upper
, plotted on the y-axis.
rmap_individual
outputlibrary(rmap)
xxx = read.csv("random_sample_example.csv")
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
design = "random_sample"
epsilon = length(e)^(-1/3)
N_bootstraps = 100
set.seed(1)
individual = rmap_individual(e, t, r, t_star, design, epsilon, N_bootstraps)
names(individual)
## [1] "df_for_risk_plot" "risk_plot"
head(individual$df_for_risk_plot)
## assigned_risk observed_risk lower upper
## 1 26.75 36.518547 19.382884 60.200836
## 2 31.16 36.639417 19.451383 61.349186
## 3 32.26 34.837770 20.312142 61.866644
## 4 32.30 37.376353 19.517306 62.296359
## 5 32.99 37.376353 19.517306 62.296359
## 6 35.16 39.047342 21.435216 62.296359
individual$risk_plot
rmap_individual
output contains two objects: df_for_risk_plot
and risk_plot
.
df_for_risk_plot
is a data.frame that you can use to construct custom individualized attribute diagrams.
risk_plot
contains an individualized attribute diagram with unique values of the input r
plotted on the x-axis and epsilon kernel nearest neighbor estiamte of outcome probability on the y-axis.
rmap_random_sample
and individual_random_sample
If you have a random sample, and you are happy using assigned risk quantiles to define your risk groups, the mean to summarize the risks in each risk group, and confidence level 0.95
, you can use the wrapper functions rmap_random_sample
and individual_random_sample
as illustrated in Introduction
rmap_weighted_sample
The wrapper function rmap_weighted_sample
can perform an rmap
analysis in which the covariates of the participants in the cohort sample do not match those of the target population, and you have a sample from the target population with covariate categories that you can compare with covariate categories of the cohort sample. This wrapper function assumes you want your risk groups to be defined by cutoffs
, you want to summarize risks in each risk group using means, and the confidence level of all confidence intervals to be 0.95
.
You can follow along in this example by first downloading the two data sets, weighted_example_cohort_sample.csv and weighted_example_target_sample.csv, and making sure they are in your working directory.
library(rmap)
xxx = read.csv("weighted_example_cohort_sample.csv",
stringsAsFactors = FALSE)
head(xxx)
## e t r category
## 1 1 2.86 0.8517 C
## 2 2 0.96 0.5130 C
## 3 1 2.03 0.5482 C
## 4 1 0.54 0.3027 A
## 5 1 1.94 0.2500 A
## 6 1 2.63 0.7006 D
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
target_sample = read.csv("weighted_example_target_sample.csv",
stringsAsFactors = FALSE)
head(target_sample)
## x_1 x_2 category
## 1 -3.17 0.15 A
## 2 -1.39 0.85 D
## 3 -3.50 -0.61 A
## 4 -2.44 -0.19 C
## 5 -1.13 0.84 D
## 6 -2.98 -0.03 A
target_category = target_sample$category
category = xxx$category
cutoffs = c(0, 0.20, 1)
N_bootstraps = 100
set.seed(2)
rmap_answers = rmap_weighted_sample(
e, t, r, category, target_category, t_star, cutoffs, N_bootstraps)
individual = individual_weighted_sample(e, t, r, category, target_category, t_star, N_bootstraps)
grid.arrange(pretty_roc_plot(rmap_answers),
pretty_risk_plot(rmap_answers),
ncol = 2)
Each of the functions rmap_random_sample
, individual_random_sample
, rmap_weighted_sample_fn
, individual_weighted_sample
is a wrapper function. You can see what is inside the definition of any R function just by sending to R the name of the function, without trailing parentheses or inputs. By examining how I constructed these wrapper functions, you can see how you can create your own customized calls to rmap
.
library(rmap)
rmap_weighted_sample
## function (
## e, t, r, category, target_category, t_star, cutoffs, N_bootstraps){
## design = list(target_category = target_category, category = category)
## risk_group = list(cutoffs = cutoffs)
## r_summary = "mean"
## confidence_level = 0.95
## rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary,
## N_bootstraps, confidence_level)
## rmap_answers
## }
## <environment: namespace:rmap>
rmap
and rmap_individual
If I didn’t have rmap_weighted_sample_fn
, this is how I would get the same results. Notice how I put the the vector cutoffs
into a named list risk_group
and the two vectors target_category
and category
into the named list design
.
library(rmap)
xxx = read.csv("weighted_example_cohort_sample.csv",
stringsAsFactors = FALSE)
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
target_sample = read.csv("weighted_example_target_sample.csv",
stringsAsFactors = FALSE)
target_category = target_sample$category
category = xxx$category
set.seed(1)
design = list(category = category, target_category = target_category)
cutoffs = c(0, 0.20, 1)
risk_group = list(cutoffs = cutoffs)
r_summary = "mean"
N_bootstraps = 100
confidence_level = 0.95
rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary,
N_bootstraps, confidence_level)
epsilon = length(e)^(-1/3)
individual = rmap_individual(e, t, r, t_star, design, epsilon,
N_bootstraps, confidence_level)
the_message = paste("doing it without calling rmap_weighted_sample", sep = "\n")
grid.arrange(textGrob(the_message),
pretty_risk_plot(rmap_answers),
pretty_roc_plot(rmap_answers),
pretty_individual_risk_plot(individual),
ncol = 2)
You may use different variable names for all inputs of R functions. I’ve repeated the code from the previous section but renamed all input. Notice that although I’ve given them different names, the inputs are in the same order.
library(rmap)
xxx = read.csv("weighted_example_cohort_sample.csv",
stringsAsFactors = FALSE)
my_e = xxx$e
my_t = xxx$t
my_r = xxx$r
my_t_star = 10
target_sample = read.csv("weighted_example_target_sample.csv",
stringsAsFactors = FALSE)
my_target_category = target_sample$category
my_category = xxx$category
set.seed(1)
my_design = list(category = my_category, target_category = my_target_category)
my_cutoffs = c(0, 0.20, 1)
my_risk_group = list(cutoffs = my_cutoffs)
my_r_summary = "mean"
my_N_bootstraps = 100
my_confidence_level = 0.95
rmap_answers = rmap(my_e, my_t, my_r, my_t_star, my_design, my_risk_group, my_r_summary,
my_N_bootstraps, my_confidence_level)
my_epsilon = length(my_e)^(-1/3)
individual = rmap_individual(my_e, my_t, my_r, my_t_star, my_design, my_epsilon,
my_N_bootstraps, my_confidence_level)
the_message = paste("doing it without calling rmap_weighted_sample", sep = "\n")
grid.arrange(textGrob(the_message),
pretty_risk_plot(rmap_answers),
pretty_roc_plot(rmap_answers),
pretty_individual_risk_plot(individual),
ncol = 2)
The inputs my_design
and my_risk_group
are lists that contain additional objects. I can examine what is inside my_design
…
my_design
## $category
## [1] "C" "C" "C" "A" "A" "D" "C" "C" "C" "A" "A" "A" "C" "C" "A" "C" "A"
## [18] "C" "C" "B" "A" "C" "A" "C" "A" "A" "A" "C" "A" "A" "A" "A" "A" "C"
## [35] "C" "C" "C" "C" "C" "A" "C" "C" "C" "C" "A" "C" "C" "C" "C" "A" "A"
## [52] "A" "C" "C" "C" "A" "C" "C" "A" "A" "C" "C" "C" "A" "C" "A" "C" "A"
## [69] "C" "C" "C" "A" "A" "A" "A" "C" "C" "A" "C" "C" "A" "C" "A" "A" "A"
## [86] "C" "A" "A" "A" "C" "A" "C" "C" "C" "C" "A" "A" "C" "C" "C" "A" "C"
## [103] "C" "A" "A" "A" "A" "A" "A" "C" "C" "A" "A" "C" "A" "C" "D" "C" "C"
## [120] "C" "C" "C" "A" "A" "C" "A" "A" "C" "A" "A" "A" "A" "C" "A" "C" "C"
## [137] "C" "C" "C" "C" "A" "A" "A" "C" "A" "C" "A" "A" "C" "A" "C" "C" "C"
## [154] "A" "C" "A" "A" "A" "A" "C" "C" "C" "A" "D" "C" "C" "C" "C" "A" "C"
## [171] "D" "C" "A" "A" "C" "A" "A" "C" "A" "C" "C" "A" "A" "A" "A" "A" "D"
## [188] "C" "A" "C" "C" "C" "C" "C" "A" "A" "C" "C" "C" "A" "C" "B" "A" "C"
## [205] "D" "A" "C" "C" "C" "A" "A" "A" "A" "A" "A" "C" "D" "C" "A" "A" "A"
## [222] "A" "C" "C"
##
## $target_category
## [1] "A" "D" "A" "C" "D" "A" "A" "B" "A" "C" "D" "A" "A" "B" "A" "B" "A"
## [18] "B" "D" "A" "C" "D" "D" "C" "D" "A" "D" "C" "B" "B" "C" "C" "D" "D"
## [35] "D" "D" "B" "B" "B" "B" "C" "A" "B" "D" "B" "D" "B" "A" "B" "A" "D"
## [52] "C" "C" "A" "A" "A" "C" "B" "A" "D" "A" "C" "B" "C" "B" "B" "B" "A"
## [69] "B" "A" "B" "A" "C" "C" "C" "B" "D" "D" "D" "B" "D" "A" "C" "B" "C"
## [86] "A" "C" "A" "B" "B" "C" "A" "C" "C" "D" "D" "A" "B" "A" "A" "A" "D"
## [103] "D" "D" "A" "C" "B" "B" "A" "D" "B" "C" "D" "C" "B" "D" "B" "C" "C"
## [120] "B" "C" "C" "B" "A" "D" "C" "B" "C" "A" "B" "D" "B" "B" "B" "C" "D"
## [137] "C" "B" "D" "D" "A" "B" "C" "D" "B" "B" "C" "A" "B" "B" "C" "D" "B"
## [154] "D" "A" "D" "B" "C" "A" "D" "B" "C" "C" "C" "B" "B" "B" "B" "A" "A"
## [171] "B" "D" "A" "A" "A" "C" "C" "B" "C" "A" "C" "A" "D" "C" "B" "B" "B"
## [188] "B" "D" "B" "C" "D" "B" "D" "A" "D" "C" "C" "A" "C" "B" "B" "D" "A"
## [205] "D" "C" "D" "D" "B" "C" "C" "D" "C" "A" "C" "A" "B" "D" "D" "C" "C"
## [222] "B" "D" "B" "B" "B" "A" "B" "A" "B" "A" "C" "B" "B" "B" "D" "D" "B"
## [239] "B" "C" "C" "B" "B" "D" "B" "B" "D" "B" "A" "D" "A" "A" "C" "A" "B"
## [256] "A" "A" "B" "D" "B" "B" "C" "C" "A" "D" "D" "C" "A" "B" "A" "B" "B"
## [273] "A" "D" "D" "D" "A" "C" "D" "A" "C" "B" "A" "C" "A" "B" "C" "D" "D"
## [290] "A" "B" "D" "C" "C" "C" "B" "D" "D" "D" "D" "C" "D" "D" "C" "A" "A"
## [307] "C" "B" "B" "C" "B" "D" "C" "A" "D" "C" "D" "A" "B" "D" "A" "C" "D"
## [324] "D" "A" "A" "C" "B" "B" "A" "B" "B" "A" "B" "D" "A" "B" "D" "C" "D"
## [341] "B" "A" "A" "C" "B" "A" "A" "B" "C" "D" "B" "C" "D" "A" "B" "D" "D"
## [358] "A" "D" "C" "B" "C" "D" "C" "B" "C" "B" "A" "C" "A" "C" "D" "B" "D"
## [375] "A" "D" "B" "C" "D" "B" "D" "C" "C" "C" "D" "D" "D" "A" "B" "B" "B"
## [392] "D" "A" "C" "C" "A" "A" "D" "A" "B"
I see that my_design
is a list that contains two objects, named category
and target_category
. These objects get their names and what is inside from their definitions inside the list my_design = list(category = my_category, target_category = my_target_category)
. The first object is named category
, and what is inside category
is my_category
my_category
## [1] "C" "C" "C" "A" "A" "D" "C" "C" "C" "A" "A" "A" "C" "C" "A" "C" "A"
## [18] "C" "C" "B" "A" "C" "A" "C" "A" "A" "A" "C" "A" "A" "A" "A" "A" "C"
## [35] "C" "C" "C" "C" "C" "A" "C" "C" "C" "C" "A" "C" "C" "C" "C" "A" "A"
## [52] "A" "C" "C" "C" "A" "C" "C" "A" "A" "C" "C" "C" "A" "C" "A" "C" "A"
## [69] "C" "C" "C" "A" "A" "A" "A" "C" "C" "A" "C" "C" "A" "C" "A" "A" "A"
## [86] "C" "A" "A" "A" "C" "A" "C" "C" "C" "C" "A" "A" "C" "C" "C" "A" "C"
## [103] "C" "A" "A" "A" "A" "A" "A" "C" "C" "A" "A" "C" "A" "C" "D" "C" "C"
## [120] "C" "C" "C" "A" "A" "C" "A" "A" "C" "A" "A" "A" "A" "C" "A" "C" "C"
## [137] "C" "C" "C" "C" "A" "A" "A" "C" "A" "C" "A" "A" "C" "A" "C" "C" "C"
## [154] "A" "C" "A" "A" "A" "A" "C" "C" "C" "A" "D" "C" "C" "C" "C" "A" "C"
## [171] "D" "C" "A" "A" "C" "A" "A" "C" "A" "C" "C" "A" "A" "A" "A" "A" "D"
## [188] "C" "A" "C" "C" "C" "C" "C" "A" "A" "C" "C" "C" "A" "C" "B" "A" "C"
## [205] "D" "A" "C" "C" "C" "A" "A" "A" "A" "A" "A" "C" "D" "C" "A" "A" "A"
## [222] "A" "C" "C"
When you hand my_design
to rmap
, the names of the objects target
and target_category
signal to rmap
that you want to do a weighted analysis and and the objects category
contains the covariate categories of the cohort sample and target_category
contains the covariate categories of the target sample.
Similarly I can examine my_risk_group
my_risk_group
## $cutoffs
## [1] 0.0 0.2 1.0
I see that my_risk_group
is a list that contains one object named cutoffs
. This object gets its name and what is inside it from its definition in the list my_risk_group = list(cutoffs = my_cutoffs)
. The object inside is named cutoffs
, and what is inside cutoffs
is the same as my_cutoffs
. When you hand my_risk_group
to rmap
, the name of the object cutoffs
tells rmap
that the contents should be interpreted as cut points.
my_cutoffs
## [1] 0.0 0.2 1.0
rmap
and rmap_individual
library(rmap)
xxx = read.csv("random_sample_example.csv")
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
design = "random_sample"
risk_group = list(K = 4)
r_summary = "mean"
N_bootstraps = 100
set.seed(1)
rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary, N_bootstraps)
epsilon = length(e)^(-1/3)
individual = rmap_individual(e, t, r, t_star, design, epsilon, N_bootstraps)
the_message = paste("rmap on random sample", sep = "\n")
grid.arrange(textGrob(the_message),
pretty_risk_plot(rmap_answers),
pretty_roc_plot(rmap_answers),
pretty_individual_risk_plot(individual),
ncol = 2)
rmap
and rmap_individual
library(rmap)
xxx = read.csv("two_stage_sample_example.csv", stringsAsFactors = FALSE)
e = xxx$e
t = xxx$t
r = xxx$r
t_star = 10
N_first_stage = c(A = 132, B = 168)
category = xxx$category
design = list(category = category, N_first_stage = N_first_stage)
risk_group = list(K = 3)
r_summary = "mean"
N_bootstraps = 100
set.seed(3)
rmap_answers = rmap(e, t, r, t_star, design, risk_group, r_summary, N_bootstraps)
epsilon = length(e)^(-1/3)
individual = rmap_individual(e, t, r, t_star, design, epsilon, N_bootstraps)
the_message = paste("rmap on two-stage-sample")
grid.arrange(textGrob(the_message),
pretty_risk_plot(rmap_answers),
pretty_roc_plot(rmap_answers),
pretty_individual_risk_plot(individual),
ncol = 2)
two_model_comparison_example.csv
rmap
and rmap_individual
on each modellibrary(rmap)
xxx = read.csv("two_model_comparison_example.csv")
head(xxx)
## e t r1 r2
## 1 1 1.11 0.8002 0.8205
## 2 2 1.79 0.4391 0.3795
## 3 1 1.69 0.4975 0.5169
## 4 1 3.99 0.5308 0.4506
## 5 2 1.45 0.4652 0.4128
## 6 2 2.15 0.4679 0.4262
e = xxx$e
t = xxx$t
r1 = xxx$r1
r2 = xxx$r2
t_star = 10
design = "random_sample"
risk_group = list(K = 4)
r_summary = "mean"
N_bootstraps = 100
rmap_1 = rmap(e, t, r1, t_star, design, risk_group, r_summary, N_bootstraps)
rmap_2 = rmap(e, t, r2, t_star, design, risk_group, r_summary, N_bootstraps)
epsilon = nrow(xxx)^(-1/3)
risk_group = list(epsilon = epsilon)
individual_1 = rmap_individual(e, t, r1, t_star, design, risk_group, N_bootstraps)
individual_2 = rmap_individual(e, t, r2, t_star, design, risk_group, N_bootstraps)
rmap_1$numerical_summaries
## $concordance
## estimate lower upper
## 0.63761916 0.60365961 0.67776946
##
## $gof_asymptotic
## statistic p_value
## 1.01610754 0.39735476
##
## $gof_bootstrap
## statistic p_value
## 1.04105560 0.38423595
##
## $grouped_estimates
## gamma_hat r pi_hat
## 1 0.25 0.4355136 0.45236268
## 2 0.25 0.5416832 0.49696358
## 3 0.25 0.6143128 0.55264638
## 4 0.25 0.7124136 0.72332186
##
## $grouped_asymptotic_sds
## sd lower upper in_ci
## 1 0.037580869 0.38025595 0.52652607 yes
## 2 0.039119764 0.42095456 0.57311321 yes
## 3 0.039448222 0.47467859 0.62810837 yes
## 4 0.032455091 0.65546335 0.78225513 yes
##
## $grouped_bootstrap_sds
## sd lower upper in_ci
## 1 0.035550750 0.38539789 0.51276520 yes
## 2 0.039732544 0.41835472 0.57789113 yes
## 3 0.038579876 0.47637754 0.61901071 yes
## 4 0.031768492 0.65771376 0.79417079 yes
names(rmap_1$plots)
## [1] "df_for_roc_plot" "df_for_risk_plot" "roc_plot"
## [4] "risk_plot"
names(individual_1)
## [1] "df_for_risk_plot" "risk_plot"
roc_1 = rmap_1$plots$df_for_roc_plot
roc_2 = rmap_2$plots$df_for_roc_plot
head(roc_1)
## one_minus_specificity sensitivity
## 873 0.0000000000 0.0000000000
## 872 0.0000000000 0.0018047327
## 871 0.0041063929 0.0018047327
## 870 0.0041063929 0.0039918629
## 869 0.0041063929 0.0058486117
## 868 0.0041063929 0.0079045932
roc_1$model = rep("Model 1 is super duper", nrow(roc_1))
roc_2$model = rep("Model 2", nrow(roc_2))
df = rbind(roc_1, roc_2)
ggplot(df, aes(x = one_minus_specificity, y = sensitivity, color = model)) +
geom_step() +
geom_abline(slope = 1, intercept = 0, color = "gray", linetype = 2) +
scale_color_manual(values = c("blue", "red")) +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2, 0.9)) +
ggtitle("ROC plots for assigned risk Models 1 and 2")
risk_1 = rmap_1$plots$df_for_risk_plot
risk_2 = rmap_2$plots$df_for_risk_plot
head(risk_1)
## r pi_hat lower upper
## 1 43.55136 45.236268 38.025595 52.652607
## 2 54.16832 49.696358 42.095456 57.311321
## 3 61.43128 55.264638 47.467859 62.810837
## 4 71.24136 72.332186 65.546335 78.225513
risk_1$model = rep("Model 1 is super duper", nrow(risk_1))
risk_2$model = rep("Model 2", nrow(risk_2))
df = rbind(risk_1, risk_2)
df
## r pi_hat lower upper model
## 1 43.55136 45.236268 38.025595 52.652607 Model 1 is super duper
## 2 54.16832 49.696358 42.095456 57.311321 Model 1 is super duper
## 3 61.43128 55.264638 47.467859 62.810837 Model 1 is super duper
## 4 71.24136 72.332186 65.546335 78.225513 Model 1 is super duper
## 5 38.09284 49.026215 41.619358 56.476075 Model 2
## 6 53.14648 51.646121 44.302436 58.919391 Model 2
## 7 63.16632 55.070842 47.412032 62.496544 Model 2
## 8 75.87008 67.363156 59.806163 74.114299 Model 2
ggplot(df, aes(x = r, y = pi_hat, ymin = lower, ymax = upper, color = model)) +
geom_point() +
geom_line() +
geom_abline(slope = 1, intercept = 0, color = "gray", linetype = 2) +
scale_color_manual(values = c("blue", "red")) +
geom_errorbar(aes(ymin = df$lower, ymax = df$upper), width = 2) +
xlim(0, 100) + ylim(0, 100) +
xlab("Assigned Risk (%)") + ylab("Observed Risk (%)") +
theme(legend.title = element_blank()) +
theme(legend.position = c(0.2, 0.9)) +
ggtitle("Attribute diagram comparing assigned risk Models 1 and 2")
p1 = rmap_1$plots$risk_plot + ggtitle("Zee Model One")
p2 = rmap_2$plots$risk_plot + ggtitle("Yep this is Model Two")
p3 = individual_1$risk_plot
p4 = individual_2$risk_plot
grobs = lapply(list(p1, p2, p3, p4), `+`,
theme(axis.title = element_blank(),
plot.title = element_text(hjust = 0.5)))
space = .5
grid.arrange(grobs[[1]] + theme( axis.text.x = element_blank(),
plot.margin = unit(c(space,0,0,space), "cm")),
grobs[[2]] + theme( axis.text = element_blank(),
plot.margin = unit(c(space,space,0,0), "cm")),
grobs[[3]] + theme( plot.margin = unit(c(0,0,space,space), "cm")),
grobs[[4]] + theme( axis.text.y = element_blank(),
plot.margin = unit(c(0,space,space,0), "cm")),
left = "Observed Risks (%)",
bottom = "Assigned Risks (%)",
top = "Attribute and Individualized Attribute Diagrams for Models 1 and 2",
ncol = 2)
weighted_example_cohort_sample.csv
weighted_example_target_sample.csv
“Evaluating disease prediction models using a cohort whose covariate distribution differs from that of the target population” click here
“Multi-stage sampling in genetic epidemiology”, Alice Whittemore and Jerry Halpern, January 30, 1997, Statistics in Medicine. doc-WHSMMR2016.pdf