“2018-02-23 15:55:19 GMT”

1 Introduction

For comments or questions, please contact Gail Gong at gailgongster@gmail.com

To download the RStudio source file for this webpage, click here: gatars-follow-me.Rmd

gatars (“Genetic Association Tests for Arbitrarily Related Subjects”) tests the association between a specified set of genetic markers, called target markers, and a binary or quantitative trait using subjects with any genealogical relationship by calculating the P-values of the following seven statistics: the three basic statistics given by the squared burden \(Q_B\), the SKAT \(Q_S\) and the trait-based \(Q_T\), and their four optimized linear combinations \(Q_{BS}\), \(Q_{BT}\), \(Q_{ST}\), \(Q_{BST}\). The gatars package handles only autosomal markers.

Recall the notation from the manuscript: \(G\) is the \(N \times M\) genotype matrix of target markers, \(y\) is the \(N \times 1\) column vector of subjects’ coded trait phenotypes, and \(\mu\) is the \(N \times 1\) column vector of user-specified phenotype predictions. Any linear combination of the three basic statistics can be written as a quadratic form \(Q(\alpha) = Z^T A(\alpha) Z\). Here \(Z\) is a linear function of the two vectors \(W G y\) and \(W G \mu\), \(W\) is a diagonal matrix whose user-specified diagonal components weight the target markers, \(A(\alpha)\) is a square matrix that does not depend on \(G\), and \(\alpha = \big( \alpha_B, \alpha_S, \alpha_T \big)\) is a triple of coefficients that lie in the unit triangle shown in Figure 1 of the manuscript. The limiting null distribution of \(Z\) when \(N\) is large is multivariate normal with mean and covariance matrix that involve the first two moments of \(G\).

For any given fixed \(\alpha\), the davies function from the CompQuadForm package can provide the significance level for \(Q(\alpha)\), and therefore for any fixed linear combination of the three basic statistics. However the optimized statistics depend on data-driven \(\alpha\)’s, so the nominal P-values from davies do not reflect the their true significance levels. To estimate the limiting null distribution of an optimized \(Q\)-statistic, gatars uses genome resampling: Simulate many null versions of the \(Q\)-statistic corresponding to a large number \(K\) of simulated null genotype matrices \(\tilde{G}^{(1)}, \cdots, \tilde{G}^{(K)}\). Specifically, for each \(k\), construct the \(N \times M\) matrix \(\tilde{G}^{(k)}\) by sampling one column from each sampling set, described below. Then, for the \(k\)-th simulated matrix \(\tilde{G}^{(k)}\), find the linear combination \(\alpha^{k}\) having minimal nominal significance level \(P_{\alpha^{k}}\) and store the transformed value \(X_{\alpha^{k}} = - \text{log}_{10} \big( P_{\alpha^{k}} \big)\) . Repeat this procedure \(K\) times and estimate the significance level of the observed \(Q\)-statistic to be the proportion of the \(K\) transformed values \(\Big\{X_{\alpha^k} \Big\}_{k = 1, \cdots, K}\) that exceed the observed transformed P-value corresponding to the observed \(Q\)-statistic.

To create sampling sets, gatars requires the \(N\) subjects’ genotypes at roughly 100,000 markers located throughout the autosomal genome. gatars chooses from these markers to create \(M\) sampling sets, one for each target marker, so that the following two requirements are satisfied: (1) the sampling set for the \(m\)-th target marker contains markers all of whose minor allele frequencies (MAFs) match the MAF of the \(m\)-th target marker, and (2) the markers in the sampling sets are independent of all target markers and all markers known to be trait-associated. (A MAF “matches” \(\pi_m\) if it is within \(\pi_m \times \big[1 -\) epsilon \(, 1 +\) epsilon \(\big]\).)

To perform its calculations gatars requires the following input from the user: 1) The binary or quantitative phenotype data \(y_n\), \(n = 1, \cdots, N\), of the \(N\) subjects. 2) A user-specified null trait prediction \(\mu_n\) (for example \(\mu_n\) can be the mean of the \(N\) \(y_n\) values or the value predicted from a logistic or linear regression of \(y_n\) on nongenetic covariates and/or on principal components of ancestry). 3) Coded genotypes \(g_n = \big(g_{n1}, \cdots, g_{nM} \big)\) at \(M\) target markers. 4) The pairwise inter-personal correlation coefficients summarized by the \(N \times N\) matrix \(\Psi\) in the manuscript. 5) The \(N\) subjects’ genotypes at roughly 100,000 markers located throughout the autosomal genome. 6) The positions of any autosomal markers known to be trait-associated.

TOP

2 Installing gatars

If you have not yet installed R, download the latest version (at least 3.2.4) for your operating system at:

http://www.r-project.org

Run the R application. To install the gatars package, enter the following lines of code to the R prompt:

install.packages("devtools")
library(devtools)
install_github("gailg/gatars")
if("gatars" %in% rownames(installed.packages())){
  print("gatars installed successfully--you are good to go!")
} else {
  print("something went wrong--ask for help")
}

If your installation was successful, you should see the message

[1] "gatars installed successfully--you are good to go!"

TOP

3 Data required

gatars requires quite large amounts of data, and typically, which can be organized in Plink (http://pngu.mgh.harvard.edu/~purcell/plink/). We assume that you can bring your data into R and create the following six data sets.

3.1 bim

A data.frame containing the three columns chromosome, snp, bp, andcontaining \(L\) rows corresponding to the \(M\) target markers and those used to build the sampling sets. The \(l\)-th row of bim summarizes the \(l\)-th marker and corresponds to the \(l\)-th column of genotype (described below). The column labeled chromosome contains integers between \(1\) and \(22\) (other integers may be included, but only chromosomes \(1\) through \(22\) will be used), the snp column contains character strings that name the marker, and the bp column contains integers giving the markers’ bp positions. (Because the gatars data set hotspot (described below) is given in Build hg38/GRCh38, bp must also be expressed in Build hg38/GRCh38.)

3.2 exclusion_region

A data.frame with one or several rows of the three columns chromosome, start, and end. Each row of exclusion_region reflects one contiguous genomic region known to be trait-associated and therefore a region used by gatars when creating the sampling sets. The column chromosome is an integer between 1 and 22 identifying the autosomal chromosome containing the region, and start and end describe its starting and ending positions (bp). If the region consists of a single marker, then start and end are both equal to the position of this marker. start and end must be expressed in Build hg38/GRCh38.

In addition, gatars provides the data set hotspot for your convenience. hotspot contains the columnschromosome,start, andend;chromosomegives the chromosome (an integer between $1$ and $22$) containing a given recombination hotspot;startandend` give the extent of the recombination hotspot (integers reflecting their Build hg38/GRCh38 bp positions).

3.3 genotype

An \(N \times L\) matrix, whose \((n,l)\)-th element records the \(n\)-th subject’s coded genotype at the \(l\)-th of \(L\) markers. The \(l\)-th column of genotype corresponds to the \(l\)-th row of bim. The object genotype could be obtained by reading in the .bed file from plink. (Here we distinguish genotype (the \(N \times L\) matrix whose columns correspond to the \(M\) target markers AND the additional \(L - M\) sampling set marker candidates) from the \(N \times M\) matrix \(G\), described in the manuscript, containing just the \(M\) target markers).

3.4 phenotype

A data.frame containing \(N\) rows and the two columns y and mu where y contains the subjects’ coded phenotypes (either 0 or 1 if the trait is binary or a real number if the trait is quantitative) and where mu contains the user-specified null trait prediction (a real number) for \(y\).

3.5 Psi = \(\Psi\)

Psi is the \(N \times N\) matrix of user-specified null correlation coefficients between pairs of subjects at any marker. Psi can be estimated using known family pedigree structures and/or from the subjects’ genotypes at markers independent of those in the target set in software packages such as KING (Manichaikul et al 2010).

3.6 target_markers

A character vector of length \(M\) that is a subset of the column bim$snp. This vector identifies the target markers.

TOP

4 An example data set

For illustration I will now use the example data set provided by gatars. You can access this example data with the following commands.

library(gatars)
# Preparing the data
bim = gatars_example$bim
exclusion_region = gatars_example$exclusion_region
genotype = gatars_example$genotype
phenotype = gatars_example$phenotype
Psi = gatars_example$Psi
target_markers = gatars_example$target_markers[3:5]

4.1 bim has 24578 rows and includes the columns chromosome, snp, and bp

str(bim)
## 'data.frame':    24578 obs. of  3 variables:
##  $ chromosome: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ snp       : chr  "exm94" "exm210" "exm269" "exm328" ...
##  $ bp        : int  874501 879481 881918 887967 949491 977356 978694 979748 980552 982722 ...

4.2 genotype has 200 rows and 24578 columns

I chose a relatively small number of columns to keep the example small.

str(genotype)
##  int [1:200, 1:24578] 0 0 0 1 0 0 0 0 0 0 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:200] "1" "2" "3" "4" ...
##   ..$ : chr [1:24578] "exm94" "exm210" "exm269" "exm328" ...

The column names of genotype must match bim$snp:

all(colnames(genotype) == bim$snp)
## [1] TRUE

4.3 target_markers is a character vector containing 3 names

str(target_markers)
##  chr [1:3] "exm1061853" "exm1061861" "exm1061863"

The elements in target_markers must be included in bim$snp

all(target_markers %in% bim$snp)
## [1] TRUE

4.4 phenotype has 200 rows and includes the columns y and mu

str(phenotype)
## 'data.frame':    200 obs. of  3 variables:
##  $ id: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ y : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ mu: num  0.879 0.838 0.798 0.838 0.798 ...

4.5 Psi is a 200 \(\times\) 200 square matrix

str(Psi)
##  num [1:200, 1:200] 1 0.5 0 0 0 0 0 0 0 0 ...

The following figure reflects the fact that the first 100 people are made up of 50 sib pairs, and the last 100 people are independent.

# figure to illustrate Psi
NNN = nrow(phenotype)
first_ten = 1:10
last_ten = NNN - (9:0)
matrix_image_fn(Psi[c(first_ten, last_ten), c(first_ten, last_ten)],
                main = "First and last 10 rows and columns of Psi")

4.6 exclusion_region is a data.frame containing the columns chromosome, start, and end

str(exclusion_region)
## 'data.frame':    115 obs. of  3 variables:
##  $ chromosome: num  1 1 1 1 1 2 2 2 2 2 ...
##  $ start     : int  10496040 150685811 154861707 204549714 205788696 9977740 10570604 20688505 43326810 62904596 ...
##  $ end       : int  10496040 150685811 154861707 204549714 205788696 9977740 10570604 20688505 43326810 62904596 ...

4.7 hotspot is available as soon as you enter gatars:

str(hotspot)
## 'data.frame':    25657 obs. of  4 variables:
##  $ chromosome: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ center    : num  949794 1725398 2021339 2102281 2282281 ...
##  $ start     : int  635391 1722898 1950897 2099781 2273781 2387781 2462781 2835085 2881085 2956085 ...
##  $ end       : int  1264196 1727898 2091781 2104781 2290781 2437781 2493781 2841085 2902085 2971086 ...

5 Check that your genotype matrix has rank \(M\)

genotype_target_markers is the \(N \times M\) genotype matrix.

# Checking the rank of the genotype_target_markers matrix
library(Matrix)
genotype_target_markers = genotype[, target_markers]
list(target_markers = target_markers,
     rank = as.numeric(rankMatrix( genotype_target_markers)))
## $target_markers
## [1] "exm1061853" "exm1061861" "exm1061863"
## 
## $rank
## [1] 3

6 gatars-sampling-set

Once you have your data in R, and you have checked that the column rank of your genotype matrix \(G =\) genotype_target_markers is \(M\), use the function gatars_sampling_set to create your sampling sets.

6.1 The arguments of gatars_sampling_set

The following is the function header of gatars_sampling_set showing which arguments are needed and in which order.

gatars_sampling_set(
  bim,
  epsilon,
  exclusion_region,
  genotype,
  hotspot,
  target_markers
)

6.1.1 bim, exclusion_region, genotype, hotspot, and target_markers,

These objects have already been discussed in An example data set.

6.1.2 epsilon

A positive small real number used to used to determine if a marker can be included in a sampling set. When creating the \(m\)-th sampling set for target marker with minor allele frequency \(\pi_m\), only those markers whose minor allele frequencies fall within the interval \(\pi_m \times \big[1 -\) epsilon \(, 1 +\) epsilon \(\big]\) can be included in the sampling set.

6.2 Example calls to gatars_sampling_set

# Example calls to gatars_sampling_set
set.seed(42)
epsilon = 0.01
exclusion_region = NULL
sampling_set = gatars_sampling_set(
  bim,
  epsilon,
  exclusion_region,
  genotype,
  hotspot,
  target_markers
)
print(sampling_set)
## $sampling_set_report
##      min     pi    max set_size
## 1 0.0200 0.0200 0.0200      632
## 2 0.0525 0.0525 0.0525      687
## 3 0.0100 0.0100 0.0100      627
## 
## $minimum_sampling_set_size
## [1] 627
previous_sampling_set = sampling_set
set.seed(42)
exclusion_region = gatars_example$exclusion_region
head(exclusion_region)
##    chromosome     start       end
## 1           1  10496040  10496040
## 2           1 150685811 150685811
## 3           1 154861707 154861707
## 4           1 204549714 204549714
## 5           1 205788696 205788696
## 11          2   9977740   9977740
sampling_set = gatars_sampling_set(
  bim,
  epsilon,
  exclusion_region,
  genotype,
  hotspot,
  target_markers
)
print(previous_sampling_set)
## $sampling_set_report
##      min     pi    max set_size
## 1 0.0200 0.0200 0.0200      632
## 2 0.0525 0.0525 0.0525      687
## 3 0.0100 0.0100 0.0100      627
## 
## $minimum_sampling_set_size
## [1] 627
print(sampling_set)
## $sampling_set_report
##      min     pi    max set_size
## 1 0.0200 0.0200 0.0200      527
## 2 0.0525 0.0525 0.0525      570
## 3 0.0100 0.0100 0.0100      509
## 
## $minimum_sampling_set_size
## [1] 509

TOP

7 gatars_test_size

7.1 The arguments of gatars_test_size

The following is the function header of gatars_test_size showing which arguments are needed and in which order.

gatars_test_size(phenotype, Psi, sampling_set, N_simulated_nulls, weights)

7.1.1 phenotype and Psi

These objects have already been discussed in An example data set.

7.1.2 sampling_set

This object is the result of gatars_sampling_set and you may follow the example from gatars-sampling-set

7.1.3 N_simulated_nulls = \(K\)

A positive integer which specifies the number of simulated null genotype matrices to generate to determine the p-value of the optimized statistics.

7.1.4 weights

A vector of length equal to \(M\) the length of target_markers. The entries in the vector are non-negative real numbers. The size of the m-th entry reflects the importance of the m-th target marker. If weights is not specified, the gatars_test_size assumes you would like equal weights among the \(M\) target markers.

7.1.5 statistics

A character vector with default value c("BS", "BT", "ST", "BST") reflecting which optimized statistics you would like gatars_test_size to compute.

TOP

7.2 Example calls to gatars_test_size

# Calling gatars_test_size 
start = Sys.time()
set.seed(42)
gatars_test_size(
  phenotype, Psi, sampling_set, N_simulated_nulls = 100, 
  statistics = c("BST"), weights = c(5, 3, 2))
## $N_simulated_nulls_required
## [1] 100
## 
## $p_value
##           B           S           T         BST 
## 0.006850701 0.041152586 0.010609984 0.000000000 
## 
## $q
##         B         S         T       BST 
##  48.24472  22.41665 160.53376  61.42519 
## 
## $x
##      BST 
## 2.406464
elapsed_time = Sys.time() - start
paste("100 sim reps used", 
      round(elapsed_time, 1), attributes(elapsed_time)$units)
## [1] "100 sim reps used 4.8 secs"

8 All the code in one place

library(gatars)
# Preparing the data
bim = gatars_example$bim
exclusion_region = gatars_example$exclusion_region
genotype = gatars_example$genotype
phenotype = gatars_example$phenotype
Psi = gatars_example$Psi
target_markers = gatars_example$target_markers[3:5]

# figure to illustrate Psi
NNN = nrow(phenotype)
first_ten = 1:10
last_ten = NNN - (9:0)
matrix_image_fn(Psi[c(first_ten, last_ten), c(first_ten, last_ten)],
                main = "First and last 10 rows and columns of Psi")

# Checking the rank of the genotype_target_markers matrix
library(Matrix)
genotype_target_markers = genotype[, target_markers]
list(target_markers = target_markers,
     rank = as.numeric(rankMatrix( genotype_target_markers)))
## $target_markers
## [1] "exm1061853" "exm1061861" "exm1061863"
## 
## $rank
## [1] 3
# Example calls to gatars_sampling_set
set.seed(42)
epsilon = 0.01
exclusion_region = NULL
sampling_set = gatars_sampling_set(
  bim,
  epsilon,
  exclusion_region,
  genotype,
  hotspot,
  target_markers
)
print(sampling_set)
## $sampling_set_report
##      min     pi    max set_size
## 1 0.0200 0.0200 0.0200      632
## 2 0.0525 0.0525 0.0525      687
## 3 0.0100 0.0100 0.0100      627
## 
## $minimum_sampling_set_size
## [1] 627
previous_sampling_set = sampling_set
set.seed(42)
exclusion_region = gatars_example$exclusion_region
head(exclusion_region)
##    chromosome     start       end
## 1           1  10496040  10496040
## 2           1 150685811 150685811
## 3           1 154861707 154861707
## 4           1 204549714 204549714
## 5           1 205788696 205788696
## 11          2   9977740   9977740
sampling_set = gatars_sampling_set(
  bim,
  epsilon,
  exclusion_region,
  genotype,
  hotspot,
  target_markers
)
print(previous_sampling_set)
## $sampling_set_report
##      min     pi    max set_size
## 1 0.0200 0.0200 0.0200      632
## 2 0.0525 0.0525 0.0525      687
## 3 0.0100 0.0100 0.0100      627
## 
## $minimum_sampling_set_size
## [1] 627
print(sampling_set)
## $sampling_set_report
##      min     pi    max set_size
## 1 0.0200 0.0200 0.0200      527
## 2 0.0525 0.0525 0.0525      570
## 3 0.0100 0.0100 0.0100      509
## 
## $minimum_sampling_set_size
## [1] 509
# Calling gatars_test_size 
start = Sys.time()
set.seed(42)
gatars_test_size(
  phenotype, Psi, sampling_set, N_simulated_nulls = 100, 
  statistics = c("BS", "BT", "ST", "BST"), weights = c(5, 3, 2))
## $N_simulated_nulls_required
## [1] 100
## 
## $p_value
##           B           S           T          BS          BT          ST 
## 0.006850701 0.041152586 0.010609984 0.000000000 0.000000000 0.000000000 
##         BST 
## 0.000000000 
## 
## $q
##         B         S         T        BS        BT        ST       BST 
##  48.24472  22.41665 160.53376  48.24472  61.75213  89.10342  61.42519 
## 
## $x
##       BS       BT       ST      BST 
## 2.164265 2.406581 2.027539 2.406464
elapsed_time = Sys.time() - start
paste("100 sim reps used", 
      round(elapsed_time, 1), attributes(elapsed_time)$units)
## [1] "100 sim reps used 8.4 secs"
start = Sys.time()
set.seed(42)
gatars_test_size(
  phenotype, Psi, sampling_set, N_simulated_nulls = 1000, 
  statistics = c("BST"), weights = c(5, 3, 2))
## $N_simulated_nulls_required
## [1] 1000
## 
## $p_value
##           B           S           T         BST 
## 0.006850701 0.041152586 0.010609984 0.005000000 
## 
## $q
##         B         S         T       BST 
##  48.24472  22.41665 160.53376  61.42519 
## 
## $x
##      BST 
## 2.406464
elapsed_time = Sys.time() - start
paste("1000 sim reps used", 
      round(elapsed_time, 1), attributes(elapsed_time)$units)
## [1] "1000 sim reps used 50 secs"
start = Sys.time()
set.seed(42)
gatars_test_size(
  phenotype, Psi, sampling_set, N_simulated_nulls = 2000, 
  statistics = c("BST"), weights = c(5, 3, 2))
## $N_simulated_nulls_required
## [1] 2000
## 
## $p_value
##           B           S           T         BST 
## 0.006850701 0.041152586 0.010609984 0.007500000 
## 
## $q
##         B         S         T       BST 
##  48.24472  22.41665 160.53376  61.42519 
## 
## $x
##      BST 
## 2.406464
elapsed_time = Sys.time() - start
paste("2000 sim reps used", 
      round(elapsed_time, 1), attributes(elapsed_time)$units)
## [1] "2000 sim reps used 1.7 mins"