2020-08-18

Docker上のnlmixrを使ってpopulation pharmacokinetics解析

2022.11.3追記

最近リリースされた nlmixr2 を使用すれば、Pythonをインストールすることなくpopulation pharmacokinetics解析を行うことができます。

Nonlinear Mixed Effects Models in Population PK/PD • nlmixr2

オープンソースの母集団薬物動態解析ソフトウェアであるnlmixr (https://ascpt.onlinelibrary.wiley.com/doi/full/10.1002/psp4.12445)を少し使ってみたのですが、Rとpythonを連携する必要があったりと立ち上げに少し手間取ったので、Dockerを使った環境設定について備忘録を残しておきます。

基本的には GitHub - yoshidk6/nlmixr.docker: docker for nlmixr に書いてある通りです。

事前準備その1

まずはDockerのインストールが必要です。 Windowsの場合ここも結構面倒ですが、pythonとRの設定を色々いじるよりは楽かと思います。特に手元にあるWindows環境がHome版だけだったので、Docker自体のインストールにも大分手間取りました。

以下のページを参考にし、まずはWSL2を導入します。 Windows 10 Home で WSL 2 + Docker を使う - Qiita
途中で確か 0x80070003 か 0x80370102 のエラーがでて進めなくなったので、下を参考にBIOSの設定をいじったら無事WSL2を入れることができました。 Install WSL | Microsoft Learn

事前準備その2

ローカル環境内の作業用フォルダを用意します。以下の例では C:\Users\USERID\Documents\test_nlmixr とします。 USERIDは自分のユーザー名で置き換えて下さい。

Docker imageを動かす

Docker Hub上にアップロードされているDockerイメージ (https://hub.docker.com/r/yoshidk6/nlmixr) を、以下のコマンドで起動します。

docker run -v C:/Users/USERID/Documents/test_nlmixr:/home/rstudio/docs -d -p 8787:8787 -e PASSWORD=nlmixr yoshidk6/nlmixr

ここでは-vオプションを使って、ローカル上のtest_nlmixrフォルダにDocker上からアクセス出来るように設定しています。再び、USERIDは自分のユーザー名で置き換えて下さい。

下の様に、特にエラーなしで進めば成功しているはずです。

RStudio Serverに接続し、nlmixrを実行

localhost:8787 にブラウザから接続し、上で設定したパスワードを使ってログインします。

RStudioが無事起動したら、examplesフォルダ内にある example.R を開いて、コマンドを上から順に実行してみてください。29行目ではsaemによる推定結果が表示されるはずです。

また、44行目以降を実行すると、shinyMixRが起動し、GUI上でモデルファイルの操作などが出来ます。

Dockerの終了

本来はコマンドを使って操作すべきところですが、僕は怠惰なのでGUIに頼ってしまいます。

右下のメニュー内にあるクジラのアイコンを右クリックし、Dashboardを選ぶと、

こんな感じで起動したコンテナを停止もしくは削除することが出来ます。

2020-08-18

gretaパッケージをMac環境でインストールする

gretaパッケージは、TensorFlow Probabilityを活用してHamiltonian Monte Carlo法による統計モデリングを行うRのパッケージです。

greta-stats.org

StanやBUGSなどと違い、すべてR上でモデルの構築が出来るため、学習のハードルが低くなっているのが特徴です。
また、上のページにもある通り、モデルの構造を綺麗に表示してくれる機能が標準で備わっています。

f:id:yoshidk6:20200602000100p:plain — Example model structure visualization

TensorFlowを使っているためPythonをRにリンクする必要があり、Get started with greta • gretaに従っただけだとインストールがうまく行きませんでした。
デフォルトで選択されるPython環境をいじる必要があったので、備忘録的に手順を残しておきます。

"R reticulateの設定"などで検索するといろいろな情報が出てきますが、今回の手順は基本的に下のページに従いました。

Chapter 15 Install Tensorflow, greta, and causact | A Business Analyst’s Introduction to Business Analytics

PythonとTensorFlow, TensorFlow Probabilityをインストールする

r-tensorflowという名前のenvironmentをAnaconda上*1に作り、その中に適切なバージョンのPythonとTensorFlow, TensorFlow Probabilityをインストールするという流れになります。

まずはAnacondaをインストールします Anaconda | Individual Edition
Anacondaのプロンプト上(Macではターミナル)で以下のコマンドを打ちます
1. conda create -n r-tensorflow python=3.7 tensorflow=1.14 pyyaml requests Pillow pip numpy=1.16
2. conda activate r-tensorflow
3. pip install tensorflow-probability==0.7

Rに結びつけるPythonのバージョン設定

Rにreticulateをインストールし、Pythonの環境を設定します。
Mac上(MacOS 10.14 Mojave)だとデフォルトではPython 2.xが選ばれてしまい、毎回Rを起動するたびにPythonの環境を選ぶ必要があって面倒です。
そのため、デフォルトで使うPythonを~/.Rprofileで設定してしまいます。グローバルに設定を変えたくない場合は、プロジェクトごとに設定することも出来るはずです。

reticulateをインストールする install.packages("reticulate")
RStudio上で.Rprofileを開く file.edit(file.path("~", ".Rprofile"))
次のコマンドを.Rprofileに追記して保存する reticulate::use_condaenv("r-tensorflow")
Rを再起動し、 reticulate::py_config() を打って r-tensorflow が選ばれていることを確認します

gretaのインストール

R/RStudio上でgretaをインストールします。

install.packages("greta")
library(greta)

動作確認

ここまでの手順がエラーなく進めば、gretaが使えるようになっているはずです。
Get started with greta • greta上にある簡単な例を試してみてください。

library(greta)

# data
x <- as_data(iris$Petal.Length)
y <- as_data(iris$Sepal.Length)

# variables and priors
int <- normal(0, 1)
coef <- normal(0, 3)
sd <- student(3, 0, 1, truncation = c(0, Inf))

# operations
mean <- int + coef * x

# likelihood
distribution(y) <- normal(mean, sd)

# defining the model
m <- model(int, coef, sd)

# plotting
# plot(m) # This works only if you installed igraph and DiagrammeR

# sampling
draws <- mcmc(m, n_samples = 1000)

# result summary
summary(draws)

*1:Anaconda上のPython/TensorFlowだとサンプリング速度が2-8倍になるそうです。

2020-01-28

Simulating parametric survival model with parametric bootstrap to capture uncertainty

I recently released an R package on CRAN calledsurvParamSim for parametric survival simulation, and here want to describe a bit more on details & motivations behind developing this package.

Parametric Survival Simulation with Parameter Uncertainty • survParamSim

The purpose of survParamSim is to package the R functions developed by a great scientist, the late Laurent Claret, and streamline the function definition/APIs to make these widely available to the public. Laurent developed these functions for his pioneering works in applying tumor growth inhibition models for clinical outcome predictions of oncology studies, such as the one below. clincancerres.aacrjournals.org

Explaining the theory behind his works is well beyond the scope of this short blog post, but the key technical highlight is that we are using parametric survival models with various covariates for predicting survival profiles and treatment benefits. In applying these workflows in the context of drug development (or maybe in most of other contexts), we want to make a prediction of not only the mean survival profiles but also the uncertainties associated with model predictions.

There are mainly two sources of uncertainty for such survival predictions. The first one is due to the nature of the survival prediction: because survival (or rather failure) is modeled as probabilistic processes, a predicted survival curve for a group will never be identical after multiple simulations. There is another source of uncertainty, which is the uncertainty associated with model parameters, such as coefficients associated with covariates in the model.

The first element is fairly straightforward to evaluate - we can run simulations many times and quantify how variable the survival curves can be by looking at prediction intervals. For example, we expect the prediction interval to generally get narrower if we make a prediction with larger number of patients. However, for the second part, there is not a readily available method for exploration to my (limited) knowledge. survival::predict.survreg() has se.fit option, but it is not straightforward to carry the output from this option for making predicted survival profiles of a group.

These are the main reasons for putting together this package - to enable the incorporation of above two sources of uncertainty in survival predictions, using parametric models obtained with survival::survreg(). For the second element of the uncertainty, we are using parametric bootstrap from variance-covariance matrix of the model parameter estimates.

Example

The code below is essentially the same as the one on GitHub pages, just showing a simple work flow to give you a sense of what to expect in terms of outputs. Vignette here has a bit more explanations on the workflow.

For this example, we use colon dataset in survival package, where the treatment benefit of Levamisole+5-FU was compared with control (called as Obs). Various covariates were further explored, such as extent of tumor local spread.

First, you would run survreg to fit parametric survival model:

library(tidyverse)
library(survival)
library(survParamSim)

set.seed(12345)

# ref for dataset https://vincentarelbundock.github.io/Rdatasets/doc/survival/colon.html
colon2 <- 
  as_tibble(colon) %>% 
  # recurrence only and not including Lev alone arm
  filter(rx != "Lev", etype == 1) %>% 
  # Same definition as Lin et al, 1994
  mutate(rx = factor(rx, levels = c("Obs", "Lev+5FU")),
         depth = as.numeric(extent <= 2))

Here we have two covariates, node4 and depth, in addition to the treatment assignment rx.

fit.colon <- survreg(Surv(time, status) ~ rx + node4 + depth, 
                     data = colon2, dist = "lognormal")

Next, parametric bootstrap simulation can be run with surv_param_sim() function from this package:

sim <- 
  surv_param_sim(object = fit.colon, newdata = colon2, 
                 censor.dur = c(1800, 3000),
                 # Simulating only 100 times to make the example go fast
                 n.rep = 100)

The output object merely has raw simulated survival data with 100 repetition (defined by n.rep argument). You can find what options you have on this output by simply printing the object.

sim
#> ---- Simulated survival data with the following model ----
#> survreg(formula = Surv(time, status) ~ rx + node4 + depth, data = colon2, 
#>     dist = "lognormal")
#> 
#> * Use `extract_sim()` function to extract individual simulated survivals
#> * Use `calc_km_pi()` function to get survival curves and median survival time
#> * Use `calc_hr_pi()` function to get hazard ratio
#> 
#> * Settings:
#>     #simulations: 100 
#>     #subjects: 619 (without NA in model variables)

To visualize simulated survival curves with uncertainly, you can use calc_km_pi() and plot_km_pi() functions sequentially. You can also specify faceting/grouping with trt and group input arguments.

km.pi <- calc_km_pi(sim, trt = "rx", group = c("node4", "depth"))

km.pi
#> ---- Simulated and observed (if calculated) survival curves ----
#> * Use `extract_median_surv()` to extract median survival times
#> * Use `extract_km_pi()` to extract prediction intervals of K-M curves
#> * Use `plot_km_pi()` to draw survival curves
#> 
#> * Settings:
#>     trt: rx 
#>     group: node4 
#>     pi.range: 0.95 
#>     calc.obs: TRUE

plot_km_pi(km.pi) +
  theme(legend.position = "bottom") +
  labs(y = "Recurrence free rate") +
  expand_limits(y = 0)

f:id:yoshidk6:20200124163724p:plain — Predicted (shaded area) and observed (line) survival curves, per depth and node4 groups

Similarly, you can visualize hazard ratios and prediction intervals with calc_hr_pi() and plot_hr_pi() functions.

hr.pi <- calc_hr_pi(sim, trt = "rx", group = c("depth"))

hr.pi
#> ---- Simulated and observed (if calculated) hazard ratio ----
#> * Use `extract_hr_pi()` to extract prediction intervals and observed HR
#> * Use `extract_hr()` to extract individual simulated HRs
#> * Use `plot_hr_pi()` to draw histogram of predicted HR
#> 
#> * Settings:
#>     trt: rx
#>          (Lev+5FU as test trt, Obs as control)
#>     group: depth 
#>     pi.range: 0.95 
#>     calc.obs: TRUE
plot_hr_pi(hr.pi)

f:id:yoshidk6:20200124163843p:plain — Histogram of predicted hazard ratios, prediction intervals (dashed lines), and observed hazard ratios (red line), per depth groups

2019-02-19

rstanやrstanarm等をWindows上でソースからインストールする時の設定

最近これに従って(Step by step guide for creating a package that depends on RStan • rstantools)RStanを使用するRのパッケージを作っていたのですが、どうもdevtools::install()やdevtools::build(binary = TRUE)などコンパイルを要するステップに入るとエラーが出てインストール出来ないという状況に陥りました。いろいろ調べると、どうやらカスタムパッケージの問題ではなく、コンパイルの設定自体に問題があり、RStanやRStanArmもソースからインストール出来ないことが分かったので、対処法をメモしておきたいと思います。

ちなみに、コンピューターはWindows 7 or 10を使っているので、もしかしたらMac等では問題なく進むのかもしれません。

RStanのWikiに書かれているステップを実行する

まずは下の2つのページに書かれているステップに従い、(1) C++ Toolchainのインストールと(2) Makevarsファイルの設定(2つ目のページの Configuration 項)を行います。

g++のバージョンを確認する

通常は問題ないはずですが、僕の環境では他のプログラム用の古いバージョンのg++がPATH上で優先されてしまっていて、コンパイル時にエラーを起こしてしまっていました。
R上で以下のコードを打ち、バージョンが4.9以降だったら問題ありません。

system("g++ -v")

もしそうでない場合には、環境変数のPATHの頭に以下を追記します。

C:\Rtools\bin;C:\Rtools\mingw_64\bin;C:\Rtools\mingw_32\bin;

Makevar.winと.Rprofileを編集する

本当は最初のステップだけで上手く行くはずなのですが、僕の環境では上の手順を辿っても上手くいきませんでした。 *1 これを回避するために、Makevars.winと.Rprofileを編集する必要がありました。

Makevars.win

Makevars.winの位置はfile.path(Sys.getenv("HOME"), ".R")で、Windowsだと通常は C:/Users/USERNAME/Documents/.R/内にあります。 Makevars.win内の二行目のこれを

CXX14 = g++ -m$(WIN) -std=c++1y

以下のものに変更します。

CXX14 = $(BINPREF)g++ -m$(WIN) -std=c++1y

.Rprofile

.RprofileはSys.getenv("HOME")にあるはずですが、無ければ作成してください。上と同様に、Windowsだと通常は C:/Users/USERNAME/Documents/内に置かれます。このファイル内に以下の行を追加してください。

Sys.setenv(BINPREF = "C:/RTools/mingw_$(WIN)/bin/")

*1:見たところ32ビット版のバイナリを作る時に64ビット版のg++を使ってしまっているようなので、PATH周りの問題かもしれません。

2018-09-05

Using purrr's map family functions in dplyr::mutate

map family functions of the purrr package are very useful for using non-vectorized functions in dplyr::mutate chain (see GitHub - jennybc/row-oriented-workflows: Row-oriented workflows in R with the tidyverse or https://www.jimhester.com/2018/04/12/vectorize/).
I encounter the needs for this especially when dealing with nested data frames.

One of the drawbacks is that name/input argument assignments become confusing when you want to use more than two columns of your data frames (and using pmap family) for the function of interest. This post first briefly review how mutate works in combination with map or map2, then provide two approaches to avoid confusions around name assignments when using pmap.

How mutate works with vectorized functions
Non-vectorized function with one or two input arguments (map or map2)
Non-vectorized function with three or more input arguments (pmap)

How `mutate` works with vectorized functions

In most cases, the processes you want to do in mutate is vectorized and there is no need to use map family function. This works because the output from the function of interest (c in the example below) has the same length as the original data frame, and mutate only need to append one column to the data frame.

library(tidyverse)

df0 <- tibble(a = 1:3, b = 4:6)

df0 %>% mutate(c = a + b)

Non-vectorized function with one or two input arguments (`map` or `map2`)

Imagine that we want to create a new column containing arithmetic progressions in each row [ref (in Japanese)]. Since seq function is not vectorized, we cannot directly use this in mutate chain.

df1 <- tibble(a = c(1, 2), b = c(3, 6), c = c(8, 10))

df1 %>% mutate(d = seq(a, b))
# Error in mutate_impl(.data, dots) : Evaluation error: 'from' must be of length 1.

Instead, we can use map family function here. map family function take list(s) as input arguments and apply the function of interest using each element of the given lists. Because each column of data frames in R is a list, map works very well in combination.

In this example, we want to provide two input arguments to the seq function, from and to. map2 is the appropriate function for this.

df2 <- df1 %>% mutate(d = map2(a, b, seq))

as.data.frame(df2)
#  a b  c             d
#1 1 3  8       1, 2, 3
#2 2 6 10 2, 3, 4, 5, 6

The figure below shows how map function handles this process in mutate chain.

f:id:yoshidk6:20180806153334p:plain

Like you do with map function outside mutate, we can use map_dbl or map_chr to create columns with double or character types.

If we want to explicitly specify names of the argument, .x and .y can be used. See what happens with this:

df2 <- mutate(df1, d = map2(a, b, ~seq(.y, .x)))

as.data.frame(df2)

Non-vectorized function with three or more input arguments (`pmap`)

Assignment of column names become confusing when using three or more columns, because we don't have shorthand like .x or .y any more. Let's take a look at the following example using rnorm function.

Case example

Generate a list of random numbers for each row with rnorm function. Each row of the original data frame contain different value of mean, sd, n.

We will first prepare a data frame with columns corresponding to mean, sd, n, and apply rnorm function for each row using pmap. Each element of the new column data contains a vector of random samples *1. This type of structure is called as "nested data frames" and there are many resources on this, such as 25 Many models | R for Data Science.

A simple case

If your data frame has the exact same names and numbers of columns to the input arguments of the function of interest, a simple syntax like the one below works *2.

df4 <- 
  tribble(~mean, ~sd, ~n,
          1,  0.03, 2,
          10, 0.1,  4,
          5,  0.1,  4)

df4.2 <- 
  df4 %>% 
  mutate(data = pmap(., rnorm)) 

as.data.frame(df4.2)

One caution is that the syntax like the one below doesn't work. pmap thinks that you are calling rnorm(df4$n, df4$mean, df4$sd) for each row, and each element of the new column contain three random samples from the same list of mean and sd. *3

df4 %>% mutate(data = pmap(., ~rnorm(n, mean, sd))) %>% as.data.frame() # Wrong answer

Number of columns > Number of input arguments

In most cases, however, you will have more columns than the input arguments. pmap complains in this case, saying that you have unused argument.

df5 <- 
  tribble(~mean, ~sd, ~dummy, ~n,
          1,  0.03, "a", 2,
          10, 0.1,  "b", 4,
          5,  0.1,  "c", 4)

df5 %>% mutate(data = pmap(., rnorm))　 # Error

There are two ways to avoid this error.

Make a small list on the fly

The first method is to create a small list that only contains the necessary columns (Ref: Dplyr: Alternatives to rowwise - tidyverse - RStudio Community )

df5.2 <- 
  df5 %>% 
  mutate(data = pmap(list(n=n, mean=mean, sd=sd), rnorm))

as.data.frame(df5.2)

Here, list(n=n, mean=mean, sd=sd) create a new list with three vectors named n, mean, and sd, which serves the same purpose as the df4 data frame in the above example.

Mind that if you don't give names to the elements of the new list, the order of the list items will be used to associate with input arguments of rnorm. My recommendation is to always assign names to the list elements.

df5 %>% mutate(data = pmap(list(n, mean, sd), rnorm)) # Correct but not recommended
df5 %>% mutate(data = pmap(list(mean, sd, n), rnorm)) # Wrong answer

Use `...` to ignore unused columns

The second method is to absorb unused columns with ... (Ref: Map over multiple inputs simultaneously. — map2 • purrr). A syntax like the one below works because pmap automatically associate names of the input list and names in function(). In other word, columns names of the data frame must match the variable names in the function().

df5.3 <- 
  df5 %>% 
  mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n=n, mean=mean, sd=sd))) 

as.data.frame(df5.3)

Input arguments of function() and rnorm() are not automatically associated with names. It is recommended to explicitly associate input argument name for the function of interest (rnorm in this case).

df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n, mean, sd))) # Correct but not recommended
df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(mean, sd, n))) # Wrong answer
df5 %>% mutate(data = pmap(., function(mean, sd, n, ...) rnorm(mean, sd, n))) # Wrong answer

A syntax like the one below gives unexpected outputs, as you saw in the df4 example.

df5 %>% mutate(data = pmap(., function(...) rnorm(n=n, mean=mean, sd=sd))) # Wrong answer

Column names are different from the input argument names

You can use either of the two approaches above.

df6 <- 
  tribble(~mean1, ~sd1, ~dummy, ~n1,
          1,  0.03, "a", 2,
          10, 0.1,  "b", 4,
          5,  0.1,  "c", 4)

df6.2 <-
  df6 %>% mutate(data = pmap(list(mean=mean1, sd=sd1, n=n1), rnorm)) 
as.data.frame(df6.2)

df6.3 <- 
  df6 %>% mutate(data = pmap(., function(n1, mean1, sd1, ...) rnorm(n=n1, mean=mean1, sd=sd1)))
as.data.frame(df6.3)

*1:In the examples below (and above), we further use as.data.frame function to exposure actual numbers of vectors

*2:This works even if the order of the columns is different from the order of input arguments

*3:This happens because rnorm is actually vectorized. See ?rnorm: If length(n) > 1, the length is taken to be the number required.

2018-08-06

dplyr::mutate内でベクトル化されていない関数を使う - purrr::pmapでのカラム名の対応付け

Rのデータフレームの各行に対して処理を行いたいが、関数をベクトル化することが難しい場合*1には、下で簡単に書くようにmutateとmapを組み合わせると非常に便利です。しかし、map内で定義する関数の入力値が3つ以上になると話が少しややこしくなります。納得するまで結構混乱したので、自分の備忘録も兼ねて記事にしておくことにしました。

通常のmutateを使った処理
各行について別々に処理、入力値が2つまでの場合
各行について別々に処理、入力値が3つ以上の場合
map系関数の使い方について参考になるサイト

通常の`mutate`を使った処理

殆どの場合、mutate内で行う処理(例えば+、*、ifelse等)はベクトル化されているので、出力値(下のケースではc)が他の列と同じ長さになり、元のデータフレームに追加されます。

library(tidyverse)

df0 <- tibble(a = 1:3, b = 4:6)

df0 %>% mutate(c = a + b)

各行について別々に処理、入力値が2つまでの場合

こちらにあるように、データフレームの各行に対して等差数列を作成する例を考えてみます。このとき、seq関数はベクトル化されていないため、以下のように書くとエラーになります。

df1 <- tibble(a = c(1, 2), b = c(3, 6), c = c(8, 10))

df %>% mutate(d = seq(x, y))
# Error in mutate_impl(.data, dots) : Evaluation error: 'from' must be of length 1.

ここでmap系関数の出番です。
map系関数は、リストを引数とし、リスト内の各要素に対して与えられた関数を適用した結果を返すという働きを持っています。 Rのデータフレームは各列がそれぞれリストとなっているので、map系巻数と相性が良くなっています。

今回の例ではseq関数にfromとtoの2つの引数を与えたいので、map2関数を使います。

df2 <- df1 %>% mutate(d = map2(a, b, seq))

print(df2$d)

下にこの処理がどのように行われているかのイメージ図を貼ってみました。

f:id:yoshidk6:20180806153334p:plain

もちろん、使用したい処理が一つの引数のみを取る場合はmapを使えばいいですし、作成する列をdoubleやcharacter型にしたければmap_dblやmap_chrなどを使います。また、mapやmap2では、関数内で明示的に引数を指定したい場合、.xや.yを使います。例えば、上の例でdf2 <- mutate(df1, d = map2(a, b, ~seq(.y, .x)))を実行してみてください。

各行について別々に処理、入力値が3つ以上の場合

この様にmap系関数を使うと、ベクトル化されていない処理をmutate内で簡潔に書くことが出来ます。 3つ以上のカラムを使用したい場合、同様にpmapを使うことになるのですが、カラム名と関数の引数名の対応付けが少し複雑になります。

以下の例を用いて、どのように対応付けを行うことができるのか見ていきます。

使用する例

複数の平均値・標準偏差の組み合わせについて、それぞれ異なった数の乱数をrnorm関数を用いて算出したい。

ここではまず、mean、sd、nに対応する3つのカラムを持ったデータフレームを作成し、各行ごとにrnorm関数を適用して乱数を発生させています。新しく生成されたカラムdataには、各行ごとに乱数値を保持したベクトルが格納されます。
結果を見やすくするため、この例では更にunnnest()関数を用いて縦長のデータフレームに変換していますが、変換せずにそのままmap系関数を使って処理を続けていくことも可能です。
ネスト化周りに関してはR4DSなど色々と解説があると思うので、馴染みのない方はそちらも参考にしてみてください。

R for Data Science
Rではじめるデータサイエンス、 20.2.1 入れ子データ
tidyverse脳になって階層構造のあるデータフレームを使いこなそう

単純な例

データフレーム内のカラムが、pmap内で使用する関数の引数と個数・名前共に一致する場合にのみ、以下のような簡潔な文法が通じます。

df4 <- 
  tribble(~mean, ~sd, ~n,
          1,  0.03, 2,
          10, 0.1,  4,
          5,  0.1,  4)

df4 %>% mutate(data = pmap(., rnorm)) %>% unnest()

ちなみに、下のような書き方だと、rnorm内の入力はdf4$n df4$mean df4$sdであると見做されてしまい、想定していない結果となってしまいます。

df4 %>% mutate(data = pmap(., ~rnorm(n, mean, sd))) %>% unnest() # Wrong answer

余分なカラムが含まれている場合

実際には、カラム数が関数の引数の数と一致するという状況は非常に稀だと思います。余計なカラムが含まれていると、unused argumentがあると言って怒られます。

df5 <- 
  tribble(~mean, ~sd, ~dummy, ~n,
          1,  0.03, "a", 2,
          10, 0.1,  "b", 4,
          5,  0.1,  "c", 4)

df5 %>% mutate(data = pmap(., rnorm))　 # Error

これを回避するには、2つの方法があります。

小さいリストを作ってしまう

1つ目は、pmapの第一引数をdf4のようなリストにその場で変えてしまう方法です。(参照: Dplyr: Alternatives to rowwise - tidyverse - RStudio Community )

df5 %>% mutate(data = pmap(list(n=n, mean=mean, sd=sd), rnorm)) %>% unnest()

注意点として、下のようにリスト内要素に名前をつけないと、rnormへの受け渡しが順序のみによって行われてしまいます。少し面倒ですが、上のようにきちんと名前を割り振る書き方を推奨します。

df5 %>% mutate(data = pmap(list(n, mean, sd), rnorm)) %>% unnest() # Correct but not recommended
df5 %>% mutate(data = pmap(list(mean, sd, n), rnorm)) %>% unnest() # Wrong answer

`...` で未使用変数を受け流す

2つ目は、pmapへの入力ではデータフレームをそのまま渡す一方で、関数宣言の所で未使用の変数を受け流す方法です(参照: Map over multiple inputs simultaneously. — map2 • purrr, "Use ... to absorb unused components of input list .l")。
pmapが、入力したリスト名と関数の引数名を自動で照合してくれる性質を利用しているので、function()内の変数名とデータフレームのカラム名が一致する必要があります。
引数を二回書く必要が出てきますが、リスト内要素に名前をつけ忘れる心配はありません。

df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n=n, mean=mean, sd=sd))) %>% unnest()

ただしもちろん、普段関数を使う際と同様に、引数を名前で紐付けないと誤った結果となります。

df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n, mean, sd))) %>% unnest() # Correct but not recommended
df5 %>% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(mean, sd, n))) %>% unnest() # Wrong answer
df5 %>% mutate(data = pmap(., function(mean, sd, n, ...) rnorm(mean, sd, n))) %>% unnest() # Wrong answer

先にdf4の例で見たように、以下のような書き方も誤った結果となります。

df5 %>% mutate(data = pmap(., function(...) rnorm(n=n, mean=mean, sd=sd))) %>% unnest() # Wrong answer

カラム名がpmap内で使用する関数の引数名と一致しない場合

上2つのいずれの方法を使っても対応できます。

df6 <- 
  tribble(~mean1, ~sd1, ~dummy, ~n1,
          1,  0.03, "a", 2,
          10, 0.1,  "b", 4,
          5,  0.1,  "c", 4)

df6 %>% mutate(data = pmap(list(mean=mean1, sd=sd1, n=n1), rnorm)) %>% unnest()
df6 %>% mutate(data = pmap(., function(n1, mean1, sd1, ...) rnorm(n=n1, mean=mean1, sd=sd1))) %>% unnest()

map系関数の使い方について参考になるサイト

*1:特にnested dafa frameを扱っているときに頻発します

2017-10-30

Pharmacometrics解析のためのRStanへのTorstenライブラリの導入

Metrum Research Groupが開発しているTorstenライブラリは、Stan上でPharmacometrics解析を行うための拡張ライブラリです。 BaseのStanパッケージに比べ、以下の拡張がなされています。

NONMEMフォーマットのデータセットの読み込みとイベントスケジュール(薬剤投与、測定)の処理
標準薬物動態ライブラリの搭載 (NONMEMのPREDPPのようなもの)
- 線形1-/2-コンパートメントモデル+一次吸収の解析解
- 線形コンパートメントモデル用の汎用モデル (Padé approximantionによる解)
- 汎用ODEモデル (Runge KuttaもしくはBDF)

実用上では一番目のポイントがかなり重要で、わざわざStan用にデータセットを編集/Stanモデル内でイベントスケジュールの管理をする必要がなくなるため、Pharmacometrics解析にStanを使う心理的・労力的ハードルが低くなります。解析解などを用いている線形コンパートメントモデルはもちろん高速ですが、汎用モデルのODE solverもC++上で動いている(はず)ので、計算時間は比較的早いと思われます。

この記事では、TorstenライブラリをRStanに導入する方法を解説し、一例として2-コンパートメントモデルを使ってみます。

インストール
- 必要なもの
- 手順
使用例
参照したリンク先のまとめ

インストール

必要なもの

R (とRStudio)
devtoolsライブラリ(CRANからインストール可能)
RStanライブラリ

手順

最近までインストールが複雑でしたが、Metrumが専用ライブラリを作成してくれたので、RStanへのインストールが格段に楽になりました。

R上で以下のコマンドを入力するとインストールが行われます。RStanライブラリのヘッダーを置き換えてコンパイルをし直すようなので、インストールに比較的時間がかかります。

devtools::install_github('metrumresearchgroup/TorstenHeaders')
torstenHeaders::install_torsten()

使用例

こちらもMetrumによるGitHubリポジトリに、Torstenを使用した幾つかのモデルの例が載っています。今回はこのうちの2-コンパートメントモデルの例を使用してみます。
https://github.com/metrumresearchgroup/example-models
Torstenのマニュアルもこのリポジトリに含まれています(PDFへのリンク)。

はじめに、今回使用するライブラリを読み込んでおき、RStanの設定もしてしまいます。

library(tidyverse)
library(ggmcmc)
library(rstan)

rstan_options(auto_write=T)
options(mc.cores=parallel::detectCores())

データセット

解析に使用するデータセットをRに読み込みます。下記のRスクリプトをダウンロードした上で、read_rdump関数を使用して読み込みを行います。 https://github.com/metrumresearchgroup/example-models/blob/torsten-0.83/TwoCptModel/TwoCptModel.data.R

data <- rstan::read_rdump("TwoCptModel.data.R")

元のデータセットとしては、これらすべての要素がカラムとして格納されたデータフレームを想像してもらえるとイメージが掴めると思います。各行が一つのイベントに対応し、それぞれ薬物投与やサンプリングに対応しています。各データが何に対応しているかを簡単に解説すると、

amt: 薬物の投与量
addl: 薬物投与を計何回行うか
ii: 投与間隔
cmt: イベントが発生するコンパートメントのインデックス (1は吸収コンパートメント、2は中心コンパートメント)
cObs: 測定値
evid: 測定の場合は0、投与の場合は1に対応するインデックス
iObs: 全イベントのうち測定に対応する行のインデックス
nObs: 測定データの個数
nt: 全イベントの個数
rate: 定速静注の速度(今回は使用なし)
ss: 定常状態から計算の開始(今回は使用なし)
time: 各イベントが発生した時間

となっています。

実際にCSVファイルなどからデータを読み込む場合は、下記のRコードの57~75行目の様な処理を行います。基本的には、with関数を使って各カラムを独立したベクトルとしてリストに格納するだけです。この際、DVカラム(観測データが含まれる)に関しては、投与イベント等、観測データが欠損しているレコードを除く必要があります。インデックスiObsを使用することで、DVのうち観測データのみがcObsに格納されています。
https://github.com/metrumresearchgroup/example-models/blob/torsten-0.83/TwoCptModel/TwoCptModelSimulation.R

Stanモデル

以下に2-コンパートメントモデルを使用したモデルの例があげられています。1被験者のパラメータを推定するだけのシンプルなモデルです。
https://github.com/metrumresearchgroup/example-models/blob/torsten-0.83/TwoCptModel/TwoCptModel.stan

鍵となるのは65~69行目で、Torstenに含まれるPKModelTwoCpt関数を使用してODEの(解析)解を求めています。

  // PKModelTwoCpt takes in the NONMEM data, followed by the parameter
  // arrays and returns a matrix with the predicted amount in each 
  // compartment at each event.
  x = PKModelTwoCpt(time, amt, rate, ii, evid, cmt, addl, ss,
                   theta, biovar, tlag);

timeからssまではデータセットの中身をそのまま渡しています。 thetaがパラメータ(CL, Q, V1, V2, ka)を含む配列、biovarとtlagは追加のパラメータです。それぞれ、

CL: 中心コンパートメントからの消失クリアランス
Q: 中心コンパートメントと末梢コンパートメント間のクリアランス
V1: 中心コンパートメントの分布容積
V2: 末梢コンパートメントの分布容積
ka: 吸収速度定数
biovar: 各コンパートメントへのバイオアベイラビリティ
tlag: 各コンパートメントへの投与ラグタイム

を表します。
現在、PKModelOneCptとPKModelTwoCptの引数としてはこれらすべてが必要であり、例えばラグタイムを設定する必要がない場合でも0を渡す必要があるようです*1。

この関数を走らせると、渡したtimeにおける全コンパートメントの薬物量がmatrixとして返されます。

Rスクリプト

上記のStanコードを動かすためのRスクリプトは以下にあります。 https://raw.githubusercontent.com/metrumresearchgroup/example-models/torsten-0.83/R/TwoCpt.R

ただ、色々と不可欠で無いものが入っているので、最小限の要素のみ残したものを以下に記載します。ここでは、Stanモデルファイルがworking directlyに保存されていると仮定しています。また、データファイルは上記のコードを使って読み込み済みと仮定しています。

## create initial estimates
init <- function() {
  list(CL = exp(rnorm(1, log(10), 0.2)),
       Q = exp(rnorm(1, log(15), 0.2)),
       V1 = exp(rnorm(1, log(35), 0.2)),
       V2 =  exp(rnorm(1, log(105), 0.2)),
       ka = exp(rnorm(1, log(2), 0.2)),
       sigma = runif(1, 0.5, 2))
}

## run Stan
fit <- stan(file = "TwoCptModel.stan",
            data = data,
            init = init)

推定が終わったら、ggmcmcライブラリなどを用いて結果を評価します。

fit.param <- ggs(fit)
list.param <- c("CL", "Q", "V1", "V2", "ka")

fit.param.plot <- 
  fit.param %>% filter(Parameter %in% list.param)

ggs_density(fit.param.plot)
ggs_traceplot(fit.param.plot)
ggs_autocorrelation(fit.param.plot)
ggs_Rhat(fit.param.plot)

推定値に基づいたシミュレーション結果と測定値を比較します。

mcmc.sample <- rstan::extract(fit)

obs.pred.interval <- 
  mcmc.sample$cObsPred %>% 
  apply(MARGIN=2,quantile, prob=c(0.05, 0.5, 0.95)) %>% 
  t() %>% 
  tbl_df()

obs.pred.interval %>% 
  mutate(Time = data$time[data$iObs],
         cObs = data$cObs) %>% 
  ggplot(aes(Time, `50%`)) +
  geom_line() +
  geom_ribbon(aes(ymin=`5%`, ymax=`95%`),alpha=0.1) +
  geom_point(aes(y=cObs)) +
  scale_x_continuous(breaks=seq(0, 200, by=24)) +
  xlab("Time (hour)") +
  ylab("Concentration")

全体の傾向・ばらつき共によく再現されています。図では複数回投与後の薬物動態プロファイルに見えませんが、これはday 2-7の間で投与前濃度しか測定を行っていないため、それ以外の時点のシミュレーション結果が出力されていないことに依ります*2。 f:id:yoshidk6:20171030065007p:plain

その他

今回の例ではまずTorstenライブラリを使ってみることに主眼をおいたため、単一被験者のデータ解析という非常にシンプルなものになっています。ポピュレーション解析をに興味がある方は、同じexample-models内にあるTwoCptModelPopulationを見てみることをおすすめします。

参照したリンク先のまとめ

RStanのインストール方法 https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started-%28Japanese%29
Torstenを導入するためのRライブラリ https://github.com/metrumresearchgroup/TorstenHeaders
Torstenライブラリを用いたモデル例をまとめたGitHubリポジトリ https://github.com/metrumresearchgroup/example-models
- Torstenのマニュアル https://github.com/metrumresearchgroup/example-models/raw/torsten-0.83/torstenManual.pdf
- 2-コンパートメントモデルのStanモデルとデータファイル生成コード https://github.com/metrumresearchgroup/example-models/blob/torsten-0.83/TwoCptModel/
- 上記のモデルを動かすためのRコード https://github.com/metrumresearchgroup/example-models/tree/torsten-0.83/R/

*1:省略できるよう改良を計画しているようです。

*2:RとStanコードを改良することで、測定点以外の時間についてもシミュレーション結果を表示することが可能です。そのうちそれについて記事を書くかもしれません。

目次