A Bayesian Workflow for Spatially Correlated Random Effects in On-farm Experiment

Zhanglong Cao
(Katia Stefanova, Mark Gibberd, Suman Rakshit)

8/11/22

I’d like to begin by acknowledging the Traditional Owners of the land on which we meet today. I would also like to pay my respects to Elders past and present.

SAGI West Node Research

Analysing Large Strip Experiments: aim is to estimate the spatially-varying treatment effects (e.g., yield response to nitrogen rates) in large paddocks. This may enable the creation of site-specific management zones.

  • Geographically weighted regression method (published in 2020)

  • Bayesian analysis framework (published in 2022)

  • MET framework for categorical variables (submitted)

  • Deep Gaussian process for spatio-temporal data

On-farm experiments

On-farm experiments

Treatments are applied to adjacent strips to detect spatial variation in treatment response.

Large strip trial

Improve profitability and sustainability

Due to spatial variability, different parts of a field exhibit different yield response to treatments. Some parts may not need high nitrogen rate, and this decreases cost and promotes environmental sustainability.

A large strip trial

Identify optimum nitrogen rates

  • Large scale (strip) trials are needed to identify optimum rates that vary across the field — small plot experiments are ineffective for this purpose.

  • To generate a spatial map of optimum nitrogen rate, we need to estimate the spatially-varying effects of nitrogen on yield.

Example: Argentinian corn field experiment

Example: Argentinian corn field experiment

Topographic factor with levels W = West slope, HT = Hilltop, E = East slope, LO = Low East.

Example: Argentinian corn field experiment

Inadequacy of a global model

Inadequacy of a global model

Zone-specific models are limiting

  • These zonal-boundaries are mostly arbitrary, and do not represent the change points for spatially-varying relationship.

  • The problem of global model persists within each zone, as we assume that the regression coefficients are constant within each zone. If there is any variation in the spatial relationship within a zone, that would not be captured in a zone-specific regression model.

Geographically weighted regression

For constructing the spatial map, we used GWR compute the regression coefficients at regular grid of points covering the study region.

  • Estimation based on the local-likelihood

  • Estimate local regression coefficients

source: Fotheringham, Brunsdon, and Charlton (2003)

A better solution is the geographically weighted regression that allows computation of the regression coefficients at every point within the study region. The creation of management zones based on these estimates would be much more meaningful than the zones created arbitrarily.

Yield prediction with GWR

source: Rakshit et al. (2020)

Bayesian Statistics and Workflow

Why Bayesian?

  • The adoption of Bayesian approaches simplifies the interpretation of the results and augments the inference (Che and Xu (2010)).

  • Compared with REML, Baek et al. (2019) demonstrate the advantages in terms of variation control and powerful inference.

  • It is a complementary approach to GWR.

Liear mixed model notation

A linear mixed effects model for YY, using the matrix notation, is

Y=Xb+Zu+e,Y=Xb+Zu+e,
and
[ue]∼N([00],[Σu00Σe]).[ue]∼N([00],[Σu00Σe]).

bb and uu are vectors of fixed and random effects, respectively, XX and ZZ are the associated design matrices, and ee is the residual error vector. It is typically assumed that the vectors uu and ee are distributed independently of each other and that their joint distribution is a multivariate Gaussian distribution.

Liear mixed model notation

Therefore,

Y∼N(Xb,ZΣuZ⊤+Σe),Y∼N(Xb,ZΣuZ⊤+Σe),
where ΣuΣu and ΣeΣe are variance-covariance matrices corresponding to the random vectors uu and ee, respectively (Zimmerman and Harville (1991), Butler et al. (2009)).

Bayesian hierarchical model

At location sisi, the former model is re-written as

y(si)=∑m=1lbmxm(si)+∑j=1huj(si)zj(si)+e(si),ui∣θu∼N(0,Vu(θu)),e(si)∣σe∼N(0,σ2e).y(si)=∑m=1lbmxm(si)+∑j=1huj(si)zj(si)+e(si),ui∣θu∼N(0,Vu(θu)),e(si)∣σe∼N(0,σe2).
And,
Y∼N(Xb+Zu,e).Y∼N(Xb+Zu,e).

b1,…,blb1,…,bl are global effects corresponding to the ll explanatory variables x1,…,xlx1,…,xl; z1,…,zhz1,…,zh denote hh variables whose effects are fitted as local effects; uj(si)uj(si) denotes the local effect corresponding to zjzj at grid si∈Ssi∈S; ui=(u1(si),…,uh(si))⊤ui=(u1(si),…,uh(si))⊤ is the vector of all local effects at sisi, i=1,…,ni=1,…,n; θuθu is a set of parameters of the covariance matrix VuVu, and σeσe is the error standard deviation, assumed to be distributed as either Gamma, half-Cauchy, or half-normal.

If u is in the residual term, it needs to calculate the inverse of a huge covariance matrix. slow down the posterior sampling process and efficiency.

Particular model

A regression model of particular interest is the quadratic response model. The term associated with the global effects would take the form:

b1+b2x(si)+b3x2(si),i=1,…,n,b1+b2x(si)+b3x2(si),i=1,…,n,
where x(si)x(si) is the particular level of some controllable treatment applied at location sisi. Local departures from the global treatment effects b2b2 and b3b3 take the form:
u1(si)+u2(si)x(si)+u3(si)x2(si),i=1,…,n,u1(si)+u2(si)x(si)+u3(si)x2(si),i=1,…,n,
where u1(si)u1(si), u2(si)u2(si), and u3(si)u3(si) are spatially correlated local effects corresponding to the location sisi.

Spatially correlated random parameters

For all u={u1,…,u3}u={u1,…,u3}, the variance is

Σu=In⊗Vu.Σu=In⊗Vu.
Or
Σu=Vs⊗Vu.Σu=Vs⊗Vu.

VsVs can be AR1⊗AR1AR1⊗AR1 for regular grid data or Matérn covariance function for irregular grid data.

To incorporate spatial correlation amongst the model parameters in our Bayesian hierarchical modelling framework, we investigate here how the variance-covariance matrix of uu can be specified to represent the spatial correlation across all the grid points si,i=1,…,nsi,i=1,…,n. Note that, at location sisi, the covariance matrix of uiui is VuVu.

Σu=In⊗Vu.Σu=In⊗Vu.
If the correlation between grid points is characterised by a spatial variance-covariance matrix VsVs, the variance-covariance matrix of uu is given by
Σu=Vs⊗Vu,Σu=Vs⊗Vu,
where VsVs may be considered either a AR1×AR1AR1×AR1 spatial variance-covariance matrix or a weighted distance matrix.

Why use spatially correlated ΣuΣu?

source: Rakshit et al. (2020)

  • Only a single treatment is directly observed in any one position.

Why use spatially correlated ΣuΣu?

The spatial model allows the exploiting of information from neighboring positions with other treatments (Piepho et al. (2011)).

Bayesian workflow

source: Gelman et al. (2020)

We need a Bayesian workflow, rather than mere Bayesian inference, for several reasons (Gelman et al. (2020)).

workflow is more general than an example but less precisely specified than a method.

  • Computation can be a challenge, and we often need to work through various steps including fitting simpler or alternative models, approximate computation that is less accurate but faster, and exploration of the fitting process, in order to get to inferences that we trust.

  • In difficult problems we typically do not know ahead of time what model we want to fit, and even in those rare cases that an acceptable model has been chosen ahead of time, we will generally want to expand it as we gather more data or want to ask more detailed questions of the data we have.

  • Even if our data were static, and we knew what model to fit, and we had no problems fitting it, we still would want to understand the fitted model and its relation to the data, and that understanding can often best be achieved by comparing inferences from a series of related models.

  • Sometimes different models yield different conclusions, without one of them being clearly favourable. In such cases, presenting multiple models is helpful to illustrate the uncertainty in model choice.

Bayesian process

Suppose θ∈Θθ∈Θ is the set of all parameters under consideration. For a given f:Θ→Rf:Θ→R, the main focus in the Bayesian approach is to estimate f(θ)f(θ), typically by its conditional expectation, which is given by

E[f(θ)∣Y]=∫Θf(θ)p(θ∣Y)dθ.E[f(θ)∣Y]=∫Θf(θ)p(θ∣Y)dθ.
Assuming a prior distribution for θθ and applying the Bayes theorem we obtain the posterior density function p(θ∣Y)p(θ∣Y), which, subsequently, leads to the solution in the above equation.

Posterior distribution

In order to estimate θθ conditional on YY, we use the Bayes theorem to obtain the joint posterior density of the parameters in terms of the likelihood p(Y∣θ)p(Y∣θ) and the prior π(θ)π(θ) as follows:

p(θ∣Y)∝p(Y∣θ)π(θ).p(θ∣Y)∝p(Y∣θ)π(θ).
The distribution p(θ∣Y)p(θ∣Y) is the key ingredient for “Bayesian inference” of the parameter θθ. The posterior distribution p(θ∣Y)p(θ∣Y) provides all information about θθ conditional on the observed data (Che and Xu (2010)).

Likelihood

We obtain for multivariate Gaussian distribution

logp(Y∣θ)∝−12(Y−Xb−Zu)⊤Σ−1e(Y−Xb−Zu)−12lndetΣe,log⁡p(Y∣θ)∝−12(Y−Xb−Zu)⊤Σe−1(Y−Xb−Zu)−12ln⁡detΣe,
and for multivariate Student-tt distribution
logp(Y∣θ)∝−ν+n2ln(1+1ν(Y−Xb−Zu)⊤Σ−1e(Y−Xb−Zu))−n2lnν+lnΓ(ν+n2)−lnΓ(ν2)−12lndetΣe,log⁡p(Y∣θ)∝−ν+n2ln⁡(1+1ν(Y−Xb−Zu)⊤Σe−1(Y−Xb−Zu))−n2ln⁡ν+ln⁡Γ(ν+n2)−ln⁡Γ(ν2)−12ln⁡detΣe,
where ν≥1ν≥1 is the degrees of freedom.

Prior specification

  • Usually, if nothing is known from earlier studies, we can use a flat non-informative prior p(θ)(∝constant)p(θ)(∝constant), also called an “improper prior” (Gelman et al. (2006)).

  • In many circumstances, a Cauchy or Gamma prior is a reasonable candidate for regression coefficients. Inverse Wishart (IW) or inverse Gamma as the prior distribution for the standard deviation parameter of a hierarchical model.

  • Gelman et al. (2006), Gelman, Simpson, and Betancourt (2017) suggested weakly informative priors for variance parameters for Bayesian analyses of hierarchical linear model

Prior specification

To specify a prior distribution for the parameters associated with the variance-covariance matrix VuVu, note that the matrix can be decomposed as follows:

Vu=B(σu)RuB(σu),Vu=B(σu)RuB(σu),
where B(σu)B(σu) denotes the diagonal matrix with diagonal elements σu1,…,σuhσu1,…,σuh, the standard deviation of u1,…,uhu1,…,uh, and RuRu is the matrix whose diagonal elements are equal to unity and off-diagonal elements are the correlation coefficients between the random effects.

Prior specification

For the matrix RuRu with correlation coefficients, we specify the Lewandowski-Kurowicka-Joe (LKJ) distribution (Lewandowski, Kurowicka, and Joe (2009)) as the prior distribution, and this specification is given by

Ru∼LKJcorr(ϵ),Ru∼LKJcorr(ϵ),
where LKJcorr(ϵ)LKJcorr(ϵ) is a positive definite correlation matrix sampled from the LKJ distribution that depends on the value of a positive parameter ϵϵ. The parameter ϵϵ controls the correlations in a way that, as the value of ϵϵ increases, the correlations amongst parameters decrease.

LKJ Prior

Ru=⎡⎣⎢1ρ21ρ31ρ121ρ32ρ13ρ231⎤⎦⎥Ru=[1ρ12ρ13ρ211ρ23ρ31ρ321]
where ρρs are the pairwise correlation parameters.

The posterior distribution can be calculated by combining the likelihood and prior distribution (Besag and Higdon (1999), Tsionas (2002)).

Predictive distribution

The predictive distribution for a new query location s∗s∗, based on the aforementioned posterior distribution, is obtained by marginalizing over θθ and is written as

p(y(s∗)∣x(s∗),z(s∗),Y,X,Z)=∫p(y(s∗)∣x(s∗),z(s∗),θ))p(θ∣Y,X,Z)dθ.p(y(s∗)∣x(s∗),z(s∗),Y,X,Z)=∫p(y(s∗)∣x(s∗),z(s∗),θ))p(θ∣Y,X,Z)dθ.

Posterior sampling with Markov chains

For high-dimensional problems, the most widely used random sampling methods are Markov chain Monte Carlo methods.

  • Metropolis-Hastings method (Robert and Casella (1999))

  • Gibbs sampling (Geman and Geman (1984))

  • Slice sampling (Neal (2003))

It is also possible to extend Gibbs sampling in various ways. For example, in the case of variables whose conditional distribution is not easy to sample from, a single iteration of slice sampling or the Metropolis–Hastings algorithm can be used to sample from the variables in question.

The simplest version of slice sampling is similar to Gibbs sampling in that it consists of one-dimensional transitions in the state space; however there is no requirement that the one-dimensional conditional distributions be easy to sample from, nor that they have any convexity properties such as are required for adaptive rejection sampling.

Posterior sampling with Markov chains

However, simple Metropolis algorithms and Gibbs sampling algorithms, although widely used, perform poorly because they explore the space by a slow random walk (MacKay, Mac Kay, et al. (2003)).

  • Hamiltonian Monte Carlo (HMC) (Brooks et al. (2011), Duane et al. (1987))

  • No-U-Turn Sampler (NUTS) (Hoffman and Gelman (2014))

Random walk Metropolis-Hastings

source:https://github.com/chi-feng

Hamiltonian Monte Carlo (HMC)

  • Hamiltonian Monte Carlo (HMC) (Brooks et al. (2011), Duane et al. (1987)) is an efficient Markov chain Monte Carlo (MCMC) method that overcomes the inefficiency associated with the random walk and with the sensitivity to correlated parameters.

  • An important step in HMC is the drawing of a set of auxiliary momentum variables r={r1,…,rd}r={r1,…,rd}, independently from the standard normal distribution for each parameter in the set θ={θ1,…,θd}θ={θ1,…,θd}. The joint density function f(θ,r)f(θ,r) of θθ and rr is given by

    f(θ,r)∝exp{L(θ)−K(r)}=exp{−H(θ,r)},f(θ,r)∝exp⁡{L(θ)−K(r)}=exp⁡{−H(θ,r)},
    where H(θ,r)H(θ,r) is the Hamiltonian system dynamics (HSD) equation with potential energy L(θ)L(θ) and kinetic energy K(r)K(r).

Hamiltonian Monte Carlo (HMC)

  • The HSD is numerically approximated in discrete time space with the leapfrog method to maintain the total energy when a new sample (θ∗,r∗)(θ∗,r∗) is drawn.

  • The leapfrog method requires two parameters: (i) a step size ϵϵ, representing the distance between two consecutive draws, and (ii) a desired number of steps LL, required to complete the process. A new sample is accepted with the probability

    α=min{1,f(θ∗,r∗)f(θ,r)}.α=min{1,f(θ∗,r∗)f(θ,r)}.

Hamiltonian Monte Carlo (HMC)

source:https://github.com/chi-feng

Problems with HMC

HMC is highly sensitive to the choice of ϵϵ and LL, and in turn, may affect the results crucially.

source:https://github.com/chi-feng

No-U-Turn Sampler (NUTS)

No-U-Turn Sampler (NUTS) determines the step size adaptively during the warm-up (burn-in) phase to a target acceptance rate and uses it then for all sampling iterations (Hoffman and Gelman (2014), Monnahan, Thorson, and Branch (2017)).

The NUTS also eliminates the need to specify a value of LL by using the criterion

ddt(θ∗−θ)⋅(θ∗−θ)2=(θ∗−θ)⋅ddt(θ∗−θ)=(θ∗−θ)⋅r∗<0,ddt(θ∗−θ)⋅(θ∗−θ)2=(θ∗−θ)⋅ddt(θ∗−θ)=(θ∗−θ)⋅r∗<0,
where r∗r∗ is the current momentum and (θ∗−θ)(θ∗−θ) is the distance from the initial position to the current position. The idea is that the trajectory will keep exploring the space until θ∗θ∗ starts to move back towards θθ.

No-U-Turn Sampler (NUTS)

The slice sampling generates a finite set of samples of the form (θ,r)(θ,r) during the doubling procedure and the binary tree building process by randomly taking forward and backward leapfrog steps until

(θ+−θ−)⋅r−<0or(θ+−θ−)⋅r+<0,(θ+−θ−)⋅r−<0or(θ+−θ−)⋅r+<0,
where (θ−,r−)(θ−,r−) and (θ+,r+)(θ+,r+) are the leftmost and rightmost leaves, respectively, in the subtree. The best candidate (θ∗,r∗)(θ∗,r∗) is uniformly sampled from the subset of all candidate values of (θ,r)(θ,r).

No-U-Turn Sampler (NUTS)

NUTS

Example of building a binary tree via repeated doubling (Hoffman and Gelman (2014)).

Each doubling proceeds by choosing a direction (forwards or backwards in time) uniformly at random, then simulating Hamiltonian dynamics for 2j2j leapfrog steps in that direction, where jj is the number of previous doublings (and the height of the binary tree). The figures at top show a trajectory in two dimensions (with corresponding binary tree in dashed lines) as it evolves over four doublings, and the figures below show the evolution of the binary tree. In this example, the directions chosen were forward (light orange node), backward (yellow nodes), backward (blue nodes), and forward (green nodes).

No-U-Turn Sampler (NUTS)

source:https://github.com/chi-feng

No-U-Turn Sampler (NUTS)

NUTS

Samples generated by random-walk Metropolis, Gibbs sampling, and NUTS. (Hoffman and Gelman (2014))

The plots compare 1,000 independent draws from a highly correlated 250-dimensional distribution (right) with 1,000,000 samples (thinned to 1,000 samples for display) generated by random-walk Metropolis (left), 1,000,000 samples (thinned to 1,000 samples for display) generated by Gibbs sampling (second from left), and 1,000 samples generated by NUTS (second from right). Only the first two dimensions are shown here.

Posterior predictive checking

The posterior predictive (PP) checking uses the posterior distribution of the model parameters to regenerate the observations.

Let YrepYrep denote a simulated or replicated data set, generated using the posterior predictive distribution

p(Yrep∣Y)=∫p(Yrep∣θ)p(θ∣Y)dθ.p(Yrep∣Y)=∫p(Yrep∣θ)p(θ∣Y)dθ.
To assess the fitted model, several data sets are simulated from p(Yrep∣Y)p(Yrep∣Y), and each of them is compared with the observed data YY

Leave-one-out (LOO) cross validation

In Bayesian statistics, the expected loglog LOO predictive density (ELPD) is used to measure the predictive accuracy :

elpd=∑i=1nlogp(yi∣y−i),elpd=∑i=1nlog⁡p(yi∣y−i),
where p(yi∣y−i)=∫p(yi∣θ)p(θ∣y−i)dθp(yi∣y−i)=∫p(yi∣θ)p(θ∣y−i)dθ is the LOO predictive density with the ii-th observation omitted from the data set (Vehtari, Gelman, and Gabry (2017)).

Bürkner, Gabry, and Vehtari (2021) proposed approximated LOO CV, which uses only a single model fit and calculating the pointwise loglog predictive density as a fast approximation to the exact LOO CV.

Pareto-smoothed importance-sampling (PSIS)

The LOO estimate is improved by using Pareto smoothed importance sampling which applies a smoothing procedure to the importance weights (Vehtari, Gelman, and Gabry (2017)).

The PSIS-LOO-CV estimate is computed taking a weighted sum over all nn pointwise loglog-likelihood by

psisˆ=∑i=1nlog(∑Mm=1p(yi∣θ(m))w(m)i∑Mm=1w(m)i),psis^=∑i=1nlog⁡(∑m=1Mp(yi∣θ(m))wi(m)∑m=1Mwi(m)),
where w(m)iwi(m) are stabilised weights computed during PSIS, m=1,…,Mm=1,…,M.

The approximated LOO CV draws nn samples, each of size MM, from the posterior distribution. For each observation, then the pointwise loglog-likelihood is computed based on the MM sampled values

Pareto-smoothed importance-sampling (PSIS)

The estimated shape parameter k^k^ of the generalized Pareto distribution can be used to assess the reliability of the estimate:

  • If k<12k<12, the variance of the raw importance ratios is finite, the central limit theorem holds, and the estimate converges quickly.

  • If 12<k<112<k<1, the variance of the raw importance ratios is infinite but the mean exists, the generalized central limit theorem for stable distributions holds, and the convergence of the estimate is slower. The variance of the PSIS estimate is finite but may be large.

  • If k>1k>1, the variance and the mean of the raw ratios distribution do not exist. The variance of the PSIS estimate is finite but may be large.

If the estimated tail shape parameter k^k^ exceeds 0.5, the user should be warned, although in practice we have observed good performance for values of k^k^ up to 0.7. Even if the PSIS estimate has a finite variance, when k^k^ exceeds 0.7 the user should consider sampling directly from p(θs |y−i ) for the problematic i, use K-fold cross-validation (see Sect. 2.3), or use a more robust model.

Bayesian R2R2

The Bayesian R2R2 (Gelman et al. (2019)) is presented as the variance of the predicted values divided by the variance of predicted values plus the expected residual variance

Bayesian R2=Var(Ypred)Var(Ypred)+Var(res).Bayesian R2=Var(Ypred)Var(Ypred)+Var(res).

  • It should not be interpreted solely if the model has a large number of bad Pareto k^k^ values, i.e., values greater than 0.7 or, even worse, greater than 1.

The proportion of variance explained, classical R2 = V(hat(y))/V(y) is a commonly used measure of model fit, and there is a long literature on interpreting it, adjusting it for degrees of freedom used in fitting the model, and generalizing it to other settings such as hierarchical models; see, for example, Xu (2003) and Gelman and Pardoe (2006).

Two challenges arise in defining R2 in a Bayesian context. The first is the desire to reflect posterior uncertainty in the coefficients, which should remove or at least reduce the overfitting problem of least squares. Second, in the presence of strong prior information and weak data, it is possible for the fitted variance, V(hat(y)) to be higher than total variance V(y) so that the classical formula (1) can yield an R2 greater than 1 (Tjur 2009).

Las Rosas data

Las Rosas data

To obtain the map of locally varying optimal input rates, we specified a quadratic regression model, in which the corn yield is modelled as a quadratic function of the nitrogen rate. The optimal treatment can be determined by estimating the coefficients of the quadratic regression model at each grid point.

Model 1 Model 2 Model 3 Model 4
Spatial correlation No Yes No Yes
Var(u)Var(u) In×n⊗VuIn×n⊗Vu Vs⊗VuVs⊗Vu In×n⊗VuIn×n⊗Vu Vs⊗VuVs⊗Vu
Distribution Gaussian Gaussian Student-tt Student-tt

RStan

RStan is the R interface to the Stan C++ package. RStan provides

  • full Bayesian inference using the No-U-Turn sampler (NUTS), a variant of Hamiltonian Monte Carlo (HMC)
  • approximate Bayesian inference using automatic differentiation variational inference (ADVI)
  • penalized maximum likelihood estimation using L-BFGS optimization

RStan

functions {
  matrix chol_AR_matrix(real rho,int d){
    matrix[d,d] MatAR;
    MatAR = rep_matrix(0,d,d);
    for(i in 1:d){
      for(j in i:d){
        if(j>=i && i==1) MatAR[j,i] = rho^(j-1);
        else if(i>=2 && j>=i) MatAR[j,i] = rho^(j-i)*sqrt(1-rho^2);
      }
    }
    return MatAR;
  }
}
data {
  int<lower=1> N;
  vector[N] Y;
  int<lower=1> K; 
  matrix[N, K] X; 
  int<lower=1> N_1; 
  int<lower=1> M_1; 
  int<lower=1> J_1[N];
  vector[N] Z_1_1;
  int nrow;
  int ncol;
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  
  vector[Kc] means_X;  
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  vector[Kc] b;
  real Intercept;  
  real<lower=0> sigma; 
  vector<lower=0>[M_1] sd_1;  
  vector[N_1] z_1[M_1]; 
  matrix[ncol, nrow] z_2;
  real<lower=0,upper=1> rho_r;
  real<lower=0,upper=1> rho_c;
}
transformed parameters {
  vector[N_1] r_1_1;
  vector[N] Sig;
  r_1_1 = (sd_1[1] * (z_1[1]));
  Sig = sigma * to_vector(chol_AR_matrix(rho_c,ncol) * z_2 * chol_AR_matrix(rho_r,nrow)"'");
}
model {
  vector[N] mu = Intercept + rep_vector(0.0, N);
  for (n in 1:N) {
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + Sig[n];
  }
  target += normal_id_glm_lpdf(Y | Xc, mu, b, sigma);
  // prior
  target += student_t_lpdf(Intercept | 3, 84.7, 31.3);
  target += student_t_lpdf(sigma | 3, 0, 31.3)
    - 1 * student_t_lccdf(0 | 3, 0, 31.3);
  target += student_t_lpdf(sd_1 | 3, 0, 31.3)
    - 1 * student_t_lccdf(0 | 3, 0, 31.3);
  target += std_normal_lpdf(z_1[1]);
  target += std_normal_lpdf(to_vector(z_2));
  target += uniform_lpdf(rho_r | 0,1) + uniform_lpdf(rho_c | 0,1);
}
generated quantities {
  vector[N] log_lik;
  vector[N] y_rep;
  vector[N] mu = Intercept + Xc * b;
  real b_Intercept = Intercept - dot_product(means_X, b);
  for (n in 1:N) {
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + Sig[n];
    log_lik[n] = normal_lpdf(Y[n] | mu[n], 1);
  }
}

RStan

Other packages:

  • brms (R)

  • rstanarm (R)

  • glmer2stan (R)

  • PyStan (Python stan)

  • Stan.jl (Julia stan)

Prior predictive simulations


Model 1 Model 2 Model 3 Model 4
b0b0 N(80,10)N(80,10)
b1b1 N(0,0.01)N(0,0.01)
b2b2 N(0,0.001)N(0,0.001)
σ0σ0 N+(0,1)N+(0,1)
σ1σ1 N+(0,0.01)N+(0,0.01)
σ2σ2 N+(0,0.001)N+(0,0.001)
σeσe N+(0,1)N+(0,1)
RuRu — LKJcorr(1) — LKJcorr(1)
ρcρc — U(0,1)U(0,1) — U(0,1)U(0,1)
ρrρr — U(0,1)U(0,1) — U(0,1)U(0,1)
νν — — Γ(2,0.1)Γ(2,0.1) Γ(2,0.1)Γ(2,0.1)

Prior predictive simulations

Posterior checking (fitting)

Posterior checking (skewness)

Posterior checking (LOO PIT)

LOO CV predictive cumulative density plots can also be used to assess the performance of fitted models. A model is well calibrated for continuous responses when the corresponding plot shows asymptotically uniform behaviour . Figure ?????? compares the density of the computed (LOO PIT) (the thick dark curve) with the 100 simulated data sets from a standard uniform distribution (the thin light curves). It is evident from Figure ?????? that Model 1 and 3 are miscalibrated. Although the Model 2 fit seems good, the frown shape of the curve indicates inferior calibration than Model 4. This implies that Model 2 is either misspecified or too flexible. A flexible model often has the capability of predicting successfully out-of-sample data. However, amongst the four fitted models, Model 4 demonstrates the best fit for the Las Rosas data set.

Pareto k^k^

Model 1 Model 2 Model 3 Model 4
Count Per M.Eff Count Per M.Eff Count Per M.Eff Count Per M.Eff
(-Inf, 0.5] (good) 28 1.7% 457 1585 94.7% 432 1474 88.1% 494 1672 99.9% 868
(0.5, 0.7] (ok)(good) 372 22.2% 112 83 5.0% 103 176 10.5% 254 2 0.1% 1733
(0.7, 1] (bad) 1138 68.0% 18 4 0.2% 70 24 1.4% 170 0 0.0% 0
(1, Inf) (very bad) 136 8.1% 8 2 0.1% 4 0 0.0% 0 0 0.0% 0

The reliability and approximate convergence rate of the PSIS-based estimates can be assessed using the estimates for the shape parameter k of the generalized Pareto distribution:

If k<0.5 then the distribution of raw importance ratios has finite variance and the central limit theorem holds. However, as k approaches 0.5 the RMSE of plain importance sampling (IS) increases significantly while PSIS has lower RMSE.

If 0.5≤k<1 then the variance of the raw importance ratios is infinite, but the mean exists. TIS and PSIS estimates have finite variance by accepting some bias. The convergence of the estimate is slower with increasing k. If k is between 0.5 and approximately 0.7 then we observe practically useful convergence rates and Monte Carlo error estimates with PSIS (the bias of TIS increases faster than the bias of PSIS). If k>0.7 we observe impractical convergence rates and unreliable Monte Carlo error estimates.

If k≥1 then neither the variance nor the mean of the raw importance ratios exists. The convergence rate is close to zero and bias can be large with practical sample sizes

Model evaluation

Model 1 Model 2 Model 3 Model 4
Estimate SE Estimate SE Estimate SE Estimate SE
elpd -7236.2 13.4 -4945.2 134.8 -7848.4 17.1 -4734.3 38.3
ploo 1487.1 11.7 341.8 41.3 241.2 6.8 516.1 10.5
looic 14472.5 26.7 9890.4 269.6 15696.8 34.3 9468.7 76.7
Median CI Median CI Median CI Median CI
Bayesian R2R2 0.842 0.563 ∼∼ 0.965 0.974 0.972 ∼∼ 0.977 0.190 0.135 ∼∼ 0.251 0.989 0.987 ∼∼ 0.991

LOO information criterion (looic), which is −2×−2× in deviance scale.

The R2R2 is valid only when the model is not misspecified. Table~?????? shows that Model 1 is a better fit than Model 3 in terms of the R2R2 value. But these two models are misspecified, as evidenced by their high Pareto k^k^ values and large  values. Therefore, we shall only focus on Model 2 and 4. Model 4 with Student-tt distribution is better than Model 2 with Gaussian distribution in terms of smaller looic and higher R2R2 value. The bad Pareto k^k^ values in Model 2 are eliminated by fitting Model 4. Therefore, we use Model 4 to fit the Las Rosas data, and the results are presented in the section below.

Results

\beta_0 \beta_1 \beta_2

Yield

Yield prediction

The figure depicts the map of the estimated yield corresponding to the spatially-varying adjusted optimal treatment rates across the field.

Compare with GWR

GWR Bayesian
Inference with neighbouring data with all data
Initialisation bandwidth selection prior specification
Objective local log-likelihood global log-likelihood
Evaluation tt scores credible intervals
pp-values PP check and LOO PIT
Pareto kk diagnosis
Bayesian R2R2

In GWR, the results crucially depends on the bandwidth of the selected kernel function. Although an appropriate bandwidth can be selected using spatial cross validation, it is computationally challenging for large data sets. To estimate the regression parameters for a query location, the neighbouring observations are given more weight than the distant ones in GWR. On the contrary, the proposed Bayesian approach uses all data in one go to produce estimates for all grid point, based on a spatial variance matrix defined for the entire field. The Bayesian inference is affected by the choice of priors and the likelihood. However, the influence of the prior reduces if the amount of data increases. The Bayesian approach in general is more flexible than GWR, as it can be easily extended and applied broadly to other applications.

Further application of BHM and GWR

Two types of large-strip trial design

Two design

Response

Two response

{LinearQuadraticyi=b0+u0i+(b1+u1i)Ni+eiyi=b0+u0i+(b1+u1i)Ni+(b2+u2i)N2i+ei,{Linearyi=b0+u0i+(b1+u1i)Ni+eiQuadraticyi=b0+u0i+(b1+u1i)Ni+(b2+u2i)Ni2+ei,

Spatial correlation

For the variance-covariance of the random effects,

Σu=Vs⊗Vu,Σu=Vs⊗Vu,
we chose Vs=IVs=I, Vs=AR1(ρc)⊗AR1(ρr)Vs=AR1(ρc)⊗AR1(ρr) and
Vs(d)=σ221−νΓ(ν)(2ν−−√dr)νKν(2ν−−√dr)Vs(d)=σ221−νΓ(ν)(2νdr)νKν(2νdr)

Results

Linear

Results

Quadratic

Take-home message

  • Large strip trials
  • Visualization: prior checking, posterior checking
  • Diagnosis and evaluation: LOO, PIT, Parote kk, R2R2
  • NUTS: rstan
  • A systematic design is suprior to a randomised design in some way

References

Baek, Eunkyeng, S. Natasha Beretvas, Wim Van den Noortgate, and John M. Ferron. 2019. “Brief Research Report: Bayesian Versus REML Estimations With Noninformative Priors in Multilevel Single-Case Data.” The Journal of Experimental Education, March, 1–13. https://doi.org/10.1080/00220973.2018.1527280.
Besag, J., and D. Higdon. 1999. “Bayesian Analysis of Agricultural Field Experiments.” J Royal Statistical Soc B 61 (4): 691–746. https://doi.org/10.1111/1467-9868.00201.
Brooks, Steve, Andrew Gelman, Galin Jones, and Xiao-Li Meng. 2011. Handbook of Markov Chain Monte Carlo. CRC press.
Bürkner, Paul-Christian, Jonah Gabry, and Aki Vehtari. 2021. “Efficient Leave-One-Out Cross-Validation for Bayesian Non-Factorized Normal and Student-t Models.” Comput Stat 36 (2): 1243–61. https://doi.org/10.1007/s00180-020-01045-4.
Butler, DG, BR Cullis, AR Gilmour, BJ Gogel, and R Thompson. 2009. “ASReml-r Reference Manual Version 4.” The State of Queensland, Department of Primary Industries and Fisheries: Brisbane, Qld.
Che, X., and S. Xu. 2010. “Bayesian Data Analysis for Agricultural Experiments.” Can. J. Plant Sci. 90 (5): 575–603. https://doi.org/10.4141/cjps10004.
Duane, Simon, Anthony D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. “Hybrid Monte Carlo.” Physics Letters B 195 (2): 216–22. https://doi.org/10.1016/0370-2693(87)91197-x.
Fotheringham, A Stewart, Chris Brunsdon, and Martin Charlton. 2003. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. John Wiley & Sons.
Gelman, Andrew et al. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models (Comment on Article by Browne and Draper).” Bayesian Analysis 1 (3): 515–34. https://doi.org/10.1214/06-BA117A.
Gelman, Andrew, Ben Goodrich, Jonah Gabry, and Aki Vehtari. 2019. “R-Squared for Bayesian Regression Models.” American Statistician 73 (3): 307–9. https://doi.org/10.1080/00031305.2018.1549100.
Gelman, Andrew, Daniel Simpson, and Michael Betancourt. 2017. “The Prior Can Often Only Be Understood in the Context of the Likelihood.” Entropy 19 (10): 555. https://doi.org/10.3390/e19100555.
Gelman, Andrew, Aki Vehtari, Daniel Simpson, Charles C Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. 2020. “Bayesian Workflow.” arXiv Preprint arXiv:2011.01808.
Geman, Stuart, and Donald Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.” IEEE Transactions on Pattern Analysis and Machine Intelligence, no. 6: 721–41.
Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” J. Mach. Learn. Res. 15 (1): 1593–623.
Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100 (9): 1989–2001. https://doi.org/https://doi.org/10.1016/j.jmva.2009.04.008.
MacKay, David JC, David JC Mac Kay, et al. 2003. Information Theory, Inference and Learning Algorithms. Cambridge university press.
Monnahan, Cole C., James T. Thorson, and Trevor A. Branch. 2017. “Faster Estimation of Bayesian Models in Ecology Using Hamiltonian Monte Carlo.” Methods in Ecology and Evolution 8 (3): 339–48. https://doi.org/10.1111/2041-210X.12681.
Neal, Radford M. 2003. “Slice Sampling.” The Annals of Statistics 31 (3): 705–67.
Piepho, Hans-Peter, Christel Richter, Joachim Spilke, Karin Hartung, and Arndt Kunick. 2011. “Statistical Aspects of on-Farm Experimentation.” Crop & Pasture Science 62: 721–35. https://doi.org/10.1071/cp11175.
Rakshit, Suman, Adrian Baddeley, Katia Stefanova, Karyn Reeves, Kefei Chen, Zhanglong Cao, Fiona Evans, and Mark Gibberd. 2020. “Novel Approach to the Analysis of Spatially-Varying Treatment Effects in on-Farm Experiments.” Field Crops Research 255 (October 2019): 107783. https://doi.org/gg2vv7.
Robert, Christian P, and George Casella. 1999. “The Metropolis—Hastings Algorithm.” In Monte Carlo Statistical Methods, 231–83. Springer.
Tsionas, Efthymios G. 2002. “Bayesian Inference in the Noncentral Student-t Model.” Journal of Computational and Graphical Statistics 11 (1): 208–21. https://doi.org/10.1198/106186002317375695.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Stat Comput 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.
Zimmerman, Dale L., and David A. Harville. 1991. “A Random Field Approach to the Analysis of Field-Plot Experiments and Other Spatial Experiments.” Biometrics 47 (1): 223–39. https://doi.org/10.2307/2532508.

Thank you.

5 / 74
On-farm experiments Treatments are applied to adjacent strips to detect spatial variation in treatment response. Large strip trial

  1. Slides

  2. Tools

  3. Close
  • A Bayesian Workflow for Spatially Correlated Random Effects in On-farm Experiment
  • \[ \newcommand{\E}{\mathrm{E}}...
  • SAGI West Node Research
  • On-farm experiments
  • On-farm experiments
  • Improve profitability and sustainability
  • Identify optimum nitrogen rates
  • Example: Argentinian corn field experiment
  • Example: Argentinian corn field experiment
  • Example: Argentinian corn field experiment
  • Inadequacy of a global model
  • Inadequacy of a global model
  • Zone-specific models are limiting
  • Geographically weighted regression
  • Yield prediction with GWR
  • Bayesian Statistics and Workflow
  • Why Bayesian?
  • Liear mixed model notation
  • Liear mixed model notation
  • Bayesian hierarchical model
  • Particular model
  • Spatially correlated random parameters
  • Why use spatially correlated \(\Sigma_u\)?
  • Why use spatially correlated \(\Sigma_u\)?
  • Bayesian workflow
  • Bayesian process
  • Posterior distribution
  • Likelihood
  • Prior specification
  • Prior specification
  • Prior specification
  • LKJ Prior
  • Predictive distribution
  • Posterior sampling with Markov chains
  • Posterior sampling with Markov chains
  • Random walk Metropolis-Hastings
  • Hamiltonian Monte Carlo (HMC)
  • Hamiltonian Monte Carlo (HMC)
  • Hamiltonian Monte Carlo (HMC)
  • Problems with HMC
  • No-U-Turn Sampler (NUTS)
  • No-U-Turn Sampler (NUTS)
  • No-U-Turn Sampler (NUTS)
  • No-U-Turn Sampler (NUTS)
  • No-U-Turn Sampler (NUTS)
  • Posterior predictive checking
  • Leave-one-out (LOO) cross validation
  • Pareto-smoothed importance-sampling (PSIS)
  • Pareto-smoothed importance-sampling (PSIS)
  • Bayesian \(R^2\)
  • Las Rosas data
  • Las Rosas data
  • RStan
  • RStan
  • RStan
  • Prior predictive simulations
  • Prior predictive simulations
  • Posterior checking (fitting)
  • Posterior checking (skewness)
  • Posterior checking (LOO PIT)
  • Pareto \(\hat{k}\)
  • Model evaluation
  • Results
  • Yield
  • Compare with GWR
  • Further application of BHM and GWR
  • Two types of large-strip trial design
  • Response
  • Spatial correlation
  • Results
  • Results
  • Take-home message
  • References
  • Thank you.
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • ? Keyboard Help