All Posts

Browse all articles and research updates

Launching Stats Journey – A Statistics Learning Platform

Introducing Stats Journey, an interactive learning platform for statistics and data science education built with modern web technologies.
interactive-learning statistics web-development react

Proving the Superiority of Systematic Designs in On-Farm Trials

Presentation at AAGI Annual Science Symposium on systematic design superiority in on-farm agricultural trials.
systematic-design on-farm-trials agricultural-statistics AAGI

Lsd function for asreml model

Presentation at AASC 2024

This is the presentation that I gave on 4th September at AASC 2024 Rottnest Island.

Case Studies in Advanced Analysis of Large Strip On-farm Experiments

Abstract: On-farm experiments (OFE) are gaining attention among farmers and agronomists for testing various research questions on real farms. The Analytics for the Australian Grains Industry (AAGI) has developed several techniques for analysing OFE data. Geographically Weighted Regression (GWR) and the multi-environment trial (MET) technique, which partitions paddocks into pseudo-environments (PEs), have proven effective. Additionally, we have explored the potential of the Generalised Additive Model (GAM) for handling temporal and spatial variability, given its flexibility in accommodating non-linear variables. In this presentation, we will demonstrate case studies using these techniques to analyse OFE data and compare the outcomes of different approaches.

PDF.

Presentation at Pawsey Centre

This is the 5-minute presentation that I gave to GRDC western panel and growers at Pawsey Centre..

Analytics Innovations in On-farm Experiments

Presentation Image

PDF.

A Deep Spatio-temporal Gaussian Process for Yield Prediction in West Australia

In the realm of precision farming, the application of Machine Learning (ML) and Deep Learning (DL) algorithms has proven to be invaluable for analyzing and modeling agricultural data that varies across both space and time. While most research has heavily leaned toward utilizing remote sensing data, particularly in image recognition, the advancements in deep learning methodologies for regression problems have been comparatively slow.

Regression techniques are crucial for large-scale agricultural trials, especially when the goal is to identify optimal management practices for specific locations within a field. To enhance predictive accuracy, it’s essential to incorporate various factors—topographical, environmental, and ground conditions—into the modeling process.

To address this gap, our recent work introduces an innovative framework that leverages spatio-temporal estimation and prediction of agricultural data through deep learning and Gaussian processes. By integrating a deep Gaussian Process component, we explicitly model the spatio-temporal structure of the data, while utilizing standard deep learning techniques to account for other influential factors.

We applied our framework to two real-world datasets, demonstrating its effectiveness and accuracy in yield prediction. This approach not only enhances our understanding of agricultural dynamics but also paves the way for more informed decision-making in precision farming.

Stay tuned for further insights as we delve deeper into the applications and implications of this research!

An Introduction to RShiny

It’s a workshop provided to introduce participants to RShiny, a powerful web application framework for R. During this workshop, attendees will learn how to create interactive web applications that can visualize data and facilitate user input.

Workshop Overview

The workshop will cover the following key topics:

  • Introduction to R and Shiny: Understanding the basics of R and how it integrates with Shiny.
  • Creating a Simple Shiny App: Step-by-step guidance on building your first Shiny application.
  • UI and Server Components: Exploring the user interface (UI) and server functions that make up a Shiny app.
  • Reactive Programming: Learning how to make applications responsive to user input.
  • Data Visualization: Using libraries like ggplot2 to create dynamic plots within Shiny apps.
  • Deployment: Overview of how to deploy Shiny apps to shinyapps.io or a local server.

Target Audience

This workshop is aimed at R users who want to enhance their data analysis skills by creating interactive web applications. No prior experience with Shiny is required, but a basic understanding of R programming is recommended.

Objectives

By the end of the workshop, participants will be able to:

  • Develop a basic Shiny application.
  • Understand the key components of Shiny apps.
  • Implement interactive features to enhance user experience.
  • Deploy their applications for public access.

Resources

Participants will be provided with:

  • Workshop materials, including example code and data.
  • Access to recorded sessions for future reference.
  • Support for any follow-up questions after the workshop.

PDF.

A Bayesian Workflow for Spatially Correlated Random Effects in On-farm Experiment

Check out the embedded Quarto HTML file below:

Optimal design for on-farm strip trials --- systematic or randomised?

There is no doubt on the importance of randomisation in agricultural experiments by agronomists and biometricians. Even when agronomists extend the experimentation from small trials to large on-farm trials, randomised designs predominate over systematic designs at most times.

However, the situation maybe changed depending on the objective of the on-farm experiments (OFE). If the goal of OFE is obtaining a smooth map showing the optimal level of a controllable input across a grid made by rows and columns covering the whole field, a systematic design should be preferred over a randomised design in terms of robustness and reliability.

With the novel geographically weighted regression (GWR) for OFE and simulation studies, we conclude that, for large OFE strip trials, the difference between randomised designs and systematic designs are not significant if the assumption is linear response and the spatial variation is not taken into account. However, systematic designs are superior to randomised designs if the assumption of the response is quadratic.

Confounded treatment and blocking

Confounded Treatment and Blocking Structure

Encounted a problem that the treatment is confounded with the blocking structure. In this situation, it is not possible to assess the interaction of the treatment and genotypes.

Example

Here is an example of the trial:

set.seed(1234)
Rep <- rep(as.vector(replicate(2,sample(c("Rep1","Rep2","Rep3"),3))),each=5)
Gens <- paste0("Gen",1:5)
Geno <- as.vector(replicate(6, sample(Gens,5)))
TOS <- rep(c("TOS1","TOS2"),each=15)
Rows <- rep(1:5,6)
Cols <- rep(1:6,each=5)
dat <- data.frame(Rows,Cols,Rep,Geno,TOS)
ggplot(dat) + geom_tile(aes(Cols,Rows,fill=Rep))

ggplot(dat) + geom_tile(aes(Cols,Rows,fill=TOS))

ggplot(dat) + geom_tile(aes(Cols,Rows,fill=Geno))

dat$y <- 3+rnorm(length(Rep))
names(dat)
## [1] "Rows" "Cols" "Rep"  "Geno" "TOS"  "y"
nms <- c("Rows","Cols","Rep","Geno","TOS")
dat[nms] <- lapply(dat[nms], factor)

The researchers would like to assess the effects of different Time of Sowing, Genotypes and their interactions on a set of traits.

Model 1

With the trial, the interaction is not possibly being assessed because the treatment is confounded with the blocking structure.

mod1 <- asreml(y ~ TOS*Geno,
               random = ~ TOS:Rep,
               data = dat)
## Online License checked out Thu Mar  4 16:57:37 2021
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Thu Mar  4 16:57:37 2021
##           LogLik        Sigma2     DF     wall    cpu
##  1      -12.0709      0.654877     20 16:57:37    0.0
##  2      -12.0484      0.644739     20 16:57:37    0.0
##  3      -12.0259      0.624656     20 16:57:37    0.0
##  4      -12.0256      0.622321     20 16:57:37    0.0
plot(mod1)

wald(mod1,denDF = "default",ssType = "conditional")
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Thu Mar  4 16:57:37 2021
##           LogLik        Sigma2     DF     wall    cpu
##  1      -12.0256      0.622286     20 16:57:37    0.0
##  2      -12.0256      0.622286     20 16:57:37    0.0
##  3      -12.0256      0.622286     20 16:57:37    0.0
## $Wald
## 
## Wald tests for fixed effects.
## Response: y
## 
##             Df denDF   F.inc   F.con Margin      Pr
## (Intercept)  1     4 131.600 131.600        0.00033
## TOS          1     4   0.003   0.003      A 0.95740
## Geno         4    16   0.394   0.394      A 0.80986
## TOS:Geno     4    16   0.619   0.619      B 0.65513
## 
## $stratumVariances
##         df  Variance TOS:Rep units!R
## TOS:Rep  4 1.1778601       5       1
## units!R 16 0.6222861       0       1

Model 2

The two TOS blocks were treated as two different environments. As such, a MET analysis techniques were used assuming heterogeneous variance for both TOS.

mod2 <- asreml(y ~ TOS + Geno,
               random = ~ at(TOS):Rep,
               data = dat)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Thu Mar  4 16:57:37 2021
##           LogLik        Sigma2     DF     wall    cpu
##  1      -12.3657      0.609972     24 16:57:37    0.0
##  2      -12.3039      0.599510     24 16:57:37    0.0
##  3      -12.2332      0.579837     24 16:57:37    0.0
##  4      -12.2288      0.575302     24 16:57:37    0.0
## Warning in asreml(y ~ TOS + Geno, random = ~at(TOS):Rep, data = dat): Some
## components changed by more than 1% on the last iteration.
plot(mod2)

wald(mod2,denDF = "default",ssType = "conditional")
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Thu Mar  4 16:57:38 2021
##           LogLik        Sigma2     DF     wall    cpu
##  1      -12.2287      0.574921     24 16:57:38    0.0
##  2      -12.2287      0.574920     24 16:57:38    0.0
##  3      -12.2287      0.574918     24 16:57:38    0.0
## $Wald
## 
## Wald tests for fixed effects.
## Response: y
## 
##             Df denDF   F.inc   F.con Margin      Pr
## (Intercept)  1   3.8 138.100 138.100        0.00039
## TOS          1   3.8   0.003   0.003      A 0.95753
## Geno         4  20.0   0.427   0.427      A 0.78766
## 
## $stratumVariances
##                   df  Variance at(TOS, TOS1):Rep at(TOS, TOS2):Rep units!R
## at(TOS, TOS1):Rep  2 1.4385847                 5     -2.297635e-17       1
## at(TOS, TOS2):Rep  2 0.9171355                 0      5.000000e+00       1
## units!R           20 0.5749179                 0      0.000000e+00       1

Hans-Peter has a paper discussing this issue and an example shows that there is complete confounding of area and treatment effects. So the model 1 in our case is not valid.

References

  • Hans-Peter Piepho, Mian Faisal Nazir, M. Kausar Nawaz Shah (2016). Design and analysis of a trial to select for stress tolerance. COMMUNICATIONS IN BIOMETRY AND CROP SCIENCE. VOL. 11, NO. 1, 2016, PP. 1–9

A note on estimating Lmm With Reml And Bayesian Approaches

A note on estimating LMM with REML and Bayesian approaches

I encountered that REML and Bayesian approahes treat the random term \(u\) of a linear mixed model differently.

Suppose we have a linear mixed model \[\begin{equation} y=X\beta + Zu+e, \end{equation}\] with \(E[u]=0\), \(E[e]=0\) and \[\begin{equation} Var \begin{bmatrix} u \\ e \end{bmatrix} = \begin{bmatrix} G & 0 \\ 0 & R \end{bmatrix}. \end{equation}\]

REML

REML has the assumption that \[\begin{align} E[y] &= X\beta \\ Var[y] &= ZGZ^\top+R. \end{align}\] Then it calculates the BLUP in the following way that \[\begin{align} X^\top R^{-1}X\hat{\beta} + X^\top R^{-1}Z\hat{u} &= X^\top R^{-1} y \\ Z^\top R^{-1} X\hat{\beta} + (Z^\top R^{-1} Z +G^{-1})\hat{u} &= Z^\top R^{-1} y \end{align}\] which, further with \(V = ZGZ^\top + R\), are \[\begin{align} \hat{\beta} &= (X^\top V^{-1}X)^{-1} X^\top V^{-1}y\\ \hat{u} &= (Z^\top R^{-1} Z+G^{-1})^{-1}Z^\top R^{-1}(I - X(X^\top V^{-1}X)^{-1}X^\top V^{-1})y \\ &= GZ^\top V^{-1}(I - X(X^\top V^{-1}X)^{-1}X^\top V^{-1})y \end{align}\]

Bayesian

Alternatively, Bayesian analysis uses \[\begin{align} E[y] &= X\beta + Zu \\ Var[y] &= R. \end{align}\] It is different from maximum likelihood approaches, Bayesian approaches treat \(u\) as model parameters in the same manner as \(\beta\) rather than assuming it is part of the error term. In this way, the uncertainty in its estimates can be naturally evaluated .

Then the Bayesian inference will be \[\begin{equation} p(\theta\mid y) \propto p(y\mid\theta)\pi(\theta), \end{equation}\] which is also called the marginal posterior distribution. The distribution of \(p(\theta\mid y)\) is the ``Bayesian inference’’ of the parameter because all information about \(\theta\) is contained in the distribution. Then, by taking natural logarithm on the posterior, for multivariate Gaussian distribution, we will have \[\begin{equation} L(\theta) \propto -\frac{1}{2} (y-X\beta-Zu)^\top R^{-1}(y-X\beta-Zu) -\frac{1}{2}\ln\det R + \ln \pi(\theta), \end{equation}\]

Presentation at SAGI Symposium 2020

This is the joint presentation by me and Suman at SAGI Symposium 2020.

Part 1 is given by Suman about OFE and GWR approach. The detail is at publication.

Title: On-farm Strip Trials: going beyond small plot experiments

Abstract: Accommodating spatial variation is common in analysing field trials, and has become a challenge in large on-farm experiments. Simple linear mixed models and Gaussian distribution are incapable of dealing with complex data sets and results in misspecification. This paper describes a Bayesian framework to analyse the model with spatially correlated random parameters for large on-farm experiments. With advanced model diagnostic tools, we found that the model with Gaussian assumption is misspecified. Therefore, we use Student distribution with which the model is augmented. The open-source R-code that implements our proposed method for analysing on-farm data is provided for further use. We also discuss the difference of the Bayesian approach and GWR, and compare the results from these two approaches.

Presentation files

Good PP check and R square but large Pareto k values

Recently, I encountered a problem with using PSIS and Pareto k for model diagnostics and selection. Some models have good posterior predictive checking performance but a few terrible Pareto k values. And sometimes, a model with more high k values might be misspecified.

A discussion is at Stan forum.

Single Site Ofe

Single Site OFE

Data visualization

The data set is lasrosas.corn from the R-package agridat. The yield monitor data for a corn field (quintals/ha) was conducted by incorporating six nitrogen rate treatments in three replicated blocks comprising 18 strips across the field in Argentina in 2001. The raw data is converted to 93 rows \(\times\) 18 cols grid.

summary(gridded_df_ordered)
##        x                  y                yield            nitro
##  Min.   :-5137122   Min.   :-6258764   Min.   : 12.66   Min.   :  0.0
##  1st Qu.:-5136910   1st Qu.:-6258727   1st Qu.: 50.51   1st Qu.: 39.0
##  Median :-5136707   Median :-6258689   Median : 84.66   Median : 63.0
##  Mean   :-5136705   Mean   :-6258689   Mean   : 75.32   Mean   : 64.9
##  3rd Qu.:-5136504   3rd Qu.:-6258650   3rd Qu.: 99.52   3rd Qu.: 99.8
##  Max.   :-5136293   Max.   :-6258613   Max.   :117.90   Max.   :124.6
##
##  topo           bv         rep       nf          StripNo         RowId
##  E :350   Min.   : 91.74   R1:558   N0:279   Strip: 1:  93   Min.   : 1
##  HT:414   1st Qu.:168.10   R2:558   N1:279   Strip: 2:  93   1st Qu.:24
##  LO:424   Median :173.63   R3:558   N2:279   Strip: 3:  93   Median :47
##  W :486   Mean   :174.67            N3:279   Strip: 4:  93   Mean   :47
##           3rd Qu.:180.26            N4:279   Strip: 5:  93   3rd Qu.:70
##           Max.   :213.82            N5:279   Strip: 6:  93   Max.   :93
##                                              (Other) :1116
##       row            col           cnitro         nitro.sq       cnitro.sq
##  1      :  18   1      :  93   Min.   :-64.9   Min.   :    0   Min.   : 110.2
##  2      :  18   2      :  93   1st Qu.:-25.9   1st Qu.: 1521   1st Qu.: 204.5
##  3      :  18   3      :  93   Median : -1.9   Median : 4123   Median : 944.4
##  4      :  18   4      :  93   Mean   :  0.0   Mean   : 5875   Mean   :1663.3
##  5      :  18   5      :  93   3rd Qu.: 34.9   3rd Qu.: 9960   3rd Qu.:3564.1
##  6      :  18   6      :  93   Max.   : 59.7   Max.   :15525   Max.   :4212.0
##  (Other):1566   (Other):1116
##    gridobject       gridId
##  1      :   6   1      :   1
##  2      :   6   2      :   1
##  3      :   6   3      :   1
##  4      :   6   4      :   1
##  5      :   6   5      :   1
##  6      :   6   6      :   1
##  (Other):1638   (Other):1668
ggplot(gridded_df_ordered) + geom_point(aes(row,col,colour=rep),size=2) + theme_bw()

ggplot(gridded_df_ordered) + geom_point(aes(row,col,colour=nf),size=2) + theme_bw()

ggplot(gridded_df_ordered) + geom_point(aes(row,col,colour=yield),size=2) + theme_bw()

ggplot(gridded_df_ordered) + geom_point(aes(row,col,colour=topo),size=2) + theme_bw()

ggplot(gridded_df_ordered, aes(x= yield,y=topo,fill=topo)) +
  geom_density_ridges()
## Picking joint bandwidth of 3.01

countplot   <- gridded_df_ordered
countplot$x1<- as.numeric(countplot$row)
countplot$y1<- as.numeric(countplot$col)
coordinates(countplot) <- c("x1", "y1")

spplot(countplot, "yield")

countplot.ppp <- as(countplot["yield"], "ppp")
countplot.ppp.smooth <- Smooth(countplot.ppp)
plot(countplot.ppp.smooth, col=matlab.like2,  ribsep=0.01,
     ribside="bottom", main="")
contour(countplot.ppp.smooth,"marks.nitro", add=TRUE, lwd=1,
        vfont=c("sans serif", "bold italic"), labcex=0.9)

Models

df <- read.csv("SinglesiteModel.csv")
kable(df)%>%kable_styling(bootstrap_options = c("striped", "hover"))
X Fixed Random Residual Log.likelihood AIC BIC
model 0 1 rep+nf -6266.143 12538.290 12554.550
model 1 1 rep+nf id(row):id(col) -6266.143 12538.290 12554.550
model 2 1+lin(row)+lin(col) rep+nf ar1(row):ar1(col) -3913.398 7836.796 7863.902
model 3 1+lin(row) rep+nf+spl(row) ar1(row):ar1(col) -3607.354 7226.708 7259.238
model 4 1+lin(row) rep+nf+spl(row)+outl ar1(row):ar1(col) -3441.974 6897.948 6935.900
model 5 1+lin(row)+nf rep+spl(row)+outl ar1(row):ar1(col) -3434.315 6880.630 6913.143
model 6 1+lin(row)+nf+lin(row):nf rep+spl(row)+outl ar1(row):ar1(col) -3450.218 6912.436 6944.931
model 7 1+lin(row)+nf rep+spl(row)+outl+nf:spl(row) ar1(row):ar1(col) -3430.177 6874.353 6912.284
model 8 1+cnitro+cnitro.sq rep ar1(row):ar1(col) -3922.920 7853.840 7875.525
model 9 1+lin(row)+cnitro+cnitro.sq rep+spl(row) ar1(row):ar1(col) -3611.849 7233.698 7260.801
model 10 1+lin(row)+cnitro+cnitro.sq rep+spl(row)+outl ar1(row):ar1(col) -3446.323 6904.646 6937.170
model 11 1+lin(row)+cnitro+cnitro.sq rep+spl(row)+outl+cnitro:spl(row)+cnitro.sq:spl(row) ar1(row):ar1(col) -3438.807 6893.614 6936.979

Fit treatment as random

#initial model
mod0 <- asreml(yield ~ 1,
               random= ~ rep+nf,
               data=gridded_df_ordered, maxiter=20)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:00 2020
##           LogLik        Sigma2     DF     wall    cpu
##  1     -6273.834       652.480   1673 09:37:00    0.0 (2 restrained)
##  2     -6267.399       654.012   1673 09:37:00    0.0 (1 restrained)
##  3     -6266.281       654.396   1673 09:37:00    0.0 (1 restrained)
##  4     -6266.152       654.576   1673 09:37:00    0.0 (1 restrained)
##  5     -6266.144       654.571   1673 09:37:00    0.0 (1 restrained)
##  6     -6266.143       654.571   1673 09:37:00    0.0
asrmod(mod0)
## $loglik
## [1] -6266.143
##
## $AIC
## [1] 12538.29
## attr(,"parameters")
## [1] 3
##
## $BIC
## [1] 12554.55
## attr(,"parameters")
## [1] 3
plot(mod0)

mod1 <- asreml(yield ~ 1,
               random= ~ rep+nf,
               residual= ~ id(row):id(col),
               data=gridded_df_ordered)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:02 2020
##           LogLik        Sigma2     DF     wall    cpu
##  1     -6273.834       652.480   1673 09:37:02    0.0 (2 restrained)
##  2     -6267.399       654.012   1673 09:37:02    0.0 (1 restrained)
##  3     -6266.281       654.396   1673 09:37:02    0.0 (1 restrained)
##  4     -6266.152       654.576   1673 09:37:02    0.0 (1 restrained)
##  5     -6266.144       654.571   1673 09:37:02    0.0 (1 restrained)
##  6     -6266.143       654.571   1673 09:37:02    0.0
asrmod(mod1)
## $loglik
## [1] -6266.143
##
## $AIC
## [1] 12538.29
## attr(,"parameters")
## [1] 3
##
## $BIC
## [1] 12554.55
## attr(,"parameters")
## [1] 3
summary(mod1,param="gamma")$varcomp
##                  gamma    component    std.error    z.ratio bound %ch
## rep       1.011929e-07 6.623795e-05 2.293633e-06 28.8790582     B   0
## nf        4.658895e-03 3.049578e+00 3.413517e+00  0.8933832     P   0
## row:col!R 1.000000e+00 6.545712e+02 2.266595e+01 28.8790582     P   0
#check the significance of the fixed effects
wald(mod1)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1   4129124         6308.1 < 2.2e-16 ***
## residual (MS)          655
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#run the diagnostics
plot(varioGram(mod1))

plot(mod1)

#fit the linear column and linear row effects model
mod2 <- asreml(yield ~ 1+lin(row)+lin(col),
               random= ~ rep+nf,
               residual= ~ar1(row):ar1(col),
               data=gridded_df_ordered,maxiter=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:04 2020
##           LogLik        Sigma2     DF     wall    cpu
##  1     -5673.261       323.517   1671 09:37:04    0.0 (3 restrained)
##  2     -4615.786       126.868   1671 09:37:04    0.0 (2 restrained)
##  3     -3997.362       134.557   1671 09:37:04    0.0 (1 restrained)
##  4     -3963.913       161.589   1671 09:37:04    0.0 (1 restrained)
##  5     -3919.907       300.788   1671 09:37:04    0.0 (2 restrained)
##  6     -3913.614       345.522   1671 09:37:04    0.0 (1 restrained)
##  7     -3913.406       337.173   1671 09:37:04    0.0 (1 restrained)
##  8     -3913.398       341.281   1671 09:37:04    0.0
##  9     -3913.398       340.286   1671 09:37:04    0.0
#diagnostics
asrmod(mod2)
## $loglik
## [1] -3913.398
##
## $AIC
## [1] 7836.796
## attr(,"parameters")
## [1] 5
##
## $BIC
## [1] 7863.902
## attr(,"parameters")
## [1] 5
# -3913.398
summary(mod2)$varcomp
##                    component    std.error   z.ratio bound %ch
## rep             3.443454e-05 5.571003e-06  6.181031     B   0
## nf              8.883445e-05 1.437211e-05  6.181031     B   0
## row:col!R       3.402861e+02 5.505330e+01  6.181031     P   0
## row:col!row!cor 9.374610e-01 1.068600e-02 87.727923     U   0
## row:col!col!cor 2.639157e-01 2.502896e-02 10.544416     U   0
wald(mod2)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1    277682         816.03 < 2.2e-16 ***
## lin(row)       1      2981           8.76  0.003077 **
## lin(col)       1        37           0.11  0.743124
## residual (MS)          340
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod2))

plot(mod2)

# lin(col) is not significant, add spline row
mod3 <- asreml(yield~ 1+lin(row),
               random= ~rep+nf+spl(row),
               residual= ~ar1(row):ar1(col),
               data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:06 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3736.861       30.5672   1672 09:37:06    0.1 (1 restrained)
##  2     -3671.982       29.3929   1672 09:37:06    0.1
##  3     -3624.633       29.7522   1672 09:37:07    0.1
##  4     -3609.884       31.0471   1672 09:37:07    0.1
##  5     -3607.429       32.0617   1672 09:37:07    0.1
##  6     -3607.356       32.2887   1672 09:37:07    0.1
##  7     -3607.354       32.3363   1672 09:37:07    0.1
asrmod(mod3)#$loglik
## $loglik
## [1] -3607.354
##
## $AIC
## [1] 7226.708
## attr(,"parameters")
## [1] 6
##
## $BIC
## [1] 7259.238
## attr(,"parameters")
## [1] 6
# -3607.354
summary(mod3)$varcomp
##                  component  std.error    z.ratio bound %ch
## rep              0.1837729 0.36505126  0.5034168     P 0.1
## nf               5.1187964 3.41221249  1.5001400     P 0.0
## spl(row)         9.8298245 2.96299498  3.3175299     P 0.0
## row:col!R       32.3362912 1.36943864 23.6128075     P 0.0
## row:col!row!cor  0.4415356 0.02081960 21.2076837     U 0.1
## row:col!col!cor  0.1486154 0.02568646  5.7857458     U 0.0
wald(mod3)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1    188245         5821.5 < 2.2e-16 ***
## lin(row)       1     91584         2832.2 < 2.2e-16 ***
## residual (MS)           32
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod3))

plot(mod3)

###############################################
# identify outliers
mod3.aom <- update(mod3,aom=T)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:09 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3607.354       32.3452   1672 09:37:09    0.1
##  2     -3607.354       32.3458   1672 09:37:09    0.1
##  3     -3607.354       32.3470   1672 09:37:09    0.1
# gridded_df_ordered$aom <- residuals(mod3.aom)
gridded_df_ordered$aom <- mod3.aom$aom$R[,2]
ggplot(gridded_df_ordered) + geom_histogram(aes(aom),binwidth=1) +
  ggtitle("Histogram of Standardised Residuals")+theme_bw()

# ggplot(gridded_df_ordered) + geom_boxplot(aes(topo,aom)) + theme_bw()

# standardised residual plots of model 3
mod3.1 <- mod3
mod3.1$residuals<- mod3.aom$aom$R[,2]
plot(mod3.1)

temp <- gridded_df_ordered
order(gridded_df_ordered$aom)[1:10]
##  [1]   74 1490 1116 1239 1369 1670  943 1473 1531   57
wh <- which(abs(gridded_df_ordered$aom)>5)
resID <- c(74)
gridded_df_ordered[resID,]
##            x        y yield nitro topo     bv rep nf  StripNo RowId row col
## 510 -5137046 -6258749 19.41  75.4    W 177.46  R1 N3 Strip: 2     5   5   2
##     cnitro nitro.sq cnitro.sq gridobject gridId     aom
## 510   10.5  5685.16    110.25          5     74 -17.228
gridded_df_ordered$outl <- 0
gridded_df_ordered$outl[resID] <- 1
gridded_df_ordered$outl <- as.factor(gridded_df_ordered$outl)

# removing extreme outliers
gridded_df_ordered$yield[74] <- NA

mod3.2 <- asreml(yield~ 1+lin(row),
                 random= ~rep+nf+spl(row),
                 residual= ~ar1(row):ar1(col),
                 data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:11 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3628.468       26.9187   1671 09:37:11    0.1 (1 restrained)
##  2     -3528.873       25.1916   1671 09:37:11    0.1
##  3     -3460.926       25.5634   1671 09:37:11    0.1
##  4     -3440.607       27.1242   1671 09:37:11    0.1
##  5     -3437.212       28.4171   1671 09:37:11    0.1
##  6     -3437.106       28.7254   1671 09:37:11    0.1
##  7     -3437.102       28.7894   1671 09:37:11    0.1
##  8     -3437.102       28.8016   1671 09:37:11    0.1
asrmod(mod3.2)#$loglik
## $loglik
## [1] -3437.102
##
## $AIC
## [1] 6886.204
## attr(,"parameters")
## [1] 6
##
## $BIC
## [1] 6918.731
## attr(,"parameters")
## [1] 6
summary(mod3.2)$varcomp
##                  component  std.error    z.ratio bound %ch
## rep              0.1996678 0.39200006  0.5093566     P   0
## nf               5.2028157 3.47688797  1.4964002     P   0
## spl(row)         9.3337692 2.71290752  3.4405040     P   0
## row:col!R       28.8015509 1.29788494 22.1911435     P   0
## row:col!row!cor  0.5118910 0.02010364 25.4626049     U   0
## row:col!col!cor  0.1479415 0.02593579  5.7041445     U   0
wald(mod3.2)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1    163993         5693.9 < 2.2e-16 ***
## lin(row)       1     76586         2659.1 < 2.2e-16 ***
## residual (MS)           29
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod3.2))

plot(mod3.2)

mod3.2.aom <- update(mod3.2,aom=T)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:13 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3437.102       28.8039   1671 09:37:13    0.1
##  2     -3437.102       28.8041   1671 09:37:14    0.1
##  3     -3437.102       28.8044   1671 09:37:14    0.1
gridded_df_ordered$aom2 <- mod3.2.aom$aom$R[,2]

ggplot(gridded_df_ordered) + geom_histogram(aes(aom2),binwidth=1) +
  ggtitle("Histogram of Standardised Residuals")+theme_bw()

order(gridded_df_ordered$aom2)[1:10]
##  [1] 1490 1116 1239 1369 1670  943 1473  719 1131 1531
resID <- c(74,which(abs(gridded_df_ordered$aom2)>2))
gridded_df_ordered[resID,]
##             x        y  yield nitro topo     bv rep nf   StripNo RowId row col
## 510  -5137046 -6258749     NA  75.4    W 177.46  R1 N3  Strip: 2     5   5   2
## 110  -5137080 -6258749  87.73  75.4    W 166.49  R1 N3  Strip: 2     1   1   2
## 1101 -5137122 -6258613  86.66  39.0    W 142.55  R3 N1 Strip: 18     1   1  18
## 270  -5137105 -6258647 100.52  75.4    W 139.78  R3 N3 Strip: 14     2   2  14
## 2101 -5137113 -6258613 102.16  39.0    W 131.17  R3 N1 Strip: 18     2   2  18
## 412  -5137063 -6258741  91.04  99.8    W 174.33  R1 N4  Strip: 3     4   4   3
## 460  -5137088 -6258656 110.90 124.6    W  96.39  R3 N5 Strip: 13     4   4  13
## 570  -5137080 -6258647  80.05  75.4    W 171.65  R3 N3 Strip: 14     5   5  14
## 580  -5137079 -6258639  85.22  99.8    W 172.75  R3 N4 Strip: 15     5   5  15
## 5101 -5137088 -6258614  99.69  39.0    W 171.22  R3 N1 Strip: 18     5   5  18
## 610  -5137038 -6258749 108.03  75.4    W 173.70  R1 N3  Strip: 2     6   6   2
## 650  -5137071 -6258639 108.45  99.8    W 174.05  R3 N4 Strip: 15     6   6  15
## 812  -5137029 -6258741 100.54  99.8    W 177.22  R1 N4  Strip: 3     8   8   3
## 9    -5137013 -6258758 100.44 124.6    W 179.49  R1 N5  Strip: 1     9   9   1
## 920  -5137038 -6258690  76.03  99.8    W 176.80  R2 N4  Strip: 9     9   9   9
## 1112 -5137021 -6258690  78.20  99.8    W 175.24  R2 N4  Strip: 9    11  11   9
## 121  -5136987 -6258750  81.90  75.4    W 183.94  R1 N3  Strip: 2    12  12   2
## 128  -5137004 -6258699  49.13  75.4    W 185.62  R2 N3  Strip: 8    12  12   8
## 13   -5136979 -6258758  82.70 124.6    W 180.27  R1 N5  Strip: 1    13  13   1
## 23   -5136894 -6258759  62.98 124.6    W 181.74  R1 N5  Strip: 1    23  23   1
## 259  -5136902 -6258683  36.08   0.0    W 177.30  R2 N0 Strip: 10    25  25  10
## 2615 -5136910 -6258632  36.09   0.0    W 181.43  R3 N0 Strip: 16    26  26  16
## 3315 -5136851 -6258632  31.38   0.0   HT 171.74  R3 N0 Strip: 16    33  33  16
## 3614 -5136817 -6258641  37.03  99.8   HT 188.19  R3 N4 Strip: 15    35  35  15
## 3917 -5136800 -6258616  52.90  39.0   HT 202.74  R3 N1 Strip: 18    38  38  18
## 4013 -5136783 -6258650  34.62  75.4   HT 198.83  R3 N3 Strip: 14    39  39  14
## 4117 -5136775 -6258650  43.70  75.4   HT 203.71  R3 N3 Strip: 14    40  40  14
## 4118 -5136775 -6258641  32.05  99.8   HT 204.97  R3 N4 Strip: 15    40  40  15
## 4120 -5136783 -6258624  31.79  50.6   HT 203.68  R3 N2 Strip: 17    40  40  17
## 4218 -5136775 -6258625  47.02  50.6   HT 202.14  R3 N2 Strip: 17    41  41  17
## 423  -5136742 -6258735  12.66   0.0   HT 193.16  R1 N0  Strip: 4    42  42   4
## 442  -5136725 -6258743  32.90  99.8   HT 187.79  R1 N4  Strip: 3    44  44   3
## 4516 -5136749 -6258625  50.57  50.6   HT 182.88  R3 N2 Strip: 17    44  44  17
## 4713 -5136732 -6258616  62.88  39.0   HT 179.37  R3 N1 Strip: 18    46  46  18
## 5218 -5136690 -6258625  68.11  50.6   HT 182.17  R3 N2 Strip: 17    50  50  17
## 5212 -5136673 -6258676  47.52  50.6    E 177.43  R2 N2 Strip: 11    51  51  11
## 531  -5136640 -6258753  75.83  75.4    E 177.89  R1 N3  Strip: 2    52  52   2
## 546  -5136648 -6258710  58.28 124.6    E 176.14  R2 N5  Strip: 7    53  53   7
## 547  -5136648 -6258702  74.35  75.4    E 175.50  R2 N3  Strip: 8    53  53   8
## 548  -5136657 -6258693  63.75  99.8    E 178.95  R2 N4  Strip: 9    53  53   9
## 556  -5136640 -6258710  87.78 124.6    E 177.27  R2 N5  Strip: 7    54  54   7
## 5415 -5136648 -6258634  69.74   0.0    E 173.77  R3 N0 Strip: 16    54  54  16
## 5614 -5136648 -6258642  65.65  99.8    E 177.05  R3 N4 Strip: 15    55  55  15
## 5717 -5136648 -6258617  81.37  39.0    E 173.73  R3 N1 Strip: 18    55  55  18
## 5817 -5136639 -6258617 101.34  39.0    E 174.48  R3 N1 Strip: 18    56  56  18
## 583  -5136606 -6258736 103.97   0.0    E 170.44  R1 N0  Strip: 4    57  57   4
## 5915 -5136605 -6258634  91.11   0.0    E 171.43  R3 N0 Strip: 16    59  59  16
## 6118 -5136605 -6258643  98.81  99.8    E 171.22  R3 N4 Strip: 15    60  60  15
## 6015 -5136597 -6258634 102.16   0.0    E 171.42  R3 N0 Strip: 16    60  60  16
## 635  -5136572 -6258719 105.73  39.0    E 176.47  R1 N1  Strip: 6    61  61   6
## 645  -5136564 -6258719  97.94  39.0    E 177.19  R1 N1  Strip: 6    62  62   6
## 6414 -5136580 -6258643 111.74  99.8    E 175.66  R3 N4 Strip: 15    62  62  15
## 6416 -5136588 -6258626 109.24  50.6    E 164.57  R3 N2 Strip: 17    62  62  17
## 6417 -5136588 -6258618  81.00  39.0    E 163.03  R3 N1 Strip: 18    62  62  18
## 655  -5136555 -6258719 106.84  39.0    E 169.45  R1 N1  Strip: 6    63  63   6
## 6514 -5136572 -6258643  95.77  99.8    E 167.74  R3 N4 Strip: 15    63  63  15
## 6516 -5136580 -6258626  92.21  50.6    E 168.51  R3 N2 Strip: 17    63  63  17
## 6811 -5136538 -6258669 110.54  39.0    E 161.69  R2 N1 Strip: 12    66  66  12
## 6916 -5136546 -6258626  96.32  50.6    E 167.79  R3 N2 Strip: 17    67  67  17
## 708  -5136521 -6258694  98.09  99.8    E 169.45  R2 N4  Strip: 9    68  68   9
## 7016 -5136538 -6258626 112.85  50.6    E 170.98  R3 N2 Strip: 17    68  68  17
## 7112 -5136513 -6258694 111.87  99.8    E 171.61  R2 N4  Strip: 9    69  69   9
## 7118 -5136521 -6258643  90.45  99.8    E 169.94  R3 N4 Strip: 15    69  69  15
## 7313 -5136504 -6258652  94.33  75.4    E 165.49  R3 N3 Strip: 14    71  71  14
## 734  -5136479 -6258728  93.06  50.6   LO 169.12  R1 N2  Strip: 5    72  72   5
## 7414 -5136495 -6258644  98.88  99.8    E 161.76  R3 N4 Strip: 15    72  72  15
## 744  -5136471 -6258729 108.01  50.6   LO 162.02  R1 N2  Strip: 5    73  73   5
## 7614 -5136478 -6258644 117.90  99.8   LO 162.17  R3 N4 Strip: 15    74  74  15
## 76   -5136446 -6258763  95.93 124.6   LO 169.89  R1 N5  Strip: 1    75  75   1
## 77   -5136437 -6258763 116.64 124.6   LO 159.75  R1 N5  Strip: 1    76  76   1
## 78   -5136429 -6258763  97.56 124.6   LO 163.36  R1 N5  Strip: 1    77  77   1
## 7914 -5136453 -6258644 112.94  99.8   LO 172.52  R3 N4 Strip: 15    77  77  15
## 8113 -5136428 -6258687 107.59   0.0   LO 169.97  R2 N0 Strip: 10    79  79  10
## 8118 -5136436 -6258644  87.43  99.8   LO 163.91  R3 N4 Strip: 15    79  79  15
## 813  -5136412 -6258746 112.83  99.8   LO 170.25  R1 N4  Strip: 3    80  80   3
## 8211 -5136420 -6258687  94.65   0.0   LO 165.06  R2 N0 Strip: 10    80  80  10
## 8417 -5136419 -6258619 107.22  39.0   LO 164.82  R3 N1 Strip: 18    81  81  18
## 83   -5136386 -6258763 110.56 124.6   LO 163.70  R1 N5  Strip: 1    82  82   1
## 8413 -5136411 -6258653 100.51  75.4   LO 153.08  R3 N3 Strip: 14    82  82  14
## 8414 -5136411 -6258644  92.60  99.8   LO 161.18  R3 N4 Strip: 15    82  82  15
## 8513 -5136402 -6258653  69.21  75.4   LO 163.36  R3 N3 Strip: 14    83  83  14
## 8514 -5136402 -6258644 106.21  99.8   LO 167.98  R3 N4 Strip: 15    83  83  15
## 8315 -5136402 -6258636 102.93   0.0   LO 165.34  R3 N0 Strip: 16    83  83  16
## 8612 -5136394 -6258661 102.66 124.6   LO 165.70  R3 N5 Strip: 13    84  84  13
## 8613 -5136394 -6258653 105.25  75.4   LO 164.87  R3 N3 Strip: 14    84  84  14
## 8415 -5136394 -6258636  86.67   0.0   LO 161.98  R3 N0 Strip: 16    84  84  16
## 862  -5136369 -6258746 109.52  99.8   LO 161.32  R1 N4  Strip: 3    85  85   3
## 8712 -5136385 -6258662 117.19 124.6   LO 165.81  R3 N5 Strip: 13    85  85  13
## 87   -5136352 -6258763  85.64 124.6   LO 169.63  R1 N5  Strip: 1    86  86   1
## 871  -5136352 -6258755 102.12  75.4   LO 177.52  R1 N3  Strip: 2    86  86   2
## 872  -5136361 -6258746  94.48  99.8   LO 171.21  R1 N4  Strip: 3    86  86   3
## 9214 -5136343 -6258653 109.57  75.4   LO 164.81  R3 N3 Strip: 14    90  90  14
## 9015 -5136343 -6258636  85.13   0.0   LO 167.75  R3 N0 Strip: 16    90  90  16
## 9316 -5136343 -6258628 100.36  50.6   LO 166.01  R3 N2 Strip: 17    90  90  17
## 9413 -5136326 -6258654 104.21  75.4   LO 173.38  R3 N3 Strip: 14    92  92  14
## 9414 -5136326 -6258645 107.01  99.8   LO 170.19  R3 N4 Strip: 15    92  92  15
## 959  -5136318 -6258654  85.05  75.4   LO 167.84  R3 N3 Strip: 14    93  93  14
##      cnitro nitro.sq cnitro.sq gridobject gridId        aom outl      aom2
## 510    10.5  5685.16    110.25          5     74 -17.228003    1        NA
## 110    10.5  5685.16    110.25          1      2  -1.810830    0 -2.042881
## 1101  -25.9  1521.00    670.81        187     18  -2.074970    0 -2.364591
## 270    10.5  5685.16    110.25        188     32   1.906677    0  2.251556
## 2101  -25.9  1521.00    670.81        188     36   2.386740    0  2.832189
## 412    34.9  9960.04   1218.01          4     57  -2.598582    0 -2.003388
## 460    59.7 15525.16   3564.09        190     67   2.964006    0  3.252966
## 570    10.5  5685.16    110.25        191     86  -2.071350    0 -2.365493
## 580    34.9  9960.04   1218.01        191     87  -1.774945    0 -2.112016
## 5101  -25.9  1521.00    670.81        191     90   2.328933    0  2.637118
## 610    10.5  5685.16    110.25          6     92   8.586755    0  2.611539
## 650    34.9  9960.04   1218.01        192    105   3.148979    0  3.561319
## 812    34.9  9960.04   1218.01          8    129   1.981116    0  2.234895
## 9      59.7 15525.16   3564.09          9    145   1.914733    0  2.045011
## 920    34.9  9960.04   1218.01        102    153  -1.770748    0 -2.023645
## 1112   34.9  9960.04   1218.01        104    189   1.763364    0  2.018685
## 121    10.5  5685.16    110.25         12    200   2.374815    0  2.631044
## 128    10.5  5685.16    110.25        105    206  -2.209220    0 -2.415099
## 13     59.7 15525.16   3564.09         13    217   2.432004    0  2.556345
## 23     59.7 15525.16   3564.09         23    397   1.787428    0  2.033787
## 259   -64.9     0.00   4212.01        118    442  -1.852269    0 -2.063370
## 2615  -64.9     0.00   4212.01        212    466  -2.089166    0 -2.315745
## 3315  -64.9     0.00   4212.01        219    592  -1.935853    0 -2.125031
## 3614   34.9  9960.04   1218.01        221    627  -2.301760    0 -2.603531
## 3917  -25.9  1521.00    670.81        224    684   2.200645    0  2.429073
## 4013   10.5  5685.16    110.25        225    698  -1.889680    0 -2.295096
## 4117   10.5  5685.16    110.25        226    716   1.823544    0  2.056192
## 4118   34.9  9960.04   1218.01        226    717  -1.865773    0 -2.150357
## 4120  -14.3  2560.36    204.49        226    719  -2.561293    0 -3.077060
## 4218  -14.3  2560.36    204.49        227    737   3.090783    0  3.575272
## 423   -64.9     0.00   4212.01         42    742  -2.344173    0 -2.442339
## 442    34.9  9960.04   1218.01         44    777  -1.795709    0 -2.028226
## 4516  -14.3  2560.36    204.49        230    791   1.921436    0  2.151343
## 4713  -25.9  1521.00    670.81        232    828   2.241009    0  2.415864
## 5218  -14.3  2560.36    204.49        236    899   1.981692    0  2.176197
## 5212  -14.3  2560.36    204.49        144    911  -2.063355    0 -2.373900
## 531    10.5  5685.16    110.25         52    920   1.933792    0  2.130680
## 546    59.7 15525.16   3564.09        146    943  -2.868999    0 -3.267559
## 547    10.5  5685.16    110.25        146    944   1.753662    0  2.001848
## 548    34.9  9960.04   1218.01        146    945  -2.124362    0 -2.448849
## 556    59.7 15525.16   3564.09        147    961   2.705213    0  3.176615
## 5415  -64.9     0.00   4212.01        240    970  -1.801181    0 -2.188551
## 5614   34.9  9960.04   1218.01        241    987  -2.175815    0 -2.228924
## 5717  -25.9  1521.00    670.81        241    990  -2.083140    0 -2.502616
## 5817  -25.9  1521.00    670.81        242   1008   2.268576    0  2.575574
## 583   -64.9     0.00   4212.01         57   1012   1.896104    0  2.060612
## 5915  -64.9     0.00   4212.01        245   1060  -1.882338    0 -2.188605
## 6118   34.9  9960.04   1218.01        246   1077  -1.733118    0 -2.016476
## 6015  -64.9     0.00   4212.01        246   1078   2.232872    0  2.659047
## 635   -25.9  1521.00    670.81         61   1086   1.692664    0  2.002704
## 645   -25.9  1521.00    670.81         62   1104  -1.723047    0 -2.054753
## 6414   34.9  9960.04   1218.01        248   1113   1.863626    0  2.148763
## 6416  -14.3  2560.36    204.49        248   1115   1.886120    0  2.103782
## 6417  -25.9  1521.00    670.81        248   1116  -3.462416    0 -3.804494
## 655   -25.9  1521.00    670.81         63   1122   1.764086    0  2.083405
## 6514   34.9  9960.04   1218.01        249   1131  -2.527782    0 -2.967026
## 6516  -14.3  2560.36    204.49        249   1133  -2.164896    0 -2.500754
## 6811  -25.9  1521.00    670.81        159   1182   1.909261    0  2.140685
## 6916  -14.3  2560.36    204.49        253   1205  -2.141243    0 -2.525669
## 708    34.9  9960.04   1218.01        161   1215  -1.805094    0 -2.112972
## 7016  -14.3  2560.36    204.49        254   1223   2.765228    0  3.202909
## 7112   34.9  9960.04   1218.01        162   1233   2.021065    0  2.331553
## 7118   34.9  9960.04   1218.01        255   1239  -3.225017    0 -3.680258
## 7313   10.5  5685.16    110.25        257   1274  -2.387866    0 -2.759634
## 734   -14.3  2560.36    204.49         72   1283  -2.505848    0 -2.917365
## 7414   34.9  9960.04   1218.01        258   1293  -2.050650    0 -2.386792
## 744   -14.3  2560.36    204.49         73   1301   1.928096    0  2.254462
## 7614   34.9  9960.04   1218.01        260   1329   2.334467    0  2.602642
## 76     59.7 15525.16   3564.09         75   1333  -2.316430    0 -2.687370
## 77     59.7 15525.16   3564.09         76   1351   3.710383    0  4.346406
## 78     59.7 15525.16   3564.09         77   1369  -3.022961    0 -3.621419
## 7914   34.9  9960.04   1218.01        263   1383   2.880488    0  3.350964
## 8113  -64.9     0.00   4212.01        172   1414   2.063600    0  2.345742
## 8118   34.9  9960.04   1218.01        265   1419  -2.180484    0 -2.369311
## 813    34.9  9960.04   1218.01         80   1425   3.092094    0  3.596567
## 8211  -64.9     0.00   4212.01        173   1432  -2.005950    0 -2.420092
## 8417  -25.9  1521.00    670.81        267   1458   1.795682    0  2.103211
## 83     59.7 15525.16   3564.09         82   1459   2.009552    0  2.317294
## 8413   10.5  5685.16    110.25        268   1472   3.234469    0  4.033093
## 8414   34.9  9960.04   1218.01        268   1473  -2.700483    0 -3.117826
## 8513   10.5  5685.16    110.25        269   1490  -7.622807    0 -8.716190
## 8514   34.9  9960.04   1218.01        269   1491   2.846929    0  3.368931
## 8315  -64.9     0.00   4212.01        269   1492   1.824425    0  2.138103
## 8612   59.7 15525.16   3564.09        270   1507  -1.784264    0 -2.209579
## 8613   10.5  5685.16    110.25        270   1508   3.757767    0  4.556601
## 8415  -64.9     0.00   4212.01        270   1510  -2.387898    0 -2.780062
## 862    34.9  9960.04   1218.01         85   1515   2.280352    0  2.625644
## 8712   59.7 15525.16   3564.09        271   1525   2.323150    0  2.534360
## 87     59.7 15525.16   3564.09         86   1531  -2.662678    0 -2.950584
## 871    10.5  5685.16    110.25         86   1532   1.750922    0  2.023950
## 872    34.9  9960.04   1218.01         86   1533  -1.775027    0 -2.059063
## 9214   10.5  5685.16    110.25        276   1616   2.418454    0  2.729257
## 9015  -64.9     0.00   4212.01        276   1618  -2.223945    0 -2.570816
## 9316  -14.3  2560.36    204.49        276   1619   1.943662    0  2.287110
## 9413   10.5  5685.16    110.25        278   1652   2.184576    0  2.594662
## 9414   34.9  9960.04   1218.01        278   1653   1.883662    0  2.186259
## 959    10.5  5685.16    110.25        279   1670  -3.021050    0 -3.436206
gridded_df_ordered$outl <- 0
gridded_df_ordered$outl[resID] <- 1
gridded_df_ordered$outl <- as.factor(gridded_df_ordered$outl)



### re-fit without outliers
mod4 <- asreml(yield~ 1+lin(row),
               random= ~rep+nf+spl(row)+outl,
               residual= ~ar1(row):ar1(col),
               data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:15 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3629.874       26.9174   1671 09:37:15    0.1 (2 restrained)
##  2     -3529.028       25.1879   1671 09:37:15    0.1 (1 restrained)
##  3     -3460.859       25.5663   1671 09:37:15    0.1 (1 restrained)
##  4     -3440.593       27.1274   1671 09:37:15    0.1 (1 restrained)
##  5     -3437.212       28.4179   1671 09:37:15    0.1 (1 restrained)
##  6     -3437.106       28.7256   1671 09:37:15    0.1
##  7     -3437.102       28.7894   1671 09:37:15    0.1
##  8     -3437.102       28.8016   1671 09:37:15    0.1
asrmod(mod4)# $loglik
## $loglik
## [1] -3437.102
##
## $AIC
## [1] 6888.204
## attr(,"parameters")
## [1] 7
##
## $BIC
## [1] 6926.152
## attr(,"parameters")
## [1] 7
summary(mod4)$varcomp
##                    component    std.error    z.ratio bound %ch
## outl            2.914513e-06 1.313368e-07 22.1911349     B   0
## rep             1.996674e-01 3.919996e-01  0.5093561     P   0
## nf              5.202816e+00 3.476888e+00  1.4964004     P   0
## spl(row)        9.333772e+00 2.712907e+00  3.4405054     P   0
## row:col!R       2.880156e+01 1.297886e+00 22.1911349     P   0
## row:col!row!cor 5.118911e-01 2.010364e-02 25.4626056     U   0
## row:col!col!cor 1.479415e-01 2.593579e-02  5.7041444     U   0
wald(mod4)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1    163993         5693.9 < 2.2e-16 ***
## lin(row)       1     76586         2659.1 < 2.2e-16 ***
## residual (MS)           29
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod4))

plot(mod4)

mod4.aom <- update(mod4,aom=T)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:17 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3437.102       28.8039   1671 09:37:17    0.1
##  2     -3437.102       28.8041   1671 09:37:17    0.1
##  3     -3437.102       28.8044   1671 09:37:17    0.1
# standardised residual plots of model 4
mod4.1 <- mod4
mod4.1$residuals<- mod4.aom$aom$R[,2]
plot(mod4.1)

## BLUP
pv.blup4 <- predict(mod4,classify="nf")
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:19 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3437.102       28.8039   1671 09:37:19    0.1
##  2     -3437.102       28.8041   1671 09:37:19    0.1
##  3     -3437.102       28.8044   1671 09:37:19    0.1
pv.blup4$pvals
##
## Notes:
## - The predictions are obtained by averaging across the hypertable
##   calculated from model terms constructed solely from factors in
##   the averaging and classify sets.
## - Use 'average' to move ignored factors into the averaging set.
## - lin(row) evaluated at average value of 47.000000
## - The ignored set: rep,outl
## 
## 
##   nf predicted.value std.error    status
## 1 N0        43.75732  1.230931 Estimable
## 2 N1        47.02025  1.230343 Estimable
## 3 N2        47.67621  1.230902 Estimable
## 4 N3        48.65409  1.230951 Estimable
## 5 N4        49.45452  1.230932 Estimable
## 6 N5        49.85577  1.230343 Estimable
lrt(mod3,mod4)
## Likelihood ratio test(s) assuming nested random models.
## (See Self & Liang, 1987)
##
##           df LR-statistic Pr(Chisq)
## mod4/mod3  1        340.5 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fit treatment as fixed

mod5 <- asreml(yield~ 1+lin(row)+nf,
               random= ~rep+spl(row)+outl,
               residual=~ar1(row):ar1(col),
               data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:24 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3621.512       26.8404   1666 09:37:24    0.1 (2 restrained)
##  2     -3521.236       25.1403   1666 09:37:24    0.1 (1 restrained)
##  3     -3453.252       25.5384   1666 09:37:24    0.1 (1 restrained)
##  4     -3432.959       27.1156   1666 09:37:24    0.1 (1 restrained)
##  5     -3429.555       28.4169   1666 09:37:24    0.1 (1 restrained)
##  6     -3429.448       28.7258   1666 09:37:24    0.1
##  7     -3429.444       28.7899   1666 09:37:24    0.1
##  8     -3429.444       28.8022   1666 09:37:24    0.1
mod5 <- update(mod5)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:25 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3429.444       28.8046   1666 09:37:25    0.1
##  2     -3429.444       28.8047   1666 09:37:26    0.1
asrmod(mod5)
## $loglik
## [1] -3429.444
##
## $AIC
## [1] 6870.888
## attr(,"parameters")
## [1] 6
##
## $BIC
## [1] 6903.397
## attr(,"parameters")
## [1] 6
summary(mod5)$varcomp
##                    component    std.error    z.ratio bound %ch
## outl            2.914831e-06 1.313756e-07 22.1870037     B   0
## rep             1.986928e-01 3.910214e-01  0.5081378     P   0
## spl(row)        9.334449e+00 2.712630e+00  3.4411066     P   0
## row:col!R       2.880470e+01 1.298269e+00 22.1870037     P   0
## row:col!row!cor 5.119366e-01 2.010437e-02 25.4639431     U   0
## row:col!col!cor 1.478615e-01 2.593447e-02  5.7013516     U   0
wald(mod5)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1   1216030          42216 < 2.2e-16 ***
## lin(row)       1     76582           2659 < 2.2e-16 ***
## nf             5      2759             96 < 2.2e-16 ***
## residual (MS)           29
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod5))

plot(mod5)

mod6 <- asreml(yield~ 1+lin(row)+nf+lin(row):nf,
               random= ~rep+spl(row)+outl,
               residual=~ar1(row):ar1(col),
               data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:28 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3635.245       26.6853   1661 09:37:28    0.1 (2 restrained)
##  2     -3536.439       25.0406   1661 09:37:28    0.1 (1 restrained)
##  3     -3469.082       25.4564   1661 09:37:28    0.1 (1 restrained)
##  4     -3448.874       27.0394   1661 09:37:28    0.1 (1 restrained)
##  5     -3445.461       28.3482   1661 09:37:28    0.1 (1 restrained)
##  6     -3445.351       28.6624   1661 09:37:28    0.1
##  7     -3445.347       28.7289   1661 09:37:28    0.1
##  8     -3445.347       28.7418   1661 09:37:28    0.1
mod6 <- update(mod6)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:29 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3445.347       28.7444   1661 09:37:29    0.1
##  2     -3445.347       28.7445   1661 09:37:29    0.1
asrmod(mod6)
## $loglik
## [1] -3445.347
##
## $AIC
## [1] 6902.694
## attr(,"parameters")
## [1] 6
##
## $BIC
## [1] 6935.185
## attr(,"parameters")
## [1] 6
# -3450.218
summary(mod6)$varcomp
##                    component    std.error    z.ratio bound %ch
## outl            2.908741e-06 1.317731e-07 22.0738570     B   0
## rep             1.987786e-01 3.905289e-01  0.5089984     P   0
## spl(row)        9.329717e+00 2.713675e+00  3.4380380     P   0
## row:col!R       2.874452e+01 1.302197e+00 22.0738570     P   0
## row:col!row!cor 5.100336e-01 2.035738e-02 25.0539977     U   0
## row:col!col!cor 1.511061e-01 2.604523e-02  5.8016782     U   0
wald(mod6)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1   1213847          42229    <2e-16 ***
## lin(row)       1     76464           2660    <2e-16 ***
## nf             5      2781             97    <2e-16 ***
## lin(row):nf    5       167              6    0.3252
## residual (MS)           29
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod6))

plot(mod6)

mod7 <- asreml(yield~ 1+lin(row)+nf,
               random= ~rep+outl+spl(row)+nf:spl(row),
               residual=~ar1(row):ar1(col),
               data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:31 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3650.042       24.2461   1666 09:37:32    0.3 (3 restrained)
##  2     -3526.211       23.8951   1666 09:37:32    0.3 (2 restrained)
##  3     -3449.071       24.6911   1666 09:37:32    0.3 (2 restrained)
##  4     -3430.819       26.4861   1666 09:37:32    0.3 (1 restrained)
##  5     -3425.773       27.6115   1666 09:37:33    0.3 (1 restrained)
##  6     -3425.339       27.8155   1666 09:37:33    0.3
##  7     -3425.309       27.8430   1666 09:37:33    0.3
##  8     -3425.309       27.8493   1666 09:37:33    0.3
mod7 <- update(mod7)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:35 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3425.309       27.8508   1666 09:37:35    0.3
##  2     -3425.309       27.8509   1666 09:37:35    0.3
asrmod(mod7)
## $loglik
## [1] -3425.309
##
## $AIC
## [1] 6864.618
## attr(,"parameters")
## [1] 7
##
## $BIC
## [1] 6902.545
## attr(,"parameters")
## [1] 7
summary(mod7)$varcomp
##                    component    std.error    z.ratio bound %ch
## outl            2.818312e-06 1.268550e-07 22.2168025     B   0
## rep             2.117725e-01 3.908947e-01  0.5417634     P   0
## spl(row)        9.389952e+00 2.723416e+00  3.4478580     P   0
## nf:spl(row)     3.043402e-03 2.337963e-03  1.3017327     P   0
## row:col!R       2.785089e+01 1.253596e+00 22.2168025     P   0
## row:col!row!cor 4.933503e-01 2.094974e-02 23.5492360     U   0
## row:col!col!cor 1.565765e-01 2.605421e-02  6.0096423     U   0
wald(mod7)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1   1176085          42228 < 2.2e-16 ***
## lin(row)       1     78843           2831 < 2.2e-16 ***
## nf             5      2926            105 < 2.2e-16 ***
## residual (MS)           28
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod7))

plot(mod7)

pv.blue7 <- predict(mod7,classify="nf")
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:37 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3425.309       27.8511   1666 09:37:37    0.3
##  2     -3425.309       27.8511   1666 09:37:38    0.3
##  3     -3425.309       27.8511   1666 09:37:38    0.3
pv.blue7$pvals
##
## Notes:
## - The predictions are obtained by averaging across the hypertable
##   calculated from model terms constructed solely from factors in
##   the averaging and classify sets.
## - Use 'average' to move ignored factors into the averaging set.
## - lin(row) evaluated at average value of 47.000000
## - The ignored set: rep,outl
## 
## 
##   nf predicted.value std.error    status
## 1 N0        42.28202  1.327859 Estimable
## 2 N1        47.89010  1.325923 Estimable
## 3 N2        48.60999  1.327754 Estimable
## 4 N3        49.20231  1.327757 Estimable
## 5 N4        49.47227  1.327859 Estimable
## 6 N5        49.02294  1.325924 Estimable
lrt(mod5,mod7)
## Likelihood ratio test(s) assuming nested random models.
## (See Self & Liang, 1987)
##
##           df LR-statistic Pr(Chisq)
## mod7/mod5  1       8.2694  0.002016 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare the BLUEs, BLUPs and raw means

BLUE is from model 7 and BLUP is from model 4.

library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:maps':
##
##     ozone
## The following objects are masked from 'package:dplyr':
##
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
pv.blup.df=pv.blup4$pvals
pv.blue.df=pv.blue7$pvals
raw.mean=ddply(gridded_df_ordered,~nf,
               summarise,mean=mean(yield, na.rm = TRUE),
               sem=round(sd(yield, na.rm = TRUE)/sqrt(length(yield[!is.na(yield)])), 2))

pv.blup.df$status=NULL
pv.blue.df$status=NULL
pv.blup.df$type="BLUP"
pv.blue.df$type="BLUE"
raw.mean$type="RAW"

names(pv.blup.df)=c("Variety", "mean", "std.error", "type")
names(pv.blue.df)=c("Variety", "mean", "std.error", "type")
names(raw.mean)=c("Variety", "mean", "std.error", "type")

lup.pv1=cbind(raw.mean, pv.blup.df, pv.blue.df)
names(lup.pv1)=c("Variety","mean","std.error1", "type1","Variety2","BLUP","std.error2", "type2","Variety3",
                 "BLUE","std.error3", "type3"  )

lup.pv2=rbind(raw.mean, pv.blup.df, pv.blue.df)
lup.pv2=lup.pv2[order(lup.pv2$type,lup.pv2$Variety, lup.pv2$mean),]

xyplot(mean~as.numeric(Variety),type=c('p'),data=lup.pv2,lwd=2,
       main="Comparison", group=type,
       xlab="Nitrogen Level",
       ylab="Means",auto.key=list(columns=3),as.table=TRUE,
       scales=list(x=list(at=seq(1,54), rot=90, cex=0.7, labels=levels(lup.pv2$Variety))),
       par.settings=simpleTheme(col=c("red", "gold", "purple"),
                                pch=c(8, 6, 5),lty=(1:3), cex=0.5))

xyplot(mean~as.numeric(Variety),type=c('p', 'h'),data=lup.pv2,lwd=2,main="Comparison", group=type,
       xlab="Nitrogen Level",
       ylab="Means",auto.key=list(columns=3),as.table=TRUE,
       scales=list(x=list(at=seq(1,54), rot=90, cex=0.7, labels=levels(lup.pv2$Variety))),
       par.settings=simpleTheme(col=c("red", "gold", "purple"),
                                pch=c(8, 6, 5),lty=(1:3), cex=0.5))

lup.pv1$Variety=factor(lup.pv1$Variety)
library(lattice)
xyplot(BLUP+BLUE~mean,type=c('p', 'r'),data=lup.pv1,lwd=2,main="Comparison",
       xlab="Observed means",
       ylab="Predicted means",auto.key=list(columns=2),as.table=TRUE,
       par.settings=simpleTheme(col=c("red", "blue"),
                                pch=c(8, 6),lty=(1:2), cex=0.5))

library(ggplot2)

pd <- position_dodge(width = 0.5)
ggplot(data=lup.pv2, aes(x = Variety, y = mean, group = type, colour=type)) +
  geom_line(aes(linetype = type), position = pd, lwd=0.8) + scale_color_manual(values=c("red", "blue", "gold")) +
  geom_point(size = 2, position = pd, aes(shape=type)) +  theme(legend.position="top",axis.text.x=element_text(angle=90, hjust=1)) + theme_bw()

ggplot(data =lup.pv2, aes(x = type, y = mean, fill=type)) +
  geom_boxplot(aes(fill = type), width = 0.85, position=position_dodge(width=0.85)) +
  theme(legend.position="right") +
  labs(y="Yield",x="") +
  # scale_fill_manual(values = hmcol) +
  stat_summary(fun.y="mean", colour="darkblue", geom="point",
               position=position_dodge(width=0.85), aes(fill=type),
               shape=18, size=2.5)+theme_bw()

Fit quadratic form \(\beta_0+\beta_1N+\beta_2N^2\)

mod8 <- asreml(yield~ 1+cnitro+cnitro.sq,
               random= ~rep,
               residual=~ar1(row):ar1(col),
               data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:42 2020
##           LogLik        Sigma2     DF     wall    cpu
##  1     -5941.452       447.779   1670 09:37:42    0.0 (3 restrained)
##  2     -4781.814       154.203   1670 09:37:42    0.0 (2 restrained)
##  3     -3858.124       134.311   1670 09:37:42    0.0 (1 restrained)
##  4     -3816.302       161.165   1670 09:37:42    0.0 (2 restrained)
##  5     -3755.051       314.682   1670 09:37:42    0.0 (1 restrained)
##  6     -3738.361       437.305   1670 09:37:42    0.0
##  7     -3737.767       329.934   1670 09:37:42    0.0
##  8     -3737.397       405.785   1670 09:37:42    0.0
##  9     -3737.357       358.073   1670 09:37:42    0.0
## 10     -3737.312       390.537   1670 09:37:42    0.0
## 11     -3737.301       369.058   1670 09:37:42    0.0
## 12     -3737.293       383.625   1670 09:37:42    0.0
## 13     -3737.291       373.863   1670 09:37:42    0.0
## 14     -3737.289       380.468   1670 09:37:42    0.0
## 15     -3737.289       376.024   1670 09:37:42    0.0
## 16     -3737.288       379.026   1670 09:37:42    0.0
## 17     -3737.288       377.003   1670 09:37:42    0.0
## 18     -3737.288       378.369   1670 09:37:42    0.0
asrmod(mod8)
## $loglik
## [1] -3737.288
##
## $AIC
## [1] 7482.577
## attr(,"parameters")
## [1] 4
##
## $BIC
## [1] 7504.259
## attr(,"parameters")
## [1] 4
summary(mod8)$varcomp
##                    component    std.error    z.ratio bound %ch
## rep             3.828827e-05 7.483575e-06   5.116308     B   0
## row:col!R       3.783692e+02 7.395356e+01   5.116308     P   0
## row:col!row!cor 9.544919e-01 9.285171e-03 102.797448     U   0
## row:col!col!cor 2.858123e-01 2.444813e-02  11.690557     U   0
wald(mod8)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1    217183         574.00    <2e-16 ***
## cnitro         1       366           0.97    0.3256
## cnitro.sq      1         3           0.01    0.9249
## residual (MS)          378
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod8))

mod9 <- asreml(yield~ 1+cnitro+cnitro.sq+lin(row),
               random= ~rep+spl(row),
               residual=~ar1(row):ar1(col),
               data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:43 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3630.728       26.7968   1669 09:37:44    0.0 (1 restrained)
##  2     -3532.309       25.1043   1669 09:37:44    0.0
##  3     -3465.017       25.4771   1669 09:37:44    0.0
##  4     -3444.899       27.0167   1669 09:37:44    0.0
##  5     -3441.556       28.2771   1669 09:37:44    0.0
##  6     -3441.455       28.5690   1669 09:37:44    0.0
##  7     -3441.452       28.6285   1669 09:37:44    0.0
asrmod(mod9)
## $loglik
## [1] -3441.452
##
## $AIC
## [1] 6892.904
## attr(,"parameters")
## [1] 5
##
## $BIC
## [1] 6920.004
## attr(,"parameters")
## [1] 5
summary(mod9)$varcomp
##                  component  std.error    z.ratio bound %ch
## rep              0.2028426 0.39285386  0.5163308     P 0.2
## spl(row)         9.3422237 2.71474268  3.4412925     P 0.0
## row:col!R       28.6285460 1.28083295 22.3515065     P 0.0
## row:col!row!cor  0.5088928 0.02007327 25.3517626     U 0.1
## row:col!col!cor  0.1488067 0.02590938  5.7433552     U 0.1
wald(mod9)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1   1203272          42031 < 2.2e-16 ***
## cnitro         1      2532             88 < 2.2e-16 ***
## cnitro.sq      1       251              9  0.003052 **
## lin(row)       1     77066           2692 < 2.2e-16 ***
## residual (MS)           29
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod9))

plot(mod9)

mod10 <- asreml(yield~ 1+cnitro+cnitro.sq+lin(row),
               random= ~rep+spl(row)+outl,
               residual=~ar1(row):ar1(col),
               data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:46 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3632.147       26.7959   1669 09:37:46    0.1 (2 restrained)
##  2     -3532.485       25.1011   1669 09:37:46    0.1 (1 restrained)
##  3     -3464.956       25.4798   1669 09:37:46    0.1 (1 restrained)
##  4     -3444.886       27.0197   1669 09:37:46    0.1 (1 restrained)
##  5     -3441.555       28.2779   1669 09:37:46    0.1 (1 restrained)
##  6     -3441.455       28.5691   1669 09:37:46    0.1
##  7     -3441.452       28.6286   1669 09:37:46    0.1
mod10<- update(mod10)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:47 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3441.452       28.6396   1669 09:37:47    0.1
##  2     -3441.452       28.6403   1669 09:37:47    0.1
asrmod(mod10)
## $loglik
## [1] -3441.452
##
## $AIC
## [1] 6894.903
## attr(,"parameters")
## [1] 6
##
## $BIC
## [1] 6927.423
## attr(,"parameters")
## [1] 6
summary(mod10)$varcomp
##                    component    std.error    z.ratio bound %ch
## outl            2.898196e-06 1.297424e-07 22.3380859     B   0
## rep             2.028607e-01 3.927744e-01  0.5164815     P   0
## spl(row)        9.344648e+00 2.714414e+00  3.4426027     P   0
## row:col!R       2.864031e+01 1.282129e+00 22.3380859     P   0
## row:col!row!cor 5.089531e-01 2.007415e-02 25.3536564     U   0
## row:col!col!cor 1.487868e-01 2.590959e-02  5.7425402     U   0
wald(mod10)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1   1203902          42035 < 2.2e-16 ***
## cnitro         1      2529             88 < 2.2e-16 ***
## cnitro.sq      1       251              9  0.003074 **
## lin(row)       1     77020           2689 < 2.2e-16 ***
## residual (MS)           29
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod10))

plot(mod10)

mod11 <- asreml(yield~ 1+cnitro+cnitro.sq+lin(row),
               random= ~rep+spl(row)+outl+cnitro:spl(row)+cnitro.sq:spl(row),
               residual=~ar1(row):ar1(col),
               data=gridded_df_ordered,maxit=30)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:49 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -4106.350       23.9321   1669 09:37:50    0.3 (4 restrained)
##  2     -3891.952       22.8173   1669 09:37:50    0.3 (3 restrained)
##  3     -3713.606       23.2858   1669 09:37:50    0.3 (3 restrained)
##  4     -3601.118       24.6452   1669 09:37:51    0.3 (3 restrained)
##  5     -3522.435       25.4587   1669 09:37:51    0.3 (3 restrained)
##  6     -3466.290       26.0618   1669 09:37:51    0.3 (2 restrained)
##  7     -3442.216       26.8583   1669 09:37:52    0.3 (2 restrained)
##  8     -3433.955       27.1907   1669 09:37:52    0.3
##  9     -3433.943       27.2382   1669 09:37:52    0.3
mod11<- update(mod11)
## Model fitted using the gamma parameterization.
## ASReml 4.1.0 Tue Jul 21 09:37:53 2020
## Spline: design points closer than 0.0092 have been merged.
##           LogLik        Sigma2     DF     wall    cpu
##  1     -3433.943       27.2472   1669 09:37:54    0.3
##  2     -3433.943       27.2475   1669 09:37:54    0.3
asrmod(mod11)
## $loglik
## [1] -3433.943
##
## $AIC
## [1] 6883.887
## attr(,"parameters")
## [1] 8
##
## $BIC
## [1] 6927.247
## attr(,"parameters")
## [1] 8
summary(mod11)$varcomp
##                       component    std.error    z.ratio bound %ch
## outl               2.757257e-06 1.211081e-07 22.7669104     B   0
## rep                2.214044e-01 3.925147e-01  0.5640665     P   0
## spl(row)           9.282358e+00 2.702708e+00  3.4344659     P   0
## cnitro:spl(row)    1.102903e-08 4.844323e-10 22.7669104     B  NA
## cnitro.sq:spl(row) 1.102903e-08 4.844323e-10 22.7669104     B  NA
## row:col!R          2.724753e+01 1.196804e+00 22.7669104     P   0
## row:col!row!cor    4.819059e-01 2.088209e-02 23.0774740     U   0
## row:col!col!cor    1.611688e-01 2.597173e-02  6.2055493     U   0
wald(mod11)
## Wald tests for fixed effects.
## Response: yield
## Terms added sequentially; adjusted for those above.
##
##               Df Sum of Sq Wald statistic Pr(Chisq)
## (Intercept)    1   1146164          42065 < 2.2e-16 ***
## cnitro         1      2736            100 < 2.2e-16 ***
## cnitro.sq      1       290             11  0.001114 **
## lin(row)       1     80351           2949 < 2.2e-16 ***
## residual (MS)           27
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(varioGram(mod11))

plot(mod11)

lrt(mod10,mod11)
## Likelihood ratio test(s) assuming nested random models.
## (See Self & Liang, 1987)
##
##             df LR-statistic Pr(Chisq)
## mod11/mod10  2       15.017 0.0001904 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit11<- fitted(mod11)
fit7 <- fitted(mod7)

plot(fit11,fit7)
abline(0,1)

cor(fit11,fit7)
## [1] NA

Comparison

Contour plots of fitted values

Raw data

Intercept by GWR

By ASReml

By Bayesian model

Contour plots of predicted values

Predicted yield (quintals/ha) for an overall medium nitrogen treatment of 75.4 kg/ha.

Prediction by GWR

Prediction by Bayesian model

Presentation at IBS 2019

This is the presentation that I gave on 5th December at IBS 2019 Adelaide.

Title: Model selection and principle of parsimony in statistical modelling in agriculture

Abstract: Model selection is an important issue in biostatistical, psychological and agricultural studies. Root mean squared error (RMSE), Akaike’s information criterion (AIC), Bayesian information criterion (BIC) and their relatives are commonly used as selection criteria for goodness-of-fit of statistical models. However, there is no robust technique that can be applied in every aspect of parameter estimation and model selection. Sometimes, the winning model is “cursed”, while the best model based on the selection criteria leads to over-fitting in practice. Goodness-of-fit must be balanced against model complexity to avoid over-fitting issues. We discuss the trap in model selection and the principle of parsimony, and present a weighted neighbouring cross-validation method. The latter will be illustrated on agricultural experimental data set.

PDF.

The Optimal We Found Is Not The Best We Want

The optimal we found is not the best we want

Data Visualization

The yield data is from the paper Chen, O’Leary, and Evans (2019). The data is collected from two sites: CNT and NVT. We just focus on the site NVT.

NVTData <- read.table("NVTData.txt",header = TRUE)
plot(NVTData$Wavail,NVTData$yield,cex=0.1)

datOR    <- NVTData[order(NVTData$Wavail),]
knotsNum <- length(table(datOR$Wavail)) # unique knots

The explanatory variables of NVT data comprises lat, longi, year, soils, Wavail, yield, where there are 362 unique water of available values.

Model Fitting by gam()

gam refers to the generalized additive models with integrated smoothness estimation (Wood 2006). The functions gam() is in the package mgcv, which is developed by Wood and Wood (2015).

fitgam.tp1 <- gam(yield ~ s(Wavail, bs = "tp"),data=datOR)
plot(datOR$Wavail,datOR$yield,cex=0.1,col="gray")
points(datOR$Wavail,fitgam.tp1$fitted.values,type="l",lwd=2)

In the above model, the basis functions is tp, which refers to thin plate regression spline. In the following models, cr is for cubic regression spline and gp is for Gaussian process models with a variety of simple correlation functions.

fitgam.tp2 <- gam(yield ~ s(Wavail,k=3, bs = "tp"), data = datOR)
fitgam.tp3 <- gam(yield ~ s(Wavail,k=10, bs = "tp"),data = datOR)
fitgam.tp4 <- gam(yield ~ s(Wavail,k=50, bs = "tp"),data = datOR)
fitgam.tp5 <- gam(yield ~ s(Wavail,k=100, bs = "tp"),data = datOR)
plot(datOR$Wavail,datOR$yield,cex=0.1,col="gray")
points(datOR$Wavail,fitgam.tp2$fitted.values,type="l",col="red",lwd=2)
points(datOR$Wavail,fitgam.tp3$fitted.values,type="l",col="blue",lwd=2)
points(datOR$Wavail,fitgam.tp4$fitted.values,type="l",col="green",lwd=2)
points(datOR$Wavail,fitgam.tp5$fitted.values,type="l",col="brown",lwd=2)

AIC(fitgam.tp2,fitgam.tp3,fitgam.tp4,fitgam.tp5)
##                    df      AIC
## fitgam.tp2   3.998908 30748.05
## fitgam.tp3  10.933138 30492.71
## fitgam.tp4  50.120859 29182.53
## fitgam.tp5 100.266214 26895.97
BIC(fitgam.tp2,fitgam.tp3,fitgam.tp4,fitgam.tp5)
##                    df      BIC
## fitgam.tp2   3.998908 30777.73
## fitgam.tp3  10.933138 30573.87
## fitgam.tp4  50.120859 29554.61
## fitgam.tp5 100.266214 27640.31

A different basis functions cr.

fitgam.cr1 <- gam(yield ~ s(Wavail, bs = "cr"),data=datOR)
plot(datOR$Wavail,datOR$yield,cex=0.1,col="gray")
points(datOR$Wavail,fitgam.cr1$fitted.values,type="l",lwd=2)

fitgam.cr2 <- gam(yield ~ s(Wavail,k=3, bs = "cr"), data = datOR)
fitgam.cr3 <- gam(yield ~ s(Wavail,k=10, bs = "cr"),data = datOR)
fitgam.cr4 <- gam(yield ~ s(Wavail,k=50, bs = "cr"),data = datOR)
fitgam.cr5 <- gam(yield ~ s(Wavail,k=100, bs = "cr"),data = datOR)
plot(datOR$Wavail,datOR$yield,cex=0.1,col="gray")
points(datOR$Wavail,fitgam.cr2$fitted.values,type="l",col="red",lwd=2)
points(datOR$Wavail,fitgam.cr3$fitted.values,type="l",col="blue",lwd=2)
points(datOR$Wavail,fitgam.cr4$fitted.values,type="l",col="green",lwd=2)
points(datOR$Wavail,fitgam.cr5$fitted.values,type="l",col="brown",lwd=2)

AIC(fitgam.cr2,fitgam.cr3,fitgam.cr4,fitgam.cr5)
##                    df      AIC
## fitgam.cr2   3.999819 30755.26
## fitgam.cr3  10.967775 30579.91
## fitgam.cr4  50.613471 29010.34
## fitgam.cr5 100.517310 27075.68
BIC(fitgam.cr2,fitgam.cr3,fitgam.cr4,fitgam.cr5)
##                    df      BIC
## fitgam.cr2   3.999819 30784.96
## fitgam.cr3  10.967775 30661.33
## fitgam.cr4  50.613471 29386.08
## fitgam.cr5 100.517310 27821.88

A different basis functions gp.

fitgam.gp2 <- gam(yield ~ s(Wavail,k=3, bs = "gp"), data = datOR)
fitgam.gp3 <- gam(yield ~ s(Wavail,k=10, bs = "gp"),data = datOR)
fitgam.gp4 <- gam(yield ~ s(Wavail,k=50, bs = "gp"),data = datOR)
fitgam.gp5 <- gam(yield ~ s(Wavail,k=100, bs = "gp"),data = datOR)
plot(datOR$Wavail,datOR$yield,cex=0.1,col="gray")
points(datOR$Wavail,fitgam.gp2$fitted.values,type="l",col="red",lwd=2)
points(datOR$Wavail,fitgam.gp3$fitted.values,type="l",col="blue",lwd=2)
points(datOR$Wavail,fitgam.gp4$fitted.values,type="l",col="green",lwd=2)
points(datOR$Wavail,fitgam.gp5$fitted.values,type="l",col="brown",lwd=2)

AIC(fitgam.gp2,fitgam.gp3,fitgam.gp4,fitgam.gp5)
##                   df      AIC
## fitgam.gp2  3.999795 30774.97
## fitgam.gp3 10.968502 30525.49
## fitgam.gp4 21.906114 29926.96
## fitgam.gp5 22.203024 29924.30
BIC(fitgam.gp2,fitgam.gp3,fitgam.gp4,fitgam.gp5)
##                   df      BIC
## fitgam.gp2  3.999795 30804.67
## fitgam.gp3 10.968502 30606.92
## fitgam.gp4 21.906114 30089.58
## fitgam.gp5 22.203024 30089.13

Comparison of different selection critera

aics.tp <- numeric(knotsNum)
gcvs.tp <- numeric(knotsNum)
devs.tp <- numeric(knotsNum)
aics.gp <- numeric(knotsNum)
gcvs.gp <- numeric(knotsNum)
devs.gp <- numeric(knotsNum)
for(kts in 3:knotsNum){
  fitgam.tp.temp <- gam(yield ~ s(Wavail,k=kts, bs = "tp"), 
                        data = datOR)
  aics.tp[kts] <- fitgam.tp.temp$aic
  gcvs.tp[kts] <- fitgam.tp.temp$gcv.ubre
  devs.tp[kts] <- fitgam.tp.temp$deviance
  fitgam.gp.temp <- gam(yield ~ s(Wavail,k=kts, bs = "gp"),                              data = datOR)
  aics.gp[kts] <- fitgam.gp.temp$aic
  gcvs.gp[kts] <- fitgam.gp.temp$gcv.ubre
  devs.gp[kts] <- fitgam.gp.temp$deviance
  cat(kts);
}
knotsNum<- 50
par(mfrow=c(1,3))
plot(3:knotsNum,aics.tp[-c(1:2)])
plot(3:knotsNum,gcvs.tp[-c(1:2)])
plot(3:knotsNum,devs.tp[-c(1:2)])

par(mfrow=c(1,1))

par(mfrow=c(1,3))
plot(3:knotsNum,aics.gp[-c(1:2)])
plot(3:knotsNum,gcvs.gp[-c(1:2)])
plot(3:knotsNum,devs.gp[-c(1:2)])

par(mfrow=c(1,1))

fitgam.gp6 <- gam(yield ~ s(Wavail,k=22, bs = "gp"),data = datOR)
plot(datOR$Wavail,datOR$yield,cex=0.1,col="gray")
points(datOR$Wavail,fitgam.gp6$fitted.values,type="l",col="red",lwd=2)

However, the contradiction still exists

plot(datOR$Wavail,datOR$yield,cex=0.1,col="gray")
points(datOR$Wavail,fitgam.gp2$fitted.values,type="l",col="red",lwd=2)
points(datOR$Wavail,fitgam.gp3$fitted.values,type="l",col="blue",lwd=2)
points(datOR$Wavail,fitgam.gp4$fitted.values,type="l",col="green",lwd=2)
points(datOR$Wavail,fitgam.gp5$fitted.values,type="l",col="brown",lwd=2)

A possible solution: aggregative data

datOR.agg <- aggregate(datOR[c("lat","longi","year","Wavail","yield")],list(datOR$Wavail),mean)
names(datOR.agg)[1] <- c("WavailAgg")
plot(datOR$Wavail,datOR$yield,cex=0.1,pch=20,col="grey")
points(datOR.agg$WavailAgg,datOR.agg$yield,pch=20,col="red")

By taking the means for each unique Wavail, we have the aggregative data, which has 362 rows rather than 12377.

aics.agg <- numeric(knotsNum)
gcvs.agg <- numeric(knotsNum)
devs.agg <- numeric(knotsNum)
for(kts in 3:knotsNum){
  fitgam.agg    <- gam(yield ~ s(Wavail,k=kts,bs = "gp"), 
                       data=datOR.agg)
  aics.agg[kts] <- fitgam.agg$aic
  gcvs.agg[kts] <- fitgam.agg$gcv.ubre
  devs.agg[kts] <- fitgam.agg$deviance
  cat(kts);
}
par(mfrow=c(1,3))
plot(3:knotsNum,aics.agg[-c(1:2)],type="b")
plot(3:knotsNum,gcvs.agg[-c(1:2)],type="b")
plot(3:knotsNum,devs.agg[-c(1:2)],type="b")

par(mfrow=c(1,1))
fitgam.gp.agg1 <- gam(yield ~ s(Wavail,k=3, bs = "gp"), data=datOR.agg)
fitgam.gp.agg2 <- gam(yield ~ s(Wavail,k=5, bs = "gp"), data=datOR.agg)
fitgam.gp.agg3 <- gam(yield ~ s(Wavail,k=10,bs = "gp"), data=datOR.agg)
plot(datOR.agg$Wavail,datOR.agg$yield,cex=0.1,pch=20)
points(datOR.agg$Wavail,fitgam.gp.agg1$fitted.values,type="l",col="red",lwd=2)
points(datOR.agg$Wavail,fitgam.gp.agg2$fitted.values,type = "l",col="blue",lwd=2)
points(datOR.agg$Wavail,fitgam.gp.agg3$fitted.values,type = "l",col="green",lwd=2)

Another solution: taking a subset

By taking a subset of size \(N\) from the origianl data set, we may reduce the variance.

set.seed(2019)
s<- sort(sample(1:nrow(datOR),1000))
datOR.subset  <- datOR[s,]
knotsNum  <- 50
aics.subset <- numeric(knotsNum)
gcvs.subset <- numeric(knotsNum)
divs.subset <- numeric(knotsNum)
for(kts in 3:knotsNum){
  fitgam.subset    <- gam(yield ~ s(Wavail,k=kts, bs = "gp"), 
                            data=datOR.subset)
  aics.subset[kts] <- fitgam.subset$aic
  gcvs.subset[kts] <- fitgam.subset$gcv.ubre
  divs.subset[kts] <- fitgam.subset$deviance
}
par(mfrow=c(1,3))
plot(aics.subset[-c(1,2)])
plot(gcvs.subset[-c(1,2)])
plot(divs.subset[-c(1,2)])

par(mfrow=c(1,1))
fitgam.subset <- gam(yield ~ s(Wavail,k=20, bs = "gp"), 
                        data=datOR.subset)
plot(datOR.subset$Wavail,datOR.subset$yield,cex=0.5,pch=20,col="grey")
points(datOR.subset$Wavail,fitgam.subset$fitted.values,type="l")

A bootstrap and naive resampling

datOR.sample <- NULL
wavs <- names(table(datOR$Wavail))
set.seed(2019)
for(k in wavs){
  tempdat<- datOR[which(datOR$Wavail == k),]
  s<- sort(sample(1:nrow(tempdat),1))
  datOR.sample<-rbind(datOR.sample,tempdat[s,])
}
fitgam.sample <- gam(yield  ~ s(Wavail,k=30,bs = "gp"), 
                     data=datOR.sample)

Taking one sample from each unique Wavail, we have the subset, which contains 362 rows. It is similar to the aggregative data approach.

plot(datOR.sample$Wavail,datOR.sample$yield,cex=0.5,pch=20,col="grey")
points(datOR.sample$Wavail,fitgam.sample$fitted.values,type = "l")

fitgam.gp.sample1 <- gam(yield ~ s(Wavail,k=3,bs = "gp"), data=datOR.sample)
fitgam.gp.sample2 <- gam(yield ~ s(Wavail,k=10,bs = "gp"), data=datOR.sample)
fitgam.gp.sample3 <- gam(yield ~ s(Wavail,k=50,bs = "gp"), data=datOR.sample)
fitgam.gp.sample4 <- gam(yield ~ s(Wavail,k=100,bs = "gp"), data=datOR.sample)
plot(datOR.sample$Wavail,datOR.sample$yield,cex=0.1,pch=20,col="grey")
points(datOR.sample$Wavail,fitgam.gp.sample1$fitted.values,type = "l",col="red",lwd=2)
points(datOR.sample$Wavail,fitgam.gp.sample2$fitted.values,type = "l",col="blue",lwd=2)
points(datOR.sample$Wavail,fitgam.gp.sample3$fitted.values,type = "l",col="green",lwd=2)
points(datOR.sample$Wavail,fitgam.gp.sample4$fitted.values,type = "l",col="brown",lwd=2)

AIC(fitgam.gp.sample1,fitgam.gp.sample2,fitgam.gp.sample3,fitgam.gp.sample4)
##                         df      AIC
## fitgam.gp.sample1 3.979491 886.6777
## fitgam.gp.sample2 5.735994 884.1414
## fitgam.gp.sample3 6.161752 883.8681
## fitgam.gp.sample4 6.720663 883.9109
BIC(fitgam.gp.sample1,fitgam.gp.sample2,fitgam.gp.sample3,fitgam.gp.sample4)
##                         df      BIC
## fitgam.gp.sample1 3.979491 902.1644
## fitgam.gp.sample2 5.735994 906.4638
## fitgam.gp.sample3 6.161752 907.8474
## fitgam.gp.sample4 6.720663 910.0654

Repeat the bootstrap sampling (naive resampling) for 30 times.

N <- 30
boost.max1 <- matrix(0,N,362)
boost.max2 <- matrix(0,N,362)
boost.max3 <- matrix(0,N,362)
boost.max4 <- matrix(0,N,362)
rowNms <- matrix(0,N,362)
set.seed(2019)
for(i in 1:N){
  datOR.boost <- NULL
  for(k in wavs){
    tempdat<- datOR[which(datOR$Wavail == k),]
    s<- sort(sample(1:nrow(tempdat),1))
    datOR.boost<-rbind(datOR.boost,tempdat[s,])
  }
  fitgam.gp.boost1 <- gam(yield ~ s(Wavail,k=3,bs="gp"),  data=datOR.boost)
  fitgam.gp.boost2 <- gam(yield ~ s(Wavail,k=10,bs="gp"), data=datOR.boost)
  fitgam.gp.boost3 <- gam(yield ~ s(Wavail,k=50,bs="gp"), data=datOR.boost)
  fitgam.gp.boost4 <- gam(yield ~ s(Wavail,k=100,bs="gp"),data=datOR.boost)
  boost.max1[i,] <- fitgam.gp.boost1$fitted.values
  boost.max2[i,] <- fitgam.gp.boost2$fitted.values
  boost.max3[i,] <- fitgam.gp.boost3$fitted.values
  boost.max4[i,] <- fitgam.gp.boost4$fitted.values
  rowNms[i,]     <- row.names(datOR.boost)
}
hist(as.integer(rowNms),30)

plot(datOR$Wavail,datOR$yield,cex=0.5,pch=20,col="grey")
points(wavs,apply(boost.max1,2,mean),type="l",col="red",lwd=2)
points(wavs,apply(boost.max2,2,mean),type="l",col="blue",lwd=2)
points(wavs,apply(boost.max3,2,mean),type="l",col="green",lwd=2)
points(wavs,apply(boost.max4,2,mean),type="l",col="brown",lwd=2)

plot(datOR$Wavail,datOR$yield,cex=0.1,pch=20,col="grey")
for(k in 1:N) points(wavs,boost.max1[k,],col=grey(0.5),type="l")

plot(datOR$Wavail,datOR$yield,cex=0.1)
for(k in 1:N) points(wavs,boost.max2[k,],col=grey(0.5),type="l")

plot(datOR$Wavail,datOR$yield,cex=0.1)
for(k in 1:N) points(wavs,boost.max3[k,],col=grey(0.5),type="l")

plot(datOR$Wavail,datOR$yield,cex=0.1)
for(k in 1:N) points(wavs,boost.max4[k,],col=grey(0.5),type="l")

Comparison of \(AIC\)s, \(BIC\)s and deviances.

plot(aics.max1,type="b",col="red")
points(aics.max2,type="b",col="blue")
points(aics.max3,type="b",col="green")
points(aics.max4,type="b",col="brown")

mean(aics.max1);mean(aics.max2);mean(aics.max3);mean(aics.max4)
## [1] 910.6492
## [1] 909.2509
## [1] 908.8717
## [1] 908.7469
plot(bics.max1,type="b",col="red")
points(bics.max2,type="b",col="blue")
points(bics.max3,type="b",col="green")
points(bics.max4,type="b",col="brown")

mean(bics.max1);mean(bics.max2);mean(bics.max3);mean(bics.max4)
## [1] 926.1302
## [1] 930.5354
## [1] 941.4827
## [1] 938.555
plot(devs.max1,type="b",col="red")
points(devs.max2,type="b",col="blue")
points(devs.max3,type="b",col="green")
points(devs.max4,type="b",col="brown")

mean(devs.max1);mean(devs.max2);mean(devs.max3);mean(devs.max4)
## [1] 256.7962
## [1] 253.7053
## [1] 249.4922
## [1] 250.3439

This approach is NOT robust.

A cutting off CV OR stratified CV

In the stratified cross-validation (SCV), the values \(\mathbf{y}_i\), which denote all measured time points corresponding to an individual \(i\), are taken from the data set for further testing. The rest data \(\mathbf{y}^{-i}\), is used for training the model.

The cutting off CV is taking a bunch of consecutive individuals \(i,\ldots,i+k\) from the data set for testing.

For example

tempTests <- subset(datOR,datOR$Wavail %in% wavs[91:100])
plot(datOR$Wavail,datOR$yield,pch=20,cex=0.5,col="grey")
points(tempTests$Wavail,tempTests$yield,pch=20,col="red")

wavs <- names(table(datOR$Wavail))
knotsNum <- length(wavs)
cv.train <- matrix(0,knotsNum,knotsNum)
cv.tests <- matrix(0,knotsNum,knotsNum)

kfo <- 5
sq1 <- seq(3,30,by=1)
sq2 <- 2:(knotsNum-kfo)

for(nkts in sq1){
  for(st in sq2){
    s         <- st:(st+kfo-1)
    tempTrain <- subset(datOR,!(datOR$Wavail %in% wavs[s]))
    tempTests <- subset(datOR,datOR$Wavail %in% wavs[s])
    fitgam.gp.cv <- gam(yield ~ s(Wavail,k=nkts), data=tempTrain)
    cv.train[nkts,st] <- RMSE(fitgam.gp.cv$fitted.values,tempTrain$yield)
    preTest <- predict(fitgam.gp.cv,data.frame(Wavail = tempTests$Wavail))
    cv.tests[nkts,st] <- RMSE(preTest,tempTests$yield)
  }
  cat(nkts);
}
plot(sq1,apply(cv.train[sq1,sq2],1,mean),ylim=c(0.75,0.9),pch=20,type="b")
points(sq1,apply(cv.tests[sq1,sq2],1,mean),col="red",pch=20,type="b")
legend("top", legend=c("Training Error", "Testing error"), col=c("black", "red"),pch=c(20,20))

The training error and testing error are balanced at the 15.

Constrained Generalized Additive Model Fitting

The mgcv package contains routines to fit the generalized additive model where the components may be modeled with shape and smoothness assumptions. The function s.incr.conc() specifies a smooth, increasing and concave shape-restriction in a cgam() formula (Xiyue Liao 2019).

According to the value of the vector itself and its shape, numknots, knots and space attributes, the cone edges will be made by C-spline basis functions, where C-spline refers to convex regression spline (Meyer 2008, @meyer2013semi).

library(cgam)
fitcgam1 <- cgam(yield ~ s.incr.conc(Wavail,numknots=3),  data=datOR)
fitcgam2 <- cgam(yield ~ s.incr.conc(Wavail,numknots=10), data=datOR)
fitcgam3 <- cgam(yield ~ s.incr.conc(Wavail,numknots=50), data=datOR)
fitcgam4 <- cgam(yield ~ s.incr.conc(Wavail,numknots=100),data=datOR)
plot(datOR$Wavail,datOR$yield,cex=0.5,pch=20,col="grey")
points(datOR$Wavail,fitcgam1$muhat,type = "l",lwd=2,col="red")
points(datOR$Wavail,fitcgam2$muhat,type = "l",lwd=2,col="blue")
points(datOR$Wavail,fitcgam3$muhat,type = "l",lwd=2,col="green")
points(datOR$Wavail,fitcgam4$muhat,type = "l",lwd=2,col="blue")

fitcgam1$deviance
## [1] 8815.735
fitcgam2$deviance
## [1] 8630.178
fitcgam3$deviance
## [1] 8584.179
fitcgam4$deviance
## [1] 8579.372

A drawback of cgam is that it doesn’t work for the interaction of two variables, where gam does.

fitcgam1.s <- cgam(yield ~ s.incr.conc(Wavail,numknots=3) + 
                   s(lat,longi, numknots = 6), data=datOR)
# system crash

Comparison of gam() and cgam() on deviance

kts     <- seq(3,40)
gamDev  <- numeric(102)
cgamDev <- numeric(102)
for(k in kts){
  gamfit    <- gam(yield ~ s(Wavail,k=k,bs="gp"), data=datOR)
  gamDev[k] <- gamfit$deviance
  cat(k);
}
for(k in kts){
  cgamfit   <- cgam(yield ~ s.incr.conc(Wavail,numknots=k), data=datOR)
  cgamDev[k]<- cgamfit$deviance
  cat(k);
}
kts     <- seq(3,40)
plot(kts,gamDev[kts],type="b",col="red",ylim=c(8000,9000))
points(kts,cgamDev[kts],type="b",col="blue")

The deviance from cgam() is more stable than it from gam().

Shape constrained additive models (SCAM)

Similar to cgam, scam is a shape constrained additive model (SCAM). It is a generalized linear model (GLM) in which the linear predictor is given by strictly parametric components plus a sum of smooth functions of the covariates where some of the functions are assumed to be shape constrained (Pya and Wood 2015, @pya2010additive).

fitscam1 <- scam(yield ~ s(Wavail,k=4, bs="cv"),  data=datOR)
fitscam2 <- scam(yield ~ s(Wavail,k=10, bs="cv"), data=datOR)
fitscam3 <- scam(yield ~ s(Wavail,k=50, bs="cv"), data=datOR)
fitscam4 <- scam(yield ~ s(Wavail,k=100, bs="cv"),data=datOR)

A SCAM formula. This is exactly like the formula for a GAM except that monotone smooth terms. bs="cv" indicates concave P-splines.

plot(datOR$Wavail,datOR$yield,cex=0.5,pch=20,col="grey")
points(datOR$Wavail,fitscam1$fitted.values,type = "l",lwd=2,col="red")
points(datOR$Wavail,fitscam2$fitted.values,type = "l",lwd=2,col="blue")
points(datOR$Wavail,fitscam3$fitted.values,type = "l",lwd=2,col="green")
points(datOR$Wavail,fitscam4$fitted.values,type = "l",lwd=2,col="blue")

fitscam1$aic;fitscam2$aic;fitscam3$aic;fitscam4$aic
## [1] 30828.17
## [1] 30670.38
## [1] 30596.57
## [1] 30590.81
# BIC(fitscam1,fitscam2,fitscam3,fitscam4)

If we compare cgam and scam on the same amount of knots:

fitcgam2 <- cgam(yield ~ s.incr.conc(Wavail,numknots=10), data=datOR)
fitscam2 <- scam(yield ~ s(Wavail,k=10, bs="cv"), data=datOR)
fitcgam2$deviance;fitscam2$deviance
## [1] 8630.178
## [1] 8627.923
plot(datOR$Wavail,datOR$yield,cex=0.5,pch=20,col="grey")
points(datOR$Wavail,fitcgam2$muhat,type = "l",lwd=2,col="red")
points(datOR$Wavail,fitscam2$fitted.values,type = "l",lwd=2,col="blue")

Here we propose a research topic: Comparison of gam,cgam,scam (and asreml) on the performance of fitting longitudinal data set (repeatedly measured data set)

An additive Gaussian process regression model

Gaussian processes are the extension of multivariate Gaussian to infinite-sized collections of real value variables, any finite number of which have a joint Gaussian distribution (???). Gaussian process regression is a probability distribution over functions. It is fully defined by its mean \(m(t)\) and covariance \(K(s,t)\) function as \[\begin{align} m(t)&=\mathrm{E} \lbrack f(t)\rbrack \\ K(s,t)&=\mathrm{E} \lbrack \left(f(s)-m(s)\right) \left(f(t)-m(t)\right)\rbrack, \end{align}\] where \(s\) and \(t\) are two variables. A function \(f\) distributed as such is denoted in form of \[\begin{equation} f \sim \mathrm{GP}\left(m(t),K(s,t) \right). \end{equation}\] Usually the mean function is assumed to be zero everywhere.

Given a set of input variables \(\mathbf{t} = \left\lbrace t_1,\ldots,t_n\right\rbrace\) for function \(f(t)\) and the output \(\mathbf{y}=f(\mathbf{t})+\varepsilon\) with i.i.d Gaussian noise \(\varepsilon\) of variance \(\sigma_n^2\), we can use the above definition to predict the value of the function \(f_*=f(t_*)\) at a particular input \(t_*\). As the noisy observations becoming \[\begin{equation} \mathrm{Cov} (y_p,y_q) = K(t_p,t_q)+\sigma_n^2 \delta_{pq} \end{equation}\] where \(\delta_{pq}\) is a Kronecker delta which is one if and only if \(p=q\) and zero otherwise, the joint distribution of the observed outputs \(\mathbf{y}\) and the estimated output \(f_*\) according to prior is \[\begin{equation} \begin{bmatrix} \mathbf{y}\\ f_*\end{bmatrix} \sim N \left( 0, \begin{bmatrix} K(\mathbf{t},\mathbf{t}) +\sigma_n^2I& K(\mathbf{t},t_*) \\ K(t_*,\mathbf{t}) & K(t_*,t_*) \end{bmatrix} \right). \end{equation}\] The posterior distribution over the predicted value is obtained by conditioning on the observed data \[\begin{equation} f_* \mid \mathbf{y},\mathbf{t},t_* \sim N\left(\bar{f_*},\mathrm{Cov} (f_*)\right) \end{equation}\] where \[\begin{align} \bar{f_*}&=\mathrm{E} \left( f_* \mid \mathbf{y},\mathbf{t},t_*\right) = K(t_*,\mathbf{t})\left( K(\mathbf{t},\mathbf{t})+\sigma_n^2\right) ^{-1}\mathbf{y},\\ \mathrm{Cov}(f_*)&=K(t_*,t_*)-K(t_*,\mathbf{t})\left( K(\mathbf{t},\mathbf{t})+\sigma_n^2I\right) ^{-1}K(\mathbf{t},t_*). \end{align}\] Therefore it can seen that the Bayesian estimation of a smoothing spline is a special format of Gaussian process regression with diffuse prior and the covariance matrix \(R(s,t)\).

Cheng et al. (2019) proposed an additive Gaussian process regression model for fitting longitudinal data in biomedical (biostatistical) research.

For example,

\[\begin{align*} \mbox{AGPM1: } y&=f^{(1)}(id)+\varepsilon \\ \mbox{AGPM2: } y&=f^{(1)}(id)+f^{(2)}(age)+f^{(3)}(id\times age)+\varepsilon \\ \mbox{AGPM3: } y&=f^{(1)}(id)+f^{(2)}(age)+f^{(3)}(id\times age)+f^{(4)}(loc)+f^{(5)}(loc\times age)+\varepsilon \end{align*}\]

knitr::include_graphics("agpm.png",dpi=100)

However, LonGP, a software to implement additive Gaussian process regression, is only available in Matlab and Linux (or Mac) system.

Here we propose another research topic: Additive Gaussian process regression model on longitudinal agricultural data, and plan to develop an R package, which will be widely used for statisticians on their PCs.

References

Chen, Kefei, Rebecca A. O’Leary, and Fiona H. Evans. 2019. “A Simple and Parsimonious Generalised Additive Model for Predicting Wheat Yield in a Decision Support Tool.” Agricultural Systems 173: 140–50.

Cheng, Lu, Siddharth Ramchandran, Tommi Vatanen, Niina Lietzén, Riitta Lahesmaa, Aki Vehtari, and Harri Lähdesmäki. 2019. “An Additive Gaussian Process Regression Model for Interpretable Non-Parametric Analysis of Longitudinal Data.” Nature Communications 10 (1): 1798.

Meyer, Mary C. 2008. “Inference Using Shape-Restricted Regression Splines.” The Annals of Applied Statistics 2 (3): 1013–33.

———. 2013. “Semi-Parametric Additive Constrained Regression.” Journal of Nonparametric Statistics 25 (3): 715–30.

Pya, Natalya. 2010. “Additive Models with Shape Constraints.” PhD thesis, University of Bath.

Pya, Natalya, and Simon N. Wood. 2015. “Shape Constrained Additive Models.” Statistics and Computing 25 (3): 543–59.

Wood, Simon, and Maintainer Simon Wood. 2015. “Package ‘Mgcv’.” R Package Version 1: 29.

Wood, S. N. 2006. “Generalized Additive Models: An Introduction with R.”

Xiyue Liao, Mary C. Meyer. 2019. “Cgam: An R Package for the Constrained Generalized Additive Model.” Journal of Statistical Software 89 (5). https://doi.org/doi: 10.18637/jss.v089.i05.

Model Selection

The optimal we found is not the best we want

1 Visualizing data (Chen, O’Leary, and Evans 2019)

NVTData <- read.table("NVTData.txt",header = TRUE)
plot(NVTData$Wavail,NVTData$yield,cex=0.1)

2 Fitting data

fitgam0 <- gam(yield ~ s(Wavail), data=NVTData)
plot(NVTData$Wavail,NVTData$yield,cex=0.1)
points(NVTData$Wavail,fitgam0$fitted.values,type="l")

3 Ordering data

datOR    <- NVTData[order(NVTData$Wavail),]
knotsNum <- length(table(datOR$Wavail)) # unique knots
print(knotsNum)
## [1] 362
fitgam1 <- gam(yield ~ s(Wavail), data=datOR)
plot(datOR$Wavail,datOR$yield,cex=0.1)
points(datOR$Wavail,fitgam1$fitted.values,type="l")

4 Comparison

fitgam2 <- gam(yield ~ s(Wavail,k=3),   data = datOR)
fitgam3 <- gam(yield ~ s(Wavail,k=100), data = datOR)
fitgam4 <- gam(yield ~ s(Wavail,k=300), data = datOR)
plot(datOR$Wavail,datOR$yield,cex=0.1)
points(datOR$Wavail,fitgam2$fitted.values,type="l",col="red")
points(datOR$Wavail,fitgam3$fitted.values,type="l",col="blue")
points(datOR$Wavail,fitgam4$fitted.values,type="l",col="green")

compTable <- AIC(fitgam0,fitgam1,fitgam2,fitgam3,fitgam4)
compTable$BIC <- BIC(fitgam0,fitgam1,fitgam2,fitgam3,fitgam4)$BIC
compTable$GCV <- c(fitgam0$gcv.ubre, fitgam1$gcv.ubre, fitgam2$gcv.ubre, fitgam3$gcv.ubre, fitgam4$gcv.ubre)
compTable$deviance <- c(fitgam0$deviance, fitgam1$deviance, fitgam2$deviance, fitgam3$deviance, fitgam4$deviance)
compTable$RMSE<- c(sqrt(mean((fitgam0$fitted.values-NVTData$yield)^2)), sqrt(mean((fitgam1$fitted.values-datOR$yield)^2)), sqrt(mean((fitgam2$fitted.values-datOR$yield)^2)), sqrt(mean((fitgam3$fitted.values-datOR$yield)^2)), sqrt(mean((fitgam4$fitted.values-datOR$yield)^2)))

print(compTable)
##                 df      AIC      BIC       GCV deviance      RMSE
## fitgam0  10.933138 30492.71 30573.87 0.6877164 8498.209 0.8286211
## fitgam1  10.933138 30492.71 30573.87 0.6877164 8498.209 0.8286211
## fitgam2   3.998908 30748.05 30777.73 0.7020508 8685.073 0.8376817
## fitgam3 100.266214 26895.97 27640.31 0.5143182 6264.017 0.7114080
## fitgam4 300.368361 14930.74 17160.55 0.1957113 2306.556 0.4316923

From the criteria, such as AIC, GCV, deviance and root mean squared error (RMSE), they indicate that the more knots are used in the model, the better it should be.

for(kts in 3:knotsNum){
  for(st in 1:5){  # 5-fold cross-validation
    s         <- seq(st,dim(dat3)[1],by=5)
    
    tempTrain <- datOR[-s,]
    tempTests <- datOR[s,]
    
    fitgam <- gam(yield ~ s(Wavail,k=kts), data=tempTrain)
    
    cv.train[kts,st] <- mean((fitgam$fitted.values-tempTrain$yield)^2)
    
    preTest <- predict(fitgam,data.frame(Wavail = tempTests$Wavail))
    
    cv.tests[kts,st] <- mean((preTest-tempTests$yield)^2)
  }
}
# aics   <- numeric(knotsNum)
# gcvs   <- numeric(knotsNum)
# mses   <- numeric(knotsNum)
for(kts in 3:knotsNum){
  fitgam    <- gam(yield ~ s(Wavail,k=kts), data=dat3)
  aics[kts] <- fitgam$aic
  gcvs[kts] <- fitgam$gcv.ubre
  mses[kts] <- sqrt(mean((fitgam$fitted.values-dat3$yield)^2))
}
cvtrain <- read.table("cvtrain.txt",header = TRUE)
cvtests <- read.table("cvtests.txt",header = TRUE)
par(mfrow=c(1,2));
plot(apply(cvtrain[-c(1,2),],1,mean),main="cv train");
plot(apply(cvtests[-c(1,2),],1,mean),main="cv test");

par(mfrow=c(1,1))

In the package mgcv, the spline method s() of the function gam() is the default penalized thin plate regression spline, which tends to give the best MSE performance (Wood 2003, 2006).

Zhu, Li, and Liang (2009) conclude that:

  1. \(R^2\) and \(RMSE\) have the same properties and perform worst in their simulation.
  2. The sample size has an effect on the performance \(AIC\), \(AIC_c\) and \(BIC\).
  3. \(AIC\) is not suitable to use in small samples and tends to select more complex model when the sample size becomes large.
  4. \(AIC_c\) and \(BIC\) have best performance in small and large sample cases, respectively.

Here is the contradiction:

The optimal model, which is found by either AIC(), BIC(), etc., is not the best we want (by visualization).

5 Potential solutions

5.1 Aggregating data

datOR.agg <- aggregate(datOR[c("lat","longi","Wavail","yield")],list(datOR$Wavail),mean)

plot(datOR$Wavail,datOR$yield,cex=0.1,pch=20)
points(datOR.agg$Wavail,datOR.agg$yield,pch=20,col="red")

# aics.agg <- numeric(knotsNum)
# gcvs.agg <- numeric(knotsNum)
# divs.agg <- numeric(knotsNum)
# mses.agg <- numeric(knotsNum)
for(kts in 3:knotsNum){
  fitgam.agg    <- gam(yield ~ s(Wavail,k=kts), data=datOR.agg)
  aics.agg[kts] <- fitgam.agg$aic
  gcvs.agg[kts] <- fitgam.agg$gcv.ubre
  divs.agg[kts] <- fitgam.agg$deviance
  mses.agg[kts] <- sqrt(mean((fitgam.agg$fitted.values-datOR.agg$yield)^2))
}
par(mfrow=c(2,2))
plot(aics.agg[-c(1:2)],type="b")
plot(gcvs.agg[-c(1:2)],type="b")
plot(divs.agg[-c(1:2)],type="b")
plot(mses.agg[-c(1:2)],type="b")

par(mfrow=c(1,1))
fitgam.agg1 <- gam(yield ~ s(Wavail,k=3), data=datOR.agg)
fitgam.agg2 <- gam(yield ~ s(Wavail,k=5), data=datOR.agg)
fitgam.agg3 <- gam(yield ~ s(Wavail,k=100), data=datOR.agg)
fitgam.agg4 <- gam(yield ~ s(Wavail,k=300), data=datOR.agg)

plot(datOR.agg$Wavail,datOR.agg$yield,cex=0.1,pch=20)
points(datOR.agg$Wavail,fitgam.agg1$fitted.values,type="l",col="red",lwd=2)
points(datOR.agg$Wavail,fitgam.agg2$fitted.values,type = "l",col="blue",lwd=2)
points(datOR.agg$Wavail,fitgam.agg3$fitted.values,type = "l",col="green",lwd=2)
points(datOR.agg$Wavail,fitgam.agg4$fitted.values,type = "l",col="brown",lwd=2)

compTable.agg <- AIC(fitgam.agg1,fitgam.agg2,fitgam.agg3,fitgam.agg4)
compTable.agg$BIC <- BIC(fitgam.agg1,fitgam.agg2,fitgam.agg3,fitgam.agg4)$BIC
compTable.agg$GCV <- c(fitgam.agg1$gcv.ubre, fitgam.agg2$gcv.ubre, fitgam.agg3$gcv.ubre, fitgam.agg4$gcv.ubre)
compTable.agg$deviance <- c(fitgam.agg1$deviance, fitgam.agg2$deviance, fitgam.agg3$deviance, fitgam.agg4$deviance)
compTable.agg$RMSE<- c(sqrt(mean((fitgam.agg1$fitted.values-datOR.agg$yield)^2)), sqrt(mean((fitgam.agg2$fitted.values-datOR.agg$yield)^2)), sqrt(mean((fitgam.agg3$fitted.values-datOR.agg$yield)^2)), sqrt(mean((fitgam.agg4$fitted.values-datOR.agg$yield)^2)))
print(compTable.agg)
##                   df      AIC      BIC       GCV deviance      RMSE
## fitgam.agg1 3.981404 860.8931 876.3873 0.6280246 223.6155 0.7859532
## fitgam.agg2 5.259226 860.9008 881.3679 0.6280828 222.0472 0.7831921
## fitgam.agg3 5.549870 861.3912 882.9893 0.6289465 221.9914 0.7830937
## fitgam.agg4 5.549814 861.3912 882.9891 0.6289465 221.9914 0.7830939

Drawbacks: some valuable data are missing; aggregated by mean() might not be the best option.

5.2 Weighted values

datOR.wts <- datOR
plot(datOR$Wavail,datOR$yield,cex=0.5)
points(datOR$Wavail,fitgam2$fitted.values,type="l",col="red",lwd=2)

sres0<-(datOR$yield - fitgam2$fitted.values)^2
sres <- 1/sres0
datOR.wts$Weights <- 0
wavs <- names(table(datOR.wts$Wavail))

for(k in wavs){
  wh <- which(datOR.wts$Wavail == k)
  datOR.wts$Weights[wh] <- sres[wh]/sum(sres[wh])
}

fitgam.wts1 <- gam(yield ~ s(Wavail,k=3),weights = datOR.wts$Weights, data=datOR.wts)
fitgam.wts2 <- gam(yield ~ s(Wavail,k=5),weights = datOR.wts$Weights, data=datOR.wts)
fitgam.wts3 <- gam(yield ~ s(Wavail,k=100),weights = datOR.wts$Weights, data=datOR.wts)
fitgam.wts4 <- gam(yield ~ s(Wavail,k=300),weights = datOR.wts$Weights, data=datOR.wts)

plot(datOR.wts$Wavail,datOR.wts$yield,cex=0.1,pch=20)
points(datOR.wts$Wavail,fitgam.wts1$fitted.values,type = "l",col="red",lwd=2)
points(datOR.wts$Wavail,fitgam.wts2$fitted.values,type = "l",col="blue",lwd=2)
points(datOR.wts$Wavail,fitgam.wts3$fitted.values,type = "l",col="green",lwd=2)
points(datOR.wts$Wavail,fitgam.wts4$fitted.values,type = "l",col="brown",lwd=2)

AIC(fitgam.wts1,fitgam.wts2,fitgam.wts3,fitgam.wts4)
##                     df      AIC
## fitgam.wts1   3.999472 49964.63
## fitgam.wts2   5.995467 49841.01
## fitgam.wts3 100.426600 45655.24
## fitgam.wts4 300.412109 31041.80
BIC(fitgam.wts1,fitgam.wts2,fitgam.wts3,fitgam.wts4)
##                     df      BIC
## fitgam.wts1   3.999472 49994.32
## fitgam.wts2   5.995467 49885.51
## fitgam.wts3 100.426600 46400.77
## fitgam.wts4 300.412109 33271.94

5.3 Taking a subset

set.seed(2019)
datOR.sample <- NULL
for(k in wavs){
  tempdat<- datOR[which(datOR$Wavail == k),]
  s<- sort(sample(1:nrow(tempdat),1))
  datOR.sample<-rbind(datOR.sample,tempdat[s,])
}

fitgam.sample <- gam(yield  ~ s(Wavail,k=300), data=datOR.sample)
plot(datOR.sample$Wavail,datOR.sample$yield,cex=0.5)
points(datOR.sample$Wavail,fitgam.sample$fitted.values,type = "l")

AIC(fitgam.sample)
## [1] 884.1104
fitgam.sample1 <- gam(yield ~ s(Wavail,k=3), data=datOR.sample)
fitgam.sample2 <- gam(yield ~ s(Wavail,k=5), data=datOR.sample)
fitgam.sample3 <- gam(yield ~ s(Wavail,k=100), data=datOR.sample)
fitgam.sample4 <- gam(yield ~ s(Wavail,k=300), data=datOR.sample)

plot(datOR.sample$Wavail,datOR.sample$yield,cex=0.1,pch=20)
points(datOR.sample$Wavail,fitgam.sample1$fitted.values,type = "l",col="red",lwd=2)
points(datOR.sample$Wavail,fitgam.sample2$fitted.values,type = "l",col="blue",lwd=2)
points(datOR.sample$Wavail,fitgam.sample3$fitted.values,type = "l",col="green",lwd=2)
points(datOR.sample$Wavail,fitgam.sample4$fitted.values,type = "l",col="brown",lwd=2)

AIC(fitgam.sample1,fitgam.sample2,fitgam.sample3,fitgam.sample4)
##                      df      AIC
## fitgam.sample1 3.980366 884.9641
## fitgam.sample2 5.404610 884.2511
## fitgam.sample3 6.515357 884.1104
## fitgam.sample4 6.515493 884.1104
BIC(fitgam.sample1,fitgam.sample2,fitgam.sample3,fitgam.sample4)
##                      df      BIC
## fitgam.sample1 3.980366 900.4543
## fitgam.sample2 5.404610 905.2840
## fitgam.sample3 6.515357 909.4659
## fitgam.sample4 6.515493 909.4664

5.3.1 A naive resampling approach

N <- 30
boost.max1 <- matrix(0,N,362)
boost.max2 <- matrix(0,N,362)
boost.max3 <- matrix(0,N,362)
boost.max4 <- matrix(0,N,362)
for(i in 1:N){
  datOR.boost <- NULL
  for(k in wavs){
    tempdat<- datOR[which(datOR$Wavail == k),]
    s<- sort(sample(1:nrow(tempdat),1))
    datOR.boost <- rbind(datOR.boost,tempdat[s,])
  }
  fitgam.boost1 <- gam(yield ~ s(Wavail,k=3),   data=datOR.boost)
  fitgam.boost2 <- gam(yield ~ s(Wavail,k=5),   data=datOR.boost)
  fitgam.boost3 <- gam(yield ~ s(Wavail,k=100), data=datOR.boost)
  fitgam.boost4 <- gam(yield ~ s(Wavail,k=300), data=datOR.boost)
  
  boost.max1[i,] <- fitgam.boost1$fitted.values
  boost.max2[i,] <- fitgam.boost2$fitted.values
  boost.max3[i,] <- fitgam.boost3$fitted.values
  boost.max4[i,] <- fitgam.boost4$fitted.values
  cat(i);
}
plot(datOR$Wavail,datOR$yield,cex=0.1)
points(datOR$Wavail,fitgam2$fitted.values,type="l",lwd=2)
points(datOR.sample$Wavail,apply(boost.max1,2,mean),type="l",col="red")
points(datOR.sample$Wavail,apply(boost.max2,2,mean),type="l",col="blue")
points(datOR.sample$Wavail,apply(boost.max3,2,mean),type="l",col="green")
points(datOR.sample$Wavail,apply(boost.max4,2,mean),type="l",col="brown")

6 A question

Is it worth spending some time on finding an approach, which leads to the consistency: the optimal is the best, for large size longitudinal data?

References

Chen, Kefei, Rebecca A. O’Leary, and Fiona H. Evans. 2019. “A Simple and Parsimonious Generalised Additive Model for Predicting Wheat Yield in a Decision Support Tool.” Agricultural Systems 173: 140–50.

Wood, Simon N. 2003. “Thin Plate Regression Splines.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (1): 95–114.

Wood, S. N. 2006. “Generalized Additive Models: An Introduction with R.”

Zhu, Lixin, Lifang Li, and Zhenlin Liang. 2009. “Comparison of Six Statistical Approaches in the Selection of Appropriate Fish Growth Models.” Chinese Journal of Oceanology and Limnology 27 (3): 457. https://doi.org/10.1007/s00343-009-9236-6.

分享一下在新西兰打官司(维权)的经历

事情已经过去快2年了,今天闲来无事,回想了一下过去几年发生在自己身上的事,觉得这件事可以分享出来,给在国外留学和生活的朋友们一点建议和意见。同时也鼓励大家,在外求学生活不易,遇事不要慌,也不必忍气吞声,必要时学会用法律武器维护自己的权益。

==============================================

1.在新西兰买卖个人物品:Trademe

在新西兰生活的朋友,无论是买卖新产品还是二手物品,包括买车、买手机,甚至是租房子,大都使用过Trademe这个网站。像是国内的淘宝一样,任何个人和商家,都可以注册之后在这个平台发布交易信息。在交易成功后,卖家需要向Trademe这个网站支付“广告费”,而买家则不需要支付额外服务费。交易完成,双方打分,评分越高代表这个人的信誉越好。

另外在新西兰,由于人与人之间的信任,很多交易方式都是买家先付款到卖家账户,然后等卖家发货。及时是生活在南北岛相隔很远的陌生人,一般也是采取这样的方式。虽然说这样也有风险、也会有诈骗,但我在新西兰生活这几年,还没有遇到过拿钱不发货的情况。

==============================================

事件起因:买了有问题的二手笔记本

事件的起因,是因为我在Trademe上看中了一款15寸 MacBook pro

2014的机器,看卖家介绍、并来回发信息询问了几个问题之后,感觉价格合适,状况良好,并且由于MacBook显示器反光涂层脱落问题,这台电脑是可以免费更换屏幕的,所以屏幕边缘的小磕碰都不是问题。基于这几个原因,决定买下来。在2017年这款机器的价格是1,500NZD。

付款后的第三天下午收到这一台从Wellington寄到Dunedin的笔记本电脑。

拆封、连接电源、开机,一切都还好,我便通过数据迁移的方式,把旧电脑的数据迁移过来。由于花费时间大约是1小时,我便离开去了健身房,留下电脑自己工作。

1小时之后,从健身房回来,发现电脑关机了。按电源键重启,数据迁移已经完成。我起初估计是数据迁移完,电脑自动关机,便没有理会。开机之后开始享受这大屏幕的震感。从13寸升级成15寸,其实还是很爽的。

然而好景不长,用了一会电脑自动关机,是那种突然断电时的关机,没有任何征兆。我以为是常规故障,毕竟我的旧13寸MacBook pro也遇到过这样的问题。于是我重启,回复SMC,进入系统继续玩。

可是没有几分钟,又发生了同样的事故:自动断电关机。我感觉到这台电脑似乎有点问题,急忙联系卖家。卖家回信息很快,他说没有任何问题。

Okay,暂时相信卖家,把电脑恢复出厂设置。

几十分钟后开机,又出现了问题。我意识到事情不太对,肯定是电脑的故障了。这次拿出手机,录像:开机,进入系统,使用一下,“pia”,电脑自动断电关机。

这个时候我可以百分之百的确定是电脑故障了。紧急联系卖家,把视频发过去,告诉他电脑有问题,请求退货,我可以承担运费。然而当天卖家并没有联系我。

然后我就走上了维权之路。

==============================================

2. 维权第一步:寻求帮助,搜集证据

买到了有问题的二手笔记本电脑,花了1,500大洋,对于普通留学生来说是一笔不小的开支,甚至超过了一个月的生活费。这钱不能白花,必须要回来。

当天联系卖家之后没有得到回信。第二天一大早,我抱着电脑来到了苹果在Dunedin指定维修中心。顺便说一下,在Dunedin有两家官方授权服务中心,其中离学校近的一家就在市中心VTNZ对面。

来到这里,我说明了来意,想要检测一下电脑,留下了学校邮箱作为联系方式(学校师生在这里维修有折扣),也说明了如果没有问题,想要更换屏幕。他们同意,并让我第二天早上去拿。

回到办公室,收到卖家回信:确认电脑没有任何问题。从此变不再理我。

几天后他们发过一封信,说的是让我把电脑退回去,再给钱。由于我不再信任卖家,所以坚持先给钱,再退货。他们不同意,所以就僵持住了。

之后我去了学校international office。作为留学生,有问题首先想到的就是他们。其中一位员工接见了我,并以学校的名义联系到了Trademe的客服,说明了情况。客服的答复是,他们可以劝说卖家退款、退货,但是不能强制卖家这样做。

在后来我通过Trademe在线客服,提出退款退货的申请后,得到了同样的答复:他们给卖家打过电话,已经联系过了,但是卖家拒绝退款退货。同时由于保护隐私,他们无法把卖家的联系方式,包括住址和电话告诉我,除非我能拿到法院的申请表。

第二天,我从苹果客户服务中心那里拿到了检测报告,被告知电脑主板有进水的痕迹,电脑突然断电应该就是这个原因引起的。有了这份检测报告,我便可以肯定这电脑有问题,但是卖家并没有在产品描述页面里提供信息。我猜测这是欺诈,决定寻求法律帮助。

==================================

3. 维权第二步:法院起诉

通过从网络搜集来的信息,我发现了有2条可能解决这个问题的方法:1. 中间人调停;2. 法院小额诉讼。

关于第二条,在新西兰叫做dispute tribunal,专门处理邻里纷争、小额纠纷,例如房客不给房租、房东克扣押金之类的。金额较大的案件无法处理,只能请律师,走法律程序。我的案件金额较小,适用于这条,并且不必要请律师,服务费45NZD。

我曾经试过走中间人调解,叫做mediator,学校里campus watch里面有个人职位是Proctor,也可以做mediator。我一开始曾联系卖家,提供了几个方案:

方案A:找mediator,我把电脑给他,卖家把钱给我,mediator再把电脑给卖家。

方案B:找学校的Proctor,我把电脑给他,卖家把钱给我,mediator再把电脑给卖家。

方案C:卖家找mediator,我把电脑给他,卖家把钱也给他,mediator再把电脑给卖家,把钱给我。

以上方案均被卖家拒绝。看来只能去dispute tribunal了。


dispute

tribunal的运作方式是这样:填写申请表,提交材料(音频、视频、文字等)给法院一份,给被告人一份,交申请费,通知开庭时间和地点,开庭,2周左右宣判。如果被告人在外地,可以选择到当地法院,通过远程视频的方式参加。如果被告人不出席,法院也会如期开庭并宣判。被告人叫respondent,原告叫applicant。

3.1

由于我已经决定起诉,首先到Trademe下载了一封申请模板,要求提供卖家的联系地址和联系方式。填好申请表,到法院盖章,扫描发回给Trademe。2天后收到信件,拿到了卖家的真实姓名、家庭住址等信息。

3.2

把之前跟卖家沟通的邮件下载,按照日期做成pdf文件,扫描了检测报告,把第一天晚上录制的视频保存到了一个廉价的优盘里面。把卖家在Trademe上的产品详情页、回答和提问页等,截屏保存下来,作为支持材料。写了一封起诉书,讲了事情经过,也提到了自己的诉求,希望卖家退还:1500(电脑费用)+60(买电脑和退电脑的运费)+69(电脑检测费)+45(起诉费)+300(误工费)=合计1974NZD。

3.3

打印材料,装信封,一封寄给卖家,一封亲自送到法院。Dunedin递交材料的地方就在casino那边的那条街上,开庭的地方在马路对面另一栋大楼里面。

=============================================

4. 维权第三步:开庭

收到法院通知,定了开庭日期和地点,并且问我是否需要翻译。由于是免费服务,而且我怕开庭过程中有些话听不清、或表达不清,就申请了一个翻译。

开庭之前,我写了一份材料,因为在法庭上,原告和被告都有机会表述自己的看法,所以我把事情的经过和要求都写了下来,基本就是递交申请时候用的内容,做了一些修改。

开庭当天还有点小激动,人生第一次打官司,竟然是在新西兰,而且没有请律师(请不起),自己上战场。

过安检、进入法庭、等待开庭。不出意料的是,被告没有出席,也没有申请远程参与。所以在法庭上,只有我、法官和一个翻译。

开庭过程比较简单,法官先宣读了一下注意事项和流程,然后进入我的个人陈述。我按照稿子念完了事情经过,也提出了我的要求。之后法官问了我几个问题,其中一个没听清,翻译帮我解释了一下,然后我就表达了自己的想法。

最后法官说,会按照“售卖产品与描述不符”这样的结果来判定,但是要以最终宣判结果为准。至于我的1974NZD申请,会回去考虑一下再宣判。

===========================================

5. 维权第五步:法院宣判,卖家主动退款

开庭之后的一周,收到法院的判决结果,肯定是我赢了,但是只支持1500

NZD笔记本的赔偿,其他的经济损失是我主观意愿要去做而造成的,跟卖家没关系。比如69检测费,是我主动要求的,并不是卖家要求,所以这损失我承担。

虽然拿到判决结果,但是并不一定代表卖家会按照法院要求进行赔偿。被告有时间向法院提起appeal,即上诉,要求重新开庭;没有appeal的话,需要等待2个月,法院才可以进行强制执行,帮我追回损失,但是也不一定成功。

也就是说,虽然赢了判决,但是也不一定能拿回损失。

转机出现了,卖家主动联系我,同意给我退款,先付款后退货的那种。

最后的结果就是:卖家退还1530 NZD。我退还笔记本电脑,损失144 NZD(30+69+45)+时间和精力。

===========================================

6. 后续

拿到退款之后很开心,感受就是在新西兰维权也并不难,或许是我运气好,也或许是我坚持了下来。刚开始我在Trademe上给他差评,还打了嘴仗,卖家估计没想到我这么执着,真告到法院了。不管是什么原因,卖家决定退款是好事儿。我把结果告诉给了法院的工作人员,他们也祝贺我有了好的结果。

在这期间,我除了告诉杨婆之外,没有告诉其他朋友,毕业之后告诉了Dr. Weiwei,分享了这一次难得的经历,算是饭后谈资吧。

在维权期间,得到了来自学校international office、OUSA员工的支持,也得到了来自法院一位法官的支持。

虽然我还在Trademe上买东西,但再也没有遇到不开心的经历了。

事情结束后,正巧有机会托朋友从香港买了新的13 Macbook pro with touchbar,正式淘汰了陪我6年的老款MacBook pro,旧机器也是从Trademe上卖掉的。

===========================================

7. 分享这件事,也是想告诉国外留学、生活的朋友,要学会用法律维护自己的权益,老外不可怕,不要忍气吞声,勇敢站出来。

NZ

Department Seminar, 2018

A seminar at the Department of Maths & Stats, University of Otago, in July 2018.

Title: Adaptive Sequential MCMC for Combined State and Parameters Estimation

Abstract: Most algorithms for combined state and parameter estimation in state space models either estimate the states and parameters by incorporating the parameters in the state space, or marginalize out the parameters through sufficient statistics. In the case of a linear state space model and starting with a joint distribution over states, observations and parameters, we implement an MCMC sampler with two phases. In the learning phase, a self-tuning sampler is used to learn the parameter mean and covariance structure. In the estimation phase, the parameter mean and covariance structure informs the proposal mechanism and is also used in a delayed-acceptance algorithm, which greatly improves sampling efficiency. Information on the resulting state of the system is given by a Gaussian mixture. In on-line mode, the algorithm is adaptive and uses a sliding window approach by cutting off historical data to accelerate sampling speed and to maintain appropriate acceptance rates. We apply the algorithm to joint state and parameter estimation in the case of irregularly sampled GPS time series data.

PDF.

Poster at Bayes Comp 2018, Barcelona

This is the poster of the work done with my supervisor Dr. Matthew Parry. The poster was given at the conference Bayes Comp 2018, Barcelona.

Bayes Comp is a biennial conference sponsored by the ISBA section of the same name. The conference and the section both aim to promote original research into computational methods for inference and decision making and to encourage the use of frontier computational tools among practitioners, the development of adapted software, languages, platforms, and dedicated machines, and to translate and disseminate methods developed in other disciplines among statisticians. The detail is at conference information.

Title: Adaptive Sequential MCMC for Combined State and Parameters Estimation

Abstract: Most algorithms for combined state and parameter estimation in state space models either estimate the states and parameters by incorporating the param-eters in the state space, or marginalize out the parameters through sufficient statistics. In the case of a linear state space model and starting with a joint distribution over states, observations and parameters, we implement an MCMC sampler with two phases. In the learning phase, a self-tuning sampler is used to learn the parameter mean and covariance structure. In the estimation phase, the parameter mean and covariance structure informs the proposal mechanism and is also used in a delayed-acceptance algorithm, which greatly improves sam-pling efficiency. Information on the resulting state of the system is given by a Gaussian mixture. In on-line mode, the algorithm is adaptive and uses a sliding window approach by cutting off historical data to accelerate sampling speed and to maintain appropriate acceptance rates. We apply the algorithm to joint state and parameter estimation in the case of irregularly sampled GPS time series data.

Presentation files

Presentation at ICSA Shanghai 2016

The first international conference that I attended at Dec 2016. This is part of the work during my PhD at the University of Otago.

Title: Trajectory estimation from GPS data using an adaptive smoothing spline

Abstract: GPS units record time series data of a moving object. Connecting location points consequently, will represent the trajectory of an individual or a vehicle. However, sparse points and data errors will give a trajectory with angels, which are unlike for a moving object. Smoothing spline methods can efficiently build up a more smooth trajectory. In conventional smoothing spline, the objective function tries to minimize errors of locations with a penalty term, who has a single parameter that controls the smoothness of reconstruction. Adaptive smoothing spline extends single parameter to a function varying in different domains and adapting the change of roughness. In this talk, I will introduce a tractor spline that incorporates both location and velocity information but penalizes excessive accelerations. The penalty term is dependent on mechanic boom status. A new parameter, which controls the error of velocity, and adjusted penalty terms, which adapts to a more complicated curvature status, are introduced to the new objective function. We develop cross validation techniques to find the three smoothing parameters of interest. Moreover, researchers found that a conventional smoothing spline can be constructed by Gaussian Process Regression, if it is equipped with an appropriate inner product in a Reproducing Kernel Hilbert Space. Similarly, we give an inner product to the “tractor spline” in a space and make it a RKHS, the function in which has piecewise continuous second derivatives. Then a posterior mean of the estimates can be found by GPR. At the end, simulation studies and a real data example are presented to demonstrate the effectiveness of “tractor spline”.

Presentation files

New Zealand Maths and Stats Postgraduates Conference, 2016

We are the organizer of the NZMASP Conference 2016. It was held at Queenstown, Otago!

Such a beautiful place and a delightful experience here!

Title: Play with GPS Data from Off-line to On-line

Abstract: GPS units mounted on a vehicle record its position, speed and bearing. The time series of positions then represents the trajectory of the vehicle. Noisy measurements and infrequent sampling, however, mean simplistic trajectory reconstruction will have unrealistic features, like sharp turns. Smoothing spline methods can efficiently build smoother, more realistic trajectories with a batch of data, which is a kind of “Off-line” methods. To get an instant status and to predict the next position of a moving vehicle, a cheaper and more effective computational method, known as Dynamic linear system, is introduced. By using which, the model can keep updating its parameters with all the data generated up to the current status. This method is an “On-line” method. In this presentation, I will talk about how I played with GPS data by using and improving these models. Some spin-off results are given. Numeric simulation study and real data example are presented.

Presentation files

Department Seminar, 2016

A seminar at the Department of Maths & Stats, University of Otago, in May 2016.

Title: Trajectory Estimation of Time Series GPS data with Adaptive Smoothing Spline

Abstract: GPS units record time series data of a moving object. Connecting position points consequently, will represent the trajectory of an individual or a vehicle. However, sparse points and data errors will give a trajectory with angels, which are unlike for a moving object. Smoothing spline methods can efficiently build up a more smooth trajectory. In conventional smoothing spline, the objective function tries to minimize errors of positions with a penalty term, who has a single parameter that controls the smoothness of reconstruction. Adaptive smoothing spline extends single parameter to a function varying in different domains and adapting the change of roughness. In this talk, I will introduce a tractor spline that incorporates both position and velocity information but penalizes excessive accelerations. The penalty term is dependent on mechanic boom status. A new parameter, which controls the errors of velocity, and adjusted penalty terms, which adapts to a more complicated curvature status, are introduced to the new objective function. We develop cross validation techniques to find the three smoothing parameters of interest. A short discussion of the relationship between spline method and Gaussian Process Regression will be given. A simulation study and real data example are presented to demonstrate the effectiveness of this new method.

PDF.

New Zealand Maths and Stats Postgraduates Conference, 2015

Title: Spline-based approach to infer farm vehicle trajectories

Abstract: Tractors, working on an orchard or vineyard, are often fitted with GPS units. These GPS units provide position and velocity information with errors to orchardists, who want to know exactly where these tractors have been. A smoothing spline is a promising method to infer such trajectories. I develop a smoothing spline that incorporates velocity information but penalises excessive accelerations. The application that concerns my industry partner is tractors in spraying mode. The penalty term is therefore also dependent on spraying state. We develop cross validation techniques to find the three smoothing parameters of interest. A simulation study and real data example are presented to demonstrate the effectiveness of this new method.

PDF.

New Zealand Maths and Stats Postgraduates Conference, 2014

The New Zealand Maths and Stats Postgraduates Conference 2014 where I gave a presentation in English for the first time. It was a big challenge for me, and I really enjoy it!

Lovely postgraduates, professional presentations, and enjoyable travel in Auckland with these young professionals!

This work is from my master study when I was at University of Shanghai for Science and Technology in 2010.

Title: Uniqueness Theory of Meromorphic Functions Sharing Weighted Values with its k-th derivative

PDF.