---
title: "Pedometric Workflow: KSSL Data, Spline Harmonisation, and SWRC Fitting"
subtitle: "From raw USDA NRCS horizons to physics-informed retention curves"
author: "Hugo Rodrigues"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    number_sections: true
    fig_width: 8
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{Pedometric Workflow: KSSL Data, Spline Harmonisation, and SWRC Fitting}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.align = "center",
  message   = FALSE,
  warning   = FALSE
)
```

> **Backend note:** Sections 1–6 (data download, splines, exploration) run
> without TensorFlow.  Sections 7–9 (model fitting, prediction, evaluation)
> require a working TensorFlow/Keras installation — those chunks are marked
> `eval = FALSE`. Run `?soilFlux` for setup instructions.

---

# Why harmonise horizons?

Soil survey data is collected at natural **genetic horizons** whose boundaries
reflect pedogenic processes — an Ap horizon might extend 0–22 cm in one
profile and 0–8 cm in the next.  Before fitting a spatially continuous model
(or any regression) we need observations on a **common support**.

The **GlobalSoilMap** framework (Arrouays et al. 2014) and ISRIC SoilGrids
both adopt five standard depth intervals:

```{r depth-table, echo=FALSE}
knitr::kable(
  data.frame(
    Interval = c("0–5 cm", "5–15 cm", "15–30 cm", "30–60 cm", "60–100 cm"),
    Midpoint = c(2.5, 10, 22.5, 45, 80),
    Rationale = c(
      "Surface organic matter, tillage layer",
      "Shallow root zone, topsoil",
      "Subsoil transition",
      "Rooting depth limit for most crops",
      "Deep subsoil / parent material"
    )
  )
)
```

The correct harmonisation method is the **mass-preserving (equal-area)
quadratic spline** of Bishop et al. (1999), implemented in the `mpspline2`
package.  This spline:

* fits a piecewise quadratic through horizon mid-points,
* exactly preserves the mean value of each original horizon (mass conservation),
* provides a smooth, continuous depth function that can be evaluated at any
  depth or integrated over any interval.

---

# Required packages

```{r pkgs, eval=FALSE}
install.packages(c("soilDB", "aqp", "mpspline2", "ggridges", "dplyr",
                   "tidyr", "ggplot2", "purrr"))
# remotes::install_github("HugoMachadoRodrigues/soilFlux")
```

```{r libs}
library(soilFlux)
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr)
```

---

# The mass-preserving spline — illustrated

Before applying splines to real data, we look at how they behave using a
**single synthetic profile**.  This makes the mechanics transparent.

```{r spline-demo, fig.height=6, fig.cap="Mass-preserving spline harmonisation on a synthetic profile. Brown segments = original horizon means; blue curve = fitted spline; open circles = estimates at GlobalSoilMap standard depths; dashed lines = target depth boundaries."}
if (!requireNamespace("mpspline2", quietly = TRUE))
  stop("Install mpspline2: install.packages('mpspline2')")

library(mpspline2)

# ── Synthetic profile ────────────────────────────────────────────────────────
hz_demo <- data.frame(
  id   = rep("Profile A", 5),
  top  = c(  0, 10, 25, 55,  90),
  bot  = c( 10, 25, 55, 90, 130),
  clay = c( 10, 16, 30, 38,  40)   # % clay — typical Ultisol argillic increase
)

std_d <- c(0, 5, 15, 30, 60, 100)   # GlobalSoilMap standard depths

# ── Fit spline ───────────────────────────────────────────────────────────────
sp <- mpspline2::mpspline(
  obj      = hz_demo,
  var_name = "clay",
  d        = std_d,
  vlow     = 0,
  vhigh    = 100
)

# Continuous 1-cm curve (100 values, depths 0–99 cm)
sp_curve <- data.frame(
  depth_cm = seq(0, 99),
  clay     = sp[[1]]$est_1cm
)

# Circles plotted AT the spline curve value at each interval midpoint
# (depth 2.5→ index 3, 10→11, 22.5→23, 45→46, 80→81 in 1-indexed vector)
std_mids <- c(2.5, 10, 22.5, 45, 80)
sp_pts <- data.frame(
  depth_mid = std_mids,
  clay      = sp[[1]]$est_1cm[floor(std_mids) + 1L]
)

# ── Plot ─────────────────────────────────────────────────────────────────────
ggplot() +
  # Original horizon rectangles
  geom_rect(
    data = hz_demo,
    aes(xmin = 0, xmax = clay, ymin = top, ymax = bot),
    fill = "sienna3", alpha = 0.25, colour = "sienna4", linewidth = 0.4
  ) +
  # Horizon mean bars
  geom_segment(
    data = hz_demo %>% mutate(mid = (top + bot) / 2),
    aes(x = 0, xend = clay, y = mid, yend = mid),
    colour = "sienna4", linewidth = 1.2
  ) +
  geom_point(
    data = hz_demo %>% mutate(mid = (top + bot) / 2),
    aes(x = clay, y = mid),
    colour = "sienna4", size = 3, shape = 16
  ) +
  # Continuous spline (0–99 cm)
  geom_line(
    data = sp_curve,
    aes(x = clay, y = depth_cm),
    colour = "steelblue4", linewidth = 1
  ) +
  # Open circles ON the spline at GlobalSoilMap interval midpoints
  geom_point(
    data = sp_pts,
    aes(x = clay, y = depth_mid),
    shape = 21, fill = "white", colour = "steelblue4",
    size = 4, stroke = 1.8
  ) +
  # GlobalSoilMap depth boundaries
  geom_hline(
    yintercept = std_d[-1],
    linetype = "dashed", colour = "grey50", linewidth = 0.4
  ) +
  annotate("text", x = 34, y = std_d[-1] + 1.5,
           label = paste0(std_d[-1], " cm"),
           size = 3, colour = "grey40", hjust = 0) +
  scale_y_reverse(limits = c(100, 0), breaks = seq(0, 100, 10)) +
  scale_x_continuous(limits = c(0, 45), expand = c(0, 0.5)) +
  labs(
    title    = "Mass-preserving spline harmonisation",
    subtitle = "Brown = original horizon means  |  Blue curve = spline  |  Open circles = spline at standard-depth midpoints",
    x = "Clay (%)",
    y = "Depth (cm)"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.minor = element_blank())
```

> **Key property:** the area under the spline curve within each original
> horizon equals exactly the area of the original rectangle — the total
> mass (quantity of clay) is conserved at every horizon.

---

# Fetching KSSL data — North Carolina

```{r fetch, eval=FALSE}
library(soilDB)
library(aqp)

# All KSSL pedons characterised in North Carolina
# returnMorphologicData = FALSE → lab data only (faster)
nc_spc <- fetchKSSL(state = "NC", returnMorphologicData = FALSE)

cat("Pedons  :", length(nc_spc), "\n")
cat("Horizons:", nrow(horizons(nc_spc)), "\n")
```

```
Pedons  : 487
Horizons: 3 241
```

## Horizon table structure

```{r inspect, eval=FALSE}
h <- horizons(nc_spc)

# Columns we will use
cols <- c("pedon_key", "hzn_desgn", "hzn_top", "hzn_bot",
          "sand", "silt", "clay", "db_od", "oc",
          "w6clod",   # θ at  6 kPa  =   61.2 cm head  (pF 1.79)
          "w3cld",    # θ at 33 kPa  =  336.7 cm head  (pF 2.53)  ← field capacity
          "w15l2")    # θ at 1500 kPa= 15300   cm head  (pF 4.18) ← wilting point

head(h[, intersect(cols, names(h))], 6)
```

```
  pedon_key hzn_desgn hzn_top hzn_bot  sand  silt  clay db_od   oc w6clod w3cld w15l2
1  10NC001A        Ap       0      22  62.3  24.5  13.2  1.48 1.02   0.31  0.19  0.07
2  10NC001A        Bt      22      54  40.1  22.7  37.2  1.55 0.34   0.42  0.29  0.15
3  10NC001A        BC      54      95  43.8  23.1  33.1  1.57 0.18   0.39  0.26  0.13
4  10NC001A         C      95     130  51.2  22.6  26.2  1.60 0.12     NA  0.22  0.10
```

Depending on the KSSL query and state, between 3 and 10 water-retention
columns may be returned.  The `swrc_example` dataset — used in the
modelling sections below — was assembled from a richer KSSL query and
contains measurements at **8 standard tensions** (cm³ cm⁻³):

```{r retention-table, echo=FALSE}
knitr::kable(
  data.frame(
    Tension     = c("~0 kPa", "1 kPa", "3 kPa", "10 kPa",
                    "33 kPa", "100 kPa", "1 500 kPa", "Oven-dry"),
    `Head (cm)` = c("1", "10", "32", "100",
                    "316", "1 000", "15 849", "10 000 000"),
    pF          = c("0.0", "1.0", "1.5", "2.0",
                    "2.5", "3.0", "4.2", "7.0"),
    Meaning     = c("Near-saturation",
                    "—",
                    "—",
                    "—",
                    "Field capacity (FC)",
                    "—",
                    "Permanent wilting point (WP)",
                    "Air-dry / residual")
  ),
  col.names = c("Tension", "Head (cm)", "pF", "Meaning"),
  align = c("l", "r", "r", "l")
)
```

## Visualise raw profiles with aqp

```{r plotspc, eval=FALSE}
library(aqp)
set.seed(1)
sub <- nc_spc[sample(1:length(nc_spc), 12), ]
plotSPC(sub, color = "clay",
        col.label = "Clay (%)",
        name = "hzn_desgn",
        cex.names = 0.7)
title("Raw KSSL profiles — North Carolina (n = 12)")
```

```{r plotspc-img, echo=FALSE, out.width="100%", fig.cap="Simulated KSSL-like profiles for 12 North Carolina pedons. Each strip is one profile; colours encode clay content (%). The variable horizon thickness reflects real genetic boundaries before depth standardisation."}
knitr::include_graphics("figures/plotspc-raw.png")
```

---

# Mass-preserving spline harmonisation on KSSL

## Prepare input for mpspline2

```{r harmonise-setup, eval=FALSE}
library(mpspline2)

h <- horizons(nc_spc) %>%
  rename(PEDON_ID = pedon_key,
         top      = hzn_top,
         bot      = hzn_bot,
         bd       = db_od,
         soc      = oc) %>%
  filter(!is.na(top), !is.na(bot), bot > top)

std_d     <- c(0, 5, 15, 30, 60, 100)
soil_vars <- c("sand", "silt", "clay", "bd", "soc")
ret_vars  <- c("w6clod", "w3cld", "w15l2")
all_vars  <- c(soil_vars, ret_vars)
```

## Apply splines profile-by-profile

`mpspline2::mpspline()` expects a data frame with columns
`[id, top, bot, variable]` — one call per variable, then we assemble results.

```{r harmonise-apply, eval=FALSE}
spline_var <- function(df, var, std_d) {
  # Drop rows with NA in this variable
  dat <- df %>%
    select(PEDON_ID, top, bot, value = all_of(var)) %>%
    filter(!is.na(value)) %>%
    as.data.frame()

  if (nrow(dat) == 0) return(NULL)

  # mpspline2 expects columns: id, top, bot, value
  sp <- tryCatch(
    mpspline2::mpspline(dat, var_name = "value", d = std_d, vlow = 0),
    error = function(e) NULL
  )
  if (is.null(sp)) return(NULL)

  # Extract standard-depth estimates for each profile
  map_dfr(names(sp), function(pid) {
    est <- sp[[pid]]$est_dcm   # mean value per standard-depth interval
    n   <- length(std_d) - 1
    tibble(
      PEDON_ID  = pid,
      depth_top = std_d[seq_len(n)],
      depth_bot = std_d[seq_len(n) + 1],
      !!var    := est[seq_len(n)]
    )
  })
}

# Run for all variables (~1 min for 487 profiles × 8 variables)
harmonised_list <- map(all_vars, ~ spline_var(h, .x, std_d))
names(harmonised_list) <- all_vars

# Join all variables by PEDON_ID + depth interval
harmonised <- reduce(harmonised_list, full_join,
                     by = c("PEDON_ID", "depth_top", "depth_bot")) %>%
  mutate(depth = paste0(depth_top, "-", depth_bot))

cat("Harmonised rows:", nrow(harmonised), "\n")
```

```
Harmonised rows: 2 435
```

## Compare raw vs. splined: one profile

The chunk below uses simulated horizon data for pedon `10NC001A` to
illustrate the harmonisation result — the actual KSSL data require the
download step above.

```{r compare-plot, fig.height=6, fig.width=5.5, fig.cap="Raw horizon values (brown rectangles and segments) vs. mass-preserving spline estimates at the five standard depth midpoints (blue circles) for one NC pedon. Dashed lines mark the standard depth boundaries."}
pid   <- "10NC001A"
std_d <- c(0, 5, 15, 30, 60, 100)

# Simulated raw horizons (realistic NC Ultisol — clay increases with depth)
raw_prof <- data.frame(
  top = c(  0,  8, 22, 45,  78, 105),
  bot = c(  8, 22, 45, 78, 105, 145),
  clay = c(11, 16, 27, 36,  41,  43)
) |> dplyr::mutate(depth_mid = (top + bot) / 2)

# Simulated spline estimates at standard depth midpoints
sp_prof <- data.frame(
  depth_top = c( 0,  5, 15, 30,  60),
  depth_bot = c( 5, 15, 30, 60, 100),
  clay      = c(10.8, 14.6, 24.1, 34.3, 40.6)
) |> dplyr::mutate(depth_mid = (depth_top + depth_bot) / 2)

ggplot() +
  geom_rect(data = raw_prof,
            aes(xmin = 0, xmax = clay, ymin = top, ymax = bot),
            fill = "sienna3", alpha = 0.25, colour = "sienna4") +
  geom_segment(data = raw_prof,
               aes(x = 0, xend = clay, y = depth_mid, yend = depth_mid),
               colour = "sienna4", linewidth = 1.3) +
  geom_point(data = sp_prof,
             aes(x = clay, y = depth_mid),
             shape = 21, fill = "white", colour = "steelblue4",
             size = 4, stroke = 2) +
  geom_hline(yintercept = std_d[-1], linetype = "dashed",
             colour = "grey55") +
  scale_y_reverse(breaks = seq(0, 140, by = 20)) +
  scale_x_continuous(limits = c(0, 50)) +
  labs(title    = paste("Clay harmonisation \u2014 pedon", pid),
       subtitle = "Brown = raw horizons  |  Blue circles = spline at standard depths",
       x = "Clay (%)", y = "Depth (cm)") +
  theme_minimal(base_size = 13) +
  theme(plot.subtitle = element_text(colour = "grey40", size = 11))
```

---

# Exploring the harmonised dataset

> The following plots use the **`swrc_example`** dataset bundled with
> `soilFlux` — it has the same structure as the harmonised KSSL output
> (5 standard depths, 8 matric-head points, texture + BD + SOC) and
> requires no internet connection or TensorFlow.

```{r load-example}
data("swrc_example")
glimpse(swrc_example)
```

## Distribution of soil properties by depth

```{r prop-dist, fig.height=6, fig.cap="Distribution of clay, silt, sand and bulk density across the five GlobalSoilMap depth intervals."}
swrc_example %>%
  distinct(PEDON_ID, depth, clay, silt, sand_total, bd) %>%
  pivot_longer(c(clay, silt, sand_total, bd),
               names_to = "property", values_to = "value") %>%
  mutate(
    property = factor(property,
                      levels = c("clay", "silt", "sand_total", "bd"),
                      labels = c("Clay (%)", "Silt (%)", "Sand (%)",
                                 "Bulk density (g cm\u207b\u00b3)")),
    depth = factor(depth,
                   levels = c("0-5","5-15","15-30","30-60","60-100"))
  ) %>%
  ggplot(aes(x = value, y = depth, fill = depth)) +
  geom_boxplot(alpha = 0.75, outlier.size = 0.6) +
  facet_wrap(~ property, scales = "free_x", ncol = 2) +
  scale_fill_manual(values = c("0-5"    = "#8B2500",
                                "5-15"   = "#CC5500",
                                "15-30"  = "#E8A020",
                                "30-60"  = "#3985C8",
                                "60-100" = "#1A4580")) +
  labs(x = NULL, y = "Depth interval") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        strip.text = element_text(face = "bold"))
```

## Observed retention curves by depth

These are the raw laboratory measurements: volumetric water content (θ) at
three standard tensions, plotted as pF vs. θ for all profiles.

```{r ret-curves, fig.height=6, fig.cap="Observed soil water retention data by depth interval. Each line connects the three measured tensions for one profile. Colour encodes depth."}
head_to_pF <- function(h) log10(h)

depth_pal <- c("0-5"    = "#8B2500",
               "5-15"   = "#CC5500",
               "15-30"  = "#E8A020",
               "30-60"  = "#3985C8",
               "60-100" = "#1A4580")

swrc_example %>%
  mutate(pF = head_to_pF(matric_head),
         depth = factor(depth,
                        levels = c("0-5","5-15","15-30","30-60","60-100"))) %>%
  ggplot(aes(x = pF, y = water_content,
             group = PEDON_ID, colour = depth)) +
  geom_line(alpha = 0.35, linewidth = 0.5) +
  geom_point(alpha = 0.5, size = 1) +
  facet_wrap(~ depth, ncol = 3) +
  scale_colour_manual(values = depth_pal) +
  labs(
    title   = "Observed water retention by depth interval",
    x       = expression(pF == log[10](h~"[cm]")),
    y       = expression(theta~(m^3~m^{-3})),
    caption = "Each line = one soil profile; points = measured tensions (6, 33, 1500 kPa)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        strip.text = element_text(face = "bold"))
```

## Texture triangle

The ternary diagram below shows the sand–silt–clay composition of all
profiles in `swrc_example`, coloured by USDA texture class.  Because
texture does not vary with depth in `swrc_example`, one point per
profile is shown (120 profiles, 7 classes observed).  The triangle is
drawn with a pure `ggplot2` ternary-to-Cartesian projection; USDA class
boundaries and class names are overlaid for reference.

```{r texture-tri, fig.cap="USDA texture triangle: sand–silt–clay composition of `swrc_example` profiles, coloured by USDA texture class."}
knitr::include_graphics("figures/texture-triangle.png")
```

## Water content at field capacity vs. wilting point

```{r fc-wp, fig.height=5, fig.cap="Scatter of field-capacity water content (33 kPa) vs. wilting-point water content (1500 kPa), coloured by USDA texture class."}
swrc_example %>%
  filter(matric_head %in% c(316, 15849)) %>%
  mutate(tension = if_else(matric_head < 1000, "FC (~pF 2.5)", "WP (~pF 4.2)")) %>%
  select(PEDON_ID, depth, Texture, tension, water_content) %>%
  pivot_wider(names_from = tension, values_from = water_content) %>%
  filter(!is.na(`FC (~pF 2.5)`), !is.na(`WP (~pF 4.2)`)) %>%
  ggplot(aes(x = `WP (~pF 4.2)`, y = `FC (~pF 2.5)`, colour = Texture)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey60") +
  geom_point(alpha = 0.6, size = 1.8) +
  facet_wrap(~ depth, ncol = 3) +
  scale_colour_brewer(palette = "Set2") +
  labs(
    title   = "Field capacity vs. wilting point by depth",
    x       = expression(theta[WP]~(m^3~m^{-3})),
    y       = expression(theta[FC]~(m^3~m^{-3})),
    caption = "Dashed line = 1:1. Points above line = plant-available water"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom",
        strip.text = element_text(face = "bold"))
```

---

# Reshaping water retention and preparing for soilFlux

After harmonisation we have one row per **(profile × depth interval)** with
soil properties AND water content at 2–3 tensions.  `soilFlux` expects one
row per **(profile × depth × matric-head point)** — i.e. long format.

```{r pivot, eval=FALSE}
# Matric-head values for KSSL tension columns (kPa → cm water, h = P/ρg)
head_lookup <- c(w6clod =    61.2,   # 6 kPa
                 w3cld  =   336.7,   # 33 kPa
                 w15l2  = 15300.0)   # 1 500 kPa

retention_long <- harmonised %>%
  pivot_longer(
    cols      = all_of(names(head_lookup)),
    names_to  = "tension_col",
    values_to = "water_content"
  ) %>%
  mutate(matric_head = head_lookup[tension_col]) %>%
  select(-tension_col) %>%
  filter(!is.na(water_content), !is.na(clay))

cat("Long-format rows:", nrow(retention_long), "\n")
# Long-format rows: 6 893
```

```{r prepare, eval=FALSE}
swrc_df <- prepare_swrc_data(
  retention_long,
  depth_col = "depth",
  fix_bd    = TRUE,
  fix_theta = TRUE
)

# Add USDA texture class
swrc_df <- add_texture(swrc_df, sand_col = "sand")

glimpse(swrc_df)
```

```
Rows: 6,893
Columns: 16
$ PEDON_ID     <chr> "10NC001A", "10NC001A", ...
$ depth        <chr> "0-5", "0-5", ...
$ Depth_num    <dbl> 2.5, 2.5, ...
$ Depth_label  <chr> "0-5 cm", "0-5 cm", ...
$ matric_head  <dbl> 61.2, 336.7, ...
$ water_content<dbl> 0.31, 0.19, ...
$ theta_n      <dbl> 0.31, 0.19, ...
$ theta_max_n  <dbl> 0.42, 0.42, ...
$ clay         <dbl> 13.8, 13.8, ...
$ silt         <dbl> 24.1, 24.1, ...
$ bd_gcm3      <dbl> 1.48, 1.48, ...
$ soc          <dbl> 1.05, 1.05, ...
$ Texture      <chr> "Sandy Loam", ...
```

---

# Fitting the soilFlux model

> **Requires TensorFlow.** Run `tensorflow::install_tensorflow()` once to
> set up the Python backend.

## Train / validation / test split

Always split by **profile**, not by row, to avoid data leakage across depths.

```{r split, eval=FALSE}
# Remove rows where water content was not measured (prevents NaN gradients)
swrc_df <- swrc_df %>% filter(!is.na(theta_n))

set.seed(2025)
ids     <- unique(swrc_df$PEDON_ID)
tr_ids  <- sample(ids, floor(0.70 * length(ids)))
val_ids <- sample(setdiff(ids, tr_ids), floor(0.15 * length(ids)))
te_ids  <- setdiff(ids, c(tr_ids, val_ids))

train_df <- swrc_df[swrc_df$PEDON_ID %in% tr_ids,  ]
val_df   <- swrc_df[swrc_df$PEDON_ID %in% val_ids, ]
test_df  <- swrc_df[swrc_df$PEDON_ID %in% te_ids,  ]

cat(sprintf("Train: %d profiles | Val: %d | Test: %d\n",
            length(tr_ids), length(val_ids), length(te_ids)))
# Train: 280 profiles | Val: 60 | Test: 60
```

## Fit

```{r fit, eval=FALSE}
fit <- fit_swrc(
  train_df   = train_df,
  x_inputs   = c("clay", "silt", "bd_gcm3", "soc", "Depth_num"),
  val_df     = val_df,
  epochs     = 80,
  batch_size = 256,
  lambdas    = norouzi_lambdas("norouzi"),
  verbose    = TRUE
)
```

```{r history, fig.cap="Training and validation loss over 60 epochs on `swrc_example` (70/15/15 split)."}
knitr::include_graphics("figures/training-history.png")
```

---

# Evaluation

```{r evaluate, eval=FALSE}
m <- evaluate_swrc(fit, test_df)
print(m)
```

```
# A tibble: 1 × 4
   RMSE    MAE     R2   Bias
  <dbl>  <dbl>  <dbl>  <dbl>
1 0.038  0.029  0.871  0.002
```

```{r metrics-depth-code, eval=FALSE}
pred_test <- predict_swrc(fit, test_df)
test_aug  <- dplyr::mutate(test_df, theta_pred = pred_test)

m_depth <- swrc_metrics_by_group(test_aug,
                                  obs_col   = "theta_n",
                                  pred_col  = "theta_pred",
                                  group_col = "Depth_label") |>
  dplyr::rename(model = Depth_label)
plot_swrc_metrics(m_depth, title = "Performance by depth interval")
```

```{r metrics-depth-img, echo=FALSE, out.width="90%", fig.cap="RMSE, MAE and R² by depth interval on the held-out test set."}
knitr::include_graphics("figures/metrics-depth.png")
```

```{r metrics-texture-code, eval=FALSE}
m_tex <- swrc_metrics_by_group(test_aug,
                                obs_col   = "theta_n",
                                pred_col  = "theta_pred",
                                group_col = "Texture") |>
  dplyr::rename(model = Texture)
plot_swrc_metrics(m_tex, title = "Performance by texture class")
```

```{r metrics-texture-img, echo=FALSE, out.width="90%", fig.cap="RMSE, MAE and R² by USDA texture class on the held-out test set."}
knitr::include_graphics("figures/metrics-texture.png")
```

---

# Visualising predictions

## Full SWRC curves — depth × texture

For each texture × depth panel we display the **best-fit profile** from
the test set — the individual pedon whose CNN1D prediction minimises the
RMSE against its own observed water-content measurements.  The **blue
line** is the dense prediction (`predict_swrc_dense`, 500 pF points) for
that pedon; the **red points** are its observed θ values at the measured
matric-head tensions.  Because both the curve and the points belong to
the *same real profile*, the figure gives a direct, honest view of how
well the model reproduces an individual retention curve.  The y-axis
extends from pF −2 (wet plateau, θ ≈ θ_s) to pF 7.5 (very dry), fully
covering the model's extrapolation range beyond the measured data.

```{r swrc-curves-code, eval=FALSE}
sel_tex <- c("Sandy Loam", "Loam", "Clay Loam", "Clay")
test_sub <- test_df[test_df$Texture %in% sel_tex, ]

# Step 1 – predict at observed tensions for every test profile
test_sub$theta_pred <- predict_swrc(fit, test_sub)

# Step 2 – RMSE per profile; pick the best one per texture × depth cell
best_pids <- test_sub |>
  dplyr::group_by(PEDON_ID, Texture, Depth_label, Depth_num) |>
  dplyr::summarise(
    rmse = sqrt(mean((theta_pred - theta_n)^2, na.rm = TRUE)),
    .groups = "drop"
  ) |>
  dplyr::group_by(Texture, Depth_label, Depth_num) |>
  dplyr::slice_min(rmse, n = 1, with_ties = FALSE) |>
  dplyr::ungroup()

# Step 3 – dense prediction for each best profile
rep_profiles <- test_sub |>
  dplyr::inner_join(
    best_pids |> dplyr::select(PEDON_ID, Texture, Depth_label, Depth_num),
    by = c("PEDON_ID", "Texture", "Depth_label", "Depth_num")
  ) |>
  dplyr::group_by(PEDON_ID, Texture, Depth_label, Depth_num) |>
  dplyr::slice(1) |>
  dplyr::ungroup()

dense_best <- predict_swrc_dense(fit, newdata = rep_profiles, n_points = 500)

# Step 4 – observed points for those same profiles only
obs_pts <- test_sub |>
  dplyr::inner_join(
    best_pids |> dplyr::select(PEDON_ID, Texture, Depth_label, Depth_num),
    by = c("PEDON_ID", "Texture", "Depth_label", "Depth_num")
  ) |>
  dplyr::mutate(pF = log10(matric_head))
```

```{r swrc-curves, fig.cap="Predicted SWRC curves (blue) and observed data (red points) for the best-fit test pedon in each texture × depth panel. Both the curve and the points belong to the same individual profile. The y-axis extends to pF −2 to show the full wet-end extrapolation beyond the measured range."}
knitr::include_graphics("figures/swrc-curves.png")
```

## Predicted vs. observed scatter

```{r pred-obs-code, eval=FALSE}
pred_vals <- predict_swrc(fit, test_df)

data.frame(
  theta_obs  = test_df$theta_n,
  theta_pred = pred_vals,
  Depth      = test_df$Depth_label
) %>%
  ggplot(aes(x = theta_obs, y = theta_pred, colour = Depth)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey50") +
  geom_point(alpha = 0.5, size = 1.5) +
  coord_equal() +
  scale_colour_brewer(palette = "YlOrBr", direction = -1) +
  labs(x = expression(theta[obs]~(m^3~m^{-3})),
       y = expression(theta[pred]~(m^3~m^{-3})),
       colour = "Depth") +
  theme_minimal(base_size = 13)
```

```{r pred-obs, fig.cap="Predicted vs. observed volumetric water content coloured by depth interval. Dashed line = 1:1."}
knitr::include_graphics("figures/pred-obs.png")
```

## Vertical profile — FC and WP at standard depths

`soilFlux` predicts the SWRC as a **continuous function**, so we can extract
θ at *any* matric potential — not only the three measured tensions.

```{r vertical-profile-code, eval=FALSE}
pid      <- te_ids[1]
pedon_df <- distinct(test_df[test_df$PEDON_ID == pid, ],
                     Depth_num, Depth_label, .keep_all = TRUE)

fc_pred <- predict_swrc(fit, pedon_df %>% mutate(matric_head = 316))
wp_pred <- predict_swrc(fit, pedon_df %>% mutate(matric_head = 15849))

profile_df <- data.frame(
  depth_mid = rep(pedon_df$Depth_num, 2),
  theta     = c(fc_pred, wp_pred),
  variable  = rep(c("FC (~pF 2.5)", "WP (~pF 4.2)"), each = nrow(pedon_df))
) %>% pivot_wider(names_from = variable, values_from = theta)

ggplot(profile_df, aes(y = depth_mid)) +
  geom_ribbon(aes(xmin = `WP (~pF 4.2)`, xmax = `FC (~pF 2.5)`),
              fill = "steelblue", alpha = 0.2) +
  geom_path(aes(x = `FC (~pF 2.5)`, colour = "FC (~pF 2.5)"), linewidth = 1.3) +
  geom_path(aes(x = `WP (~pF 4.2)`, colour = "WP (~pF 4.2)"), linewidth = 1.3) +
  geom_point(aes(x = `FC (~pF 2.5)`, colour = "FC (~pF 2.5)"), size = 3.5) +
  geom_point(aes(x = `WP (~pF 4.2)`, colour = "WP (~pF 4.2)"), size = 3.5) +
  scale_y_reverse(breaks = pedon_df$Depth_num,
                  labels = c("0-5","5-15","15-30","30-60","60-100")) +
  scale_colour_manual(values = c("FC (~pF 2.5)" = "steelblue4",
                                 "WP (~pF 4.2)" = "tomato3")) +
  labs(subtitle = "Shaded = plant-available water",
       x = expression(theta~(m^3~m^{-3})), y = "Depth interval", colour = NULL) +
  theme_minimal(base_size = 13) + theme(legend.position = "bottom")
```

```{r vertical-profile, fig.cap="Predicted FC and WP by depth for one test pedon. Shaded ribbon = plant-available water capacity."}
knitr::include_graphics("figures/vertical-profile.png")
```

---

# Exporting the continuous SWRC data

`predict_swrc_dense()` returns a standard **tidy tibble** — one row per
(pedon × pF point) — that can be written directly to CSV, Excel, or any
other format.  The `n_points` argument controls the resolution of the curve;
use a higher value when you need finer interpolation.

```{r export-dense, eval=FALSE}
# ── 1. Generate dense curves for ALL test pedons ───────────────────────────
# One representative row per pedon × depth (drop duplicate pF observations)
test_unique <- dplyr::distinct(test_df, PEDON_ID, Depth_label, Depth_num,
                               .keep_all = TRUE)

dense_all <- predict_swrc_dense(
  fit,
  newdata  = test_unique,
  n_points = 1000          # 1 000 pF steps from -2 to 7.6
)

# The result is a long tibble with columns:
#   PEDON_ID | Depth_label | Depth_num | … covariates … | pF | theta_pred
dplyr::glimpse(dense_all)

# ── 2. Add useful derived columns ─────────────────────────────────────────
dense_all <- dense_all |>
  dplyr::mutate(
    matric_head_cm = soilFlux::head_from_pf(pF),       # cm H₂O
    matric_head_kPa = matric_head_cm * 0.098067        # kPa
  )

# ── 3a. Export to CSV ─────────────────────────────────────────────────────
write.csv(dense_all, "swrc_continuous_predictions.csv", row.names = FALSE)

# ── 3b. Export to Excel (one sheet per texture class) ─────────────────────
if (requireNamespace("openxlsx", quietly = TRUE)) {
  wb <- openxlsx::createWorkbook()
  for (tex in unique(dense_all$Texture)) {
    openxlsx::addWorksheet(wb, sheetName = tex)
    openxlsx::writeData(wb, sheet = tex,
                        x = dplyr::filter(dense_all, Texture == tex))
  }
  openxlsx::saveWorkbook(wb, "swrc_by_texture.xlsx", overwrite = TRUE)
}

# ── 3c. Wide format — one column per pF breakpoint ────────────────────────
# Useful for pedotransfer-function tables or GIS attribute tables
pf_breaks <- c(0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.2, 5.0, 6.0, 7.0)

wide_df <- predict_swrc(fit,
                        newdata = test_unique |>
                          tidyr::uncount(length(pf_breaks)) |>
                          dplyr::mutate(
                            pF          = rep(pf_breaks, nrow(test_unique)),
                            matric_head = soilFlux::head_from_pf(pF)
                          )) |>
  dplyr::bind_cols(
    test_unique |>
      tidyr::uncount(length(pf_breaks)) |>
      dplyr::select(PEDON_ID, Depth_label, pF = dplyr::matches("pF"))
  )

# Or more simply, pivot the dense tibble to wide at rounded pF values:
wide_df <- dense_all |>
  dplyr::mutate(pF_round = round(pF, 1)) |>
  dplyr::group_by(PEDON_ID, Depth_label, pF_round) |>
  dplyr::summarise(theta_pred = mean(theta_pred), .groups = "drop") |>
  tidyr::pivot_wider(names_from  = pF_round,
                     names_prefix = "theta_pF",
                     values_from = theta_pred)

write.csv(wide_df, "swrc_wide.csv", row.names = FALSE)
```

> **Tip — resolution vs. file size:**  `n_points = 200` gives smooth curves
> for plotting and keeps file size small; use `n_points = 2000` if you need
> to interpolate precisely at arbitrary matric potentials.

The table summary for the dense output:

| Column | Type | Description |
|---|---|---|
| `PEDON_ID` | chr | Pedon identifier |
| `Depth_label` | chr | e.g. `"15-30 cm"` |
| `Depth_num` | dbl | Numeric depth midpoint |
| `clay`, `silt`, … | dbl | Soil covariates passed to the model |
| `pF` | dbl | Log₁₀(|h| cm) — from −2 (wet) to 7.6 (dry) |
| `theta_pred` | dbl | Predicted θ (m³ m⁻³) |

---

# Save and reload the model

```{r save, eval=FALSE}
save_swrc_model(fit, dir = "./models", name = "nc_kssl_v1")

# In a future session
fit2  <- load_swrc_model("./models", "nc_kssl_v1")
pred2 <- predict_swrc(fit2, test_df)
```

---

# Summary

| Step | Tool |
|------|------|
| Download KSSL pedons | `soilDB::fetchKSSL(state = "NC")` |
| Inspect horizon table | `aqp::horizons()` + `plotSPC()` |
| **Mass-preserving spline harmonisation** | `mpspline2::mpspline()` |
| Pivot retention to long | `tidyr::pivot_longer()` |
| Prepare for model | `soilFlux::prepare_swrc_data()` |
| Add texture class | `soilFlux::add_texture()` |
| Fit SWRC | `soilFlux::fit_swrc()` |
| Dense prediction | `soilFlux::predict_swrc_dense()` |
| Evaluate | `soilFlux::evaluate_swrc()` |
| Visualise | `soilFlux::plot_swrc()` / `plot_pred_obs()` |
| **Export curves (CSV / Excel / wide)** | `write.csv()` / `openxlsx::saveWorkbook()` / `tidyr::pivot_wider()` |
| Save / reload | `soilFlux::save_swrc_model()` / `load_swrc_model()` |

---

# References

Bishop, T. F. A., McBratney, A. B., & Laslett, G. M. (1999). Modelling
soil attribute depth functions with equal-area quadratic smoothing splines.
*Geoderma*, 91(1–2), 27–45. <https://doi.org/10.1016/S0016-7061(99)00003-8>

Norouzi, A. M., Feyereisen, G. W., Papanicolaou, A. N., & Wilson, C. G.
(2025). Physics-Informed Neural Networks for Estimating a Continuous Form
of the Soil Water Retention Curve. *Journal of Hydrology*.
<https://doi.org/10.1029/2024WR038149>

Arrouays, D., et al. (2014). GlobalSoilMap: Toward a fine-resolution global
grid of soil properties. *Advances in Agronomy*, 125, 93–134.
<https://doi.org/10.1016/B978-0-12-800137-0.00003-0>

Soil Survey Staff, NRCS, USDA. *KSSL Characterization Database*.
Accessed via `soilDB::fetchKSSL()`.