Spatial clustering based on correlation or other metrics.

Usage

cluster_locid(
  x,
  varname,
  locid = "locid",
  time = "UTC",
  locid_info = NULL,
  weight = NULL,
  group = NULL,
  k = c(1:20, 25, 30, 40, 50, 75, 100, 150, 200, 300, 500, 1000, 10000),
  max_loss = 0.05,
  distance = "cor",
  cores = 1,
  plot = FALSE,
  verbose = TRUE,
  ...
)

Arguments

x: `data.frame` (merra subset) with location and time identifiers, and a time-series variable to cluster.
varname: name of column with data to be used to cluster locations.
locid: name of column of location identifiers.
time: name of column with time dimension
locid_info: (optional) `data.frame` or `sf` object with weights and/or spatial groups (regions) of location identifiers.
weight: (optional) name of column with (positive) weights in `locid_info`, used in calculating weighted `mean` and `sd` metrics.
group: (optional) name of column with group-names of locations (such as regions). If provided, clustering will be made for each group separately.
k: (optional) integer vector of number of clusters to test. By default (`NULL`) clustering process start from `1` to the number of locations and terminates when `max_loss` condition is met.
max_loss: maximum loss of variation (standard deviation) of clustered variable, measured as `1 - sd(clustered_variable) / sd(original_variable)`. Default value is `0.05`, meaning up to `5` percent of variability of original, non-clustered variable is allowed to be lost by clustering.
distance: character name of a selected distance measure to use `TSdist::KMedoids`. Default metrics is `cor` - Pearson's correlation between the time series variable in different locations. Alternative, allowed methasures: `"euclidean", "manhattan", "minkowski", "infnorm", "ccor", "sts", "dtw", "keogh_lb", "edr", "erp", "lcss", "fourier", "tquest", "dissimfull", "dissimapprox", "acf", "pacf", "ar.lpc.ceps", "ar.mah", "ar.mah.statistic", "ar.mah.pvalue", "ar.pic", "cdm", "cid", "cor", "cort", "wav", "int.per", "per", "mindist.sax", "ncd", "pred", "spec.glk", "spec.isd", "spec.llr", "pdc", "frechet"`.
cores: integer number of processor cores to use, currently ignored.
verbose: logical, should the clustering process be reported, TRUE by default.
...: additional parameters to pass to `TSdist::KMedoids`, might be required for some distance measures.

Value

`data.frame` with alternative number of clusters with columns:

k: Number of clusters
N: Total number of time series
locid: location identifier in `merra2ools` datasets
"group": (if provided) column with locid-groups
cluster: cluster number in every `k`-group
weight: weight of the cluster in the `k`-group
sd_N: standard deviation of the whole sample of (N) time-series
sd_k: standard deviation of clustered time series with `k` clusters
sd_loss: loss of standard deviation as result of clusterisation, for each `k`

Examples

# see "Cluster locations" in "Get started"