Spatial clustering based on correlation or other metrics.
Usage
cluster_locid(
x,
varname,
locid = "locid",
time = "UTC",
locid_info = NULL,
weight = NULL,
group = NULL,
k = c(1:20, 25, 30, 40, 50, 75, 100, 150, 200, 300, 500, 1000, 10000),
max_loss = 0.05,
distance = "cor",
cores = 1,
plot = FALSE,
verbose = TRUE,
...
)
Arguments
- x
`data.frame` (merra subset) with location and time identifiers, and a time-series variable to cluster.
- varname
name of column with data to be used to cluster locations.
- locid
name of column of location identifiers.
- time
name of column with time dimension
- locid_info
(optional) `data.frame` or `sf` object with weights and/or spatial groups (regions) of location identifiers.
- weight
(optional) name of column with (positive) weights in `locid_info`, used in calculating weighted `mean` and `sd` metrics.
- group
(optional) name of column with group-names of locations (such as regions). If provided, clustering will be made for each group separately.
- k
(optional) integer vector of number of clusters to test. By default (`NULL`) clustering process start from `1` to the number of locations and terminates when `max_loss` condition is met.
- max_loss
maximum loss of variation (standard deviation) of clustered variable, measured as `1 - sd(clustered_variable) / sd(original_variable)`. Default value is `0.05`, meaning up to `5` percent of variability of original, non-clustered variable is allowed to be lost by clustering.
- distance
character name of a selected distance measure to use `TSdist::KMedoids`. Default metrics is `cor` - Pearson's correlation between the time series variable in different locations. Alternative, allowed methasures: `"euclidean", "manhattan", "minkowski", "infnorm", "ccor", "sts", "dtw", "keogh_lb", "edr", "erp", "lcss", "fourier", "tquest", "dissimfull", "dissimapprox", "acf", "pacf", "ar.lpc.ceps", "ar.mah", "ar.mah.statistic", "ar.mah.pvalue", "ar.pic", "cdm", "cid", "cor", "cort", "wav", "int.per", "per", "mindist.sax", "ncd", "pred", "spec.glk", "spec.isd", "spec.llr", "pdc", "frechet"`.
- cores
integer number of processor cores to use, currently ignored.
- verbose
logical, should the clustering process be reported, TRUE by default.
- ...
additional parameters to pass to `TSdist::KMedoids`, might be required for some distance measures.
Value
`data.frame` with alternative number of clusters with columns:
- k
Number of clusters
- N
Total number of time series
- locid
location identifier in `merra2ools` datasets
- "group"
(if provided) column with locid-groups
- cluster
cluster number in every `k`-group
- weight
weight of the cluster in the `k`-group
- sd_N
standard deviation of the whole sample of (N) time-series
- sd_k
standard deviation of clustered time series with `k` clusters
- sd_loss
loss of standard deviation as result of clusterisation, for each `k`