pith. sign in

arxiv: 2607.01451 · v1 · pith:EB3BOMCUnew · submitted 2026-07-01 · 📊 stat.AP · cs.CG· stat.ME

Sampling for Region-Aggregated Spatial Scan Statistics

Pith reviewed 2026-07-03 01:13 UTC · model grok-4.3

classification 📊 stat.AP cs.CGstat.ME
keywords spatial scan statisticsregion-aggregated datauniform samplinganomaly detectiongeospatial analysisstatistical powerconvergence analysispoint approximation
0
0 comments X

The pith

Replacing each aggregated region with 20-50 uniformly sampled points spread evenly across its geometry raises statistical power in spatial scan statistics compared with collapsing regions to centroids.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses a mismatch in geospatial anomaly detection: data arrives as counts inside fixed regions such as census tracts, yet the fastest scan algorithms expect point locations. Standard practice reduces every region to its single centroid, which the authors show throws away spatial extent and lowers detection power. Their fix converts each region into a modest cloud of 20-50 points drawn uniformly from its actual shape and distributes the region's total value evenly among those points. A convergence argument shows why this small number of samples already yields an accurate approximation. The result is a simple conversion step that lets existing point-based scanners run on region data with noticeably better sensitivity while staying computationally practical.

Core claim

The authors establish that a region can be replaced by a small set of points sampled uniformly from its geometry, with the region's count value divided equally among the points, and that this substitution produces a scan statistic whose power converges quickly to the power that would be obtained from the true continuous region; because the approximation error drops rapidly, only 20-50 samples per region are needed in practice to recover most of the lost detection ability that occurs when regions are collapsed to centroids.

What carries the argument

Uniform sampling from region geometry combined with even value spreading, which converts each polygon into a small point set that approximates its contribution to the scan statistic.

If this is right

  • Existing point-based spatial scan algorithms can be applied directly to region-aggregated data without custom polygon-aware code.
  • Detection power rises because the sampled points retain information about the region's spatial extent rather than discarding it at a single centroid.
  • Computational cost remains comparable to the centroid method once the modest number of extra points is added.
  • The same conversion step applies to any scan statistic whose efficient implementation assumes point data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on region data whose boundaries are known only approximately, to see how sensitive the power gain is to boundary error.
  • Because the sampling is independent per region, parallel generation of the point sets could further reduce preprocessing time on very large collections of polygons.
  • The convergence analysis might extend to other spatial statistics that aggregate over polygons, such as certain kernel density or hotspot methods.

Load-bearing premise

Uniform sampling from a region's geometry together with even spreading of its value produces an unbiased approximation to how that region would contribute if it were treated as a continuous area.

What would settle it

Run the same scan statistic on a collection of real region-aggregated datasets once with centroids and once with the 20-50 point sampling; if the sampled version does not recover a measurable increase in detected anomalies or in power on synthetic signals planted inside the regions, the claimed improvement does not hold.

Figures

Figures reproduced from arXiv: 2607.01451 by Drew McClelland, Foad Namjoo, Jeff M. Phillips, Michael Matheny.

Figure 1
Figure 1. Figure 1: Region-to-point sampling on Arkansas counties: each county is replaced by [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Point Jaccard distance between target (red) and discovered rectangles (green outline) for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: New York City with 263 zip codes. The inset (upper right) shows the planted target rectangle (red dashed), spanning longitudes −74 to −73.8 and latitudes 40.6 to 40.8. The Point Jaccard distance versus pq difference, averaged over 20 trials with ±1 standard deviation bands [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Utah with 29 counties. The inset (upper right) shows the state and the planted target [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: California with 69 counties. The inset (upper right) shows the state and the planted target [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The Point Jaccard distance on all 3,711 counties in the continental United States. The inset (upper right) shows the country and the planted target rectangle (red dashed), spanning longitudes −100 to −90 and latitudes 33 to 40. in the continental United States; inset in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Point Jaccard distance as a function of the relative size of the planted target rectangle in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Point Jaccard distance between the planted target and the discovered region for Arkansas; [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Point Jaccard distance between the planted and discovered regions for Arkansas, with the [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Discovered rectangular windows on Arkansas for both planted targets (black dashed) at [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Georgia counties with a planted region covering approximately [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Point Jaccard distance as a function of k on Geom k, at fixed pq difference of 0.15, across six datasets. Mean over 20 trials; shaded bands show ±1 standard deviation. The black dashed line overlays Theorem 1’s predicted asymptote dJac ∝ 1/ √ k, calibrated to Arkansas at k = 2. Perhaps surprisingly, this analysis indicates that as the number of regions n increases, then the number of samples per region k … view at source ↗
Figure 13
Figure 13. Figure 13: California Valley Fever: Point Jaccard distance to the San Joaquin Valley ground truth. The [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Circular cluster recovery on Arkansas counties: Point Jaccard distance vs. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Discovered disk windows on Arkansas for both planted disk targets (black dashed) at [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗
read the original abstract

Anomaly detection in geospatial data is a crucial tool in geographic information science (GIS), with applications ranging from national security to public-health surveillance to the study of societal disparities. This work focuses on spatial scan statistics and addresses a key mismatch: spatial counts are typically aggregated into predefined regions (census tracts, zip codes, counties), whereas the most efficient scan algorithms operate on spatial point data. The standard remedy -- collapsing each region to its centroid, as in widely used tools such as SaTScan -- is convenient but, as we show, discards the region's spatial extent and causes a significant loss in statistical power. To resolve this, we propose a simple yet scalable fix: replace each spatial region with 20-50 points sampled uniformly from its geometry and spread the region's values evenly across them. This approach improves statistical power while maintaining computational tractability. A convergence analysis explains why so few samples per region suffice. We recommend this sampling-based conversion as the default way to apply point-based spatial scan statistics to region-aggregated data for anomaly detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a sampling-based conversion for applying point-based spatial scan statistics to region-aggregated geospatial data: each region is replaced by 20-50 points drawn uniformly from its geometry, with the region's value spread evenly across the samples. The central claims are that this yields higher statistical power than the standard centroid approximation and that a convergence analysis shows why such a small number of samples per region suffices for practical use in anomaly detection.

Significance. If the approximation error is controlled uniformly over the collection of candidate windows, the method would provide a simple, scalable default for converting region data to point format while preserving more spatial information than centroids, with direct relevance to public-health and GIS applications.

major comments (2)
  1. [Convergence Analysis] Convergence Analysis section: the provided analysis establishes pointwise convergence of the per-region contribution under uniform sampling, but the scan statistic is defined as a supremum (maximum likelihood ratio) over all candidate windows. No uniform bound or concentration result over the (typically exponential) collection of windows is shown, so it is not immediate that the per-region error remains controlled after maximization; this directly affects whether the claimed power gain with 20-50 samples is guaranteed.
  2. [§4] §4 (Empirical Evaluation): the reported power comparisons use a fixed sample count (20-50) chosen after the fact; without a pre-specified sample-size rule or sensitivity analysis showing that the power advantage persists under the worst-case window, the empirical results do not yet confirm that the convergence analysis suffices for the maximized statistic.
minor comments (2)
  1. [§3] Notation for the even spreading of region values across samples is introduced without an explicit equation; adding a short displayed equation would clarify the conversion step.
  2. [Figure 2] Figure 2 caption does not state the number of Monte Carlo replications used to estimate power; this detail should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our theoretical and empirical results. We address each major point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Convergence Analysis] Convergence Analysis section: the provided analysis establishes pointwise convergence of the per-region contribution under uniform sampling, but the scan statistic is defined as a supremum (maximum likelihood ratio) over all candidate windows. No uniform bound or concentration result over the (typically exponential) collection of windows is shown, so it is not immediate that the per-region error remains controlled after maximization; this directly affects whether the claimed power gain with 20-50 samples is guaranteed.

    Authors: We agree that the analysis provides pointwise convergence per region rather than a uniform bound over the collection of windows. The per-region result is the core building block, and because the number of candidate windows is finite in any concrete application (even if large), the small per-region approximation error (controlled by the derived rate) translates to controlled error in the maximized statistic for the sample sizes considered. To make this explicit, we will revise the Convergence Analysis section to add a short discussion of the implications for the supremum, including a remark on the finite nature of the window collection and the continuity of the likelihood ratio. revision: yes

  2. Referee: [§4] §4 (Empirical Evaluation): the reported power comparisons use a fixed sample count (20-50) chosen after the fact; without a pre-specified sample-size rule or sensitivity analysis showing that the power advantage persists under the worst-case window, the empirical results do not yet confirm that the convergence analysis suffices for the maximized statistic.

    Authors: The sample sizes were chosen to align with the convergence rates shown in the analysis, where the approximation error drops below a practical threshold by n=20. We will revise §4 to include an expanded sensitivity analysis that varies the number of samples per region (e.g., 5 to 100) across multiple simulated scenarios, explicitly checking robustness for windows that maximize the scan statistic and confirming that the power advantage over centroids stabilizes for n≥20. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces a sampling conversion from region-aggregated data to point data and supplies a separate convergence analysis to justify the small sample count per region. No equation or central claim reduces by construction to a fitted parameter, self-definition, or self-citation chain; the approximation step and its error bound are presented as independent technical content. The reader's assessment of score 2.0 is consistent with the absence of load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on standard domain assumptions about region geometries and uniform sampling; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Region geometries are available and permit uniform point sampling.
    Required for the sampling step described in the abstract.

pith-pipeline@v0.9.1-grok · 5719 in / 1198 out tokens · 40683 ms · 2026-07-03T01:13:33.201492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references

  1. [1]

    Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems , pages=

    Scalable spatial scan statistics through sampling , author=. Proceedings of the 24th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems , pages=. 2016 , organization=

  2. [2]

    Agarwal and A

    D. Agarwal and A. McGregor and J. M. Phillips and S. Venkatasubramanian and Z. Zhu. Spatial scan statistics: Approximations and performance study. KDD. 2006

  3. [3]

    D. B. Neill and A. W. Moore and G. F. Cooper. Rapid detection of significant spatial clusters. KDD. 2004

  4. [4]

    Phillips

    Jeff M. Phillips. Small and stable descriptors of distributions for geometric statistical problems. 2009

  5. [5]

    Kulldorff

    M. Kulldorff. A spatial scan statistic. Communications in Statistics: Theory and Methods. 1997

  6. [6]

    Biometrika , volume=

    Clustering of random points in two dimensions , author=. Biometrika , volume=. 1965 , publisher=

  7. [7]

    Am Cartogr , volume=

    Algorithm for the reduction of the number of points required to represent a line or its character , author=. Am Cartogr , volume=

  8. [8]

    Annals of the Association of American Geographers , volume=

    Spatial clustering overview and comparison: Accuracy, sensitivity, and computational expense , author=. Annals of the Association of American Geographers , volume=. 2014 , publisher=

  9. [9]

    National Climatic Data Center , publisher=

    Storm Events Database , url=. National Climatic Data Center , publisher=. 2015 , month=

  10. [10]

    State Cancer Profiles , publisher=

    Cancer Incidence Rates , url=. State Cancer Profiles , publisher=

  11. [11]

    Centers for Disease Control and Prevention, CDC , year=

    Diabetes County Data Indicators , url=. Centers for Disease Control and Prevention, CDC , year=

  12. [12]

    Economic Research Service , publisher=

    County-level Data Sets , url=. Economic Research Service , publisher=. 2015 , month=

  13. [13]

    SaTScan , year=

    New York State Cancer Data , url=. SaTScan , year=

  14. [14]

    Cartographic Boundary Shapefiles , url=

  15. [15]

    Communications in Statistics-Theory and methods , volume=

    A spatial scan statistic , author=. Communications in Statistics-Theory and methods , volume=. 1997 , publisher=

  16. [16]

    Phillips , journal =

    Michael Matheny and Jeff M. Phillips , journal =. Computing Approximate Statistical Discrepancy , year =

  17. [17]

    , booktitle =

    Matheny, Michael and Singh, Raghvendra and Zhang, Liang and Wang, Kaiqiang and Phillips, Jeff M. , booktitle =. Scalable Spatial Scan Statistics Through Sampling , year =

  18. [18]

    Phillips and Suresh Venkatasubramanian , journal =

    Deepak Agarwal and Jeff M. Phillips and Suresh Venkatasubramanian , journal =. The Hunting of the Bump: On Maximizing Statistical Discrepancy , year =

  19. [19]

    Handbook of discrete and computational geometry , pages=

    Coresets and sketches , author=. Handbook of discrete and computational geometry , pages=. 2017 , publisher=

  20. [20]

    Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

    Spatial scan statistics: approximations and performance study , author=. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

  21. [21]

    Science Advances , volume=

    Presyndromic surveillance for improved detection of emerging public health threats , author=. Science Advances , volume=. 2022 , publisher=

  22. [22]

    Neill and Andrew W

    Daniel B. Neill and Andrew W. Moore , booktitle =. Rapid Detection of Significant Spatial Clusters , year =

  23. [23]

    Neill and Andrew W

    Daniel B. Neill and Andrew W. Moore and Gregory F. Cooper , booktitle =. A. 2006 , pages =

  24. [24]

    Environmental and Ecological statistics , volume=

    Upper level set scan statistic for detecting arbitrarily shaped hotspots , author=. Environmental and Ecological statistics , volume=. 2004 , publisher=

  25. [25]

    SatScan User Guide , year =

    Martin Kulldorff , edition =. SatScan User Guide , year =

  26. [26]

    Michael Matheny , address =

  27. [27]

    Takahashi, Tetsuji Yokoyama and Toshiro Tango , address =

  28. [28]

    International journal of health geographics , volume=

    A flexibly shaped spatial scan statistic for detecting clusters , author=. International journal of health geographics , volume=. 2005 , publisher=

  29. [29]

    An elliptic spatial scan statistic

    Martin Kulldorff and Lan Huang and Linda Pickle and Luiz Duczmal , journal =. An elliptic spatial scan statistic. , volume =

  30. [30]

    Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems , pages=

    The kernel spatial scan statistic , author=. Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems , pages=. 2019 , publisher =

  31. [31]

    Transactions in GIS , volume=

    Street-level spatial scan statistic and STAC for analysing street crime concentrations , author=. Transactions in GIS , volume=. 2011 , publisher=

  32. [32]

    43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016) , volume=

    Tight Hardness Results for Maximum Weight Rectangles , author=. 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016) , volume=. 2016 , publisher =

  33. [33]

    ACM Computing Surveys (CSUR) , volume=

    Statistically-robust clustering techniques for mapping spatial hotspots: A survey , author=. ACM Computing Surveys (CSUR) , volume=. 2022 , publisher=

  34. [34]

    Statistic Surveys , volume=

    An up-to-date review of scan statistics , author=. Statistic Surveys , volume=. 2021 , publisher=

  35. [35]

    2024 , publisher=

    Handbook of scan statistics , author=. 2024 , publisher=

  36. [36]

    Applied geography , volume=

    Rapid surveillance of COVID-19 in the United States using a prospective space-time scan statistic: Detecting and evaluating emerging clusters , author=. Applied geography , volume=

  37. [37]

    Statistics in Medicine , volume=

    A Flexible Spatial Scan Statistic with a Restricted Likelihood Ratio for Detecting Clusters , author =. Statistics in Medicine , volume=. 2012 , doi=

  38. [38]

    GeoInformatica , volume =

    Processing aggregated data: the location of clusters in health data , author =. GeoInformatica , volume =. 2012 , doi =

  39. [39]

    2024 , note =

    Valley Fever (. 2024 , note =

  40. [40]

    2024 , note =

    Coccidioidomycosis (. 2024 , note =

  41. [41]

    2024 , note =

    Infectious Diseases by Disease, County, Year, and Sex , howpublished =. 2024 , note =