Bootstrapping a significance threshold for periodogram analysis
In periodogram analysis of data with measurement errors, one must make a critical decision about where to draw the line between what is interpreted as a real signal and what could be just a signature of noise. Typically I am analyzing time series measurements of brightnesses of stars with the Fourier transform or the Lomb-Scargle periodogram, and I want to know if the star is varying significantly in brightness or if the measurements are consistent with what we expect for noisy measurements of a constant-brightness source. My approach is to consider whether any peaks in the periodogram are tall enough to be exceedingly unlikely to represent noise. A periodogram peak must be as tall as the “significance threshold” for me to believe it represents a real signal.

I go over some of the basics of periodogram analysis in my previous post “How data sampling affects the Fourier transform periodogram.” The animated periodograms in that post include a line at 4 times the average amplitude in the periodogram as an approximate significance threshold, but careful data analysis should adopt a threshold level that is deliberately chosen to be appropriate for the problem at hand. This post will demonstrate how to calculate a statistical threshold with bootstrap resampling.
Consider the periodogram below of a pulsating white dwarf star observed by NASA’s TESS satellite under the target name TIC 188087204.

This is the amplitude spectrum (square root of the power spectrum on the y axis, in units of parts-per-thousand), showing what the best-fit amplitude is for a series of sinusoids with different frequencies (sampled on the x axis in units of microHertz from zero to the Nyquist frequency). This is the frequency-domain representation of 60 days of brightness measurements of a white dwarf star taken every 2 minutes, with considerable measurement noise affecting each data point. The noise manifests in the periodogram as a noise floor of peaks with a range of heights distributed across all sampled frequencies. This might look like the grass of an unkempt lawn growing along the bottom of the plot. It is apparent in the periodogram shown above that some peaks stick out taller than the rest. The question of significance, in the lawn analogy, is whether the tallest peaks are dandelions (signal) or just the tallest of the many blades of grass (noise).
If we make some simple (but not exactly correct) assumptions about the time series data—that each measurement is evenly spaced in time and has uncorrelated Gaussian-random error—then the amplitude values in the periodogram are distributed as a Chi distribution with two degrees of freedom (I describe this a bit more here and here). That distribution function drops off exponentially to high amplitude, but there always some probability that a random noise peak could be sampled very far into that tail, however small that probability may be. So no matter how high a single peak rises above the rest in the periodogram to distinguish itself from all the other noise peaks, there’s always some, perhaps infinitesimal, risk that it’s a noise peak itself.
Acknowledging the risk that we might misinterpret a tall noise peak as genuine signal, we must accept some risk tolerance. Because we might understand the distribution of peaks due to noise, we can calculate an amplitude threshold above which only in exceptionally rare circumstances would noise conspire to produce such a peak. This threshold is typically set to correspond to a calculated false alarm probability (FAP). The FAP is the probability that a random data set like yours could have a noise peak that would exceed the threshold. Perhaps a 1/1000 risk would be tolerable for your data analysis, as a reasonable trade-off between risking a false detection and potentially missing a real signal. For an analytical distribution like a 2-degrees-of-freedom Chi distribution, this threshold can be set at the location where the cumulative density function (integrated probability density function) equals 1 - FAP, e.g, 0.999 for FAP = 1/1000.
Unfortunately, real data sets generally don’t satisfy the assumptions that would produce Chi-distributed noise: the time sampling may not be evenly spaced, and the errors will be non-Gaussian, may be correlated, and may vary with time. Your results are so sensitive to where you set the significance threshold that the choice should be carefully considered. One approach to calculating a threshold based on the properties of your actual data is bootstrapping.
TODO:…null hypothesis…sampling with replacement…CDF of highest peaks.
- Bootstrapping a significance threshold for periodogram analysis
- How data sampling affects the Fourier transform periodogram
- Three statistical tests for average spacing among numbers
- Confidence intervals for 2D Gaussian mixture models with contours
- What's the expected average value of a noisy amplitude spectrum?