Time Averaging of 1-min Data (floor / ceiling / center)#
分钟级数据时间平均(floor / ceiling / center)#
This notebook loads one monthly archive (§2), runs the same solpos → clear-sky → ``qc_test`` → ``qc_mask`` pipeline as ``2.quality_control.ipynb`` in brief (§3), then shows averaging on the dataset via ``BSRNDataset.average`` (§4). The rest focuses on ``pretty_average`` and direct DataFrame use.
§5 explains how pretty_average differs from pandas.resample (windows, monthly label trim, coverage, match_ceiling_labels); §5.1 uses ``ds.data()`` (QC-masked minute data after §3) for alignments; §5.2 compares ``match_ceiling_labels`` for center only; §5.3 tabulates samples per 1 h interval (floor / ceiling / center). §6 applies 1 h averaging to the QIQ month. §7–8 plot comparisons (plots at the end of the workflow).
本笔记本加载月度存档(§2),在 §3 中简要复现与 ``2.quality_control.ipynb`` 相同的 solpos → clear-sky → ``qc_test`` → ``qc_mask`` 流程,随后在 §4 演示 ``BSRNDataset.average`` 的数据集管线;其后以 ``pretty_average`` 与直接操作 ``DataFrame`` 为主。
§5 说明差异(窗、月界裁剪、覆盖、match_ceiling_labels);§5.1 用 §3 后 ``ds.data()``(经 ``qc_mask`` 的分钟数据)比较对齐;§5.2 仅比较 center 的 ``match_ceiling_labels``;§5.3 统计每个 1h 窗的样本数(floor / ceiling / center)。§6 对 QIQ 全月做 1 小时平均。§7–8 为对比图(图放在工作流末尾)。
1. Import libraries / 导入库#
[1]:
import os
import matplotlib.pyplot as plt
import pandas as pd
import bsrn
from bsrn.utils import pretty_average
# Same sample file pattern as 4.clear_sky_detection.ipynb
INPUT_FILE = "/Volumes/Macintosh Research/Data/bsrn-qc/data/QIQ/qiq0824.dat.gz"
pd.options.display.max_columns = None
2. Load QIQ August data / 加载 QIQ 8 月数据#
[2]:
if os.path.exists(INPUT_FILE):
ds = bsrn.BSRNDataset.from_file(INPUT_FILE)
stn = ds.station_code
print(
f"Data for {stn} loaded successfully ({ds.year}-{ds.month:02d})."
)
display(ds.data().head())
else:
raise FileNotFoundError(
f"File not found: {INPUT_FILE}. Current working directory: {os.getcwd()}"
)
Data for QIQ loaded successfully (2024-08).
| ghi | bni | dhi | lwd | temp | rh | pressure | |
|---|---|---|---|---|---|---|---|
| 2024-08-01 00:00:00+00:00 | 105.0 | 0.0 | 106.0 | 421.0 | 25.2 | 70.0 | 981.0 |
| 2024-08-01 00:01:00+00:00 | 108.0 | 0.0 | 109.0 | 422.0 | 25.2 | 71.0 | 981.0 |
| 2024-08-01 00:02:00+00:00 | 112.0 | 0.0 | 113.0 | 421.0 | 25.2 | 70.0 | 981.0 |
| 2024-08-01 00:03:00+00:00 | 113.0 | 0.0 | 114.0 | 421.0 | 25.2 | 70.0 | 981.0 |
| 2024-08-01 00:04:00+00:00 | 116.0 | 0.0 | 117.0 | 421.0 | 25.2 | 71.0 | 981.0 |
3. Solar geometry, clear-sky (REST2), and QC / 太阳几何、晴空(REST2)与 QC#
We use the same pipeline as ``2.quality_control.ipynb``: BSRNDataset.solpos, BSRNDataset.clear_sky (here model='rest2' with MERRA-2 via Hugging Face), then BSRNDataset.qc_test followed immediately by BSRNDataset.qc_mask (failed irradiance → NaN, standard flag* columns dropped). Read ``2.quality_control`` for details; we do not tabulate flags here. ``ds.data()`` after §3 is the QC-masked minute frame for ``pretty_average`` from §5 onward;
§4 calls ``average`` on ``ds.model_copy(deep=True)`` so ``ds`` stays at minute resolution.
与 ``2.quality_control.ipynb`` 使用相同流程:BSRNDataset.solpos、BSRNDataset.clear_sky(此处 model='rest2',MERRA-2 经 Hugging Face)、再 BSRNDataset.qc_test 并紧接着 BSRNDataset.qc_mask(失败辐照度置 NaN,删除标准 flag* 列)。细节见 ``2.quality_control``;此处不汇总标记列。§3 后的 ``ds.data()`` 为已 QC 掩膜的分钟帧,供 §5 起 ``pretty_average`` 使用;§4 在 ``ds.model_copy(deep=True)`` 的副本上调用
``average``,以免 ``ds`` 被替换为较粗时间索引。
[3]:
# Same pipeline as 2.quality_control.ipynb / 与 2.quality_control 相同流程
ds.solpos()
ds.clear_sky(model="rest2")
ds.qc_test()
ds.qc_mask()
display(ds.data()[["ghi", "ghi_clear", "zenith"]].head())
Fetching MERRA-2 from Hugging Face: qiq/qiq0824_merra2.parquet
| ghi | ghi_clear | zenith | |
|---|---|---|---|
| 2024-08-01 00:00:00+00:00 | 105.0 | 507.439814 | 54.881 |
| 2024-08-01 00:01:00+00:00 | 108.0 | 509.994881 | 54.717 |
| 2024-08-01 00:02:00+00:00 | 112.0 | 512.545982 | 54.553 |
| 2024-08-01 00:03:00+00:00 | 113.0 | 515.077560 | 54.390 |
| 2024-08-01 00:04:00+00:00 | 116.0 | 517.620641 | 54.226 |
4. Pipeline averaging: BSRNDataset.average / 管线平均:BSRNDataset.average#
On a ``BSRNDataset``, call ``average(freq, …)`` on the dataset (not on the bare DataFrame from ``data()``). It delegates to ``pretty_average`` on the cached frame, then replaces that cache with the coarser-index result. §5+ still need minute-level ``ds``, so we ``model_copy(deep=True)`` (Pydantic) and call ``average`` on the copy—same columns as ``ds.data()`` after §3 (including QC-masked irradiance).
在 ``BSRNDataset`` 上调用 ``average(freq, …)``(不要对 ``data()`` 返回的裸 ``DataFrame`` 调用)。内部以 ``pretty_average`` 替换缓存。§5 及以后仍需要分钟级 ``ds``,故先用 ``model_copy(deep=True)``(Pydantic)深拷贝再 ``average``,副本与 §3 后 ``ds.data()`` 列一致(含 QC 掩膜后的辐照度)。
When you are done with minute data, you may call ``average`` on ``ds`` itself instead of a copy.
若不再需要分钟数据,也可直接在 ``ds`` 上调用 ``average``,无需副本。
[4]:
from IPython.display import display
ds_h = ds.model_copy(deep=True)
ds_h.average("1h", alignment="ceiling")
display(ds_h.data()[["ghi", "ghi_clear", "zenith"]].head())
| ghi | ghi_clear | zenith | |
|---|---|---|---|
| 2024-08-01 01:00:00+00:00 | 203.385965 | 581.242271 | 49.990333 |
| 2024-08-01 02:00:00+00:00 | 380.630435 | 706.014627 | 41.060483 |
| 2024-08-01 03:00:00+00:00 | 303.216667 | 791.170072 | 33.993883 |
| 2024-08-01 04:00:00+00:00 | 230.416667 | 826.488629 | 30.319967 |
| 2024-08-01 05:00:00+00:00 | 237.150000 | 817.447416 | 31.301850 |
5. What pretty_average does / pretty_average 的语义#
bsrn.utils.averaging.pretty_average implements explicit labeled windows — it is NOT pandas.DataFrame.resample. For each output label time L (bin length Δ from freq), data rows are selected with a rule that depends on ``alignment``.
pretty_average 使用显式标签窗,不是 resample。对每个输出标签时刻 L(freq 给出窗长 Δ),按 ``alignment`` 选取行。
Windows / 窗定义#
Alignment |
Interval |
Role of |
|---|---|---|
floor |
\([L, L+\Delta)\) |
Label at left edge / 标签在左端 |
ceiling |
\((L-\Delta, L]\) |
Label at right edge / 标签在右端 |
center |
\([L-\Delta/2+\mathrm{res},\, L+\Delta/2]\) |
Label at center; |
Monthly label trim / 月界标签裁剪#
The regular grid from floor(min) … ceil(max) at freq is trimmed so edge bins do not straddle month boundaries (lo / hi from the data index). floor drops labels at the high edge; ceiling drops at the low edge. For center, choose the same trim as ceiling or floor via ``match_ceiling_labels`` (default True).
规则网格会裁剪,避免边界小时跨月。center 与 ceiling 或 floor 共用哪一套裁剪由 ``match_ceiling_labels`` 控制(默认与 ceiling 一致)。
Coverage / 覆盖#
A numeric aggregate is emitted only if enough rows are “valid” (numeric columns: at least one finite value in that row).
floor / ceiling: valid count must be > half of the nominal step count \(\approx \Delta/\mathrm{res}\).
center: valid count must be > half of rows in the window (not the nominal count).
Otherwise the row is still emitted but values are NaN.
Inputs / 输入#
``DatetimeIndex`` is required.:nbsphinx-math:n
Below: §5.1 uses ``ds.data()`` for floor/ceiling/center; §5.2 compares ``match_ceiling_labels`` (center only); §5.3 counts samples per 1 h window. §6 applies pretty_average for the workflow plots.
下方:§5.1 用 ``ds.data()`` 比较 floor/ceiling/center;§5.2 比较 ``match_ceiling_labels``(仅 center);§5.3 统计每 1h 窗样本数;§6 为工作流整月应用。
5.1 Case A — full month, different label times / 整月、不同标签时刻#
All rows in ``ds.data()`` after §3 (one station-month from the archive). freq='1h': floor / ceiling / center differ in where the hour label sits; counts can differ slightly at month edges.
§3 后的 ``ds.data()`` 全表(存档中的单站单月)。freq='1h':floor / ceiling / center 的标签时刻不同;月界附近行数可略有差异。
[5]:
# ds.data() after §3: one station-month (see INPUT_FILE). / §3 后 ds.data() 为单站单月(见 INPUT_FILE)
df_a = ds.data()[["ghi", "zenith"]].copy()
freq = "1h"
f_a = pretty_average(df_a, freq, alignment="floor")
c_a = pretty_average(df_a, freq, alignment="ceiling")
z_a = pretty_average(df_a, freq, alignment="center")
print("Case A — 1 h labels (UTC) on the loaded month:")
print(f" native rows in window: {len(df_a)}")
for name, s in [("floor", f_a), ("ceiling", c_a), ("center", z_a)]:
print(f" {name:7s}: n={len(s):4d} first={s.index[0]} last={s.index[-1]}")
Case A — 1 h labels (UTC) on the loaded month:
native rows in window: 44640
floor : n= 744 first=2024-08-01 00:00:00+00:00 last=2024-08-31 23:00:00+00:00
ceiling: n= 744 first=2024-08-01 01:00:00+00:00 last=2024-09-01 00:00:00+00:00
center : n= 744 first=2024-08-01 01:00:00+00:00 last=2024-09-01 00:00:00+00:00
5.2 Case B — match_ceiling_labels usage / match_ceiling_labels 用法#
For ``alignment=’center’`` only: ``match_ceiling_labels`` chooses whether monthly label trimming follows ceiling (default) or floor. Compare the two ``center`` outputs (row counts and first/last labels).
仅 ``alignment=’center’``:``match_ceiling_labels`` 决定月界裁剪与 ceiling 还是 floor 一致(默认 True)。比较两种 ``center`` 结果(行数与首尾标签)。
[6]:
freq = "1h"
# Same input columns as §5.1; Case B only compares center + match_ceiling_labels / 与 §5.1 同列;本节只比较 center 与 match_ceiling_labels
df_b = ds.data()[["ghi", "zenith"]].copy()
z_hi = pretty_average(df_b, freq, alignment="center", match_ceiling_labels=True)
z_lo = pretty_average(df_b, freq, alignment="center", match_ceiling_labels=False)
print("Case B — center only: match_ceiling_labels True vs False")
print(f" n_row: {len(z_hi)} vs {len(z_lo)}")
print(" first label:", z_hi.index[0], "vs", z_lo.index[0])
print(" last label:", z_hi.index[-1], "vs", z_lo.index[-1])
Case B — center only: match_ceiling_labels True vs False
n_row: 744 vs 744
first label: 2024-08-01 01:00:00+00:00 vs 2024-08-01 00:00:00+00:00
last label: 2024-09-01 00:00:00+00:00 vs 2024-08-31 23:00:00+00:00
5.3 Case C — samples per 1 h interval / 每个 1 小时窗内样本数#
For ``freq=’1h’``, each output row is the mean over a time window; the table counts how many native rows fall in that window (``n_samples``) for floor, ceiling, and center (same logic as pretty_average). Labels and counts differ across alignments. Preview tables use ``display()`` so every row renders in the notebook (plain ``print`` on wide tables can be truncated).
``freq=’1h’`` 下,每个输出标签对应一个时间窗;下表统计窗内原生行数(``n_samples``),分别对应 floor、ceiling、center(与 pretty_average 一致)。不同对齐的标签时刻与行数可不同。预览表用 ``display()`` 逐块输出,避免宽表 ``print`` 在界面中被截断。
[7]:
import bsrn.utils.averaging as av
from IPython.display import display
def hourly_sample_counts(df, freq, alignment, match_ceiling_labels=True):
"""Native rows per 1 h label (same windows as pretty_average). / 与 pretty_average 相同的窗内行数"""
delta = av._period_delta(freq)
res = av._archive_timestep_1_or_3(df.index)
labels = av._label_grid(df.index, freq)
labels = av._trim_labels_for_alignment(
labels, df.index, freq, alignment, match_ceiling_labels=match_ceiling_labels,
)
idx = df.index
rows = []
for L in labels:
m = av._window_mask(idx, L, delta, alignment, res)
rows.append((L, int(m.sum())))
return pd.DataFrame(rows, columns=["label_utc", "n_samples"])
freq = "1h"
dfc = ds.data()[["ghi", "zenith"]].copy()
floor_ct = hourly_sample_counts(dfc, freq, "floor")
ceil_ct = hourly_sample_counts(dfc, freq, "ceiling")
cent_ct = hourly_sample_counts(dfc, freq, "center", match_ceiling_labels=True)
print("Case C — n_samples per 1 h interval (native rows in each window)")
# display() renders full HTML tables (avoids stdout truncation in long cells).
# display() 输出完整 HTML 表,避免长单元格内 stdout 被截断。
print("\nfloor — head 4 / 前 4:")
display(floor_ct.head(4))
print("floor — tail 4 / 後 4:")
display(floor_ct.tail(4))
print("\nceiling — head 4 / 前 4:")
display(ceil_ct.head(4))
print("ceiling — tail 4 / 後 4:")
display(ceil_ct.tail(4))
print("\ncenter (match_ceiling_labels=True) — head 4 / 前 4:")
display(cent_ct.head(4))
print("center — tail 4 / 後 4:")
display(cent_ct.tail(4))
print("\nSummary n_samples (all intervals / 全部区间):")
for name, tab in [("floor", floor_ct), ("ceiling", ceil_ct), ("center", cent_ct)]:
s = tab["n_samples"]
print(f" {name:7s}: count={len(s)}, min={s.min()}, max={s.max()}, mean={s.mean():.1f}")
Case C — n_samples per 1 h interval (native rows in each window)
floor — head 4 / 前 4:
| label_utc | n_samples | |
|---|---|---|
| 0 | 2024-08-01 00:00:00+00:00 | 60 |
| 1 | 2024-08-01 01:00:00+00:00 | 60 |
| 2 | 2024-08-01 02:00:00+00:00 | 60 |
| 3 | 2024-08-01 03:00:00+00:00 | 60 |
floor — tail 4 / 後 4:
| label_utc | n_samples | |
|---|---|---|
| 740 | 2024-08-31 20:00:00+00:00 | 60 |
| 741 | 2024-08-31 21:00:00+00:00 | 60 |
| 742 | 2024-08-31 22:00:00+00:00 | 60 |
| 743 | 2024-08-31 23:00:00+00:00 | 60 |
ceiling — head 4 / 前 4:
| label_utc | n_samples | |
|---|---|---|
| 0 | 2024-08-01 01:00:00+00:00 | 60 |
| 1 | 2024-08-01 02:00:00+00:00 | 60 |
| 2 | 2024-08-01 03:00:00+00:00 | 60 |
| 3 | 2024-08-01 04:00:00+00:00 | 60 |
ceiling — tail 4 / 後 4:
| label_utc | n_samples | |
|---|---|---|
| 740 | 2024-08-31 21:00:00+00:00 | 60 |
| 741 | 2024-08-31 22:00:00+00:00 | 60 |
| 742 | 2024-08-31 23:00:00+00:00 | 60 |
| 743 | 2024-09-01 00:00:00+00:00 | 59 |
center (match_ceiling_labels=True) — head 4 / 前 4:
| label_utc | n_samples | |
|---|---|---|
| 0 | 2024-08-01 01:00:00+00:00 | 60 |
| 1 | 2024-08-01 02:00:00+00:00 | 60 |
| 2 | 2024-08-01 03:00:00+00:00 | 60 |
| 3 | 2024-08-01 04:00:00+00:00 | 60 |
center — tail 4 / 後 4:
| label_utc | n_samples | |
|---|---|---|
| 740 | 2024-08-31 21:00:00+00:00 | 60 |
| 741 | 2024-08-31 22:00:00+00:00 | 60 |
| 742 | 2024-08-31 23:00:00+00:00 | 60 |
| 743 | 2024-09-01 00:00:00+00:00 | 29 |
Summary n_samples (all intervals / 全部区间):
floor : count=744, min=60, max=60, mean=60.0
ceiling: count=744, min=59, max=60, mean=60.0
center : count=744, min=29, max=60, mean=60.0
6. Apply to QIQ (1 h) / 应用到 QIQ(1 小时)#
Same freq and three alignments on the QC-masked August frame (§3). Row counts often differ slightly between alignments.
在 §3 QC 掩膜后的 8 月数据上使用相同 freq 与三种对齐;不同对齐下行数可能略有差异。
[8]:
FREQ = "1h"
# Irradiance + zenith (required by pretty_average)
cols_avg = [c for c in ("ghi", "ghi_clear", "zenith") if c in ds.data().columns]
df_in = ds.data()[cols_avg].copy()
avg_floor = pretty_average(df_in, FREQ, alignment="floor")
avg_ceil = pretty_average(df_in, FREQ, alignment="ceiling")
avg_center = pretty_average(df_in, FREQ, alignment="center")
print("Row counts — floor:", len(avg_floor), "ceiling:", len(avg_ceil), "center:", len(avg_center))
Row counts — floor: 744 ceiling: 744 center: 744
7. Plots: compare GHI means / 图:比较 GHI 平均#
Run §6 first so avg_floor, avg_ceil, and avg_center exist.
先运行 §6,以生成 avg_floor、avg_ceil、avg_center。
Overlay 1-hour mean GHI for the three alignments on a short slice (2024-08-08 UTC).
在一天片段上叠加三种对齐的 1 小时平均 GHI。
[9]:
day = "2024-08-08"
sub = slice(f"{day} 00:00", f"{day} 23:59")
fig, ax = plt.subplots(figsize=(10, 4), dpi=120)
if "ghi" in ds.data().columns:
ax.plot(
ds.data().loc[sub].index,
ds.data().loc[sub, "ghi"],
color="0.85",
linewidth=0.8,
label="GHI (1 min native)",
)
ax.plot(avg_floor.loc[sub].index, avg_floor.loc[sub, "ghi"], marker="o", markersize=3, label="floor")
ax.plot(avg_ceil.loc[sub].index, avg_ceil.loc[sub, "ghi"], marker="s", markersize=3, label="ceiling")
ax.plot(avg_center.loc[sub].index, avg_center.loc[sub, "ghi"], marker="^", markersize=3, label="center")
ax.set_ylabel("GHI [W/m²]")
ax.set_xlabel("Time (UTC)")
ax.set_title(f"{stn} — {day} — 1 h mean GHI vs alignment")
ax.legend(loc="upper left", fontsize=8)
fig.autofmt_xdate()
plt.tight_layout()
plt.show()
8. Optional: compare clear-sky GHI / 可选:比较晴空 GHI#
Same overlay for ``ghi_clear`` (REST2).
对 ``ghi_clear``(REST2)做同样叠加。
[10]:
if "ghi_clear" in avg_floor.columns:
fig, ax = plt.subplots(figsize=(10, 4), dpi=120)
ax.plot(avg_floor.loc[sub].index, avg_floor.loc[sub, "ghi_clear"], marker="o", markersize=3, label="floor")
ax.plot(avg_ceil.loc[sub].index, avg_ceil.loc[sub, "ghi_clear"], marker="s", markersize=3, label="ceiling")
ax.plot(avg_center.loc[sub].index, avg_center.loc[sub, "ghi_clear"], marker="^", markersize=3, label="center")
ax.set_ylabel("GHI clear [W/m²]")
ax.set_xlabel("Time (UTC)")
ax.set_title(f"{stn} — {day} — REST2 ghi_clear (1 h)")
ax.legend(loc="upper left", fontsize=8)
fig.autofmt_xdate()
plt.tight_layout()
plt.show()