Environmental Data Science

Fenyes Spectral Attenuation Pipeline

A reproducible water-quality modeling pipeline for deriving spectral Kd targets and irradiance-aware features from field measurements.

Fenyes is a scientific data-processing project centered on underwater light measurements. The repository assembles raw field spectra into attenuation targets, builds compact ML-ready datasets across FULL and PAR wavelength ranges, generates reproducible train/test splits, derives irradiance inputs, and exposes a first FastAPI plus React interface for manual Kd, Imean, and Iz calculations.

Overview

The core repository is a reproducible Python pipeline that turns raw depth-resolved measurement files into structured outputs for water-quality modeling. It includes filename auditing, Kd spectrum generation, long-form dataset assembly, feature engineering, split generation, and irradiance derivation.

The processed datasets support two spectral configurations: a FULL range from 380 to 900 nm and a PAR range from 400 to 700 nm. Both are built from the same raw source table and exported as compact modeling datasets.

The repository also contains a first application layer around the research code: a FastAPI backend that serves health, Kd, and irradiance endpoints, and a React/Vite frontend that provides a manual scientific calculator for Kd, Imean, and Iz workflows.

The project is portfolio-ready as an applied research and scientific software case study, but some presentation details such as the public GitHub URL, demo URL, and whether specific figures are safe for public reuse still need manual review.

Problem

Raw underwater light measurements are difficult to reuse directly for modeling because they are spread across heterogeneous field files, locations, dates, and measurement conventions.

Water-quality modeling workflows need consistent attenuation targets and aligned environmental features rather than ad hoc notebook processing.

Irradiance-aware features require a reproducible way to combine station metadata, clear-sky estimates, spectral reference data, and Beer-Lambert style attenuation logic.

What This Repository Implements

Built a public pipeline entrypoint that runs audit, Kd, dataset, feature, split, and irradiance stages in a fixed reproducible order.

Implemented Kd spectrum processing from raw text measurements, including calibration loading, interpolation to a common wavelength grid, attenuation fitting, smoothing, and per-station CSV and PNG outputs.

Built compact FULL and PAR modeling datasets with log-transformed targets and environmental features derived from chlorophyll, CDOM, and TSS inputs.

Added reproducible split generation for both random-by-sample and holdout-by-location evaluation setups.

Added a first FastAPI backend and React frontend so the scientific logic can also be accessed through manual calculator workflows.

Approach

Normalized raw spectra onto a shared 380 to 900 nm wavelength grid and converted calibrated measurements into photon-based units before attenuation fitting.

Estimated spectral Kd with log-linear fits across depth, quality thresholds, and Savitzky-Golay smoothing, then saved both spectrum and QC plots per station.

Assembled a long-form raw dataset and transformed it into compact ML tables for FULL and PAR wavelength ranges with one row per sample and one target column per wavelength.

Generated reproducible train/test exports with a fixed seed and an explicit holdout-by-location strategy to separate location generalization from random sample splits.

Derived irradiance features with `pvlib` clear-sky calculations, spectral reference interpolation, photon conversion, and depth-averaged Imean computation.

Results

The repository produces a complete data workflow from raw field measurements to modeling tables, split artifacts, irradiance derivatives, and reusable model exports.

Processed outputs already exist in the repo, including `data/raw/raw_data.csv`, `data/processed/model_dataset_FULL.csv`, `data/processed/model_dataset_PAR.csv`, split metadata under `artifacts/reports/splits_par/`, and model artifacts under `artifacts/models/legacy_exports/`.

A first API and frontend layer are implemented for manual scientific calculations, covering Kd prediction plus Imean and Iz workflows.

The application layer is not fully complete yet: the backend marks batch processing as a stub, and the frontend README explicitly notes that only manual calculations are implemented.

Raw Dataset

129,729 rows

From `data/raw/raw_data.csv`.

FULL Dataset

245 x 527

Rows x columns in `data/processed/model_dataset_FULL.csv`.

PAR Dataset

245 x 307

Rows x columns in `data/processed/model_dataset_PAR.csv`.

Evaluation Setups

2 split strategies

Random-by-sample and holdout-by-location.

Visuals

Project assets are shown here when they are site-ready. Repository artifact paths are preserved so you can promote selected files into `/public` later without changing the content model.

Artifact Reference

data/raw_measurements/uj/gyékényes20160908/Kd_spectrum_gyékényes20160908.png

Example Kd spectrum plot for a station measurement.

Artifact Reference

data/raw_measurements/uj/gyékényes20160908/Kd_QC_gyékényes20160908.png

Quality-control plot for a Kd spectrum measurement.

Charts & Figures

Saved figures and chart artifacts referenced by the project.

Artifact Reference

artifacts/reports/figures/parameter_jelentossegek.png

Saved project figure showing model parameter importance.

Chart Data

PAR Random Split Sample Counts

Saved in `artifacts/reports/splits_par/random_by_sample/metadata.json`.

Train190
Test48

PAR Holdout-by-Location Sample Counts

Saved in `artifacts/reports/splits_par/holdout_by_location/metadata.json`.

Train207
Test31

PAR Holdout-by-Location Coverage

Number of unique locations in the saved split metadata.

Train Locations33
Test Locations5