Final assignment: Data processing#


The final exercise involves converting data from one or more providers.

Since this exercise is designed to prepare you for real project work, the information you need to solve it might be slightly incomplete or not provided in context. Use your best judgment and most likely your client and your colleagues will be happy ;-)

Parts of this assignment can be solved in several ways. Use descriptive variable names and comments or descriptive text if necessary to clarify. The final solution should be clear to your colleagues and will be shared with some of your fellow students for review.

The data will be used for MIKE modelling and must be converted to Dfs with apppropriate EUM types/units in order to be used by the MIKE software.

The data is provided as a zip file and two binary files (in the data folder - see FA.3 below).

Inside the zip file, there are a many timeseries (ASCII format) of discharge data from streams located across several regions (*.dat).

Static data for each region is found in a separate file (region_info.csv)

Pandas read_csv is very powerful, but here are a few things to keep in mind

  • Column separator e.g. comma (,)

  • Blank lines

  • Comments

  • Missing values

  • Date format

The MIKE engine can not handle missing values / delete values, fill in missing values with interpolated values.

In order to save diskspace, crop the timeseries to simulation period Feb 1 - June 30.

FA.1 Convert all timeseries to Dfs0#

import os
import numpy as np
import pandas as pd
import mikeio

from mikeio import Dataset
from mikeio.eum import EUMType, ItemInfo, EUMUnit
# This is one way to find and filter filenames in a directory
# [x for x in os.listdir("datafolder") if "some_str" in x]
# This is useful!
# help(pd.read_csv)
# example of reading csv
# df = pd.read_csv("../data/oceandata.csv", comment='#', index_col=0, sep=',', parse_dates=True)

a) Convert all timeseries to dfs0 (remember that the notebook should be runnable for your peers so put the files somewhere reasonable).

b) Read s15_east_novayork_river.dfs0, print the “header”, plot, and show that the number of missing values is 0.

FA.2 Add region specific info to normalize timeseries with surface area#

Each timeseries belongs to a region identified in the filename, e.g. s15_east_novayork_river.dat is located in the novayork region.

a) Convert all timeseries to dfs0 with specific discharge, by doing:

For each timeseries in the dataset:

  • Find out which region it belongs to (hint: the string method split() will be useful)

  • Divide the timeseries values with the surface area for the region (take into account units)

  • Create a dfs0 file with specific discharge (discharge / area) (like the one with discharge from FA.1)

b) Determine which station has the largest max specific discharge (in the simulation period).

FA.3 Gridded weather forcing data#

The dataset is provided in NumPy binary format and consists of

  • Temperature 2m (degree Kelvin)

  • Relative humidity 2m (%)

The spatial grid is: 40°N - 50°N, 10°E-15°E with a grid spacing of 1 degree in each direction.

The time axis consists of two timesteps ‘2005-01-31’, ‘2005-07-31’ which is sufficent to cover the simulation period.

# Note: these files are in the data folder
tmp = np.load("../data/temperature_2m.npy")
rh = np.load("../data/rel_hum_2m.npy")
dy = 1.0
dx = 1.0
time = pd.date_range("2005-01-01", freq='6M', periods=2)
data = [tmp, rh]
# ds = Dataset(data,...
# geometry = mikeio.Grid2D(x,...)

a) Create the Dataset as indicated and write it to a new dfs2 file (check that file appears as the one shown below when opening in MIKE Zero)

b) Open the file again and print the header and the longitude, latitude (of the origin)

c) Read the data from the file and document that the mean of the relative humidity is the same as the mean of the rh variable above (note: due to rounding errors the may not be exactly the same)

The expected outcome when opening the file in MIKE Zero

Submission of solution#

Your solution to the above tasks is to be delivered in the format of a single Jupyter notebook file. Please create a new and name it final_assignment_xyz.ipynb where xyz is your initials. It should be easy to understand and runnable by your peers.

The solution will be reviewed by three of your fellow students, which will provide feedback on both the correctnes and clarity of your solution.

The submission and review process is handled by Campus + Eduflow.