This section explores how to select specific subsets of time series data.
11.1 Why subset data?
Selecting a subset of data is useful for:
focusing analysis on data of interest (e.g. specific item or time range)
reducing memory usage and computational overhead (helpful for large files)
generating relevant illustrations (plots and table views)
11.2 Alternative Methods
There are various ways of selecting subsets of dfs0 data. This section covers two different approaches:
Using mikeio.read()
Using Pandas DataFrame
As mentioned, MIKE IO also provides additional functionality for selecting subsets, however this course focuses on Pandas for simplicity.
Memory considerations
Selecting data via the read() method is generally most performant, since it will avoid loading the entire file into memory. Selecting data via Dataset, DataArray, and DataFrame objects requires first loading the entire file into memory.
A dfs0 file is a special case where the entire file is loaded into memory regardless, however that will not be the case for other dfs formats (e.g. dfs2, dfsu). Therefore, it’s a good practice to use the read() method when you know which data you want in advance.
11.3 Selecting Items
When reading data with mikeio.read(), the items argument lets you specify which items to load. You can do this by providing a list of item names.
Using item indices can be convenient, especially for quick explorations. However, specifying item names explicitly makes your code more readable and robust to changes in the dfs0 file structure, such as if items are reordered.
From a Pandas DataFrame, you can select items using standard Pandas column selection techniques.
Python’s slice() method is versatile for defining ranges. While list-like slicing notation (e.g., time_series[start:end]) is common with Pandas DataFrames, slice(start, end) is the explicit way to create a slice object, often used in functions like mikeio.read().
From a Pandas DataFrame, standard indexing and slicing techniques of the DatetimeIndex may be used.
Providing a list to iloc returns another DataFrame, whereas providing a single value returns a Series.
For the first three time steps:
df.iloc[0:3]
ST 2: WL (m)
1993-12-02 00:00:00
-0.2689
1993-12-02 00:30:00
-0.2847
1993-12-02 01:00:00
-0.3020
For selection by timestamp strings, use .loc.
df.loc[["1993-12-02 00:00:00"]]
ST 2: WL (m)
1993-12-02
-0.2689
To select all data for a particular day:
df.loc["1993-12-03"].head()
ST 2: WL (m)
1993-12-03 00:00:00
0.0879
1993-12-03 00:30:00
0.0951
1993-12-03 01:00:00
0.0988
1993-12-03 01:30:00
0.0836
1993-12-03 02:00:00
0.0634
And for a range between start and end timestamps:
df.loc["1993-12-02 12:00":"1993-12-02 16:00"]
ST 2: WL (m)
1993-12-02 12:00:00
-0.4590
1993-12-02 12:30:00
-0.4698
1993-12-02 13:00:00
-0.4812
1993-12-02 13:30:00
-0.4919
1993-12-02 14:00:00
-0.5012
1993-12-02 14:30:00
-0.4798
1993-12-02 15:00:00
-0.4486
1993-12-02 15:30:00
-0.4137
1993-12-02 16:00:00
-0.3772
A key distinction in Pandas is between .iloc and .loc:
.iloc is used for integer-location based indexing (by position, e.g., df.iloc[0] for the first row).
.loc is used for label-based indexing (by index names or boolean arrays, e.g., df.loc['2023-01-01']).
When working with time series data having a DatetimeIndex, .loc is particularly powerful as it allows you to use date/time strings for intuitive selections and slicing, as shown in the examples.
11.5 Example
Let’s tie these concepts together with an example of plotting a subset of a dfs0 file.
1. Read a specific item of the dfs0 file into a Dataset