Pandas#

````{tab-set}
```{tab-item} Pandas
import pandas as pd
```
```{tab-item} Polars
import polars as pl
```
````
  Cell In[1], line 1
    ````{tab-set}
    ^
SyntaxError: invalid syntax
cities = {'name': ["Copenhagen", "London"],
          'population': [1.5, 11.2],
          'dist_to_coast': [0.0, 2.3]}
df = pd.DataFrame(cities)
df
df[df.name=='London']
df.population.mean()

Get row by number#

df.iloc[0]
df.iloc[1]

Get row by name (named index)#

df = df.set_index('name')
df
df.loc["London"]
df.index
df.columns

We can transpose the dataframe, (rows -> columns)

df.T
df.loc["London"].population

Delimited files#

Delimited files, separated by comma, semi-colon, tabs, spaces or any other special character, is a very common data format for tabular data. Comma separated value (csv) files can be read by the pandas read_csv function. It is a very powerful function, with a lot of options. It is very rare, that you have to write your own python function to parse csv files.

Below is an example of csv file:

  • Header with comments

  • Columns are separated with semi-colon (;)

  • Decimal separator is (,)

  • Date and time are in separate columns

  • There is a redundant station column

  • The column names are not in english

df = pd.read_csv("data/boogus.csv", 
                 comment="%",
                 sep=";",
                 decimal=",",
                 parse_dates=[["dato","tid"]],
                 usecols=[1,2,3])
df

Most functions in Pandas returns a copy, so even though the below line, looks like it changes the name, since it is printed to the screen, the df variable is not changed.

df.rename(columns={"temperatur": "air_temperature"})
df # not modfied
df_new_name = df.rename(columns={"temperatur": "air_temperature"})
df_new_name

Inline exercise

  1. Calculate the min and max tempterature by selecting the air_temperature column from the df_new_name dataframe like this: df_new_name.air_temperature and call the .min() and max() methods.

  2. Read the csv file again, but this time include all columns.

  3. Read the csv file once more, but don’t use the comment argument, use another argument to skip the first four rows. Help: pd.read_csv?

Timeseries#

df = pd.read_csv("data/oceandata.csv", parse_dates=True, index_col='datetime')
df.head()
type(df)
type(df.waterlevel)
df.plot()
df.waterlevel.plot()
df.index
df.describe()
df.loc['2015-02-18'].plot()
df.loc['2015-02-18'].interpolate().plot()
df.loc['2015-02-18 14:00':'2015-02-18 15:20']
df_interp = df.interpolate()
df_interp.loc['2015-02-18 14:00':'2015-02-18 15:20']

Resampling#

Aggregate temporal data

df.resample('H')

Resampling requires an aggregation function, e.g., sum, mean, median,…

df.resample('D').sum().head()

The sum function doesn’t make sense in this example. Better to use mean.

df.resample('H').mean().head()
df.resample('H').first().head()
df.resample('H').median().head()
df_h = df.resample('H').interpolate().dropna()
df_h.head()

Note: resample will use either the left or the right end-point depending on the resampling frequency (e.g. for hours the beginning of the hour but for months the end of the month). If you want to make sure you are resampling right - specify the closed argument.

Inline exercise#

Please find the maximum value for every 6 hour period.

# insert your code here

Extrapolation#

rng = pd.date_range("2015-02-17","2015-02-20",freq='H')
ix = pd.DatetimeIndex(rng)
dfr = df_interp.reindex(ix)
dfr.plot()
dfr.ffill().plot()
df_extra = dfr.bfill().ffill()
df_extra.plot()
df_extra
from IPython.display import YouTubeVideo

YouTubeVideo("8upGdZMlkYM")

For more tips and tricks on how to use Pandas for timeseries data see this talk: Ian Ozsvald: A gentle introduction to Pandas timeseries and Seaborn | PyData London 2019