Pandas#
````{tab-set}
```{tab-item} Pandas
import pandas as pd
```
```{tab-item} Polars
import polars as pl
```
````
Cell In[1], line 1
````{tab-set}
^
SyntaxError: invalid syntax
cities = {'name': ["Copenhagen", "London"],
'population': [1.5, 11.2],
'dist_to_coast': [0.0, 2.3]}
df = pd.DataFrame(cities)
df
df[df.name=='London']
df.population.mean()
Get row by number#
df.iloc[0]
df.iloc[1]
Get row by name (named index)#
df = df.set_index('name')
df
df.loc["London"]
df.index
df.columns
We can transpose the dataframe, (rows -> columns)
df.T
df.loc["London"].population
Delimited files#
Delimited files, separated by comma, semi-colon, tabs, spaces or any other special character, is a very common data format for tabular data. Comma separated value (csv) files can be read by the pandas read_csv
function. It is a very powerful function, with a lot of options. It is very rare, that you have to write your own python function to parse csv files.
Below is an example of csv file:
Header with comments
Columns are separated with semi-colon (;)
Decimal separator is (,)
Date and time are in separate columns
There is a redundant station column
The column names are not in english
df = pd.read_csv("data/boogus.csv",
comment="%",
sep=";",
decimal=",",
parse_dates=[["dato","tid"]],
usecols=[1,2,3])
df
Most functions in Pandas returns a copy, so even though the below line, looks like it changes the name, since it is printed to the screen, the df
variable is not changed.
df.rename(columns={"temperatur": "air_temperature"})
df # not modfied
df_new_name = df.rename(columns={"temperatur": "air_temperature"})
df_new_name
Inline exercise
Calculate the min and max tempterature by selecting the
air_temperature
column from thedf_new_name
dataframe like this:df_new_name.air_temperature
and call the .min()
andmax()
methods.Read the csv file again, but this time include all columns.
Read the csv file once more, but don’t use the comment argument, use another argument to skip the first four rows. Help:
pd.read_csv?
Timeseries#
df = pd.read_csv("data/oceandata.csv", parse_dates=True, index_col='datetime')
df.head()
type(df)
type(df.waterlevel)
df.plot()
df.waterlevel.plot()
df.index
df.describe()
df.loc['2015-02-18'].plot()
df.loc['2015-02-18'].interpolate().plot()
df.loc['2015-02-18 14:00':'2015-02-18 15:20']
df_interp = df.interpolate()
df_interp.loc['2015-02-18 14:00':'2015-02-18 15:20']
Resampling#
Aggregate temporal data
df.resample('H')
Resampling requires an aggregation function, e.g., sum, mean, median,…
df.resample('D').sum().head()
The sum
function doesn’t make sense in this example. Better to use mean
.
df.resample('H').mean().head()
df.resample('H').first().head()
df.resample('H').median().head()
df_h = df.resample('H').interpolate().dropna()
df_h.head()
Note: resample will use either the left or the right end-point depending on the resampling frequency (e.g. for hours the beginning of the hour but for months the end of the month). If you want to make sure you are resampling right - specify the closed argument.
Inline exercise#
Please find the maximum value for every 6 hour period.
# insert your code here
Extrapolation#
rng = pd.date_range("2015-02-17","2015-02-20",freq='H')
ix = pd.DatetimeIndex(rng)
dfr = df_interp.reindex(ix)
dfr.plot()
dfr.ffill().plot()
df_extra = dfr.bfill().ffill()
df_extra.plot()
df_extra
from IPython.display import YouTubeVideo
YouTubeVideo("8upGdZMlkYM")
For more tips and tricks on how to use Pandas for timeseries data see this talk: Ian Ozsvald: A gentle introduction to Pandas timeseries and Seaborn | PyData London 2019