# Pandas

In [None]:
import pandas as pd

In [None]:
cities = {'name': ["Copenhagen", "London"],
          'population': [1.5, 11.2],
          'dist_to_coast': [0.0, 2.3]}

In [None]:
df = pd.DataFrame(cities)
df

In [None]:
df[df.name=='London']

In [None]:
df.population.mean()

## Get row by number

In [None]:
df.iloc[0]

In [None]:
df.iloc[1]

## Get row by name (named index)

In [None]:
df = df.set_index('name')
df

In [None]:
df.loc["London"]

In [None]:
df.index

In [None]:
df.columns

We can transpose the dataframe, (rows -> columns)

In [None]:
df.T

In [None]:
df.loc["London"].population

## Delimited files

Delimited files, separated by comma, semi-colon, tabs, spaces or any other special character, is a very common data format for tabular data. Comma separated value (csv) files can be read by the [pandas `read_csv` function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). It is a very powerful function, with a lot of options. It is very rare, that you have to write your own python function to parse csv files.

Below is an example of csv file:

![](images/csv_example.png)


* Header with comments
* Columns are separated with semi-colon (;)
* Decimal separator is (,)
* Date and time are in separate columns
* There is a redundant station column
* The column names are not in english 

In [None]:
df = pd.read_csv("data/boogus.csv", 
                 comment="%",
                 sep=";",
                 decimal=",",
                 parse_dates=[["dato","tid"]],
                 usecols=[1,2,3])
df

Most functions in Pandas returns a copy, so even though the below line, looks like it changes the name, since it is printed to the screen, the `df` variable is not changed.

In [None]:
df.rename(columns={"temperatur": "air_temperature"})

In [None]:
df # not modfied

In [None]:
df_new_name = df.rename(columns={"temperatur": "air_temperature"})
df_new_name

**Inline exercise**

1. Calculate the min and max tempterature by selecting the `air_temperature` column from the `df_new_name` dataframe like this: `df_new_name.air_temperature` and call the .`min()` and `max()` methods.
2. Read the csv file again, but this time include all columns.
3. Read the csv file once more, but don't use the comment argument, use another argument to skip the first four rows. Help: `pd.read_csv?`

## Timeseries

In [None]:
df = pd.read_csv("data/oceandata.csv", parse_dates=True, index_col='datetime')

In [None]:
df.head()

In [None]:
type(df)

In [None]:
type(df.waterlevel)

In [None]:
df.plot()

In [None]:
df.waterlevel.plot()

In [None]:
df.index

In [None]:
df.describe()

In [None]:
df.loc['2015-02-18'].plot()

In [None]:
df.loc['2015-02-18'].interpolate().plot()

In [None]:
df.loc['2015-02-18 14:00':'2015-02-18 15:20']

In [None]:
df_interp = df.interpolate()

In [None]:
df_interp.loc['2015-02-18 14:00':'2015-02-18 15:20']

## Resampling

Aggregate temporal data 

In [None]:
df.resample('H')

Resampling requires an aggregation function, e.g., sum, mean, median,...

In [None]:
df.resample('D').sum().head()

The `sum` function doesn't make sense in this example. Better to use `mean`.

In [None]:
df.resample('H').mean().head()

In [None]:
df.resample('H').first().head()

In [None]:
df.resample('H').median().head()

In [None]:
df_h = df.resample('H').interpolate().dropna()
df_h.head()

Note: resample will use either the left or the right end-point depending on the resampling frequency (e.g. for hours the beginning of the hour but for months the end of the month). If you want to make sure you are resampling right - specify the closed argument.

### Inline exercise

Please find the maximum value for every 6 hour period.

In [None]:
# insert your code here

## Extrapolation

In [None]:
rng = pd.date_range("2015-02-17","2015-02-20",freq='H')
ix = pd.DatetimeIndex(rng)
dfr = df_interp.reindex(ix)
dfr.plot()

In [None]:
dfr.ffill().plot()

In [None]:
df_extra = dfr.bfill().ffill()
df_extra.plot()

In [None]:
df_extra

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo("8upGdZMlkYM")

For more tips and tricks on how to use Pandas for timeseries data see this talk:
[Ian Ozsvald: A gentle introduction to Pandas timeseries and Seaborn | PyData London 2019](https://www.youtube.com/watch?v=8upGdZMlkYM)