In this note, I use df
for DataFrame
, s
for Series
.
import pandas as pd # import pandas package
import numpy as np
dataquest_aio = '<https://raw.githubusercontent.com/dinhanhthi/dataquest-aio/master/step-2-data-analysis-and-visualization/>'
dataset_url = dataquest_aio + 'course-4-data-cleaning-and-analysis/data/World_Happiness_2015.csv'
df = pd.read_csv(dataset_url) # read the data set
df.head()
Country | Region | Happiness Rank | Happiness Score | Standard Error | |
---|---|---|---|---|---|
0 | Switzerland | Western Europe | 1 | 7.587 | 0.03411 |
1 | Iceland | Western Europe | 2 | 7.561 | 0.04884 |
2 | Denmark | Western Europe | 3 | 7.527 | 0.03328 |
3 | Norway | Western Europe | 4 | 7.522 | 0.03880 |
4 | Canada | North America | 5 | 7.427 | 0.03553 |
groupby()
Group df
by column Region
and then selct the column Western Europe
,
df.groupby('Region').get_group('Western Europe') # returns a df
Country | Region | Happiness Rank | Happiness Score | Standard Error | |
---|---|---|---|---|---|
0 | Switzerland | Western Europe | 1 | 7.587 | 0.03411 |
1 | Iceland | Western Europe | 2 | 7.561 | 0.04884 |
2 | Denmark | Western Europe | 3 | 7.527 | 0.03328 |
3 | Norway | Western Europe | 4 | 7.522 | 0.03880 |
5 | Finland | Western Europe | 6 | 7.406 | 0.03140 |
Select just the Happiness Score
column and then find the mean
,
df.groupby('Region')['Happiness Score'].mean()
# other methods: size, max, min, count
Region
Australia and New Zealand 7.285000
Central and Eastern Europe 5.332931
Eastern Asia 5.626167
Latin America and Caribbean 6.144682
Middle East and Northern Africa 5.406900
North America 7.273000
Southeastern Asia 5.317444
Southern Asia 4.580857
Sub-Saharan Africa 4.202800
Western Europe 6.689619
Name: Happiness Score, dtype: float64
Apply multiple/custom functions,
def max_min(group):
return group.max() - group.min()
df.groupby(['Country', 'Region']).agg([np.mean, np.max, max_min]).head()
Happiness Rank | Happiness Score | ||||||
---|---|---|---|---|---|---|---|
mean | amax | max_min | mean | amax | max_min | ||
Country | Region | ||||||
Afghanistan | Southern Asia | 153 | 153 | 0 | 3.575 | 3.575 | 0.0 |
Albania | Central Europe | 95 | 95 | 0 | 4.959 | 4.959 | 0.0 |
Algeria | Middle Africa | 68 | 68 | 0 | 5.605 | 5.605 | 0.0 |
If you wanna apply different functions on different columns,
df.groupby(['Country', 'Region']).agg({
'Happiness Rank': max_min,
'Happiness Score': ['min', 'max'],
'Standard Error': 'count'
}).head(3)
Happiness Rank | Happiness Score | |||
---|---|---|---|---|
max_min | min | max | ||
Country | Region | |||
Afghanistan | Southern Asia | 0 | 3.575 | 3.575 |
Albania | Central Europe | 0 | 4.959 | 4.959 |
Algeria | Middle Africa | 0 | 5.605 | 5.605 |
Or using apply
and lambda
function,