Data Aggregation

In this note, I use df for DataFrame, s for Series.

Libraries

import pandas as pd # import pandas package
import numpy as np

Dataframe

dataquest_aio = '<https://raw.githubusercontent.com/dinhanhthi/dataquest-aio/master/step-2-data-analysis-and-visualization/>'
dataset_url = dataquest_aio + 'course-4-data-cleaning-and-analysis/data/World_Happiness_2015.csv'
df = pd.read_csv(dataset_url) # read the data set
df.head()

	Country	Region	Happiness Rank	Happiness Score	Standard Error
0	Switzerland	Western Europe	1	7.587	0.03411
1	Iceland	Western Europe	2	7.561	0.04884
2	Denmark	Western Europe	3	7.527	0.03328
3	Norway	Western Europe	4	7.522	0.03880
4	Canada	North America	5	7.427	0.03553

Group dataset using `groupby()`

Group df by column Region and then selct the column Western Europe,

df.groupby('Region').get_group('Western Europe') # returns a df

	Country	Region	Happiness Rank	Happiness Score	Standard Error
0	Switzerland	Western Europe	1	7.587	0.03411
1	Iceland	Western Europe	2	7.561	0.04884
2	Denmark	Western Europe	3	7.527	0.03328
3	Norway	Western Europe	4	7.522	0.03880
5	Finland	Western Europe	6	7.406	0.03140

Select just the Happiness Score column and then find the mean,

df.groupby('Region')['Happiness Score'].mean()
# other methods: size, max, min, count

Region
Australia and New Zealand          7.285000
Central and Eastern Europe         5.332931
Eastern Asia                       5.626167
Latin America and Caribbean        6.144682
Middle East and Northern Africa    5.406900
North America                      7.273000
Southeastern Asia                  5.317444
Southern Asia                      4.580857
Sub-Saharan Africa                 4.202800
Western Europe                     6.689619
Name: Happiness Score, dtype: float64

Apply multiple/custom functions,

def max_min(group):
  return group.max() - group.min()

df.groupby(['Country', 'Region']).agg([np.mean, np.max, max_min]).head()

		Happiness Rank			Happiness Score
		mean	amax	max_min	mean	amax	max_min
Country	Region
Afghanistan	Southern Asia	153	153	0	3.575	3.575	0.0
Albania	Central Europe	95	95	0	4.959	4.959	0.0
Algeria	Middle Africa	68	68	0	5.605	5.605	0.0

If you wanna apply different functions on different columns,

df.groupby(['Country', 'Region']).agg({
    'Happiness Rank': max_min,
    'Happiness Score': ['min', 'max'],
    'Standard Error': 'count'
}).head(3)

		Happiness Rank	Happiness Score
		max_min	min	max
Country	Region
Afghanistan	Southern Asia	0	3.575	3.575
Albania	Central Europe	0	4.959	4.959
Algeria	Middle Africa	0	5.605	5.605

Or using apply and lambda function,

Libraries

Dataframe

Group dataset using groupby()

Group dataset using `groupby()`