Open In Colab


I will be going through this dataset found on kaggle, that will be modified for only the years 2015 to 2017.

I will be cleaning the data, I will go into detail on correlation and the R and R^2 values, aswell as looking at the regression plots for all the columns.


This dataset is obtained from Kaggle. I have modified the data to contain only data from 2015 to 2017. This report ranks 155 countries by their happiness level through 6 indicators:

  • economic production
  • social support
  • life expectancy
  • freedom
  • absence of corruption
  • generosity

The last indicator is dystopia residual. Dystopia residual is "the Dystopia Happiness Score(1.85) + the Residual value or the unexplained value for each country". Dystopia is a made up country that has the world's least happiest people. This made up country is high in corruption, low in average income, employment, etc. Dystopia residual is used as a benchmark and should be used side-by-side with the happiness score.

  • Low dystopia residual = low level of happiness
  • high dystopia residual = high level of happiness

Understanding the column data

  • Country
  • Happiness rank
  • Happiness score

This is obtained from a sample of population. The survey-taker asked the respondent to rate their happiness from 1 to 10.

  • Economic (GDP per cap)
    • Extend of GDP that contributes to the happiness score
  • Family
    • To what extend does family contribute to the happiness score
  • Health
    • Extend of health (life expectancy) contribute to the happiness score
  • Freedom
    • Extend of freedom that contribute to happiness. The freedom here represents the freedom of speech, freedom to pursue what we want, etc
  • Trust (Government corruption)
    • Extend of trust with regards to government corruption that contribute to happiness score
  • Generosity
    • Extend of generosity that contribute to happiness score
  • dystopia residual
  • Year

Do note:

HappinessScore = Economic(GDPpercap) + Family + Health + Freedom + Trust + Generosity + Dystopia Residual

Lets get started!

In [0]:
# Importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
In [4]:
# Importing dataset into colab
from google.colab import files 
uploaded = files.upload() #import 2015, 2016, 2017 files
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving World_Happiness_2017.csv to World_Happiness_2017.csv
In [156]:
#reading in all three files
raw_2015= pd.read_csv('World_Happiness_2015.csv')
raw_2016= pd.read_csv('World_Happiness_2016.csv')
raw_2017= pd.read_csv('World_Happiness_2017.csv')

#lets check each head
raw_2015.head(2) #2015 head
Country Region Happiness Rank Happiness Score Standard Error Economy (GDP per Capita) Family Health (Life Expectancy) Freedom Trust (Government Corruption) Generosity Dystopia Residual
0 Switzerland Western Europe 1 7.587 0.03411 1.39651 1.34951 0.94143 0.66557 0.41978 0.29678 2.51738
1 Iceland Western Europe 2 7.561 0.04884 1.30232 1.40223 0.94784 0.62877 0.14145 0.43630 2.70201
In [157]:
raw_2016.head(2) #2016 head
Country Region Happiness Rank Happiness Score Lower Confidence Interval Upper Confidence Interval Economy (GDP per Capita) Family Health (Life Expectancy) Freedom Trust (Government Corruption) Generosity Dystopia Residual
0 Denmark Western Europe 1 7.526 7.460 7.592 1.44178 1.16374 0.79504 0.57941 0.44453 0.36171 2.73939
1 Switzerland Western Europe 2 7.509 7.428 7.590 1.52733 1.14524 0.86303 0.58557 0.41203 0.28083 2.69463
In [158]:
raw_2017.head(2) #2017 head
Country Happiness.Rank Happiness.Score Whisker.high Whisker.low Economy..GDP.per.Capita. Family Health..Life.Expectancy. Freedom Generosity Trust..Government.Corruption. Dystopia.Residual
0 Norway 1 7.537 7.594445 7.479556 1.616463 1.533524 0.796667 0.635423 0.362012 0.315964 2.277027
1 Denmark 2 7.522 7.581728 7.462272 1.482383 1.551122 0.792566 0.626007 0.355280 0.400770 2.313707

Things we need to fix with the data

  • 2017 has different header names.
  • 2015 and 2016 have a region attached, giving it one extra column.

There will be other things to check, but I want to fix these first.

Fixing 2015 and 2016 columns

In [159]:
correct_columns = ['Country', 'Happiness Rank', 'Happiness Score', 'Economy (GDP per Capita)', 'Family', 'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)', 'Generosity', 'Dystopia Residual']
raw_2015 = raw_2015[correct_columns]
raw_2016 = raw_2016[correct_columns]
if raw_2015.columns.unique().all() == raw_2016.columns.unique().all():
  print("unique columns are the same")
unique columns are the same

Fixing 2017 data

The trust and generosity columns were swapped in our data, so we can go ahead and swap them back to match our 2015 and 2016 set.

In [160]:
#we need to change the the 2017 data to also look like the 2015 and 2016
temp_col = ['Country', 'Happiness.Rank', 'Happiness.Score','Economy..GDP.per.Capita.', 'Family', 'Health..Life.Expectancy.', 'Freedom', 'Trust..Government.Corruption.', 'Generosity', 'Dystopia.Residual']
raw_2017 = raw_2017[temp_col] #deleting two useless columns for us

raw_2017.columns = raw_2015.columns
if raw_2015.columns.unique().all() == raw_2017.columns.unique().all():
  print("unique columns are the same")
unique columns are the same

Awesome, now all of our years match, lets load this all into one dataset now, first we should add a year column, so we can use this later on.

Adding Year columns to data

In [0]:

Joining data into one dataset

In [163]:
join = [raw_2015, raw_2016, raw_2017]
data = pd.concat(join)
# Looking at total samples
data.count() #everything looks right
Country                          470
Happiness Rank                   470
Happiness Score                  470
Economy (GDP per Capita)         470
Family                           470
Health (Life Expectancy)         470
Freedom                          470
Trust (Government Corruption)    470
Generosity                       470
Dystopia Residual                470
Year                             470
dtype: int64
In [164]:
# Checking how many rows and columns we have
(470, 11)
In [165]:
# Checking the data types of our data
Country                           object
Happiness Rank                     int64
Happiness Score                  float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
Year                               int64
dtype: object
In [166]:
# Checking data for null values
Country                          0
Happiness Rank                   0
Happiness Score                  0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
Year                             0
dtype: int64
In [167]:
factors = ['Happiness Score','Economy (GDP per Capita)','Family','Health (Life Expectancy)','Freedom','Trust (Government Corruption)','Generosity','Dystopia Residual']
data_w_cols = data[factors]
data_w_cols[data_w_cols <= 0].count()
#data_w_cols[data_w_cols < 0].count() #unhide to check if and are below 0
Happiness Score                  0
Economy (GDP per Capita)         3
Family                           3
Health (Life Expectancy)         3
Freedom                          3
Trust (Government Corruption)    3
Generosity                       3
Dystopia Residual                0
dtype: int64

If you unhide the second count, we can see that all these values are strictly 0

We have 3 values in each that are set to 0, out of 470 samples, this wont throw off our values too much, although let me make sure they are not all the same row.

In [168]:
data_w_cols[data_w_cols <= 0].count().count()

out of 18 values that are set to 0, it comes from only 8 rows. This means they are spread out enough not to mess with our data too much. As we are unsure if these values were too low to count, or not entered correctly. Alternatively, we could also replace these with the value of our mean for the column.

In [169]:

Lets take a look at some scatter plots

In [170]:
#pulling up a quick graph of plots for our data
g = sns.pairplot(data)
g.fig.suptitle('FacetGrid plot', fontsize = 20)
g.fig.subplots_adjust(top= 0.9)