Introduction and Recap
In last part of this tutorial series we discussed and learnt about the Pandas IO. We learnt how to read/write dataframe from/to many external data sources like SQL Databases, HTML, JSON, CSV files etc. In this part of series we will learn more about the most used data-structure used in Pandas i.e DataFrame.
Setup
I am using a dataset provided by OGD platform. Dataset is State-wise educational status of suicide victims during 2014. You can get the CSV file for dataset here.
import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.read_csv('suicide-education-2014.csv', encoding='ISO-8859-1')
Basic DataFrame Operations
DataFrame Attributes
# getting current SHAPE of DataFrame. shape refers to (ROWS, COLUMNS) >>> df.shape (390, 7) # getting dataframe's INDEX >>> df.index RangeIndex(start=0, stop=390, step=1) #retrieving COLUMNS of dataframe >>> df.columns Index(['States/UTs', 'Year', 'CrimeHead', 'Male', 'Female', 'Transgender', 'Total'], dtype='object') # to get the actual data in our data-structure as ndarrays we can use VALUES property >>> df.values array([['Andhra Pradesh', 2014, 'No Education', ..., 470, 0, 1397], ['Andhra Pradesh', 2014, 'Primary (upto class-5)', ..., 385, 0, 1128], ['Andhra Pradesh', 2014, 'Middle (upto class-8)', ..., 343, 0, 973], ..., ['Total (All India)', 2014, 'Professionals (MBA; etc.)', ..., 94, 0, 383], ['Total (All India)', 2014, 'Status not known', ..., 4438, 9, 14243], ['Total (All India)', 2014, 'Total', ..., 42521, 16, 131666]], dtype=object) # getting the TRANSPOSE of dataframe >>> df.T # checking if dataframe is EMPTY >>> df.empty False # getting DATATYPES of all the columns of dataframe >>> df.dtypes States/UTs object Year int64 CrimeHead object Male int64 Female int64 Transgender int64 Total int64 dtype: object # getting the DIMENSION of dataframe >>> df.ndim 2 # getting the SIZE of dataframe >>> df.size 2730
Printing Dataframe
#printing FIRST n ROWS of dataframe. By default prints FIRST 5 ROWS. >>> df.head(3) States/UTs Year CrimeHead Male Female Transgender Total 0 Andhra Pradesh 2014 No Education 927 470 0 1397 1 Andhra Pradesh 2014 Primary (upto class-5) 743 385 0 1128 2 Andhra Pradesh 2014 Middle (upto class-8) 630 343 0 973 #printing LAST n ROWS of dataframe. By default prints LAST 5 ROWS. >>> df.tail(3) States/UTs Year CrimeHead Male Female Transgender Total 387 Total (All India) 2014 Professionals (MBA; etc.) 289 94 0 383 388 Total (All India) 2014 Status not known 9796 4438 9 14243 389 Total (All India) 2014 Total 89129 42521 16 131666 # printing dataframe information >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 390 entries, 0 to 389 Data columns (total 7 columns): States/UTs 390 non-null object Year 390 non-null int64 CrimeHead 390 non-null object Male 390 non-null int64 Female 390 non-null int64 Transgender 390 non-null int64 Total 390 non-null int64 dtypes: int64(5), object(2) memory usage: 21.4+ KB
DataFrame Meta Operations
# changing the INDEX COLUMN of dataframe >>> df.set_index('COLUMN NAME') # assigning new COLUMNS list for new columns names >>> df.columns = ['states', 'year', 'education', 'male', 'female', 'transgender', 'total'] >>> df.columns Index(['states', 'year', 'education', 'male', 'female', 'transgender', 'total'], dtype='object')
Data Access and Retrieval Operations
Retrieving Columns
# getting SINGLE column using column name as ATTRIBUTE >>> df.states.head(2) 0 Andhra Pradesh 1 Andhra Pradesh Name: states, dtype: object # getting SINGLE column using dictionary like INDEXING >>> df['states'].head(2) 0 Andhra Pradesh 1 Andhra Pradesh Name: states, dtype: object # getting MULTIPLE columns >>> df[['states','education']].head(3) states education 0 Andhra Pradesh No Education 1 Andhra Pradesh Primary (upto class-5) 2 Andhra Pradesh Middle (upto class-8)
Retrieving Rows
# retrieving rows using SLICING >>> df[4:7] states year education male female transgender total 4 Andhra Pradesh 2014 Higher Secondary/ Intermediate/ Pre-University... 395 150 0 545 5 Andhra Pradesh 2014 Diploma/ITI/Certificate 77 16 0 93 6 Andhra Pradesh 2014 Graduate and above 174 39 0 213 # retrieving using DATAFRAME INDEX (loc is LABEL based) >>> df.loc[[4,7]] states year education male female transgender total 4 Andhra Pradesh 2014 Higher Secondary/ Intermediate/ Pre-University... 395 150 0 545 7 Andhra Pradesh 2014 Professionals (MBA; etc.) 16 4 0 20 # retrieving using dataframe ROW NUMBER (iloc is INTEGER based) >>> df.iloc[[4,7]] states year education male female transgender total 4 Andhra Pradesh 2014 Higher Secondary/ Intermediate/ Pre-University... 395 150 0 545 7 Andhra Pradesh 2014 Professionals (MBA; etc.) 16 4 0 20 # using SLICING with loc/iloc >>> df.loc[4:7] states year education male female transgender total 4 Andhra Pradesh 2014 Higher Secondary/ Intermediate/ Pre-University... 395 150 0 545 5 Andhra Pradesh 2014 Diploma/ITI/Certificate 77 16 0 93 6 Andhra Pradesh 2014 Graduate and above 174 39 0 213 7 Andhra Pradesh 2014 Professionals (MBA; etc.) 16 4 0 20
Retrieving Both Rows and Columns
We can use loc and iloc to retrieve both rows and columns also. loc and iloc takes second value as the list of values as columns.
# using loc >>> df.loc[[4,5,6,7],['male','education']] male education 4 395 Higher Secondary/ Intermediate/ Pre-University... 5 77 Diploma/ITI/Certificate 6 174 Graduate and above 7 16 Professionals (MBA; etc.) #using iloc >>> df.iloc[[4,5,6,7],[2,3]] education male 4 Higher Secondary/ Intermediate/ Pre-University... 395 5 Diploma/ITI/Certificate 77 6 Graduate and above 174 7 Professionals (MBA; etc.) 16
More about loc/iloc
Note that both loc/iloc takes two values first one is row(s) and second one is column(s). It can accept various different type of values, and their combinations.
- Single Scalar Value
- Python List
- Range
#SCALAR VALUE - row 5 and male column >>> df.loc[5, 'male'] 77 #PYTHON LIST >>> df.loc[[4,5,6,7],['education', 'male']] education male 4 Higher Secondary/ Intermediate/ Pre-University... 395 5 Diploma/ITI/Certificate 77 6 Graduate and above 174 7 Professionals (MBA; etc.) 16 #RANGE >>> df.loc[5:8, 'education':'female'] education male female 5 Diploma/ITI/Certificate 77 16 6 Graduate and above 174 39 7 Professionals (MBA; etc.) 16 4 8 Status not known 601 167 # retrieving COLUMNS ONLY >>> df.loc[:, ['education','male','female']].head(2) education male female 0 No Education 927 470 1 Primary (upto class-5) 743 385
To learn more about loc/iloc, indexing or selection in general or Advanced Indexing read the 5th part of this series where we will discuss advanced indexing techniques.
Filtering
We can filter rows using various different comparison operators and expressions. This filtering and selection capability makes dataframe access makes very convenient and powerful.
Firstly let’s retrieve all the states that have Transgender Suicide Rate > 0
>>> df[df.transgender > 0].tail(3)
states year education male female transgender total
384 Total (All India) 2014 Higher Secondary/ Intermediate/ Pre-University... 10079 4348 1 14428
388 Total (All India) 2014 Status not known 9796 4438 9 14243
389 Total (All India) 2014 Total 89129 42521 16 131666
As we can see that we have lot of rows included which are just rows with Total Values. So we will filter these rows out.
# states with transgender suicide rate > 0 with no Total rows >>> df[df.transgender > 0][df.states.str.contains('Total') == False] states year education male female transgender total 8 Andhra Pradesh 2014 Status not known 601 167 1 769 9 Andhra Pradesh 2014 Total 4220 1880 1 6101 112 Karnataka 2014 Middle (upto class-8) 1230 549 1 1780 118 Karnataka 2014 Status not known 889 282 1 1172 119 Karnataka 2014 Total 7684 3259 2 10945 140 Maharashtra 2014 No Education 1064 552 3 1619 149 Maharashtra 2014 Total 11828 4476 3 16307 232 Tamil Nadu 2014 Middle (upto class-8) 2551 1304 1 3856 233 Tamil Nadu 2014 Matriculate/Secondary (upto class-10) 2325 1027 1 3353 234 Tamil Nadu 2014 Higher Secondary/ Intermediate/ Pre-University... 1348 609 1 1958 238 Tamil Nadu 2014 Status not known 541 195 1 737 239 Tamil Nadu 2014 Total 10963 5155 4 16122 268 Uttar Pradesh 2014 Status not known 415 260 6 681 269 Uttar Pradesh 2014 Total 2099 1485 6 3590
Now since we only need unique state names whose suicide rate is greater than 0, therefore we will only retrieve states column and get that column’s unique values.
>>> df[df.transgender > 0][df.states.str.contains('Total') == False]['states'].unique()
array(['Andhra Pradesh', 'Karnataka', 'Maharashtra', 'Tamil Nadu',
'Uttar Pradesh'], dtype=object)
In similar fashion we can do all kind of comparison operations. Pandas will perform these vector operations and filter our the rows with False values.
Iteration
We will see two methods to iterate over dataframe rows which are iterrows() and itertuples()
Using iterrows()
# printing whole rows >>> for index,row in df.head(2).iterrows(): ... print(row) ... print() ... states Andhra Pradesh year 2014 education No Education male 927 female 470 transgender 0 total 1397 Name: 0, dtype: object states Andhra Pradesh year 2014 education Primary (upto class-5) male 743 female 385 transgender 0 total 1128 Name: 1, dtype: object # retrieving columns values >>> for index,row in df.head(2).iterrows(): ... print(row.education) ... No Education Primary (upto class-5)
Using itertuples()
Iterating over tuple we can access column values by using column name as property.
# printing whole row >>> for row in df.head().itertuples(): .... print(row) .... Pandas(Index=0, states='Andhra Pradesh', year=2014, education='No Education', male=927, female=470, transgender=0, total=1397) Pandas(Index=1, states='Andhra Pradesh', year=2014, education='Primary (upto class-5)', male=743, female=385, transgender=0, total=1128) Pandas(Index=2, states='Andhra Pradesh', year=2014, education='Middle (upto class-8)', male=630, female=343, transgender=0, total=973) Pandas(Index=3, states='Andhra Pradesh', year=2014, education='Matriculate/Secondary (upto class-10)', male=657, female=306, transgender=0, total=963) Pandas(Index=4, states='Andhra Pradesh', year=2014, education='Higher Secondary/ Intermediate/ Pre-University (upto class-12)', male=395, female=150, transgender=0, total=545) # retrieving COLUMN VALUES from row >>> for row in df.head().itertuples(): ... print(row.education) ... No Education Primary (upto class-5) Middle (upto class-8) Matriculate/Secondary (upto class-10) Higher Secondary/ Intermediate/ Pre-University (upto class-12)
Data Munging Operations
Handling Missing Values
We can handle missing data either by dropping rows with missing data or filling missing values using some strategy like mean, interpolation etc to guess missing values.
Dropping NA
We can chose to drop all rows that either contains all missing data or some missing data.
#dropping rows with ALL missing values >>> df.dropna(how='all') #dropping rows with ANY missing values >>> df.dropna(how='any') #DROPPING COLUMNS with missing values >>> df.drop(axis=1)
NOTE that dropna method will return new dataframe with dropped values. To do inplace drop, use keyword argument inplace
# INPLACE drop
>>> df.dropna(how='any', inplace=True)
Filling NA
#filling using a SINGLE SCALAR value >>> df.fillna(5)
There are different methods/ways using which we can fill the missing data, and which are passed as string to method argument:
- backfill/bfill – Fills the NaN values in backward direction
- pad/ffill – Fills the NaN values in forward direction i.e gets its value from previous value.
Just like dropna, fillna also accepts axis and inplace as optional arguments.
Filling NA using Interpolation
# filling values using LINEAR INTERPOLATION >>> df.interpolate()
interpolate() methods supports a lot of interpolation methods. Few of these different supported interpolation methods are:
- linear
- time
- index
- values
- nearest
- zero
- and many more…
# interpolation using actual index values >>> df.interpolate(method='index')
Just like dropna and fillna interpolate also accepts inplace and axis keyword argument as well.
Both interpolate and fillna accepts a keyword argument limit which takes a number which tells us what is upper limit for no of consecutive NaNs to allow interpolation or filling values.
# interpolation only those values whose consecutive NaN values is not greater than 5 >>> df.interpolate(limit=5)
Replacing Values
replace method can accept both scalar value and list as its first and second argument
>>> df.year.replace(2014, 14, inplace=True)
states year education male female transgender total
0 Andhra Pradesh 14 No Education 927 470 0 1397
1 Andhra Pradesh 14 Primary (upto class-5) 743 385 0 1128
2 Andhra Pradesh 14 Middle (upto class-8) 630 343 0 973
3 Andhra Pradesh 14 Matriculate/Secondary (upto class-10) 657 306 0 963
4 Andhra Pradesh 14 Higher Secondary/ Intermediate/ Pre-University... 395 150 0 545
Shifting Values
We can shift values by some number which will cycle the values. This operations often becomes useful in time series data.
>>> df.states.shift(-1)
Conclusion
In this post I discussed about Pandas DataFrame in little more depth. We learnt about various basic dataframe operations, various ways to retrieve data from dataframe, In next part of this series we will continue to discuss about dataframe and various dataframe manipulation and transformation operations. We also learnt different data munging operations. . In next part of this tutorial series we will look at various DataFrame manipulation operations with few statistical operations.