Pandas Tutorial: Part 3 – DataFrames In-Depth

Introduction and Recap

In last part of this tutorial series we discussed and learnt about the Pandas IO. We learnt how to read/write dataframe from/to many external data sources like SQL Databases, HTML, JSON, CSV files etc. In this part of series we will learn more about the most used data-structure used in Pandas i.e DataFrame.

Setup

I am using a dataset provided by OGD platform. Dataset is State-wise educational status of suicide victims during 2014. You can get the CSV file for dataset here.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('suicide-education-2014.csv', encoding='ISO-8859-1')

Basic DataFrame Operations

DataFrame Attributes

# getting current SHAPE of DataFrame. shape refers to (ROWS, COLUMNS)
>>> df.shape
(390, 7)

# getting dataframe's INDEX
>>> df.index
RangeIndex(start=0, stop=390, step=1)

#retrieving COLUMNS of dataframe
>>> df.columns
Index(['States/UTs', 'Year', 'CrimeHead', 'Male', 'Female', 'Transgender',
       'Total'],
      dtype='object')

# to get the actual data in our data-structure as ndarrays we can use VALUES property
>>> df.values
array([['Andhra Pradesh', 2014, 'No Education', ..., 470, 0, 1397],
       ['Andhra Pradesh', 2014, 'Primary (upto class-5)', ..., 385, 0, 1128],
       ['Andhra Pradesh', 2014, 'Middle (upto class-8)', ..., 343, 0, 973],
       ...,
       ['Total (All India)', 2014, 'Professionals (MBA; etc.)', ..., 94, 0,
        383],
       ['Total (All India)', 2014, 'Status not known', ..., 4438, 9, 14243],
       ['Total (All India)', 2014, 'Total', ..., 42521, 16, 131666]], dtype=object)

# getting the TRANSPOSE of dataframe
>>> df.T

# checking if dataframe is EMPTY
>>> df.empty
False

# getting DATATYPES of all the columns of dataframe
>>> df.dtypes
States/UTs     object
Year            int64
CrimeHead      object
Male            int64
Female          int64
Transgender     int64
Total           int64
dtype: object

# getting the DIMENSION of dataframe
>>> df.ndim
2

# getting the SIZE of dataframe
>>> df.size
2730

Printing Dataframe

#printing FIRST n ROWS of dataframe. By default prints FIRST 5 ROWS.
>>> df.head(3)
       States/UTs  Year                                          CrimeHead  Male  Female  Transgender  Total
0  Andhra Pradesh  2014                                       No Education   927     470            0   1397
1  Andhra Pradesh  2014                             Primary (upto class-5)   743     385            0   1128
2  Andhra Pradesh  2014                              Middle (upto class-8)   630     343            0    973

#printing LAST n ROWS of dataframe. By default prints LAST 5 ROWS.
>>> df.tail(3)
            States/UTs  Year                  CrimeHead   Male  Female  Transgender   Total
387  Total (All India)  2014  Professionals (MBA; etc.)    289      94            0     383
388  Total (All India)  2014           Status not known   9796    4438            9   14243
389  Total (All India)  2014                      Total  89129   42521           16  131666

# printing dataframe information
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 7 columns):
States/UTs     390 non-null object
Year           390 non-null int64
CrimeHead      390 non-null object
Male           390 non-null int64
Female         390 non-null int64
Transgender    390 non-null int64
Total          390 non-null int64
dtypes: int64(5), object(2)
memory usage: 21.4+ KB

DataFrame Meta Operations

# changing the INDEX COLUMN of dataframe
>>> df.set_index('COLUMN NAME')

# assigning new COLUMNS list for new columns names
>>> df.columns = ['states', 'year', 'education', 'male', 'female', 'transgender', 'total']
>>> df.columns
Index(['states', 'year', 'education', 'male', 'female', 'transgender', 'total'], dtype='object')

Data Access and Retrieval Operations

Retrieving Columns

# getting SINGLE column using column name as ATTRIBUTE
>>> df.states.head(2)
0 Andhra Pradesh
1 Andhra Pradesh
Name: states, dtype: object

# getting SINGLE column using dictionary like INDEXING
>>> df['states'].head(2)
0 Andhra Pradesh
1 Andhra Pradesh
Name: states, dtype: object

# getting MULTIPLE columns
>>> df[['states','education']].head(3)
           states               education
0  Andhra Pradesh            No Education
1  Andhra Pradesh  Primary (upto class-5)
2  Andhra Pradesh   Middle (upto class-8)

Retrieving Rows

# retrieving rows using SLICING
>>> df[4:7]
           states  year                                          education  male  female  transgender  total
4  Andhra Pradesh  2014  Higher Secondary/ Intermediate/ Pre-University...   395     150            0    545
5  Andhra Pradesh  2014                            Diploma/ITI/Certificate    77      16            0     93
6  Andhra Pradesh  2014                                 Graduate and above   174      39            0    213

# retrieving using DATAFRAME INDEX (loc is LABEL based)
>>> df.loc[[4,7]]
           states  year                                          education  male  female  transgender  total
4  Andhra Pradesh  2014  Higher Secondary/ Intermediate/ Pre-University...   395     150            0    545
7  Andhra Pradesh  2014                          Professionals (MBA; etc.)    16       4            0     20

# retrieving using dataframe ROW NUMBER (iloc is INTEGER based)
>>> df.iloc[[4,7]]
           states  year                                          education  male  female  transgender  total
4  Andhra Pradesh  2014  Higher Secondary/ Intermediate/ Pre-University...   395     150            0    545
7  Andhra Pradesh  2014                          Professionals (MBA; etc.)    16       4            0     20

# using SLICING with loc/iloc
>>> df.loc[4:7]
           states  year                                          education  male  female  transgender  total
4  Andhra Pradesh  2014  Higher Secondary/ Intermediate/ Pre-University...   395     150            0    545
5  Andhra Pradesh  2014                            Diploma/ITI/Certificate    77      16            0     93
6  Andhra Pradesh  2014                                 Graduate and above   174      39            0    213
7  Andhra Pradesh  2014                          Professionals (MBA; etc.)    16       4            0     20

Retrieving Both Rows and Columns

We can use loc and iloc to retrieve both rows and columns also. loc and iloc takes second value as the list of values as columns.

# using loc
>>> df.loc[[4,5,6,7],['male','education']]
   male                                          education
4   395  Higher Secondary/ Intermediate/ Pre-University...
5    77                            Diploma/ITI/Certificate
6   174                                 Graduate and above
7    16                          Professionals (MBA; etc.)

#using iloc
>>> df.iloc[[4,5,6,7],[2,3]]
                                           education  male
4  Higher Secondary/ Intermediate/ Pre-University...   395
5                            Diploma/ITI/Certificate    77
6                                 Graduate and above   174
7                          Professionals (MBA; etc.)    16

More about loc/iloc

Note that both loc/iloc takes two values first one is row(s) and second one is column(s). It can accept various different type of values, and their combinations.

  • Single Scalar Value
  • Python List
  • Range
#SCALAR VALUE - row 5 and male column
>>> df.loc[5, 'male']
77

#PYTHON LIST
>>> df.loc[[4,5,6,7],['education', 'male']]
                                           education  male
4  Higher Secondary/ Intermediate/ Pre-University...   395
5                            Diploma/ITI/Certificate    77
6                                 Graduate and above   174
7                          Professionals (MBA; etc.)    16

#RANGE
>>> df.loc[5:8, 'education':'female']
                   education  male  female
5    Diploma/ITI/Certificate    77      16
6         Graduate and above   174      39
7  Professionals (MBA; etc.)    16       4
8           Status not known   601     167

# retrieving COLUMNS ONLY
>>> df.loc[:, ['education','male','female']].head(2)
                education  male  female
0            No Education   927     470
1  Primary (upto class-5)   743     385

To learn more about loc/iloc, indexing or selection in general or Advanced Indexing read the 5th part of this series where we will discuss advanced indexing techniques.

Filtering

We can filter rows using various different comparison operators and expressions. This filtering and selection capability makes dataframe access makes very convenient and powerful.

Firstly let’s retrieve all the states that have Transgender Suicide Rate > 0

>>> df[df.transgender > 0].tail(3)
                states  year                                          education   male  female  transgender   total
384  Total (All India)  2014  Higher Secondary/ Intermediate/ Pre-University...  10079    4348            1   14428
388  Total (All India)  2014                                   Status not known   9796    4438            9   14243
389  Total (All India)  2014                                              Total  89129   42521           16  131666

As we can see that we have lot of rows included which are just rows with Total Values. So we will filter these rows out.

# states with transgender suicide rate > 0 with no Total rows
>>> df[df.transgender > 0][df.states.str.contains('Total') == False]
             states  year                                          education   male  female  transgender  total
8    Andhra Pradesh  2014                                   Status not known    601     167            1    769
9    Andhra Pradesh  2014                                              Total   4220    1880            1   6101
112       Karnataka  2014                              Middle (upto class-8)   1230     549            1   1780
118       Karnataka  2014                                   Status not known    889     282            1   1172
119       Karnataka  2014                                              Total   7684    3259            2  10945
140     Maharashtra  2014                                       No Education   1064     552            3   1619
149     Maharashtra  2014                                              Total  11828    4476            3  16307
232      Tamil Nadu  2014                              Middle (upto class-8)   2551    1304            1   3856
233      Tamil Nadu  2014              Matriculate/Secondary (upto class-10)   2325    1027            1   3353
234      Tamil Nadu  2014  Higher Secondary/ Intermediate/ Pre-University...   1348     609            1   1958
238      Tamil Nadu  2014                                   Status not known    541     195            1    737
239      Tamil Nadu  2014                                              Total  10963    5155            4  16122
268   Uttar Pradesh  2014                                   Status not known    415     260            6    681
269   Uttar Pradesh  2014                                              Total   2099    1485            6   3590

Now since we only need unique state names whose suicide rate is greater than 0, therefore we will only retrieve states column and get that column’s unique values.

>>> df[df.transgender > 0][df.states.str.contains('Total') == False]['states'].unique()
array(['Andhra Pradesh', 'Karnataka', 'Maharashtra', 'Tamil Nadu',
 'Uttar Pradesh'], dtype=object)

In similar fashion we can do all kind of comparison operations. Pandas will perform these vector operations and filter our the rows with False values.

Iteration

We will see two methods to iterate over dataframe rows which are iterrows() and itertuples()

Using iterrows()

# printing whole rows
>>> for index,row in df.head(2).iterrows():
...     print(row)
...     print()
...
states         Andhra Pradesh
year                     2014
education        No Education
male                      927
female                    470
transgender                 0
total                    1397
Name: 0, dtype: object

states                 Andhra Pradesh
year                             2014
education      Primary (upto class-5)
male                              743
female                            385
transgender                         0
total                            1128
Name: 1, dtype: object

# retrieving columns values
>>> for index,row in df.head(2).iterrows():
... print(row.education)
...
No Education
Primary (upto class-5)

Using itertuples()

Iterating over tuple we can access column values by using column name as property.

# printing whole row
>>> for row in df.head().itertuples():
....    print(row)
....
Pandas(Index=0, states='Andhra Pradesh', year=2014, education='No Education', male=927, female=470, transgender=0, total=1397)
Pandas(Index=1, states='Andhra Pradesh', year=2014, education='Primary (upto class-5)', male=743, female=385, transgender=0, total=1128)
Pandas(Index=2, states='Andhra Pradesh', year=2014, education='Middle (upto class-8)', male=630, female=343, transgender=0, total=973)
Pandas(Index=3, states='Andhra Pradesh', year=2014, education='Matriculate/Secondary (upto class-10)', male=657, female=306, transgender=0, total=963)
Pandas(Index=4, states='Andhra Pradesh', year=2014, education='Higher Secondary/ Intermediate/ Pre-University (upto class-12)', male=395, female=150, transgender=0, total=545)

# retrieving COLUMN VALUES from row
>>> for row in df.head().itertuples():
... print(row.education)
...
No Education
Primary (upto class-5)
Middle (upto class-8)
Matriculate/Secondary (upto class-10)
Higher Secondary/ Intermediate/ Pre-University (upto class-12)

Data Munging Operations

Handling Missing Values

We can handle missing data either by dropping rows with missing data or filling missing values using some strategy like mean, interpolation etc to guess missing values.

Dropping NA

We can chose to drop all rows that either contains all missing data or some missing data.

#dropping rows with ALL missing values
>>> df.dropna(how='all')

#dropping rows with ANY missing values
>>> df.dropna(how='any')

#DROPPING COLUMNS with missing values
>>> df.drop(axis=1)

NOTE that dropna method will return new dataframe with dropped values. To do inplace drop, use keyword argument inplace

# INPLACE drop
>>> df.dropna(how='any', inplace=True)

Filling NA

#filling using a SINGLE SCALAR value
>>> df.fillna(5)

There are different methods/ways using which we can fill the missing data, and which are passed as string to method argument:

  • backfill/bfill – Fills the NaN values in backward direction
  • pad/ffill – Fills the NaN values in forward direction i.e gets its value from previous value.

Just like dropna, fillna also accepts axis and inplace as optional arguments.

Filling NA using Interpolation

# filling values using LINEAR INTERPOLATION
>>> df.interpolate()

interpolate() methods supports a lot of interpolation methods. Few of these different supported interpolation methods are:

  • linear
  • time
  • index
  • values
  • nearest
  • zero
  • and many more…
# interpolation using actual index values
>>> df.interpolate(method='index')

Just like dropna and fillna interpolate also accepts inplace and axis keyword argument as well.

Both interpolate and fillna accepts a keyword argument limit which takes a number which tells us what is upper limit for no of consecutive NaNs to allow interpolation or filling values.

# interpolation only those values whose consecutive NaN values is not greater than 5
>>> df.interpolate(limit=5)

Replacing Values

replace method can accept both scalar value and list as its first and second argument

>>> df.year.replace(2014, 14, inplace=True)
           states  year                                          education  male  female  transgender  total
0  Andhra Pradesh    14                                       No Education   927     470            0   1397
1  Andhra Pradesh    14                             Primary (upto class-5)   743     385            0   1128
2  Andhra Pradesh    14                              Middle (upto class-8)   630     343            0    973
3  Andhra Pradesh    14              Matriculate/Secondary (upto class-10)   657     306            0    963
4  Andhra Pradesh    14  Higher Secondary/ Intermediate/ Pre-University...   395     150            0    545

Shifting Values

We can shift values by some number which will cycle the values. This operations often becomes useful in time series data.

>>> df.states.shift(-1)

Conclusion

In this post I discussed about Pandas DataFrame in little more depth. We learnt about various basic dataframe operations, various ways to retrieve data from dataframe, In next part of this series we will continue to discuss about dataframe and various dataframe manipulation and transformation operations. We also learnt different data munging operations. . In next part of this tutorial series we will look at various DataFrame manipulation operations with few statistical operations.

Leave a comment