Table of Contents
Introduction
Boolean masking refers to selecting subsets of a DataFrame with the help of Boolean expressions. In particular, it can be applied to filtering the rows of a DataFrame based on certain specified Boolean expressions.
In DataFrames, however, we cannot use the Python logical operators and
, or
and not
. Instead, we need to use the symbols as shown in the table below.
Summary
Logical boolean operators in pandas.Python Logical Operators | pandas Logical Operators |
---|---|
and |
& |
or |
| |
not |
~ |
We first import the required libraries and modules.
1import numpy as np
2import pandas as pd
3import random as rd
Let’s create a more elaborate dataframe for this section.
1a='Tom Gary Lois Wendy Betty Jacobs Donald Yasmine Nick Andrew'.split()
2b=rd.choices(range(50,85),k=10)
3c=np.random.uniform(1.6, 1.9, 10).round(2)
4d = rd.choices(['Germany', 'UK', 'USA', 'France', 'Canada'], k=10)
5
6df = pd.DataFrame({'Weight':b,
7 'Height':c,
8 'Country':d}, index=a)
9df
Weight | Height | Country | |
---|---|---|---|
Tom | 59 | 1.85 | France |
Gary | 68 | 1.64 | Germany |
Lois | 81 | 1.85 | Germany |
Wendy | 56 | 1.67 | Canada |
Betty | 65 | 1.73 | USA |
Jacobs | 59 | 1.69 | France |
Donald | 72 | 1.81 | UK |
Yasmine | 52 | 1.85 | Canada |
Nick | 66 | 1.67 | Germany |
Andrew | 69 | 1.60 | UK |
Filtering Based on Column Values
Suppose we want to select persons whose weight is below 60.
Example
Filtering rows based on column ‘Weight’.1df[df['Weight']<60]
Weight | Height | Country | |
---|---|---|---|
Tom | 59 | 1.85 | France |
Wendy | 56 | 1.67 | Canada |
Jacobs | 59 | 1.69 | France |
Yasmine | 52 | 1.85 | Canada |
The expression within the square brackets is a Boolean expression and it is actually a Series of Boolean values.
1df['Weight']<60
Tom True
Gary False
Lois False
Wendy True
Betty False
Jacobs True
Donald False
Yasmine True
Nick False
Andrew False
Name: Weight, dtype: bool
We can select only persons (rows) who come from ‘Germany’.
Example
Filtering rows based on column ‘Country’.1df[df['Country']=='Germany']
Weight | Height | Country | |
---|---|---|---|
Gary | 68 | 1.64 | Germany |
Lois | 81 | 1.85 | Germany |
Nick | 66 | 1.67 | Germany |
Suppose we want to select persons whose weight is above 60 and height is below 1.70.
We first create a mask which is a Boolean expression containing the two conditions. Thereafter, we apply the mask to the DataFrame.
Example
Filtering rows based on multiple columns.1mask = (df['Weight']>60) & (df['Height']<1.7)
2df[mask]
Weight | Height | Country | |
---|---|---|---|
Gary | 68 | 1.64 | Germany |
Nick | 66 | 1.67 | Germany |
Andrew | 69 | 1.60 | UK |
If you have more than one condition, it is advisable to put every Boolean expression within parentheses so that you do not need to worry about operator precedence.
isin() Method
To filter the DataFrame to include only participants from USA or Germany.
Example
Filtering rows using theisin()
method.
1df.loc[df["Country"].isin(["USA", "Germany"]), :]
Weight | Height | Country | |
---|---|---|---|
Gary | 68 | 1.64 | Germany |
Lois | 81 | 1.85 | Germany |
Betty | 65 | 1.73 | USA |
Nick | 66 | 1.67 | Germany |
Filtering Based on Index
It is also possible to filter the rows of a DataFrame based on the index column. Recall that df.index
allows us to see all the row labels.
1df.index
Index(['Tom', 'Gary', 'Lois', 'Wendy', 'Betty', 'Jacobs', 'Donald', 'Yasmine',
'Nick', 'Andrew'],
dtype='object')
startswith() Method
To filter rows such that the index starts with a certain character, we use the str
attribute followed by the startswith()
method. The Boolean mask looks like this.
Example
Filtering rows based on the index using thestartswith()
method.
1df.index.str.startswith(('A','B'))
array([False, False, False, False, True, False, False, False, False,
True])
We then feed the above mask into df.loc
.
1df.loc[df.index.str.startswith(('A','B'))]
Weight | Height | Country | |
---|---|---|---|
Betty | 65 | 1.73 | USA |
Andrew | 69 | 1.60 | UK |
endswith() Method
To filter rows such that the index ends with certain characters, we use the str
attribute followed by the endswith()
method.
Example
Filtering rows based on the index using theendswith()
method.
1df.loc[df.index.str.endswith(('y','s'))]
Weight | Height | Country | |
---|---|---|---|
Gary | 68 | 1.64 | Germany |
Lois | 81 | 1.85 | Germany |
Wendy | 56 | 1.67 | Canada |
Betty | 65 | 1.73 | USA |
Jacobs | 59 | 1.69 | France |
contains() Method
To filter rows such that the index contains certain characters, we use the str
attribute followed by the contains()
method.
Example
Filtering rows such that the index contains string ‘on’.1df.loc[df.index.str.contains('on')]
Weight | Height | Country | |
---|---|---|---|
Donald | 72 | 1.81 | UK |
We can also filter the rows such that the index contains one of multiple specified strings.
Example
Filtering rows such that the index contains string ‘on’ or ‘as’.1df.loc[df.index.str.contains('on|as')]
Weight | Height | Country | |
---|---|---|---|
Donald | 72 | 1.81 | UK |
Yasmine | 52 | 1.85 | Canada |
Example
Filtering rows such that the index contains ‘on’, ‘as’ or ’et’.1df.loc[df.index.str.contains('on|as|et')]
Weight | Height | Country | |
---|---|---|---|
Betty | 65 | 1.73 | USA |
Donald | 72 | 1.81 | UK |
Yasmine | 52 | 1.85 | Canada |
We can also grep index labels that contain all the specified strings.
Example
Filtering rows such that the index contains both ‘o’ and ‘i’.1df.loc[(df.index.str.contains('o')) & (df.index.str.contains('i'))]
Weight | Height | Country | |
---|---|---|---|
Lois | 81 | 1.85 | Germany |
len() Method
To filter rows such that the index has more than a certain number of characters.
Example
Filtering rows based on the index using thelen()
method.
1df.loc[df.index.str.len()>5]
Weight | Height | Country | |
---|---|---|---|
Jacobs | 76 | 1.73 | Canada |
Donald | 55 | 1.65 | France |
Yasmine | 77 | 1.80 | Germany |
Andrew | 59 | 1.68 | Canada |
Filtering Integer Index
We first convert the current name index into an integer range index.
Example
Resetting DataFrame index to integer range.1df.reset_index(inplace=True)
The inplace=True
argument indicates that the DataFrame is modified in place which means that original df
is modified without making a further assignment.
1df
index | Weight | Height | Country | |
---|---|---|---|---|
0 | Tom | 54 | 1.84 | Canada |
1 | Gary | 57 | 1.77 | USA |
2 | Lois | 83 | 1.87 | France |
3 | Wendy | 52 | 1.84 | Canada |
4 | Betty | 67 | 1.61 | USA |
5 | Jacobs | 76 | 1.73 | Canada |
6 | Donald | 55 | 1.65 | France |
7 | Yasmine | 77 | 1.80 | Germany |
8 | Nick | 59 | 1.82 | Germany |
9 | Andrew | 59 | 1.68 | Canada |
We can select all rows whose indices lie between 3 and 8, exclusive.
Example
Filtering rows based on integer index.1df.loc[(df.index > 3) & (df.index < 8)]
index | Weight | Height | Country | |
---|---|---|---|---|
4 | Betty | 67 | 1.61 | USA |
5 | Jacobs | 76 | 1.73 | Canada |
6 | Donald | 55 | 1.65 | France |
7 | Yasmine | 77 | 1.80 | Germany |