Table of Contents
Introduction
Boolean masking refers to selecting subsets of a DataFrame with the help of Boolean expressions. In particular, it can be applied to filtering the columns of a DataFrame based on certain specified Boolean expressions.
In DataFrames, however, we cannot use the Python logical operators and
, or
and not
. Instead, we need to use the symbols as shown in the table below.
Summary
Logical boolean operators in pandas.Python Logical Operators | pandas Logical Operators |
---|---|
and |
& |
or |
| |
not |
~ |
We first import the required libraries and modules.
1import numpy as np
2import pandas as pd
Let’s create a more elaborate dataframe with 10 columns.
1R = np.random.RandomState(10) # set a random seed
2a='Tom Gary Lois Wendy Betty Jacobs Donald Yasmine Nick Andrew'.split()
3b=R.choice(range(50,85), size=10, replace=False)
4c=R.uniform(1.6, 1.9, 10).round(2)
5d=R.choice(['Germany', 'UK', 'USA', 'France', 'Canada'], size=10)
6
7df = pd.DataFrame({'Weight':b,
8 'Height':c,
9 'Country':d}, index=a)
10df = df.T
11df
Tom | Gary | Lois | Wendy | Betty | Jacobs | Donald | Yasmine | Nick | Andrew | |
---|---|---|---|---|---|---|---|---|---|---|
Weight | 73 | 74 | 52 | 57 | 55 | 53 | 62 | 64 | 76 | 63 |
Height | 1.78 | 1.65 | 1.85 | 1.63 | 1.69 | 1.62 | 1.62 | 1.83 | 1.6 | 1.78 |
Country | USA | Germany | UK | USA | Germany | Germany | France | UK | France | Canada |
Filtering Columns Based on Row Values
Suppose we want to select persons whose weight is below 60.
Example
Filtering columns based on row ‘Weight’.1df.loc[:, df.loc['Weight']<60]
Lois | Jacobs | |
---|---|---|
Weight | 52 | 53 |
Height | 1.85 | 1.62 |
Country | UK | Germany |
The expression within the square brackets is a Boolean expression and it is actually a Series of Boolean values.
1df.loc['Weight']<60
Tom False
Jack False
Gary False
Lois True
Jacobs True
Donald False
Yasmine False
Cindy False
Name: Weight, dtype: bool
Or persons who come from ‘Germany’.
Example
Filtering columns based on row ‘Country’.1df.loc[:,df.loc['Country']=='Germany']
Gary | Jacobs | |
---|---|---|
Weight | 74 | 53 |
Height | 1.65 | 1.62 |
Country | Germany | Germany |
Suppose we want to select persons whose weight is above 60 and height is below 1.70
We first create a mask which is a Boolean expression containing the two conditions. Thereafter, we apply the mask to the DataFrame.
Example
Filtering columns based on multiple rows.1mask = (df.loc['Weight']>60) & (df.loc['Height']<1.7)
2df.loc[:,mask]
Gary | Donald | |
---|---|---|
Weight | 74 | 62 |
Height | 1.65 | 1.62 |
Country | Germany | France |
If you have more than one condition, make sure to put every Boolean expression in between parentheses so that you do not need to worry about operator precedence.
isin() Method
To select only columns such that the ‘Country’ is either ‘USA’ or ‘Germany’.
Example
Filtering columns using theisin()
method.
1df.loc[:, df.loc['Country'].isin(["USA", "Germany"])]
Tom | Gary | Wendy | Betty | Jacobs | |
---|---|---|---|---|---|
Weight | 73 | 74 | 57 | 55 | 53 |
Height | 1.78 | 1.65 | 1.63 | 1.69 | 1.62 |
Country | USA | Germany | USA | Germany | Germany |
Filtering Based on Column Labels
It is also possible to filter the columns of a DataFrame based on the column labels. Recall that df.columns
allows us to see all the column labels.
1df.columns
Index(['Tom', 'Jack', 'Gary', 'Lois', 'Jacobs', 'Donald', 'Yasmine', 'Cindy'], dtype='object')
startswith() Method
To filter columns such that the label starts with a certain character, we use the str
attribute followed by the startswith()
method. The Boolean mask looks like this.
Example
Filtering columns such that label starts with ‘A’ or ‘B’.1df.columns.str.startswith(('A','B'))
array([False, False, False, False, True, False, False, False, False,
True])
We then feed the above mask into df.loc
.
1df.loc[:,df.columns.str.startswith(('A','B'))]
Betty | Andrew | |
---|---|---|
Weight | 55 | 63 |
Height | 1.69 | 1.78 |
Country | Germany | Canada |
endswith() Method
To filter columns such that the label ends with certain characters, we use the str
attribute followed by the endswith()
method.
Example
Filtering columns such that label ends with ‘y’ or ’s’.1df.loc[:,df.columns.str.endswith(('y','s'))]
Gary | Lois | Wendy | Betty | Jacobs | |
---|---|---|---|---|---|
Weight | 74 | 52 | 57 | 55 | 53 |
Height | 1.65 | 1.85 | 1.63 | 1.69 | 1.62 |
Country | Germany | UK | USA | Germany | Germany |
contains() Method
To filter columns such that the label contains certain characters, we use the str
attribute followed by the contains()
method.
Example
Filtering columns such that label contains a specified string.1df.loc[:,df.columns.str.contains('on')]
Donald | |
---|---|
Weight | 62 |
Height | 1.62 |
Country | France |
Column labels containing one of multiple string patterns.
Example
Filtering columns such that label contains one of 2 specified strings.1df.loc[:,df.columns.str.contains('on|as')]
Donald | Yasmine | |
---|---|---|
Weight | 62 | 64 |
Height | 1.62 | 1.83 |
Country | France | UK |
Example
Filtering columns such that label contains one of 3 specified strings.1df.loc[:,df.columns.str.contains('on|as|et')]
Betty | Donald | Yasmine | |
---|---|---|---|
Weight | 55 | 62 | 64 |
Height | 1.69 | 1.62 | 1.83 |
Country | Germany | France | UK |
To grep column labels that contain BOTH “o” and “i”.
Example
Filtering columns such that label contains both ‘o’ and ‘i’.1df.loc[:,(df.columns.str.contains('o')) & (df.columns.str.contains('i'))]
Lois | |
---|---|
Weight | 52 |
Height | 1.85 |
Country | UK |
len() Method
To filter columns such that the label has more than a certain number of characters.
Example
Filtering columns such that the label is greater than a certain length.1df.loc[:,df.columns.str.len()>5]
Jacobs | Donald | Yasmine | Andrew | |
---|---|---|---|---|
Weight | 53 | 62 | 64 | 63 |
Height | 1.62 | 1.62 | 1.83 | 1.78 |
Country | Germany | France | UK | Canada |