Table of Contents
- Introduction
- Selecting Columns by Labels
- Selection by the
loc
Attribute - Selection by the
iloc
Attribute
Introduction
We learned in the previous sections that a DataFrame behaves in many ways like a 2-D NumPy array or a structured array, and in other ways like a Python dictionary of Series objects sharing the same index.
These analogies turn out to be helpful as we study data selection in DataFrames. In this section, we will discuss accessing the columns in a DataFrame by label and by position.
Selecting Columns by Labels
We first import the required libraries and modules.
1import numpy as np
2import pandas as pd
3import random as rd
We now create a dataframe from a dictionary of lists using the random
module to generate the required data.
Example
Create a DataFrame from a dictionary of Python lists.1a=['Tom','Gary','Lois','Wendy','Betty']
2b=rd.choices(range(50,85),k=5)
3c=np.random.uniform(1.6, 1.9, 5).round(2)
4
5df = pd.DataFrame({'Weight':b,
6 'Height':c}, index=a)
7df
Weight | Height | |
---|---|---|
Tom | 58 | 1.65 |
Gary | 54 | 1.84 |
Lois | 57 | 1.70 |
Wendy | 79 | 1.73 |
Betty | 65 | 1.68 |
Selecting Single Column
We can access each column using its column label (which is the key of the original dictionary). The output is a Series.
Example
Access a column of a DataFrame using its label.1df['Weight']
Tom 58
Gary 54
Lois 57
Wendy 79
Betty 65
Name: Weight, dtype: int64
Alternatively, we can access each column using the column name as an attribute.
Example
Access a column of a DataFrame using using its label as an attribute.1df.Weight
Tom 58
Gary 54
Lois 57
Wendy 79
Betty 65
Name: Weight, dtype: int64
This second method of accessing a column is not recommended since it will not work if the column names are not strings or if the column names conflict with methods or attributes of the DataFrame.
For example, we create a column with ‘shape’ as its label.
Example
Create a DataFrame with theshape
attribute as a column label.
1df1 = pd.DataFrame({'Weight':b,
2 'shape':c}, index=a)
3df1
Weight | shape | |
---|---|---|
Tom | 58 | 1.65 |
Gary | 54 | 1.84 |
Lois | 57 | 1.70 |
Wendy | 79 | 1.73 |
Betty | 65 | 1.68 |
In this case, df1.shape
will return the shape of the dataframe rather than the ‘shape’ column.
Example
Column label conflicts attribute of DataFrame.1df1.shape
(5, 2)
The dictionary-like syntax can be used to modify the dataframe object. For example, creating a new column.
Example
Creating a new column from existing columns.1df['Ratio'] = df['Weight']/df['Height']
2df
Weight | Height | Ratio | |
---|---|---|---|
Tom | 58 | 1.65 | 35.151515 |
Gary | 54 | 1.84 | 29.347826 |
Lois | 57 | 1.70 | 33.529412 |
Wendy | 79 | 1.73 | 45.664740 |
Betty | 65 | 1.68 | 38.690476 |
We can also round off a certain column using the round()
method.
Example
Rounding off a numerical column of values.1df['Ratio'] = df['Ratio'].round(1)
2df
Weight | Height | Ratio | |
---|---|---|---|
Tom | 58 | 1.65 | 35.2 |
Gary | 54 | 1.84 | 29.3 |
Lois | 57 | 1.70 | 33.5 |
Wendy | 79 | 1.73 | 45.7 |
Betty | 65 | 1.68 | 38.7 |
Selecting Multiple Columns
We can put the labels of all columns we want in a list and feed into the dataframe.
Example
Selecting multiple columns using labels.1cols = ['Weight','Ratio']
2df[cols]
Weight | Ratio | |
---|---|---|
Tom | 58 | 35.2 |
Gary | 54 | 29.3 |
Lois | 57 | 33.5 |
Wendy | 79 | 45.7 |
Betty | 65 | 38.7 |
Selection by the loc Attribute
The loc
attribute is used to access a group of rows and columns by label(s) or a boolean array.
Syntax
Theloc
attribute.
1dataframe.loc[row_selection, column_selection]
where row_selection
and column_selection
refer to a list of row labels and column labels, respectively, i.e. ['label1', 'label2', 'label3']
. loc
supports the slice notation and therefore accepts a colon(:
) to select all rows or columns.
Selecting a Cell
We can select a cell value by specifying both row and column labels.
Example
Selecting a cell using theloc
attribute.
1df.loc['Tom','Height']
1.65
Selecting Single Column
Select an entire column using its label. The colon (:
) in the row position indicates all rows are selected.
Example
Selecting a column using theloc
attribute.
1df.loc[:,"Weight"]
Tom 58
Gary 54
Lois 57
Wendy 79
Betty 65
Name: Weight, dtype: int64
The above returns a series object.
1type(df.loc[:,"Weight"])
pandas.core.series.Series
If the column label is in a list, then a DataFrame will be created instead.
1df.loc[:,["Weight"]]
Weight | |
---|---|
Tom | 58 |
Gary | 54 |
Lois | 57 |
Wendy | 79 |
Betty | 65 |
1type(df.loc[:,["Weight"]])
pandas.core.frame.DataFrame
Selecting Multiple Columns
The labels of the columns we wish to select are inside a list. This will always return a DataFrame.
Example
Selecting multiple columns using theloc
attribute.
1df.loc[:,['Weight','Ratio']]
Weight | Ratio | |
---|---|---|
Tom | 58 | 35.2 |
Gary | 54 | 29.3 |
Lois | 57 | 33.5 |
Wendy | 79 | 45.7 |
Betty | 65 | 38.7 |
We can also select multiple contiguous columns (adjacent to each other with no gap) using the colon :
notation.
Example
Contiguous column selection.1df.loc[:,'Weight':'Height']
Weight | Height | |
---|---|---|
Tom | 58 | 1.65 |
Gary | 54 | 1.84 |
Lois | 57 | 1.70 |
Wendy | 79 | 1.73 |
Betty | 65 | 1.68 |
The usefulness of contiguous column selection may not be apparent in a DataFrame with only a few columns. However, imagine you have a large DataFrame and you need to select over a dozen contiguous columns. In this case, the colon (:
) notation will do away with the need to write down the label of every single column included in the selection.
Summary
Column selection by theloc
attribute.
Selection | Return Data Type | Example |
---|---|---|
Single value | Scalar | df.loc['Tom','Height'] |
Single column | Series | df.loc[:,"Weight"] |
Single column | DataFrame | df.loc[:,["Weight"]] |
Multiple columns | DataFrame | df.loc[:,['Weight','Ratio']] |
Contiguous columns | DataFrame | df.loc[:,'Weight':'Ratio'] |
Selection by the iloc Attribute
The iloc
attribute is a purely integer-location based indexing for selection by position.
Syntax
Theiloc
attribute.
1dataframe.iloc[row_selection, column_selection]
where row_selection
and column_selection
refer to a list of row indices and column indices, respectively.iloc
supports the slice notation and therefore accepts a colon(:
) to select all rows or columns.
When performing data selection using the iloc
attribute, we employ the standard half-open interval.
Selecting a Cell
We can select a cell value by specifying both row and column indices.
Example
Selecting a cell using theiloc
attribute.
1df.iloc[0,0]
58
Selecting Single Column
We can select an entire column using its index. The colon (:) in row_selection
indicates all rows are selected.
Example
Selecting a column using theiloc
attribute.
1df.iloc[:,2]
Tom 35.2
Gary 29.3
Lois 33.5
Wendy 45.7
Betty 38.7
Name: Ratio, dtype: float64
The above returns a series object.
1type(df.iloc[:,2])
pandas.core.series.Series
If the column label is in a list, then a DataFrame will be created instead.
Example
Selecting a column using theiloc
attribute - returns a DataFrame.
1df.iloc[:,[2]]
Ratio | |
---|---|
Tom | 35.2 |
Gary | 29.3 |
Lois | 33.5 |
Wendy | 45.7 |
Betty | 38.7 |
1type(df.iloc[:,[2]])
pandas.core.frame.DataFrame
Selecting Multiple Columns
Selecting multiple columns (whose indices are inside a list) will always return a DataFrame.
Example
Selecting multiple columns using theiloc
attribute.
1df.iloc[:,[0,2]]
Weight | Ratio | |
---|---|---|
Tom | 58 | 35.2 |
Gary | 54 | 29.3 |
Lois | 57 | 33.5 |
Wendy | 79 | 45.7 |
Betty | 65 | 38.7 |
We can also select multiple contiguous columns (adjacent to each other with no gap) using the colon :
notation.
Example
Contiguous column selection.1df.iloc[:,0:2]
Weight | Height | |
---|---|---|
Tom | 58 | 1.65 |
Gary | 54 | 1.84 |
Lois | 57 | 1.70 |
Wendy | 79 | 1.73 |
Betty | 65 | 1.68 |
Summary
Column selection by theiloc
attribute.
Selection | Return Data Type | Example |
---|---|---|
Single value | Scalar | df.iloc[1,2] |
Single column | Series | df.iloc[:,2] |
Single column | DataFrame | df.iloc[:,[2]] |
Multiple columns | DataFrame | df.iloc[:,[2,1]] |
Contiguous columns | DataFrame | df.iloc[:, 1:4] |