Table of Contents
- Introduction
- Syntax
- DataFrame from a Series Object
- DataFrame from a Dictionary of Series Objects
- DataFrame from a Dictionary of Python Lists
- DataFrame from a List of Dictionaries
- DataFrame from a Two-dimensional Array
- DataFrame from a Nested List
- DataFrame from a NumPy Structured Array
Introduction
A pandas DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data. As mentioned, a DataFrame is analogous to the EXCEL spreadsheet with its rows and columns, while a Series is analogous to a single column of data.
Just like the Series object which is an analog of a 1-D NumPy array with flexible indices, a DataFrame is an analog of a 2-D NumPy array with both flexible row indices and column labels.
It can also be thought of as a dictionary-like container for Series objects. We can think of a DataFrame as a sequence of aligned Series objects, or in other words, Series that share the same index.
Syntax
A pandas DataFrame can be created using the following constructor.
Syntax
Thepandas.DataFrame
function.
1pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
Parameter | Required? | Default Value | Description |
---|---|---|---|
data |
✔️ Yes | NA | ndarray (structured or homogeneous), Iterable, dict, or DataFrame. |
index |
❌ No | RangeIndex |
Index used to label rows of resulting frame. |
columns |
❌ No | RangeIndex |
Column labels used for resulting frame. |
dtype |
❌ No | Inferred from data |
Data type to force. Only a single dtype is allowed. If None , it is inferred from data . |
copy |
❌ No | None |
bool . Copy data from inputs. For dict data, the default of None behaves like copy=True . For DataFrame or 2-D ndarray input, the default of None behaves like copy=False . If data is a dict containing one or more Series (possibly of different dtypes), copy=False will ensure that these inputs are not copied. |
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
DataFrame from a Series Object
A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series.
We first import the required libraries and modules.
1import numpy as np
2import pandas as pd
3import random as rd
Example
Step 1: Creating a pandas Series from a dictionary.1age_dict = {'Tom': 32,
2 'Gary': 26,
3 'Lois': 22,
4 'Wendy': 31,
5 'Betty': 35} # create dictionary
6
7age = pd.Series(age_dict) # create Series
8age
Tom 32
Gary 26
Lois 22
Wendy 31
Betty 35
dtype: int64
We now create a single-column DataFrame from the above Series object.
Example
Step 2: Creating a pandas DataFrame from a pandas Series.1pd.DataFrame(age, columns=['Age']) # create dataframe
Age | |
---|---|
Tom | 32 |
Gary | 26 |
Lois | 22 |
Wendy | 31 |
Betty | 35 |
DataFrame from a Dictionary of Series Objects
It is also possible to create a DataFrame from a dictionary of Series objects.
We create another Series object that share the same index as the age
Series object (created above).
Example
Creating another pandas Series from a dictionary.1height_dict = {'Tom': 1.75,
2 'Gary': 1.83,
3 'Lois': 1.69,
4 'Wendy': 1.67,
5 'Betty': 1.62} # create dictionary
6
7height = pd.Series(height_dict) # create Series
8height
Tom 1.75
Gary 1.83
Lois 1.69
Wendy 1.67
Betty 1.62
dtype: float64
We then create a DataFrame from a dictionary of the two Series objects.
Example
Creating a pandas DataFrame from a dictionary of two Series.1pd.DataFrame({'age':age, 'height':height}) # create dataframe
age | height | |
---|---|---|
Tom | 32 | 1.75 |
Gary | 26 | 1.83 |
Lois | 22 | 1.69 |
Wendy | 31 | 1.67 |
Betty | 35 | 1.62 |
Note that the column labels are just the ‘keys’ of the dictionary.
DataFrame from a Dictionary of Python Lists
It is also possible to create a pandas DataFrame from a dictionary of Python Lists.
We first create the lists to store the ’names’, ‘ages’, and ‘heights’. For convenience, we employ the random
functions to generate some random data.
Example
Creating a pandas DataFrame from a dictionary of lists.1a=['Tom','Gary','Lois','Wendy','Betty']
2b=rd.choices(range(30,45),k=5)
3c=np.random.uniform(1.6, 1.9, 5).round(2)
4
5# printing
6for x in (a,b,c):
7 print(x)
['Tom', 'Gary', 'Lois', 'Wendy', 'Betty']
[32, 41, 39, 36, 40]
[1.87 1.81 1.67 1.78 1.74]
We now create a dictionary of the above lists and generate a DataFrame using pd.DataFrame()
, while specifying list a
to be the index of the DataFrame.
1df = pd.DataFrame({'Age':b,
2 'Height':c}, index=a)
3df
Age | Height | |
---|---|---|
Tom | 32 | 1.87 |
Gary | 41 | 1.81 |
Lois | 39 | 1.67 |
Wendy | 36 | 1.78 |
Betty | 40 | 1.74 |
DataFrame from a List of Dictionaries
It is possible to convert a list of dictionaries into a DataFrame. In this case, each row of the dataframe is a dictionary.
Example
Creating a pandas DataFrame from list of dictionaries.1dict1 = {'Name':'Tom',
2 'Age':30,
3 'Height':1.88}
4
5dict2 = {'Name':'Gary',
6 'Age':38,
7 'Height':1.68}
8
9pd.DataFrame([dict1,dict2])
Name | Age | Height | |
---|---|---|---|
0 | Tom | 30 | 1.88 |
1 | Gary | 38 | 1.68 |
DataFrame from a Two-dimensional Array
Given a 2-D NumPy array of data, we can create a DataFrame with any specified column and index labels. If omitted, an integer RangeIndex
will be used.
First, let’s recall that the split()
method applied to a string (where the characters are separated by space) results in a list of characters.
Example
Creating a list of characters from a string.1'A B C D E F'.split()
['A', 'B', 'C', 'D', 'E', 'F']
We now create a DataFrame where the column and index labels result from using the split()
method.
Example
Creating a DataFrame from a 2-D NumPy array.1pd.DataFrame(np.random.rand(6,4),
2 index='A B C D E F'.split(),
3 columns='W X Y Z'.split())
W | X | Y | Z | |
---|---|---|---|---|
A | 0.074901 | 0.696408 | 0.153493 | 0.098724 |
B | 0.265094 | 0.659235 | 0.833043 | 0.985687 |
C | 0.930414 | 0.512948 | 0.539358 | 0.541957 |
D | 0.911736 | 0.975602 | 0.777425 | 0.922223 |
E | 0.255050 | 0.830163 | 0.964033 | 0.693914 |
F | 0.246925 | 0.060152 | 0.535843 | 0.622826 |
Note that np.random.rand(6,4)
creates a 2-D array of the shape (6,4)
and populates it with random samples from a uniform distribution over $[0, 1)$.
DataFrame from a Nested List
Another way of creating a DataFrame is to provide the data as a nested list, along with labels for the columns and the index.
Example
Creating a pandas DataFrame from a nested list.1data = [['Tom', 32, 1.75],
2 ['Gary', 26, 1.83],
3 ['Lois', 22, 1.69],
4 ['Wendy', 31, 1.67],
5 ['Betty', 35, 1.62]]
6pd.DataFrame(data, columns=['Name', "Age", "Height"])
Name | Age | Height | |
---|---|---|---|
0 | Tom | 32 | 1.75 |
1 | Gary | 26 | 1.83 |
2 | Lois | 22 | 1.69 |
3 | Wendy | 31 | 1.67 |
4 | Betty | 35 | 1.62 |
DataFrame from a NumPy Structured Array
A structured array is a stripped-down version of a pandas DataFrame, so it comes as no surprise that the latter can be created directly from the former.
We first create a structured array using the lists a
, b
and c
defined earlier. The first step is to create the data types.
Example
Creating a pandas DataFrame from a NumPy structured array.1dt = np.dtype({
2'names':('Name', 'Age', 'Height'),
3'formats':( 'U10', int, float)})
4print(dt)
[('Name', '<U10'), ('Age', '<i4'), ('Height', '<f8')]
The next step is to initialize the structured array. Since there are a total of 5 entries, we initialize the structured array with an empty 1-D array with 5 elements using the dt
data type defined earlier.
1data = np.empty(5, dt)
2print(data)
[('', 0, 0.) ('', 0, 0.) ('', 0, 0.) ('', 0, 0.) ('', 0, 0.)]
The final step is to populate the empty structured array and convert it into a pandas DataFrame.
1data['Name'] = a
2data['Age'] = b
3data['Height'] = c
4
5pd.DataFrame(data)
Name | Age | Height | |
---|---|---|---|
0 | Tom | 32 | 1.87 |
1 | Gary | 41 | 1.81 |
2 | Lois | 39 | 1.67 |
3 | Wendy | 36 | 1.78 |
4 | Betty | 40 | 1.74 |
This method of constructing a pandas DataFrame is not recommended unless the structured array is already available in the first place.