Table of Contents
Introduction
This section discusses the use of NumPy structured arrays which provide efficient storage for compound, heterogeneous data.
Structured arrays are ndarrays whose datatype is a composition of simpler datatypes organized as a sequence of named fields. The following is an example of a structured array that consists of 3 named fields - “name”, “age” and “weight”.
name age weight
0 Julia 25 55.0
1 Android 45 85.5
2 Rees 19 60.9
3 Calvin 37 68.0
4 Henry 19 61.5
Despite the 2-D like structure of the above structured array, it is actually 1-dimensional with a size of 5 elements and 3 fields.
Creating Structured Arrays
Imagine that we have several categories of data on a number of people (say, ’name’, ‘age’, and ‘weight’), and we would like to store these values in Python. It is possible to store data from these 3 different fields in three separate lists:
name = ['Julia', 'Android', 'Rees', 'Calvin', 'Henry']
age = [25, 45, 19, 37, 19]
weight = [55.0, 85.5, 60.9, 68.0, 61.5]
Unfortunately, the above separate lists do not tell us how the different fields are related. It would be good if we can store the data in some kind of spreadsheet or table and perform some simple analyses. NumPy structured arrays allow us to do that easily, though a pandas dataframe is obviously a better choice for larger compound datasets of this kind.
Compound Data Type
The first step to create a structured array is to define the data types of the different fields (columns).
Example
Specifying the compound data type with a list of tuples.1import numpy as np
2dt = np.dtype([('name', 'U10' ), ('age', np.int_), ('weight', np.float_)])
3print(dt)
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
For the ’name’ field, we are using the 'U10'
data type which means a unicode string with 10 characters. For the ‘age’ field, we are using np.int_
which refers to 32-bit signed integers. Finally, for the ‘weight’ field we are using 64-bit floating point numbers as indicated by np.float_
.
Alternatively, we can also use a Python dictionary to specify the compound data type.
Example
Specifying the compound data type using a dictionary.1dt = np.dtype({
2'names':('name', 'age', 'weight'),
3'formats':( 'U10', int, float)})
4print(dt)
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
ADVERTISEMENT
Initialization and Printing
Since there are a total of 5 entries, we will initialize the structured array with an empty 1-D array with 5 elements using the dt
data type defined earlier.
Example
Initializing the structured array.1data = np.empty(5, dt)
2print(data)
3print(data.dtype)
[('', 0, 0.) ('', 0, 0.) ('', 0, 0.) ('', 0, 0.) ('', 0, 0.)]
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
The final step is to simply populate the empty structured array and print the resulting array. For pretty printing, we have imported the pandas package which treats the structured array as a dataframe.
Example
Populating the empty structured array.1data['name'] = name
2data['age'] = age
3data['weight'] = weight
4import pandas as pd
5pd.DataFrame(data)
name age weight
0 Julia 25 55.0
1 Android 45 85.5
2 Rees 19 60.9
3 Calvin 37 68.0
4 Henry 19 61.5
Indexing and Slicing
In structured arrays, we can refer to values either by index or by name.
Example
Access all elements of a field.1data['name'] # get all names
array(['Julia', 'Android', 'Rees', 'Calvin', 'Henry'], dtype='<U10')
Example
Access all elements of a row.1data[1] # second row
('Android', 45, 85.5)
Example
Access all elements of last row.1data[-1] # last row
('Henry', 19, 61.5)
Example
Access a certain field of last row.1data[-1]['name'] # name from last row
'Henry'
Example
Access first two rows.Here, we once again use the DataFrame function of pandas for pretty ouput.
1pd.DataFrame(data[0:2]) # first two rows
name age weight
0 Julia 25 55.0
1 Android 45 85.5
ADVERTISEMENT
Filtering with numpy.sort
We can perform simple filtering on structured arrays. For more complicated operations, the user is advised to employ the pandas package.
Example
Filtering of data with single condition.1pd.DataFrame(data[data['age'] < 30]) # only entries of persons whose age < 30
name age weight
0 Julia 25 55.0
1 Rees 19 60.9
2 Henry 19 61.5
Example
Filtering of data with multiple conditions.1cond1 = data['age'] < 30 # age < 30
2cond2 = data['weight'] > 60 # weight > 60
3pd.DataFrame(data[cond1 & cond2])
name age weight
0 Rees 19 60.9
1 Henry 19 61.5
Sorting
We can perform simple sorting on structured arrays too.
Example
Sorting a structured array by single field (ascending).1df_sorted = np.sort(data, order='age')
2pd.DataFrame(df_sorted)
name age weight
0 Henry 19 61.5
1 Rees 19 60.9
2 Julia 25 55.0
3 Calvin 37 68.0
4 Android 45 85.5
In the above example, the entries are sorted in ascending order by the 'age'
column.
Example
Sorting a structured array by single field (descending).1df_sorted = np.sort(data, order='age')
2pd.DataFrame(df_sorted[::-1])
name age weight
0 Android 45 85.5
1 Calvin 37 68.0
2 Julia 25 55.0
3 Rees 19 60.9
4 Henry 19 61.5
In the above example, the entries are sorted in descending order by the 'age'
column.
Example
Sorting a structured array by multiple fields (ascending).1df_sorted = np.sort(data, order=['age', 'weight'])
2pd.DataFrame(df_sorted)
name age weight
0 Rees 19 60.9
1 Henry 19 61.5
2 Julia 25 55.0
3 Calvin 37 68.0
4 Android 45 85.5
In the above example, the entries are sorted in ascending order by age and then by weight if ages are equal.