Table of Contents



Introduction

This section discusses the use of NumPy structured arrays which provide efficient storage for compound, heterogeneous data.

Structured arrays are ndarrays whose datatype is a composition of simpler datatypes organized as a sequence of named fields. The following is an example of a structured array that consists of 3 named fields - “name”, “age” and “weight”.

      name  age  weight
0    Julia   25    55.0
1  Android   45    85.5
2     Rees   19    60.9
3   Calvin   37    68.0
4    Henry   19    61.5

Despite the 2-D like structure of the above structured array, it is actually 1-dimensional with a size of 5 elements and 3 fields.

Creating Structured Arrays

Imagine that we have several categories of data on a number of people (say, ’name’, ‘age’, and ‘weight’), and we would like to store these values in Python. It is possible to store data from these 3 different fields in three separate lists:

name = ['Julia', 'Android', 'Rees', 'Calvin', 'Henry']
age = [25, 45, 19, 37, 19]
weight = [55.0, 85.5, 60.9, 68.0, 61.5]

Unfortunately, the above separate lists do not tell us how the different fields are related. It would be good if we can store the data in some kind of spreadsheet or table and perform some simple analyses. NumPy structured arrays allow us to do that easily, though a pandas dataframe is obviously a better choice for larger compound datasets of this kind.

Compound Data Type

The first step to create a structured array is to define the data types of the different fields (columns).

Example

Specifying the compound data type with a list of tuples.

1import numpy as np
2dt = np.dtype([('name', 'U10' ), ('age', np.int_), ('weight', np.float_)])
3print(dt)
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]

For the ’name’ field, we are using the 'U10' data type which means a unicode string with 10 characters. For the ‘age’ field, we are using np.int_ which refers to 32-bit signed integers. Finally, for the ‘weight’ field we are using 64-bit floating point numbers as indicated by np.float_.

Alternatively, we can also use a Python dictionary to specify the compound data type.

Example

Specifying the compound data type using a dictionary.

1dt = np.dtype({
2'names':('name', 'age', 'weight'),
3'formats':( 'U10', int, float)})
4print(dt)
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]

Initialization and Printing

Since there are a total of 5 entries, we will initialize the structured array with an empty 1-D array with 5 elements using the dt data type defined earlier.

Example

Initializing the structured array.

1data = np.empty(5, dt)
2print(data)
3print(data.dtype)
[('', 0, 0.) ('', 0, 0.) ('', 0, 0.) ('', 0, 0.) ('', 0, 0.)]

[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]

The final step is to simply populate the empty structured array and print the resulting array. For pretty printing, we have imported the pandas package which treats the structured array as a dataframe.

Example

Populating the empty structured array.

1data['name'] = name
2data['age'] = age
3data['weight'] = weight
4import pandas as pd
5pd.DataFrame(data)
      name  age  weight
0    Julia   25    55.0
1  Android   45    85.5
2     Rees   19    60.9
3   Calvin   37    68.0
4    Henry   19    61.5

Indexing and Slicing

In structured arrays, we can refer to values either by index or by name.

Example

Access all elements of a field.

1data['name'] # get all names
array(['Julia', 'Android', 'Rees', 'Calvin', 'Henry'], dtype='<U10')

Example

Access all elements of a row.

1data[1] # second row
('Android', 45, 85.5)

Example

Access all elements of last row.

1data[-1] # last row
('Henry', 19, 61.5)

Example

Access a certain field of last row.

1data[-1]['name'] # name from last row
'Henry'

Example

Access first two rows.

Here, we once again use the DataFrame function of pandas for pretty ouput.

1pd.DataFrame(data[0:2]) # first two rows
      name  age  weight
0    Julia   25    55.0
1  Android   45    85.5

Filtering with numpy.sort

We can perform simple filtering on structured arrays. For more complicated operations, the user is advised to employ the pandas package.

Example

Filtering of data with single condition.

1pd.DataFrame(data[data['age'] < 30]) # only entries of persons whose age < 30
    name  age  weight
0  Julia   25    55.0
1   Rees   19    60.9
2  Henry   19    61.5

Example

Filtering of data with multiple conditions.

1cond1 = data['age'] < 30 # age < 30
2cond2 = data['weight'] > 60  # weight > 60
3pd.DataFrame(data[cond1 & cond2])
    name  age  weight
0   Rees   19    60.9
1  Henry   19    61.5

Sorting

We can perform simple sorting on structured arrays too.

Example

Sorting a structured array by single field (ascending).

1df_sorted = np.sort(data, order='age')
2pd.DataFrame(df_sorted)
      name  age  weight
0    Henry   19    61.5
1     Rees   19    60.9
2    Julia   25    55.0
3   Calvin   37    68.0
4  Android   45    85.5

In the above example, the entries are sorted in ascending order by the 'age' column.

Example

Sorting a structured array by single field (descending).

1df_sorted = np.sort(data, order='age')
2pd.DataFrame(df_sorted[::-1])
      name  age  weight
0  Android   45    85.5
1   Calvin   37    68.0
2    Julia   25    55.0
3     Rees   19    60.9
4    Henry   19    61.5

In the above example, the entries are sorted in descending order by the 'age' column.

Example

Sorting a structured array by multiple fields (ascending).

1df_sorted = np.sort(data, order=['age', 'weight'])
2pd.DataFrame(df_sorted)
      name  age  weight
0     Rees   19    60.9
1    Henry   19    61.5
2    Julia   25    55.0
3   Calvin   37    68.0
4  Android   45    85.5

In the above example, the entries are sorted in ascending order by age and then by weight if ages are equal.