Python’s NumPy library is the standard for working with numerical data. This article will explain why that is, and how to use it.
In addition to this, code examples for basic usage will be shown, and we will build a real-world example that implements NumPy to work with some example data.
Python and Arrays
Whether you are working with data collected for scientific or business reasons, it is usually collected in the form of arrays of data presented as tables with single or multiple columns.
A single column is a simple array – a one-dimensional ordered list of values.
When two or more columns are present, a matrix (a two-dimensional array) is created. Each row of data is its own array, so in effect, you have an array (the rows) that contains arrays (each row’s columns).
In computer programming, these are known as multidimensional arrays. You can go further and have three, four, or more dimensions, by nesting more arrays. Three-dimensional arrays, for example, can be used to represent objects and movement in 3d space and are commonly used when building video games.
Python Lists Are Not Efficient for These Tasks
Python’s built in List class holds an ordered collection of values, and can contain other Lists – so it would seem sensible to use that to hold your arrays of data.
However, this is not the case. The Python list data type, while functionally capable, are not designed for this.
Firstly, Python lists are not efficient. When lists are modified, and items are added and removed, they are not stored contiguously (side by side) in your computers memory, so they are slower to access when you retrieve or modify them.
Secondly, they can become unwieldy quite quickly – if you’re working with large matrixes or arrays with many dimensions, it can be quickly become tough to manage the data within them using the basic methods provided for working with Python lists.
What is NumPy?
Enter NumPy (Numerical Python).
NumPy is a Python library specifically built for working with numerical data stored in arrays of any size and dimension.
It is memory efficient, built with highly optimized code. NumPy is written in the C programming language, allowing it to run much quicker than other libraries that are written in Python only (don’t worry, you don’t need to know C to use NumPy – it presents an entirely-Python interface).
NumPy’s syntax is also easy to use, offering pre-built functions for working with algebra, random numbers, fourier transforms, and more.
Why Do You Need NumPy?
For these reasons alone, it makes sense to use NumPy. Your code will run faster, and as it’s easier to write and understand, will be more reliable.
One of the requirements when working with any data is accuracy, so better code = better results. NumPy also provides tools to address floating point number errors. The numpy.around() can be used for rounding and correcting for floating point errors, for example.
Everyone’s Using It
Everyone from scientists and engineers to advertisers looking at large data sets use NumPy. Because of this, there is a huge number of tutorials, questions and answers, and examples available. If you’re stuck, there will be somewhere you can ask a question, most likely even if it is industry-specific.
NumPy’s documentation is thorough, and provides its own quickstart guides, tutorials and HowTo’s, so getting started is easy, no matter what it is you need to accomplish.
The only thing you need to use NumPy is Python (version 3, not version 2!). Python comes in several flavours, so here are installation instructions for the most popular – regular Python and Python provided through the Anaconda data science platform.
Installing NumPy in Vanilla Python using pip
Pip is the most popular package manager for Python and can be used to install NumPy. If you don’t have pip installed, here’s how to do it. Once installed, installing NumPy is as simple as running:
pip install numpy
virtualenv is recommended when working with multiple Python projects – it lets you install Python packages per-project rather than globally, which is good if you need to use different versions of a package for different projects. Keeping things compartmentalized can help when debugging and moving your application to another computer (or collaborating with others), too.
To create a virtual environment and install NumPy, run the following commands:
# Install virtualenv globally (the default Python environment) pip install virtualenv # Create a new virtualenv environment, which will act as the root of your project virtualenv -p python3 MyNumPyProject # Change directory to the project cd MyNumPyProject # Activate the virtual environment bin/activate # Install NumPy in this projects virtual environment pip install numpy
If you want to return to your global (default) Python environment, run:
Installing NumPy in Anaconda
Install NumPy in Anaconda by running:
conda install numpy
Anaconda has it’s own virtual environments implementation as well, which you should utilize for the same reasons outlined for the default Python installation above:
# Create and activate a new environment for your project conda create -n MyNumPyProject conda activate MyNumPyProject # Install NumPy conda install numpy
Importing Numpy Into Your Python Script
Before you can use NumPy in your Python scripts, you need to import the library. Add the following to the top of your Python file:
import numpy as np
Above, numpy is imported under the name np, so that you don’t have to type out numpy every time you call the library. The np is just convention, you could use whatever name you want, so long as it isn’t in use by any other library or variable. np is the convention, however, so it’s best to stick to it if you can as it’s used in most existing projects and examples.
Now you’re ready to use NumPy – here are some basic examples to get started.
Creating a NumPy Array
Before you can work on your data, you need to create an array to hold it.
myArray = np.array([3, 7, 2, 4, 1, 0])
Above, a one-dimensional array is created containing some numbers. These numbers have no meaning, they’re just some randomly picked numbers used as an example.
Each value in the array can be accessed by its index:
myArray # Will return the first value of the array, 3
Indexes are the position of each item in an array. Indexes start counting at position 0, with the second position being index 1, and so on.
Multidimensional arrays are declared the same way:
myMultiDimensionalArray = np.array([[3, 5, 2], [4, 5, 9], [11, 0 ,4]])
Above, an array of arrays is declared, creating a matrix. Presented visually, it would look like this:
3 5 2 4 5 9 11 0 4
Values in multidimensional arrays are also accessed by their index – but as there are two dimensions, two indexes are required to locate the value:
myMultiDimensionalArray # Will return the value stored in the second row (index 1 in the outer array) in the third column (index 2 of the inner array).
Once an array is defined, you can modify it as you need, adding and removing values, and even changing the shape of the array.
Rounding All Values in A NumPy Array
This first example starts simple – rounding all of the values in a a NumPy array using numpy.around():
myArray = np.array([3.11, 7.66, 2.55, 4.32, 1.04, 0.01]) rounded = np.around(myArray, decimals=1); # A parameter is included to the number of decimal places print(roundedArray) # Will print [7.7, 2.6, 4.3, 1.0, 0.0]
Rounding using numpy.around() is not always accurate, if accuracy is a concern, check out our article on rounding numbers in Python.
Analysis First Steps – Mean and Median
A common task when working with numerical data sets is calculating the mean and median. The mean is another word for the average of a data set, while the mode is the most common number found in a data set.
myArray = np.array([3, 7, 2, 4, 1, 0, 2, 2]) mean = np.mean(myArray) # Will be calculated as 3 median = np.median(myArray) # Will be calculated as 2
3 is returned as the mean as it is the average of all of the numbers, while 2 will be returned as the median as it is the most commonly ocuring.
The standard deviation of a data set is the measure the amount of variation in the data (how spread out the values are). The standard deviation is represented by the greek character ? (though you won’t see that often in programming languages as they tend to stick to alphanumeric characters for function and variable names).
The standard deviation is calculated by finding the average of the squared differences of each value from the mean. This isn’t a maths blog, so I’ll leave it to someone better informed on the matter to explain the how and why.
You can skip the calculations by using NumPy’s numpy.std() function:
myArray = np.array([3, 7, 2, 4, 1, 0, 2, 2]) standardDev = np.std(myArray) # Will be calculated as 2
Above, the standard deviation is calculated, showing that the standard deviation for the population provided is 2.
This function can also be used on multidimensional arrays by spe
numpy.std() is a new addition to NumPy, so make sure you’re using the latest version!
Finding the Largest Number
The numpy.amax() function returns the largest number in an array:
myArray = np.array([3, 7, 2, 4, 1, 0, 2, 2]) np.amax(myArray) # Will return 7
Calculating the Results from Multidimensional arrays
By default, NumPy will flatten the array and treat the values as one big single dimensional array when performing calculations.
You can operate on specific axis on your multidimensional array by specifying the axis parameter.
The axis specifies the position of the values in each nested array that should be operated on.
For example, in the multidimensional array:
myMultiDimensionalArray = np.array([[3, 5, 2], [4, 5, 9]]);
…values 3 and 4 are on axis 0 – the first column of the data.
But Wait, There’s More
There are also many more mathematical functions you can utilize. Check out the NumPy documentation to see the full list of functions you can perform on arrays and the values in them.
In Action – A Real-World NumPy Example
Now for an example of how the above can be applied. Weather data is numerical and is often analyzed to find trends, interesting patterns, and strange outliers (like the hottest February day on record). The below example will define some weather data and then gain some insights from it using NumPy.
Below a multidimensional NumPy array with some Mock Data for the temperature measured (in centigrade) on each day in February is defined. Each week of the month receives its own row in the matrix:
febWeather = np.array([ [12, 14, 17, 11, 14, 15, 9], [29, 16, 14, 10, 18, 19, 17], [9, 11, 18, 17, 12, 20, 10], [16 ,24, 23, 20, 18, 22, 25] ]);
Why February? Because it has 28 days and can be evenly split into weeks, making for a tidy array. In 2021, February also started on a Monday, which is quite convenient.
Knowing the Data
Before you start running your data through your various calculations, take a look at it. You should know what data is there, what it represents, and how it is formatted. This will ensure that the code you write is applicable to the data set – for example, if you didn’t know how the data was formatted, you might try to access values that don’t exist (eg, trying to access data for day 30, which won’t be there).
You might also be able to spot some trends or outlying values with the naked eye – this can be useful for confirming the results you get from your calculations, and help you spot where things may have gone wrong.
Asking Questions and forming a Hypothesis.
Next, you need to know what questions you want answers to – you can’t write code if you don’t have a goal. Using the examples above, we will answer the following questions about the weather data:
- What was the average (mean) temperature for February?
- When was the hottest temperature on a Monday?
- What was the average (mean) temperature on a Wednesday?
- What was the most common (median) temperature?
- Standard deviation – how much did the temperature vary?
When working with your own data, you will need to come up with your own questions relevant to it.
Finding Answers Using Python and NumPy
Once you have your data, Python and NumPy can be used to run calculations and find the answers:
meanFebruaryTemperature = np.mean(febWeather) hottestMondayTemperature = np.amax(febWeather, axis=0); meanWednesdayTemperature = np.mean(febWeather, axis=2); medianTemperature = np.median(febWeather) standardDeviation = np.stddev(febWeather)
Note that there’s no need to loop over these arrays to access values – the axis is simply supplied to dictate which column we are operating on. Monday is the first day of the week, so it will be on axis 0.
When answering the questions you have formulated, you can use the full range of Python and NumPy mathematical functions.
Understanding the data
Once you have some answers you can form hypothesis (a potential explanation as to why something is the way it is) and further your investigations. Data from multiple sources can be processed and trends can be associated and hypothesis confirmed – perhaps another dataset shows an decrease in the average cloud cover on the weeks where the average temperature was higher, for example.