import numpy as np
NumPy which is short for Numerical Python is the one of the most foundational tools for scientific computing with Python today. It is so important that many other tools like pandas utilise NumPy array objects as their basis. The most important tool in NumPy is the ndarray which is usually a fixed-size multidimensional container of items of the same type and size. Other useful tools in NumPy are the mathematical functions that support fast operations on the ndarrays without using loops as required in pure Python, and even a C API that connects NumPy with libraries written in C, C ++ or FORTRAN.
In this short piece, I attempt to highlight some important NumPy features that may be useful to individuals interested in scientific computing.
First before using NumPy, it is important to install it first, after installing Python of course. This can be done using conda or pip. The specific codes can be found in the documentation and is dependent on the individual’s setup eg windows, linux, Mac etc. After installation, NumPy can be imported in the following way before it can be used:
There are other ways of importing NumPy but the syntax above is the most widely accepted and it is advisable as it can foster collaboration. To check the package version, the following can be used:
print('numpy:', np.__version__)
numpy: 1.22.3
The NumPy array object, the basis of NumPy
N-dimensional array object, or ndarray, is a fast, flexible container for large datasets in Python. Arrays enable performance of mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. It enables one to perform computations on indivudual elements of an ndarray without the need for a loop or list comprehension as may be required while trying to perform a similar computation using lets say a list in pure python. In addition, computations in NumPy are significantly faster than those in pure Python eg:
# Comparing speed with pure Python
= np.arange(1_000_000)
my_arr
= list(range(1_000_000)) my_list
An ndarray and a list of the same size are declared and the below, I time a similar computation on the array and the list.
%timeit my_arr2 = my_arr * 2
%timeit my_list2 = [x * 2 for x in my_list]
1.08 ms ± 22.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
52.4 ms ± 786 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
The result above shows that the computation on the array is many times faster than same on the list and this is due to the power of NumPy.
Continuing the discussion on the ndarray, it contains elements of the same data type. In addition, every array has a shape which is a tuple that gives the dimensions of the array (ie size of each dimension) , and also a dtype (data type, since the array has the same data type).
Creating ndarrays
ndarrays can be created by using the array function and they can be initiated from lists or other data structures as shown below
# from lists
= [1,2,3,4,5,6]
data = np.array(data)
arr1 arr1
array([1, 2, 3, 4, 5, 6])
This creates an array object that contains the elements in the data list.
We can also input the elements directly as a list as follows:
= np.array([1,2,3,4,5,6])
arr2 arr2
array([1, 2, 3, 4, 5, 6])
Same array object will be created as arr1 above.
The functions and methods available in NumPy can be accessed by the following code:
dir(np) # assuming numpy is imported as np
['ALLOW_THREADS',
'AxisError',
'BUFSIZE',
'CLIP',
'ComplexWarning',
'DataSource',
'ERR_CALL',
'ERR_DEFAULT',
'ERR_IGNORE',
'ERR_LOG',
'ERR_PRINT',
'ERR_RAISE',
'ERR_WARN',
'FLOATING_POINT_SUPPORT',
'FPE_DIVIDEBYZERO',
'FPE_INVALID',
'FPE_OVERFLOW',
'FPE_UNDERFLOW',
'False_',
'Inf',
'Infinity',
'MAXDIMS',
'MAY_SHARE_BOUNDS',
'MAY_SHARE_EXACT',
'ModuleDeprecationWarning',
'NAN',
'NINF',
'NZERO',
'NaN',
'PINF',
'PZERO',
'RAISE',
'RankWarning',
'SHIFT_DIVIDEBYZERO',
'SHIFT_INVALID',
'SHIFT_OVERFLOW',
'SHIFT_UNDERFLOW',
'ScalarType',
'Tester',
'TooHardError',
'True_',
'UFUNC_BUFSIZE_DEFAULT',
'UFUNC_PYVALS_NAME',
'VisibleDeprecationWarning',
'WRAP',
'_CopyMode',
'_NoValue',
'_UFUNC_API',
'__NUMPY_SETUP__',
'__all__',
'__builtins__',
'__cached__',
'__config__',
'__deprecated_attrs__',
'__dir__',
'__doc__',
'__expired_functions__',
'__file__',
'__getattr__',
'__git_version__',
'__loader__',
'__name__',
'__package__',
'__path__',
'__spec__',
'__version__',
'_add_newdoc_ufunc',
'_distributor_init',
'_financial_names',
'_from_dlpack',
'_globals',
'_mat',
'_pytesttester',
'_version',
'abs',
'absolute',
'add',
'add_docstring',
'add_newdoc',
'add_newdoc_ufunc',
'alen',
'all',
'allclose',
'alltrue',
'amax',
'amin',
'angle',
'any',
'append',
'apply_along_axis',
'apply_over_axes',
'arange',
'arccos',
'arccosh',
'arcsin',
'arcsinh',
'arctan',
'arctan2',
'arctanh',
'argmax',
'argmin',
'argpartition',
'argsort',
'argwhere',
'around',
'array',
'array2string',
'array_equal',
'array_equiv',
'array_repr',
'array_split',
'array_str',
'asanyarray',
'asarray',
'asarray_chkfinite',
'ascontiguousarray',
'asfarray',
'asfortranarray',
'asmatrix',
'asscalar',
'atleast_1d',
'atleast_2d',
'atleast_3d',
'average',
'bartlett',
'base_repr',
'binary_repr',
'bincount',
'bitwise_and',
'bitwise_not',
'bitwise_or',
'bitwise_xor',
'blackman',
'block',
'bmat',
'bool8',
'bool_',
'broadcast',
'broadcast_arrays',
'broadcast_shapes',
'broadcast_to',
'busday_count',
'busday_offset',
'busdaycalendar',
'byte',
'byte_bounds',
'bytes0',
'bytes_',
'c_',
'can_cast',
'cast',
'cbrt',
'cdouble',
'ceil',
'cfloat',
'char',
'character',
'chararray',
'choose',
'clip',
'clongdouble',
'clongfloat',
'column_stack',
'common_type',
'compare_chararrays',
'compat',
'complex128',
'complex256',
'complex64',
'complex_',
'complexfloating',
'compress',
'concatenate',
'conj',
'conjugate',
'convolve',
'copy',
'copysign',
'copyto',
'core',
'corrcoef',
'correlate',
'cos',
'cosh',
'count_nonzero',
'cov',
'cross',
'csingle',
'ctypeslib',
'cumprod',
'cumproduct',
'cumsum',
'datetime64',
'datetime_as_string',
'datetime_data',
'deg2rad',
'degrees',
'delete',
'deprecate',
'deprecate_with_doc',
'diag',
'diag_indices',
'diag_indices_from',
'diagflat',
'diagonal',
'diff',
'digitize',
'disp',
'divide',
'divmod',
'dot',
'double',
'dsplit',
'dstack',
'dtype',
'e',
'ediff1d',
'einsum',
'einsum_path',
'emath',
'empty',
'empty_like',
'equal',
'errstate',
'euler_gamma',
'exp',
'exp2',
'expand_dims',
'expm1',
'expm1x',
'extract',
'eye',
'fabs',
'fastCopyAndTranspose',
'fft',
'fill_diagonal',
'find_common_type',
'finfo',
'fix',
'flatiter',
'flatnonzero',
'flexible',
'flip',
'fliplr',
'flipud',
'float128',
'float16',
'float32',
'float64',
'float_',
'float_power',
'floating',
'floor',
'floor_divide',
'fmax',
'fmin',
'fmod',
'format_float_positional',
'format_float_scientific',
'format_parser',
'frexp',
'frombuffer',
'fromfile',
'fromfunction',
'fromiter',
'frompyfunc',
'fromregex',
'fromstring',
'full',
'full_like',
'gcd',
'generic',
'genfromtxt',
'geomspace',
'get_array_wrap',
'get_include',
'get_printoptions',
'getbufsize',
'geterr',
'geterrcall',
'geterrobj',
'gradient',
'greater',
'greater_equal',
'half',
'hamming',
'hanning',
'heaviside',
'histogram',
'histogram2d',
'histogram_bin_edges',
'histogramdd',
'hsplit',
'hstack',
'hypot',
'i0',
'identity',
'iinfo',
'imag',
'in1d',
'index_exp',
'indices',
'inexact',
'inf',
'info',
'infty',
'inner',
'insert',
'int0',
'int16',
'int32',
'int64',
'int8',
'int_',
'intc',
'integer',
'interp',
'intersect1d',
'intp',
'invert',
'is_busday',
'isclose',
'iscomplex',
'iscomplexobj',
'isfinite',
'isfortran',
'isin',
'isinf',
'isnan',
'isnat',
'isneginf',
'isposinf',
'isreal',
'isrealobj',
'isscalar',
'issctype',
'issubclass_',
'issubdtype',
'issubsctype',
'iterable',
'ix_',
'kaiser',
'kernel_version',
'kron',
'lcm',
'ldexp',
'left_shift',
'less',
'less_equal',
'lexsort',
'lib',
'linalg',
'linspace',
'little_endian',
'load',
'loadtxt',
'log',
'log10',
'log1p',
'log2',
'logaddexp',
'logaddexp2',
'logical_and',
'logical_not',
'logical_or',
'logical_xor',
'logspace',
'longcomplex',
'longdouble',
'longfloat',
'longlong',
'lookfor',
'ma',
'mask_indices',
'mat',
'math',
'matmul',
'matrix',
'matrixlib',
'max',
'maximum',
'maximum_sctype',
'may_share_memory',
'mean',
'median',
'memmap',
'meshgrid',
'mgrid',
'min',
'min_scalar_type',
'minimum',
'mintypecode',
'mod',
'modf',
'moveaxis',
'msort',
'multiply',
'nan',
'nan_to_num',
'nanargmax',
'nanargmin',
'nancumprod',
'nancumsum',
'nanmax',
'nanmean',
'nanmedian',
'nanmin',
'nanpercentile',
'nanprod',
'nanquantile',
'nanstd',
'nansum',
'nanvar',
'nbytes',
'ndarray',
'ndenumerate',
'ndim',
'ndindex',
'nditer',
'negative',
'nested_iters',
'newaxis',
'nextafter',
'nonzero',
'not_equal',
'numarray',
'number',
'obj2sctype',
'object0',
'object_',
'ogrid',
'oldnumeric',
'ones',
'ones_like',
'os',
'outer',
'packbits',
'pad',
'partition',
'percentile',
'pi',
'piecewise',
'place',
'poly',
'poly1d',
'polyadd',
'polyder',
'polydiv',
'polyfit',
'polyint',
'polymul',
'polynomial',
'polysub',
'polyval',
'positive',
'power',
'printoptions',
'prod',
'product',
'promote_types',
'ptp',
'put',
'put_along_axis',
'putmask',
'quantile',
'r_',
'rad2deg',
'radians',
'random',
'ravel',
'ravel_multi_index',
'real',
'real_if_close',
'rec',
'recarray',
'recfromcsv',
'recfromtxt',
'reciprocal',
'record',
'remainder',
'repeat',
'require',
'reshape',
'resize',
'result_type',
'right_shift',
'rint',
'roll',
'rollaxis',
'roots',
'rot90',
'round',
'round_',
'row_stack',
's_',
'safe_eval',
'save',
'savetxt',
'savez',
'savez_compressed',
'sctype2char',
'sctypeDict',
'sctypes',
'searchsorted',
'select',
'set_numeric_ops',
'set_printoptions',
'set_string_function',
'setbufsize',
'setdiff1d',
'seterr',
'seterrcall',
'seterrobj',
'setxor1d',
'shape',
'shares_memory',
'short',
'show_config',
'sign',
'signbit',
'signedinteger',
'sin',
'sinc',
'single',
'singlecomplex',
'sinh',
'size',
'sometrue',
'sort',
'sort_complex',
'source',
'spacing',
'split',
'sqrt',
'square',
'squeeze',
'stack',
'std',
'str0',
'str_',
'string_',
'subtract',
'sum',
'swapaxes',
'sys',
'take',
'take_along_axis',
'tan',
'tanh',
'tensordot',
'test',
'testing',
'tile',
'timedelta64',
'trace',
'tracemalloc_domain',
'transpose',
'trapz',
'tri',
'tril',
'tril_indices',
'tril_indices_from',
'trim_zeros',
'triu',
'triu_indices',
'triu_indices_from',
'true_divide',
'trunc',
'typecodes',
'typename',
'ubyte',
'ufunc',
'uint',
'uint0',
'uint16',
'uint32',
'uint64',
'uint8',
'uintc',
'uintp',
'ulonglong',
'unicode_',
'union1d',
'unique',
'unpackbits',
'unravel_index',
'unsignedinteger',
'unwrap',
'use_hugepage',
'ushort',
'vander',
'var',
'vdot',
'vectorize',
'version',
'void',
'void0',
'vsplit',
'vstack',
'warnings',
'where',
'who',
'zeros',
'zeros_like']
Using the dir function is a handy way of finding the methods and functions associated with a package or even an object.
Data types for ndarrays
It has been established that elements in an ndarray are usually of the same data type. As such, the data type of an array can be declared while creating the array as follows:
= np.array([4,3,5,6,7], dtype = np.float64)
arr arr
array([4., 3., 5., 6., 7.])
Also, conversion of data types can also be done:
= arr.astype(np.int64)
int_arr int_arr
array([4, 3, 5, 6, 7])
The following is a list of NumPy data types: 1. int 2. float 3. bool 4. complex 5. object 6. string 7. unicode We can also specify the number of bits for some of them e.g. int8, int32, float64, complex64 etc.
Mathematical Calculations in NumPy
NumPy has two features on ndarrays that reflect the powerful nature of the library, vectorization and broadcasting. These are not available on traditional data structures like Python lists. Vectorization to refers to element-wise calculations done on entire arrays by using simple syntax same as will be used for scalar calculations:
= np.array([[1., 2., 3.], [4., 5., 6.]])
arr arr
array([[1., 2., 3.],
[4., 5., 6.]])
* arr arr
array([[ 1., 4., 9.],
[16., 25., 36.]])
Arithmetic with scalars propagates the scalar to each element in the array:
1 / arr
array([[1. , 0.5 , 0.33333333],
[0.25 , 0.2 , 0.16666667]])
In the above, 1 is divided by each of the elements in arr to return a new array of the same size as arr. This is the beauty of NumPy.
Apart from regular calculations, NumPy also supports ufuncs (universal functions) that allow element-wise computations on ndarrays. Examples include numpy.sqrt and numpy.exp. Ufuncs can be unary (acts on elements of a single array) eg numpy.sqrt or binary (takes two arrays as arguments) eg numpy.multiply, numpy.mod etc. Ufuncs are powerful and I have seen them used in a simple implementation of logistic regression.
Indexing ndarrays
There are a variety of ways to index NumPy arrays but in the basic form, indexing one dimensional arrays is very similar to indexing lists in pure Python. However, since the dimensions of an ndarray can be multidimensional, it becomes more complex with increase in dimension. In addition, numpy allows for boolean indexing (using a conditional) and fancy indexing (using integer arrays). Both are not available in pure Python. An important first distinction from Python’s built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.
Conclusion
In conclusion, NumPy provides powerful tools that can be harnessed for scientific computing and this is an attempt to provide a brief intro. As I begin my data science journey, I try to document my learnings and this is one of such attempts at documentation. This is a very skeletal view of the potential possibilities with NumPy. I have barely begun to scratch the surface. I recommend checking the NumPy documentation and also practicing the concepts as we learn.