Data Manipulation with Pandas
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data
As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
As we saw, NumPy’s ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time.
Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices. As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are. Thus, before we go any further, let’s introduce these three fundamental Pandas data structures: the Series, DataFrame, and Index.
import numpy as np
import pandas as pd
The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:
data=pd.Series([0.25,0.5,0.75,1])
data
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
data.values
array([ 0.25, 0.5 , 0.75, 1. ])
data.index
RangeIndex(start=0, stop=4, step=1)
data[1:3]
1 0.50
2 0.75
dtype: float64
SERIES AS GENERALIZED NUMPY ARRAY
it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the NumPy array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:
data=pd.Series([0.25,0.5,0.75,1],index=['a','b','c','d'])
data
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
data['b']
0.5
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data[5]
0.5
SERIES AS SPECIALIZED DICTIONARY
In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure that maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.
We can make the Series-as-dictionary analogy even more clear by constructing a Series object directly from a Python dictionary:
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64
By default, a Series will be created where the index is drawn from the sorted keys. From here, typical dictionary-style item access can be performed:
population['California']
38332521
Unlike a dictionary, though, the Series also supports array-style operations such as slicing:
population['California':'Illinois']
California 38332521
Florida 19552860
Illinois 12882135
dtype: int64
CONSTRUCTING SERIES OBJECTS
We’ve already seen a few ways of constructing a Pandas Series from scratch; all of them are some version of the following:
pd.Series(data, index=index)
pd.Series(5, index=[100, 200, 300])
100 5
200 5
300 5
dtype: int64
pd.Series({2:'a', 1:'b', 3:'c'})
1 b
2 a
3 c
dtype: object
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
3 c
2 a
dtype: object
The Pandas DataFrame Object
The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We’ll now take a look at each of these perspectives.
DATAFRAME AS A GENERALIZED NUMPY ARRAY
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by “aligned” we mean that they share the same index.
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64
Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:
states = pd.DataFrame({'population': population,
'area': area})
states
area | population | |
---|---|---|
California | 423967 | 38332521 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
New York | 141297 | 19651127 |
Texas | 695662 | 26448193 |
population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193
dtype: int64
area
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
dtype: int64
states.index
Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
states.columns
Index(['area', 'population'], dtype='object')
Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data
DATAFRAME AS SPECIALIZED DICTIONARY
Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the ‘area’ attribute returns the Series object containing the areas we saw earlier:
states['area']
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
CONSTRUCTING DATAFRAME OBJECTS
####1)From a single Series object
pd.DataFrame(population, columns=['population'])
population | |
---|---|
California | 38332521 |
Florida | 19552860 |
Illinois | 12882135 |
New York | 19651127 |
Texas | 26448193 |
###2a From list of lists
pd.DataFrame( [ ["Michigan","Ann Arbor"], ["Michigan", "Yipsilanti"] ],
columns=["State","RegionName"] )
State | RegionName | |
---|---|---|
0 | Michigan | Ann Arbor |
1 | Michigan | Yipsilanti |
####2)From a list of dicts .Any list of dictionaries can be made into a DataFrame
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
print(data)
pd.DataFrame(data)
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
a | b | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | 4 |
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
a | b | c | |
---|---|---|---|
0 | 1.0 | 2 | NaN |
1 | NaN | 3 | 4.0 |
####3)From a dictionary of Series objects
pd.DataFrame({"area":area,"population":population})
area | population | |
---|---|---|
California | 423967 | 38332521 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
New York | 141297 | 19651127 |
Texas | 695662 | 26448193 |
####4)From a two-dimensional NumPy array
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
foo | bar | |
---|---|---|
a | 0.819422 | 0.217693 |
b | 0.521680 | 0.138218 |
c | 0.091140 | 0.504031 |
####5)From a NumPy structured array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
print(A)
pd.DataFrame(A)
[(0, 0.) (0, 0.) (0, 0.)]
A | B | |
---|---|---|
0 | 0 | 0.0 |
1 | 0 | 0.0 |
2 | 0 | 0.0 |
The Pandas Index Object
We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set
ind = pd.Index([2, 3, 5, 7, 11])
ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')
INDEX AS IMMUTABLE ARRAY
This immutability makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification.
ind[1],ind[::2]
(3, Int64Index([2, 5, 11], dtype='int64'))
print(ind.size, ind.shape, ind.ndim, ind.dtype)
5 (5,) 1 int64
INDEX AS ORDERED SET
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python’s built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA & indB # intersection
Int64Index([3, 5, 7], dtype='int64')
indA | indB # union
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA ^ indB # symmetric difference
Int64Index([1, 2, 9, 11], dtype='int64')
Data Indexing and Selection
Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.
INDEXERS: LOC, ILOC, AND IX
These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as >data[1] will use the explicit indices, while a slicing operation like >data[1:3] will use the implicit Python-style index.
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
1 a
3 b
5 c
dtype: object
# explicit index when indexing
data[1]
'a'
# implicit index when slicing
data[1:3]
3 b
5 c
dtype: object
Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series
First, the loc attribute allows indexing and slicing that always references the explicit index:
data.loc[1]
'a'
data.loc[1:3]
1 a
3 b
dtype: object
The iloc attribute allows indexing and slicing that always references the implicit Python-style index:
data.iloc[1]
'b'
data.iloc[1:3]
3 b
5 c
dtype: object
One guiding principle of Python code is that “explicit is better than implicit.” The explicit nature of loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes
Data Selection in DataFrame
Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.
DATAFRAME AS A DICTIONARY
DataFrame as a dictionary of related Series objects
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
New York | 141297 | 19651127 |
Texas | 695662 | 26448193 |
The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:
data['area']
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
#Equivalently, we can use attribute-style access with column names that are strings:
data.area
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
data.area is data['area']
True
data.pop is data['pop']
False
#this dictionary-style syntax can also be used to modify the object, in this case to add a new column:
data['density'] = data['pop'] / data['area']
#This shows a preview of the straightforward syntax of element-by-element arithmetic between Series objects
data
area | pop | density | |
---|---|---|---|
California | 423967 | 38332521 | 90.413926 |
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
New York | 141297 | 19651127 | 139.076746 |
Texas | 695662 | 26448193 | 38.018740 |
DATAFRAME AS TWO-DIMENSIONAL ARRAY
As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:
data.values
array([[ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
[ 1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
[ 1.49995000e+05, 1.28821350e+07, 8.58837628e+01],
[ 1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
[ 6.95662000e+05, 2.64481930e+07, 3.80187404e+01]])
With this picture in mind, we can do many familiar array-like observations on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:
data.T
California | Florida | Illinois | New York | Texas | |
---|---|---|---|---|---|
area | 4.239670e+05 | 1.703120e+05 | 1.499950e+05 | 1.412970e+05 | 6.956620e+05 |
pop | 3.833252e+07 | 1.955286e+07 | 1.288214e+07 | 1.965113e+07 | 2.644819e+07 |
density | 9.041393e+01 | 1.148061e+02 | 8.588376e+01 | 1.390767e+02 | 3.801874e+01 |
When it comes to indexing of DataFrame objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:
data.values[0]
array([ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01])
#and passing a single “index” to a DataFrame accesses a column:
data['area']
California 423967
Florida 170312
Illinois 149995
New York 141297
Texas 695662
Name: area, dtype: int64
Thus for array-style indexing, we need another convention. Here Pandas again uses the loc, iloc, and ix indexers mentioned earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:
data.iloc[:3, :2]
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
data.loc[:'Illinois', :'pop']
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
#The ix indexer allows a hybrid of these two approaches:
data.ix[:3, :'pop']
C:\Users\Nithin\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
Any of the familiar NumPy-style data access patterns can be used within these indexers. For example, in the loc indexer we can combine masking and fancy indexing as in the following:
data.loc[data.density > 100, ['pop', 'density']]
pop | density | |
---|---|---|
Florida | 19552860 | 114.806121 |
New York | 19651127 | 139.076746 |
Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:
data.iloc[0, 2] = 90
data
area | pop | density | |
---|---|---|---|
California | 423967 | 38332521 | 90.000000 |
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
New York | 141297 | 19651127 | 139.076746 |
Texas | 695662 | 26448193 | 38.018740 |
ADDITIONAL INDEXING CONVENTIONS
There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. First, while indexing refers to columns, slicing refers to rows:
data['Florida':'Illinois']
area | pop | density | |
---|---|---|---|
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
#Such slices can also refer to rows by number rather than by index:
data[1:3]
area | pop | density | |
---|---|---|---|
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
#Similarly, direct masking operations are also interpreted row-wise rather than column-wise:
data[data.density > 100]
area | pop | density | |
---|---|---|---|
Florida | 170312 | 19552860 | 114.806121 |
New York | 141297 | 19651127 | 139.076746 |
Operating on Data in Pandas
One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.). Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced in “Computation on NumPy Arrays: Universal Functions” are key to this.
Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc. This means that keeping the context of data and combining data from different sources—both potentially error-prone tasks with raw NumPy arrays—become essentially foolproof ones with Pandas. We will additionally see that there are well-defined operations between one-dimensional Series structures and two-dimensional DataFrame structures.
Ufuncs: Index Preservation
Because Pandas is designed to work with NumPy, any NumPy ufunc will workon Pandas Series and DataFrame objects.
rng=np.random.RandomState(42)
ser=pd.Series(rng.randint(0,10,4))
ser
0 6
1 3
2 7
3 4
dtype: int32
df = pd.DataFrame(rng.randint(0,10,(3,4)),
columns=['A','B','C','D'])
df
A | B | C | D | |
---|---|---|---|---|
0 | 6 | 9 | 2 | 6 |
1 | 7 | 4 | 3 | 7 |
2 | 7 | 2 | 5 | 4 |
np.exp(ser)
0 403.428793
1 20.085537
2 1096.633158
3 54.598150
dtype: float64
np.sin(df * np.pi / 4)
A | B | C | D | |
---|---|---|---|---|
0 | -1.000000 | 7.071068e-01 | 1.000000 | -1.000000e+00 |
1 | -0.707107 | 1.224647e-16 | 0.707107 | -7.071068e-01 |
2 | -0.707107 | 1.000000e+00 | -0.707107 | 1.224647e-16 |
UFuncs: Index Alignment
For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation. This is very convenient when you are working with incomplete data
#INDEX ALIGNMENT IN SERIES
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')
population / area
Alaska NaN
California 90.413926
New York NaN
Texas 38.018740
dtype: float64
The resulting array contains the union of indices of the two input arrays, which we could determine using standard Python set arithmetic on these indices
area.index | population.index
Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
0 NaN
1 5.0
2 9.0
3 NaN
dtype: float64
A.add(B, fill_value=0)
0 2.0
1 5.0
2 9.0
3 5.0
dtype: float64
#INDEX ALIGNMENT IN DATAFRAME
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
columns=list('AB'))
A
A | B | |
---|---|---|
0 | 1 | 11 |
1 | 5 | 1 |
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
columns=list('BAC'))
B
B | A | C | |
---|---|---|---|
0 | 4 | 0 | 9 |
1 | 5 | 8 | 0 |
2 | 9 | 2 | 6 |
A + B
A | B | C | |
---|---|---|---|
0 | 1.0 | 15.0 | NaN |
1 | 13.0 | 6.0 | NaN |
2 | NaN | NaN | NaN |
Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted. As was the case with Series, we can use the associated object’s arithmetic method and pass any desired fill_value to be used in place of missing entries. Here we’ll fill with the mean of all values in A (which we compute by first stacking the rows of A):
fill=A.stack().mean()
A.add(B,fill_value=fill)
A | B | C | |
---|---|---|---|
0 | 1.0 | 15.0 | 13.5 |
1 | 13.0 | 6.0 | 4.5 |
2 | 6.5 | 13.5 | 10.5 |
Ufuncs: Operations Between DataFrame and Series
When you are performing operations between a DataFrame and a Series, the index and column alignment is similarly maintained. Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-dimensional NumPy array. Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:
A = rng.randint(10, size=(3, 4))
A
array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])
A - A[0]
array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
According to NumPy’s broadcasting rules (see “Computation on Arrays: Broadcasting”), subtraction between a two-dimensional array and one of its rows is applied row-wise.
In Pandas, the convention similarly operates row-wise by default:
df = pd.DataFrame(A,columns=list('QRST'))
df
Q | R | S | T | |
---|---|---|---|---|
0 | 3 | 8 | 2 | 4 |
1 | 2 | 6 | 4 | 8 |
2 | 6 | 1 | 3 | 8 |
df -df.iloc[0]
Q | R | S | T | |
---|---|---|---|---|
0 | 0 | 0 | 0 | 0 |
1 | -1 | -2 | 2 | 4 |
2 | 3 | -7 | 1 | 4 |
If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the axis keyword:
df.subtract(df['R'],axis=0)
Q | R | S | T | |
---|---|---|---|---|
0 | -5 | 0 | -6 | -4 |
1 | -4 | 0 | -2 | 2 |
2 | 5 | 0 | 2 | 7 |
halfrow = df.iloc[0, ::2]
halfrow
Q 3
S 2
Name: 0, dtype: int32
df - halfrow
Handling Missing Data
different data sources may indicate missing data in different ways
how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python. Here and throughout the book, we’ll refer to missing data in general as null, NaN, or NA values.
Missing Data in Pandas
The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which does not have a built-in notion of NA values for non-floating-point data types.
With these constraints in mind, Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point NaN value, and the Python None object. This choice has some side effects, as we will see, but in practice ends up being a good compromise in most cases of interest.
NONE: PYTHONIC MISSING DATA
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because None is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ‘object’ (i.e., arrays of Python objects):
vals1 = np.array([1, None, 3, 4])
vals1
array([1, None, 3, 4], dtype=object)
This dtype=object means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects. While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:
for dtype in ["object","int"]:
print("dtype =",dtype)
%timeit np.arange(1E6,dtype=dtype).sum()
print()
dtype = object
188 ms ± 5.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
dtype = int
6.69 ms ± 85.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The use of Python objects in an array also means that if you perform aggregations like sum() or min() across an array with a None value, you will generally get an error:
vals1.sum()
#This reflects the fact that addition between an integer and None is undefined.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-82-0b91035bdd89> in <module>()
----> 1 vals1.sum()
2 #This reflects the fact that addition between an integer and None is undefined.
~\Anaconda3\lib\site-packages\numpy\core\_methods.py in _sum(a, axis, dtype, out, keepdims)
30
31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):
---> 32 return umr_sum(a, axis, dtype, out, keepdims)
33
34 def _prod(a, axis=None, dtype=None, out=None, keepdims=False):
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
NAN: MISSING NUMERICAL DATA
The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype
Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. You should be aware that NaN is a bit like a data virus—it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN:
Note that this means that aggregates over the values are well defined (i.e., they don’t result in an error) but not always useful:
vals2.sum(), vals2.min(), vals2.max()
NumPy does provide some special aggregations that will ignore these missing values:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
Keep in mind that NaN is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types.
NAN AND NONE IN PANDAS
NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:
pd.Series([1, np.nan, 2, None])
x = pd.Series(range(2), dtype=int)
x
x[0] = None
x
Keep in mind that in Pandas, string data is always stored with an object dtype.
Operating on Null Values
As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:
isnull()
Generate a Boolean mask indicating missing values
notnull()
Opposite of isnull()
dropna()
Return a filtered version of the data
fillna()
Return a copy of the data with missing values filled or imputed
DETECTING NULL VALUES
Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()
data[data.notnull()]
The isnull() and notnull() methods produce similar Boolean results for DataFrames.
DROPPING NULL VALUES
In addition to the masking used before, there are the convenience methods, dropna() (which removes NA values) and fillna() (which fills in NA values). For a Series, the result is straightforward:
data.dropna()
#For a DataFrame, there are more options. Consider the following DataFrame:
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so dropna() gives a number of options for a DataFrame.
By default, dropna() will drop all rows in which any null value is present:
df.dropna()
df.dropna(axis=1)
But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values. This can be specified through the how or thresh parameters, which allow fine control of the number of nulls to allow through.
The default is how=’any’, such that any row or column (depending on the axis keyword) containing a null value will be dropped. You can also specify how=’all’, which will only drop rows/columns that are all null values:
df[3] = np.nan
df
df.dropna(axis=1,how='all')
For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept:
df.dropna(axis='rows', thresh=3)
FILLING NULL VALUES
because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
data.fillna(0)
# forward-fill
data.fillna(method='ffill')
area | pop | density | |
---|---|---|---|
California | 423967 | 38332521 | 90.000000 |
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
New York | 141297 | 19651127 | 139.076746 |
Texas | 695662 | 26448193 | 38.018740 |
# backward-fill
data.fillna(method='bfill')
area | pop | density | |
---|---|---|---|
California | 423967 | 38332521 | 90.000000 |
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
New York | 141297 | 19651127 | 139.076746 |
Texas | 695662 | 26448193 | 38.018740 |
df
Q | R | S | T | |
---|---|---|---|---|
0 | 3 | 8 | 2 | 4 |
1 | 2 | 6 | 4 | 8 |
2 | 6 | 1 | 3 | 8 |
df.fillna(method='ffill',axis=1)
Q | R | S | T | |
---|---|---|---|---|
0 | 3 | 8 | 2 | 4 |
1 | 2 | 6 | 4 | 8 |
2 | 6 | 1 | 3 | 8 |
Hierarchial Indexing
Multi indexing and how to operate – refer Python Data Science Handbook
pop = pd.DataFrame([['California',2000,33871648],
['California',2010,37253956],
['New York',2000,18976457],
['New York',2010,19378102],
['Texas',2000,20851820],
['Texas',2010,25145561]])
pop
0 | 1 | 2 | |
---|---|---|---|
0 | California | 2000 | 33871648 |
1 | California | 2010 | 37253956 |
2 | New York | 2000 | 18976457 |
3 | New York | 2010 | 19378102 |
4 | Texas | 2000 | 20851820 |
5 | Texas | 2010 | 25145561 |
newcols = {
0:'state',
1:'year',
2:'population'
}
pop.rename(columns=newcols,inplace=True)
pop
state | year | population | |
---|---|---|---|
0 | California | 2000 | 33871648 |
1 | California | 2010 | 37253956 |
2 | New York | 2000 | 18976457 |
3 | New York | 2010 | 19378102 |
4 | Texas | 2000 | 20851820 |
5 | Texas | 2010 | 25145561 |
Often when you are working with data in the real world, the raw input data looks like this and it’s useful to build a MultiIndex from the column values. This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame:
pop.set_index('state',inplace=True)
pop
year | population | |
---|---|---|
state | ||
California | 2000 | 33871648 |
California | 2010 | 37253956 |
New York | 2000 | 18976457 |
New York | 2010 | 19378102 |
Texas | 2000 | 20851820 |
Texas | 2010 | 25145561 |
pop.reset_index(inplace=True)
pop.set_index(['state','year'])
population | ||
---|---|---|
state | year | |
California | 2000 | 33871648 |
2010 | 37253956 | |
New York | 2000 | 18976457 |
2010 | 19378102 | |
Texas | 2000 | 20851820 |
2010 | 25145561 |
In practice, I find this type of reindexing to be one of the more useful patterns when I encounter real-world datasets.
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1)
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
subject | Bob | Guido | Sue | ||||
---|---|---|---|---|---|---|---|
type | HR | Temp | HR | Temp | HR | Temp | |
year | visit | ||||||
2013 | 1 | -0.5 | 2.4 | -1.8 | -0.5 | -0.7 | -0.0 |
2 | 1.5 | 1.2 | 1.6 | 1.8 | 1.5 | 0.3 | |
2014 | 1 | 1.3 | -0.0 | 0.1 | -1.8 | -0.7 | 1.6 |
2 | -0.7 | -0.5 | 0.2 | -0.8 | 1.1 | 0.3 |
Data Aggregations on Multi-Indices
We’ve previously seen that Pandas has built-in data aggregation methods,such as mean(), sum(), and max(). For hierarchically indexeddata, these can be passed a level parameter that controls whichsubset of the data the aggregate is computed on.For example, let’s return to our health data:
health_data
subject | Bob | Guido | Sue | ||||
---|---|---|---|---|---|---|---|
type | HR | Temp | HR | Temp | HR | Temp | |
year | visit | ||||||
2013 | 1 | -0.5 | 2.4 | -1.8 | -0.5 | -0.7 | -0.0 |
2 | 1.5 | 1.2 | 1.6 | 1.8 | 1.5 | 0.3 | |
2014 | 1 | 1.3 | -0.0 | 0.1 | -1.8 | -0.7 | 1.6 |
2 | -0.7 | -0.5 | 0.2 | -0.8 | 1.1 | 0.3 |
Perhaps we’d like to average out the measurements in the two visits each year. We can do this by naming the index level we’d like to explore, in this case the year:
data_mean = health_data.mean(level='year')
data_mean
subject | Bob | Guido | Sue | |||
---|---|---|---|---|---|---|
type | HR | Temp | HR | Temp | HR | Temp |
year | ||||||
2013 | 0.5 | 1.80 | -0.10 | 0.65 | 0.4 | 0.15 |
2014 | 0.3 | -0.25 | 0.15 | -1.30 | 0.2 | 0.95 |
data_mean.mean(axis=1, level='type')
type | HR | Temp |
---|---|---|
year | ||
2013 | 0.266667 | 0.866667 |
2014 | 0.216667 | -0.200000 |
Thus in two lines, we’ve been able to find the average heart rate and temperature measured among all subjects in all visits each year. This syntax is actually a shortcut to the GroupBy functionality, which we will discuss in “Aggregation and Grouping”. While this is a toy example, many real-world datasets have similar hierarchical structure.
Combining Datasets: Concat and Append
Some of the most interesting studies of data come from combiningdifferent data sources. These operations can involve anything from verystraightforward concatenation of two different datasets, to morecomplicated database-style joins and merges that correctly handle anyoverlaps between the datasets. Series and DataFrames are builtwith this type of operation in mind, and Pandas includes functions andmethods that make this sort of data wrangling fast and straightforward.
Here we’ll take a look at simple concatenation of Series and DataFrameswith the pd.concat function; later we’ll dive into more sophisticatedin-memory merges and joins implemented in Pandas.
def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC', range(3))
A | B | C | |
---|---|---|---|
0 | A0 | B0 | C0 |
1 | A1 | B1 | C1 |
2 | A2 | B2 | C2 |
Simple Concatenation with pd.concat
Signature in Pandas v0.18 pd.concat(objs, axis=0, join=’outer’, join_axes=None, ignore_index=False,keys=None, levels=None, names=None,verify_integrity=False,copy=True)
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
1 A
2 B
3 C
4 D
5 E
6 F
dtype: object
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
print(df1); print(df2); print(pd.concat([df1, df2]))
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
print(df3); print(df4); print(pd.concat([df3, df4], axis='columns'))
A B
0 A0 B0
1 A1 B1
C D
0 C0 D0
1 C1 D1
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
DUPLICATE INDICES
One important difference between np.concatenate and pd.concat is that Pandas concatenation preserves indices, even if the result will have duplicate indices! Consider this simple example:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
print(x); print(y); print(pd.concat([x, y]))
A B
0 A0 B0
1 A1 B1
A B
0 A2 B2
1 A3 B3
A B
0 A0 B0
1 A1 B1
0 A2 B2
1 A3 B3
Notice the repeated indices in the result. While this is valid within DataFrames, the outcome is often undesirable. pd.concat() gives us a few ways to handle it.
#Catching the repeats as an error
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
ValueError: Indexes have overlapping values: [0, 1]
#Ignoring the index
print(x); print(y); print(pd.concat([x, y], ignore_index=True))
A B
0 A0 B0
1 A1 B1
A B
0 A2 B2
1 A3 B3
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
#Adding MultiIndex keys
print(x); print(y); print(pd.concat([x, y], keys=['x', 'y']))
A B
0 A0 B0
1 A1 B1
A B
0 A2 B2
1 A3 B3
A B
x 0 A0 B0
1 A1 B1
y 0 A2 B2
1 A3 B3
CONCATENATION WITH JOINS
In the simple examples we just looked at, we were mainly concatenating DataFrames with shared column names. In practice, data from different sources might have different sets of column names, and pd.concat offers several options in this case. Consider the concatenation of the following two DataFrames, which have some (but not all!) columns in common:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
print(df5); print(df6); print(pd.concat([df5, df6]))
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
3 B3 C3 D3
4 B4 C4 D4
A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4
By default, the entries for which no data is available are filled with NA values. To change this, we can specify one of several options for the join and join_axes parameters of the concatenate function. By default, the join is a union of the input columns (join=’outer’), but we can change this to an intersection of the columns using join=’inner’:
print(df5); print(df6);
print(pd.concat([df5, df6], join='inner'))
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
3 B3 C3 D3
4 B4 C4 D4
B C
1 B1 C1
2 B2 C2
3 B3 C3
4 B4 C4
Another option is to directly specify the index of the remaining colums using the join_axes argument, which takes a list of index objects. Here we’ll specify that the returned columns should be the same as those of the first input:
print(df5); print(df6);
print(pd.concat([df5, df6], join_axes=[df5.columns]))
A B C
1 A1 B1 C1
2 A2 B2 C2
B C D
3 B3 C3 D3
4 B4 C4 D4
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
THE APPEND() METHOD
Because direct array concatenation is so common, Series and DataFrame objects have an append method that can accomplish the same thing in fewer keystrokes. For example, rather than calling pd.concat([df1, df2]), you can simply call df1.append(df2):
print(df1); print(df2); print(df1.append(df2))
A B
1 A1 B1
2 A2 B2
A B
3 A3 B3
4 A4 B4
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4
Keep in mind that unlike the append() and extend() methods of Python lists, the append() method in Pandas does not modify the original object—instead, it creates a new object with the combined data. It also is not a very efficient method, because it involves creation of a new index and data buffer. Thus, if you plan to do multiple append operations, it is generally better to build a list of DataFrames and pass them all at once to the concat() function.
Combining Datasets: Merge and Join
One essential feature offered by Pandas is its high-performance, in-memory join and merge operations. If you have ever worked with databases, you should be familiar with this type of data interaction. The main interface for this is the pd.merge function, and we’ll see a few examples of how this can work in practice.
Relational Algebra
The behavior implemented in pd.merge() is a subset of what is known as relational algebra, which is a formal set of rules for manipulating relational data, and forms the conceptual foundation of operations available in most databases. The strength of the relational algebra approach is that it proposes several primitive operations, which become the building blocks of more complicated operations on any dataset. With this lexicon of fundamental operations implemented efficiently in a database or other program, a wide range of fairly complicated composite operations can be performed.
Pandas implements several of these fundamental building blocks in the pd.merge() function and the related join() method of Series and DataFrames. As we will see, these let you efficiently link data from different sources.
Categories of Joins
The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins. All three types of joins are accessed via an identical call to the pd.merge() interface; the type of join performed depends on the form of the input data. Here we will show simple examples of the three types of merges, and discuss detailed options further below.
ONE-TO-ONE JOINS
Perhaps the simplest type of merge expression is the one-to-one join, which is in many ways very similar to the column-wise concatenation seen in “Combining Datasets: Concat and Append”. As a concrete example, consider the following two DataFrames, which contain information on several employees in a company:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
print(df1); print(df2)
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
df3=pd.merge(df1,df2)
df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
The pd.merge() function recognizes that each DataFrame has an “employee” column, and automatically joins using this column as a key. The result of the merge is a new DataFrame that combines the information from the two inputs. Notice that the order of entries in each column is not necessarily maintained: in this case, the order of the “employee” column differs between df1 and df2, and the pd.merge() function correctly accounts for this. Additionally, keep in mind that the merge in general discards the index, except in the special case of merges by index (see “The left_index and right_index keywords”).
MANY-TO-ONE JOINS
Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate. Consider the following example of a many-to-one join:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
print(df3); print(df4); print(pd.merge(df3, df4))
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
group supervisor
0 Accounting Carly
1 Engineering Guido
2 HR Steve
employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve
The resulting DataFrame has an additional column with the “supervisor” information, where the information is repeated in one or more locations as required by the inputs.
MANY-TO-MANY JOINS
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge. This will be perhaps most clear with a concrete example. Consider the following, where we have a DataFrame showing one or more skills associated with a particular group.
By performing a many-to-many join, we can recover the skills associated with any individual person:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
print(df1); print(df5); print(pd.merge(df1, df5))
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
group skills
0 Accounting math
1 Accounting spreadsheets
2 Engineering coding
3 Engineering linux
4 HR spreadsheets
5 HR organization
employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization
These three types of joins can be used with other Pandas tools to implement a wide array of functionality. But in practice, datasets are rarely as clean as the one we’re working with here. In the following section, we’ll consider some of the options provided by pd.merge() that enable you to tune how the join operations work.
Specification of the Merge Key
We’ve already seen the default behavior of pd.merge(): it looks for one or more matching column names between the two inputs, and uses this as the key. However, often the column names will not match so nicely, and pd.merge() provides a variety of options for handling this.
THE ON KEYWORD
Most simply, you can explicitly specify the name of the key column using the on keyword, which takes a column name or a list of column names:
print(df1); print(df2); print(pd.merge(df1, df2, on='employee'))
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
This option works only if both the left and right DataFrames have the specified column name.
THE LEFT_ON AND RIGHT_ON KEYWORDS
At times you may wish to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as “name” rather than “employee”. In this case, we can use the left_on and right_on keywords to specify the two column names:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
print(df1); print(df3);
print(pd.merge(df1, df3, left_on="employee", right_on="name"))
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
employee group name salary
0 Bob Accounting Bob 70000
1 Jake Engineering Jake 80000
2 Lisa Engineering Lisa 120000
3 Sue HR Sue 90000
The result has a redundant column that we can drop if desired—for example, by using the drop() method of DataFrames:
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
employee | group | salary | |
---|---|---|---|
0 | Bob | Accounting | 70000 |
1 | Jake | Engineering | 80000 |
2 | Lisa | Engineering | 120000 |
3 | Sue | HR | 90000 |
THE LEFT_INDEX AND RIGHT_INDEX KEYWORDS
Sometimes, rather than merging on a column, you would instead like to merge on an index. For example, your data might look like this:
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
print(df1a); print(df2a)
group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR
hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014
You can use the index as the key for merging by specifying the left_index and/or right_index flags in pd.merge():
print(df1a); print(df2a);
print(pd.merge(df1a, df2a, left_index=True, right_index=True))
group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR
hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
For convenience, DataFrames implement the join() method, which performs a merge that defaults to joining on indices:
print(df1a); print(df2a); print(df1a.join(df2a))
group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR
hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
If you’d like to mix indices and columns, you can combine left_index with right_on or left_on with right_index to get the desired behavior:
print(df1a); print(df3);
print(pd.merge(df1a, df3, left_index=True, right_on='name'))
group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR
name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
group name salary
0 Accounting Bob 70000
1 Engineering Jake 80000
2 Engineering Lisa 120000
3 HR Sue 90000
Specifying Set Arithmetic for Joins
In all the preceding examples we have glossed over one important consideration in performing a join: the type of set arithmetic used in the join. This comes up when a value appears in one key column but not the other. Consider this example:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
print(df6); print(df7); print(pd.merge(df6, df7))
name food
0 Peter fish
1 Paul beans
2 Mary bread
name drink
0 Mary wine
1 Joseph beer
name food drink
0 Mary bread wine
Here we have merged two datasets that have only a single “name” entry in common: Mary. By default, the result contains the intersection of the two sets of inputs; this is what is known as an inner join. We can specify this explicitly using the how keyword, which defaults to ‘inner’:
pd.merge(df6, df7, how='inner')
name | food | drink | |
---|---|---|---|
0 | Mary | bread | wine |
Other options for the how keyword are ‘outer’, ‘left’, and ‘right’. An outer join returns a join over the union of the input columns, and fills in all missing values with NAs:
print(df6); print(df7); print(pd.merge(df6, df7, how='outer'))
name food
0 Peter fish
1 Paul beans
2 Mary bread
name drink
0 Mary wine
1 Joseph beer
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
3 Joseph NaN beer
The left join and right join return join over the left entries and right entries, respectively. For example:
print(df6); print(df7); print(pd.merge(df6, df7, how='left'))
name food
0 Peter fish
1 Paul beans
2 Mary bread
name drink
0 Mary wine
1 Joseph beer
name food drink
0 Peter fish NaN
1 Paul beans NaN
2 Mary bread wine
The output rows now correspond to the entries in the left input. Using how=’right’ works in a similar manner.
All of these options can be applied straightforwardly to any of the preceding join types.
Overlapping Column Names: The suffixes Keyword
Finally, you may end up in a case where your two input DataFrames have conflicting column names. Consider this example:
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
print(df8); print(df9); print(pd.merge(df8, df9, on="name"))
name rank
0 Bob 1
1 Jake 2
2 Lisa 3
3 Sue 4
name rank
0 Bob 3
1 Jake 1
2 Lisa 4
3 Sue 2
name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2
Because the output would have two conflicting column names, the merge function automatically appends a suffix _x or _y to make the output columns unique. If these defaults are inappropriate, it is possible to specify a custom suffix using the suffixes keyword:
print(df8); print(df9);
print(pd.merge(df8, df9, on="name", suffixes=["_L", "_R"]))
name rank
0 Bob 1
1 Jake 2
2 Lisa 3
3 Sue 4
name rank
0 Bob 3
1 Jake 1
2 Lisa 4
3 Sue 2
name rank_L rank_R
0 Bob 1 3
1 Jake 2 1
2 Lisa 3 4
3 Sue 4 2
These suffixes work in any of the possible join patterns, and work also if there are multiple overlapping columns.
For more information on these patterns, see “Aggregation and Grouping”, where we dive a bit deeper into relational algebra. Also see the “Merge, Join, and Concatenate” section of the Pandas documentation for further discussion of these topics.
Example: US States Data
Merge and join operations come up most often when one is combining data from different sources. Here we will consider an example of some data about US states and their populations. The data files can be found at http://github.com/jakevdp/data-USstates/:
https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
Aggregation and Grouping
https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html
Pivot Tables
https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html
Vectorized String Operations
https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html
Working with Time Series
https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html
High-Performance Pandas: eval() and query()
https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html
Sources:
Python Data Science Handbook - https://jakevdp.github.io/PythonDataScienceHandbook/
Please note this is for reference.For detailed explanation of methods and complete understanding buy the book in link:- http://shop.oreilly.com/product/0636920034919.do