Regression-Advanced Housing Prices Kaggle

Housing Prices - Regression - Kaggle

Predict sales prices and practice feature engineering, RFs, and gradient boosting

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

#main modules
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0)
import seaborn as sns
from scipy import stats
from scipy.stats import norm

Load the csv data into dataframes

#read the data from csv files into dataframes
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

#print out no of rows and columns in Train and test data
print('Train data:-\n Columns: {}  Rows: {}'.format(train_df.shape[1],train_df.shape[0]))
print('Test data:-\n Columns: {}  Rows: {}'.format(test_df.shape[1],test_df.shape[0]))

Train data:-
 Columns: 81  Rows: 1460
Test data:-
 Columns: 80  Rows: 1459

train_df.head()

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

5 rows × 81 columns

test_df.head()

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	NaN	NaN	NaN	0	1	2010	WD	Normal

5 rows × 80 columns

#combine both train and test data
#df=train_df.append(test_df,ignore_index=True)
df=pd.concat([train_df,test_df])

df.shape

(2919, 81)

Get feel of data

##only numeric columns
df.describe()

	1stFlrSF	2ndFlrSF	3SsnPorch	BedroomAbvGr	BsmtFinSF1	BsmtFinSF2	BsmtFullBath	BsmtHalfBath	BsmtUnfSF	EnclosedPorch	...	OverallQual	PoolArea	SalePrice	ScreenPorch	TotRmsAbvGrd	TotalBsmtSF	WoodDeckSF	YearBuilt	YearRemodAdd	YrSold
count	2919.000000	2919.000000	2919.000000	2919.000000	2918.000000	2918.000000	2917.000000	2917.000000	2918.000000	2919.000000	...	2919.000000	2919.000000	1460.000000	2919.000000	2919.000000	2918.000000	2919.000000	2919.000000	2919.000000	2919.000000
mean	1159.581706	336.483727	2.602261	2.860226	441.423235	49.582248	0.429894	0.061364	560.772104	23.098321	...	6.089072	2.251799	180921.195890	16.062350	6.451524	1051.777587	93.709832	1971.312778	1984.264474	2007.792737
std	392.362079	428.701456	25.188169	0.822693	455.610826	169.205611	0.524736	0.245687	439.543659	64.244246	...	1.409947	35.663946	79442.502883	56.184365	1.569379	440.766258	126.526589	30.291442	20.894344	1.314964
min	334.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	1.000000	0.000000	34900.000000	0.000000	2.000000	0.000000	0.000000	1872.000000	1950.000000	2006.000000
25%	876.000000	0.000000	0.000000	2.000000	0.000000	0.000000	0.000000	0.000000	220.000000	0.000000	...	5.000000	0.000000	129975.000000	0.000000	5.000000	793.000000	0.000000	1953.500000	1965.000000	2007.000000
50%	1082.000000	0.000000	0.000000	3.000000	368.500000	0.000000	0.000000	0.000000	467.000000	0.000000	...	6.000000	0.000000	163000.000000	0.000000	6.000000	989.500000	0.000000	1973.000000	1993.000000	2008.000000
75%	1387.500000	704.000000	0.000000	3.000000	733.000000	0.000000	1.000000	0.000000	805.500000	0.000000	...	7.000000	0.000000	214000.000000	0.000000	7.000000	1302.000000	168.000000	2001.000000	2004.000000	2009.000000
max	5095.000000	2065.000000	508.000000	8.000000	5644.000000	1526.000000	3.000000	2.000000	2336.000000	1012.000000	...	10.000000	800.000000	755000.000000	576.000000	15.000000	6110.000000	1424.000000	2010.000000	2010.000000	2010.000000

8 rows × 38 columns

#train_df.info()
def variable_dtype_plot(df):
    '''bar plot indicating the count of data types
        present in the dataframe'''
    df_dtype = pd.DataFrame(df.dtypes.value_counts()).reset_index().rename(columns={"index":"datatype",0:"count"})
    fig,ax = plt.subplots()
    fig.set_size_inches(20,5)
    sns.barplot(data=df_dtype,x="datatype",y="count",ax=ax,color="#34495e")
    ax.set(xlabel='Variable Type', ylabel='Count',title="Variables Count Across Datatype")
    plt.show()

variable_dtype_plot(df)

png

#check no of each column types
df.dtypes.value_counts()

object     43
int64      26
float64    12
dtype: int64

#get numerical and categorical columns of dataframe into lists.
categorical = list(df.columns[df.dtypes=="object"])
numerical = list(df.columns[df.dtypes!="object"])

#use describe method on the categorical columns
df[categorical].describe()

	Alley	BldgType	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinType2	BsmtQual	CentralAir	Condition1	Condition2	...	MiscFeature	Neighborhood	PavedDrive	PoolQC	RoofMatl	RoofStyle	SaleCondition	SaleType	Street	Utilities
count	198	2919	2837	2837	2840	2839	2838	2919	2919	2919	...	105	2919	2919	10	2919	2919	2919	2918	2919	2917
unique	2	5	4	4	6	6	4	2	9	8	...	4	25	3	3	8	6	6	9	2	2
top	Grvl	1Fam	TA	No	Unf	Unf	TA	Y	Norm	Norm	...	Shed	NAmes	Y	Gd	CompShg	Gable	Normal	WD	Pave	AllPub
freq	120	2425	2606	1904	851	2493	1283	2723	2511	2889	...	95	443	2641	4	2876	2310	2402	2525	2907	2916

4 rows × 43 columns

Exploratory Data Analysis

analytically exploring data in order to provide some insights for subsequent processing and modeling.

Usually we would load the data using Pandas and make some visualizations to understand the data.

Inspect the distribution of target variable. Depending on what scoring metric is used, an imbalanced distribution of target variable might harm the model’s performance.
For numerical variables, use box plot and scatter plot to inspect their distributions and check for outliers.
For classification tasks, plot the data with points colored according to their labels. This can help with feature engineering.
Make pairwise distribution plots and examine their correlations.

https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations

#univariate analysis
##plotting histograms to find distribution of continuous variables in dataset
#num = [f for f in df.columns if df.dtypes[f] != 'object']
numdf=pd.melt(train_df,value_vars=numerical)
numgrid=sns.FacetGrid(numdf,col='variable',col_wrap=5,sharex=False,sharey=False)
numgrid=numgrid.map(sns.distplot,'value')
numgrid

<seaborn.axisgrid.FacetGrid at 0x210818eb080>

png

##count plot of categorical attributes
#col = [f for f in train_df.columns if train_df.dtypes[f] == 'object']
coldf=pd.melt(train_df,value_vars=categorical)
colgrid=sns.FacetGrid(coldf,col='variable',col_wrap=4,size=6,aspect=.6,sharex=False,sharey=False)
plt.xticks(rotation=90) 
colgrid = colgrid.map(sns.countplot,'value')
plt.tight_layout()
plt.subplots_adjust(hspace=0.5, wspace=0.4)


###rotating x axis labels to prevent overlapping
for ax in colgrid.axes:
    for label in ax.get_xticklabels():
        label.set_rotation(90)
colgrid

<seaborn.axisgrid.FacetGrid at 0x210859a5320>

png

def boxplot(x,y,**kwargs):
            sns.boxplot(x=x,y=y)
            x = plt.xticks(rotation=90)

#cat = [f for f in train_df.columns if train_df.dtypes[f] == 'object']

p = pd.melt(train_df, id_vars='SalePrice', value_vars=categorical)
g = sns.FacetGrid (p, col='variable', col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, 'value','SalePrice')
g

<seaborn.axisgrid.FacetGrid at 0x21083ae2898>

png

Data Preprocessing

Note that we combined the data from our training and test sets into a single dataframe here (called df), dropping our target variable (SalePrice). These numbers thus represent the total number of missing values across the full dataset. Since the total number of entries in our train and test sets is 2919, we can see that for some features nearly all entries are missing, while for others it is just one or two. How we proceed to treat these missing values depends very much on the reasons the data is missing, the problem and the type of model we want to use.

Checking for null values ,missing data

#sns.pairplot(train_df[['LotFrontage','LotArea']],diag_kind="hist")
#check the columns which are having any null values
#train_df.columns[train_df.isnull().any()]
print('    >>>>>Numerical variables having nulls and its counts<<<<<<<<')
num_nulls = df[numerical].isnull().sum()[df[numerical].isnull().sum()>0].sort_values(ascending=False)
num_nulls

    >>>>>Numerical variables having nulls and its counts<<<<<<<<





SalePrice       1459
LotFrontage      486
GarageYrBlt      159
MasVnrArea        23
BsmtHalfBath       2
BsmtFullBath       2
TotalBsmtSF        1
GarageCars         1
GarageArea         1
BsmtUnfSF          1
BsmtFinSF2         1
BsmtFinSF1         1
dtype: int64

print('      >>>>>Categorical variables having nulls and its counts<<<<<<<<')
col_nulls = df[categorical].isnull().sum()[df[categorical].isnull().sum()>0].sort_values(ascending=False)
col_nulls

      >>>>>Categorical variables having nulls and its counts<<<<<<<<





PoolQC          2909
MiscFeature     2814
Alley           2721
Fence           2348
FireplaceQu     1420
GarageQual       159
GarageFinish     159
GarageCond       159
GarageType       157
BsmtCond          82
BsmtExposure      82
BsmtQual          81
BsmtFinType2      80
BsmtFinType1      79
MasVnrType        24
MSZoning           4
Utilities          2
Functional         2
Electrical         1
Exterior1st        1
Exterior2nd        1
SaleType           1
KitchenQual        1
dtype: int64

Treating Null values

We could either delete,impute or leave null values based on algorithms used.

On checking the distribution of ‘LotFrontage’ variable in below univariate analysis hist plot,we can see that data is skewed left,therefore we use the median value to impute missing value here

df['LotFrontage'].fillna(df['LotFrontage'].median(),inplace=True)

We will fill ‘GarageYrBlt’ variable null values with zero for now as it indicates no garage built.

If we look at the data description,missing values indicates meaningful data and we need to fill with 0 instead.

This might be used to create new feature in feature engineering.

df['GarageYrBlt'].fillna(0, inplace=True)

df.MasVnrArea.fillna(0, inplace=True)    
df.BsmtHalfBath.fillna(0, inplace=True)
df.BsmtFullBath.fillna(0, inplace=True)
df.GarageArea.fillna(0, inplace=True)
df.GarageCars.fillna(0, inplace=True)    
df.TotalBsmtSF.fillna(0, inplace=True)   
df.BsmtUnfSF.fillna(0, inplace=True)     
df.BsmtFinSF2.fillna(0, inplace=True)    
df.BsmtFinSF1.fillna(0, inplace=True)   

Now we have imputed missing values in numerical variables.Lets check the categorical variables null values.In case of PoolQC we can drop this whole column due to large null values and infering less dependancy with target variable.For now we will impute ‘NA’ as pool area for these is also 0.

df.loc[df['PoolQC'].isnull()==True,['PoolArea','PoolQC']].describe()
df.PoolQC.fillna('NA', inplace=True)

df.MiscFeature.fillna('NA', inplace=True)    
df.Alley.fillna('NA', inplace=True)          
df.Fence.fillna('NA', inplace=True)         
df.FireplaceQu.fillna('NA', inplace=True)    
df.GarageCond.fillna('NA', inplace=True)    
df.GarageQual.fillna('NA', inplace=True)     
df.GarageFinish.fillna('NA', inplace=True)   
df.GarageType.fillna('NA', inplace=True)     
df.BsmtExposure.fillna('NA', inplace=True)     
df.BsmtCond.fillna('NA', inplace=True)        
df.BsmtQual.fillna('NA', inplace=True)        
df.BsmtFinType2.fillna('NA', inplace=True)     
df.BsmtFinType1.fillna('NA', inplace=True)     
df.MasVnrType.fillna('None', inplace=True)   
df.Exterior2nd.fillna('None', inplace=True) 

We could predict these values based on other similar variable in our data using knn algorithm,for example MSZoning variable.for now we can impute or drop rows having nulls of this variable.

#df.dropna(subset =['MSZoning'],inplace=True)

df.Functional.fillna(df.Functional.mode()[0], inplace=True)       
df.Utilities.fillna(df.Utilities.mode()[0], inplace=True)          
df.Exterior1st.fillna(df.Exterior1st.mode()[0], inplace=True)        
df.SaleType.fillna(df.SaleType.mode()[0], inplace=True)                
df.KitchenQual.fillna(df.KitchenQual.mode()[0], inplace=True)        
df.Electrical.fillna(df.Electrical.mode()[0], inplace=True)   

For now,we have handled null values,we can always comeback and use different imputation strategies

Treating Outliers

@@@@@@@@@To be performed …left out as we would try tree based models which would be less prone to outliers@@@@@@@@@

Categorical variables encoding

http://pbpython.com/categorical-encoding.html Use either numeric encoding or one-hot encoding. As a guide, you’ll want to use one-hot encodind when there’s no inherent order to your category, and numeric otherwise.

From the variable with object dtype we can infer that few are nominal and ordinal,we select these and perform encoding on these.

##selecting few cat features to be encoded.
cols_tobe_encoded= ['Alley','BsmtCond','BsmtExposure', 'BsmtFinType1','BsmtFinType2','BsmtQual', 'ExterCond','ExterQual' ,'FireplaceQu', 
                    'Functional','GarageCond', 'GarageQual', 'HeatingQC','KitchenQual','LandSlope', 'PavedDrive', 'PoolQC', 'Street', 'Utilities' ]
df[cols_tobe_encoded].describe()

	Alley	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinType2	BsmtQual	ExterCond	ExterQual	FireplaceQu	Functional	GarageCond	GarageQual	HeatingQC	KitchenQual	LandSlope	PavedDrive	PoolQC	Street	Utilities
count	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919
unique	3	5	5	7	7	5	5	4	6	7	6	6	5	4	3	3	4	2	2
top	NA	TA	No	Unf	Unf	TA	TA	TA	NA	Typ	TA	TA	Ex	TA	Gtl	Y	NA	Pave	AllPub
freq	2721	2606	1904	851	2493	1283	2538	1798	1420	2719	2654	2604	1493	1493	2778	2641	2909	2907	2918

##categorical features having inherent ordering
ordinal_cat_cols= ['BsmtCond','BsmtExposure', 'BsmtFinType1','BsmtFinType2','BsmtQual', 'ExterCond','ExterQual' ,'FireplaceQu', 
                    'Functional','GarageCond', 'GarageQual', 'HeatingQC','KitchenQual', 'PoolQC',]

##categorical features which are nominal
nominal_cat_cols= [i for i in cols_tobe_encoded if i not in ordinal_cat_cols]
print(nominal_cat_cols)

['Alley', 'LandSlope', 'PavedDrive', 'Street', 'Utilities']

[i for i in categorical if i not in cols_tobe_encoded]

['BldgType',
 'CentralAir',
 'Condition1',
 'Condition2',
 'Electrical',
 'Exterior1st',
 'Exterior2nd',
 'Fence',
 'Foundation',
 'GarageFinish',
 'GarageType',
 'Heating',
 'HouseStyle',
 'LandContour',
 'LotConfig',
 'LotShape',
 'MSZoning',
 'MasVnrType',
 'MiscFeature',
 'Neighborhood',
 'RoofMatl',
 'RoofStyle',
 'SaleCondition',
 'SaleType']

dummy variables tend to increase the multicollinearity in our dataset,therefore it is important to check VIF later and remove high multicollinear variable http://www.algosome.com/articles/dummy-variable-trap-regression.html

df['BsmtCond']=df['BsmtCond'].astype('category',
                             categories=['NA','Po','Fa','TA','Gd','Ex'],
                             ordered=True)

#######
df['ExterQual']=df['ExterQual'].astype('category',
                             categories=['Po','Fa','TA','Gd','Ex'],
                             ordered=True)
#######
df['KitchenQual']=df['KitchenQual'].astype('category',
                             categories=['Po','Fa','TA','Gd','Ex'],
                             ordered=True)

#######
df['PoolQC']=df['PoolQC'].astype('category',
                             categories=['NA','Po','Fa','TA','Gd','Ex'],
                             ordered=True)

df['BsmtExposure']=df['BsmtExposure'].astype('category',
                             categories=['NA','No','Mn','Av','Gd'],
                             ordered=True)

df['BsmtFinType1']=df['BsmtFinType1'].astype('category',
                             categories=['NA','Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
                             ordered=True)

df['BsmtFinType2']=df['BsmtFinType2'].astype('category',
                             categories=['NA','Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
                             ordered=True)

df['BsmtQual']=df['BsmtQual'].astype('category',
                             categories=['NA','Po','Fa','TA','Gd','Ex'],
                             ordered=True)
df['ExterCond']=df['ExterCond'].astype('category',
                             categories=['Po','Fa','TA','Gd','Ex'],
                                       ordered=True)


df['FireplaceQu']=df['FireplaceQu'].astype('category',
                             categories=['NA','Po','Fa','TA','Gd','Ex'],
                             ordered=True)
df['Functional']=df['Functional'].astype('category',
                             categories=['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],
                             ordered=True)

df['GarageQual']=df['GarageQual'].astype('category',
                             categories=['NA','Po','Fa','TA','Gd','Ex'],
                             ordered=True)

df['GarageCond']=df['GarageCond'].astype('category',
                             categories=['NA','Po','Fa','TA','Gd','Ex'],
                             ordered=True)

df['HeatingQC']=df['HeatingQC'].astype('category',
                             categories=['Po','Fa','TA','Gd','Ex'],
                             ordered=True)

variable_dtype_plot(df)

png

ordinal_cat_cols= ['BsmtCond','BsmtExposure', 'BsmtFinType1','BsmtFinType2','BsmtQual', 'ExterCond','ExterQual' ,'FireplaceQu', 
                    'Functional','GarageCond', 'GarageQual', 'HeatingQC','KitchenQual', 'PoolQC',]

for col in ordinal_cat_cols:
    df[col] = df[col].cat.codes

##update list of numerical features and categorical features
numerical = numerical + ordinal_cat_cols

for col in ordinal_cat_cols:
    categorical.remove(col)

Now we have marked the ordinal columns which are encoded as nos to be numerical features and removed them from categorical features list.

Now the columns remaining in categorical features list would need to be one-hot encoded as there is no inherent ordering in the column values.

df[categorical].describe().T

	count	unique	top	freq
Alley	2919	3	NA	2721
BldgType	2919	5	1Fam	2425
CentralAir	2919	2	Y	2723
Condition1	2919	9	Norm	2511
Condition2	2919	8	Norm	2889
Electrical	2919	5	SBrkr	2672
Exterior1st	2919	15	VinylSd	1026
Exterior2nd	2919	17	VinylSd	1014
Fence	2919	5	NA	2348
Foundation	2919	6	PConc	1308
GarageFinish	2919	4	Unf	1230
GarageType	2919	7	Attchd	1723
Heating	2919	6	GasA	2874
HouseStyle	2919	8	1Story	1471
LandContour	2919	4	Lvl	2622
LandSlope	2919	3	Gtl	2778
LotConfig	2919	5	Inside	2133
LotShape	2919	4	Reg	1859
MSZoning	2915	5	RL	2265
MasVnrType	2919	4	None	1766
MiscFeature	2919	5	NA	2814
Neighborhood	2919	25	NAmes	443
PavedDrive	2919	3	Y	2641
RoofMatl	2919	8	CompShg	2876
RoofStyle	2919	6	Gable	2310
SaleCondition	2919	6	Normal	2402
SaleType	2919	9	WD	2526
Street	2919	2	Pave	2907
Utilities	2919	2	AllPub	2918

##these variables tend to have no inherent ordering in their values
#df = pd.get_dummies(df,columns=['Alley','LandSlope','PavedDrive','Street','Utilities'])

There are still a few non-numerical data in our dataframe which needs to be encoded using one of the techniques.

##MSSubClass is a category feature and needs to be encoded as categories.
df.replace({'MSSubClass':{20:'class1', 30:'class2', 40:'class3', 45:'class4',
                                   50:'class5', 60:'class6', 70:'class7', 75:'class8',
                                   80:'class9', 85:'class10', 90:'class11', 120:'class12',
                                   150:'class13', 160:'class14', 180:'class15', 190:'class16'}},inplace=True)

df['MSSubClass'] = df['MSSubClass'].astype('category')
##update categorical features

categorical.append('MSSubClass')

numerical.remove('MSSubClass')

df[categorical].describe()

	Alley	BldgType	CentralAir	Condition1	Condition2	Electrical	Exterior1st	Exterior2nd	Fence	Foundation	...	MiscFeature	Neighborhood	PavedDrive	RoofMatl	RoofStyle	SaleCondition	SaleType	Street	Utilities	MSSubClass
count	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919	...	2919	2919	2919	2919	2919	2919	2919	2919	2919	2919
unique	3	5	2	9	8	5	15	17	5	6	...	5	25	3	8	6	6	9	2	2	16
top	NA	1Fam	Y	Norm	Norm	SBrkr	VinylSd	VinylSd	NA	PConc	...	NA	NAmes	Y	CompShg	Gable	Normal	WD	Pave	AllPub	class1
freq	2721	2425	2723	2511	2889	2672	1026	1014	2348	1308	...	2814	443	2641	2876	2310	2402	2526	2907	2918	1079

4 rows × 30 columns

As we can see,there are many categories in Neigborhood attribute(25),we will analyze this attribute and combine levels of the neighborhood variable into fewer levels.

We plot below the median of saleprice wrt to each neighborhood type and it gives us an idea to combine levels with similar height into a single level.

train_df.groupby('Neighborhood')['SalePrice'].agg({'median':np.median}).sort_values(by='median').plot(kind='bar')

C:\Users\Nithin\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  """Entry point for launching an IPython kernel.

<matplotlib.axes._subplots.AxesSubplot at 0x2109188f860>

png

neighborhood_map = {"MeadowV" : 0, "IDOTRR" : 1, "BrDale" : 1, "OldTown" : 1, "Edwards" : 1,
                    "BrkSide" : 1,"Sawyer" : 1, "Blueste" : 1, "SWISU" : 2, "NAmes" : 2,
                    "NPkVill" : 2, "Mitchel" : 2, "SawyerW" : 2, "Gilbert" : 2, "NWAmes" : 2,
                    "Blmngtn" : 2, "CollgCr" : 2, "ClearCr" : 3, "Crawfor" : 3, "Veenker" : 3,
                    "Somerst" : 3, "Timber" : 3, "StoneBr" : 4, "NoRidge" : 4, "NridgHt" : 4}

df.replace({'Neighborhood':neighborhood_map},inplace=True)

df['Neighborhood'] = df['Neighborhood'].astype('category')

train_df.groupby('Exterior1st')['SalePrice'].agg({'median':np.median}).sort_values(by='median').plot(kind='bar')

C:\Users\Nithin\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  """Entry point for launching an IPython kernel.

<matplotlib.axes._subplots.AxesSubplot at 0x210916a9a58>

png

exterior1st_map = {"BrkComm":1,"AsphShn":1,"CBlock":1,"AsbShng":1,"WdShing":2,
                  "Wd Sdng":2,"MetalSd":2,"Stucco":2,"HdBoard":2,"BrkFace":3,"Plywood":3,
                  "VinylSd":4,"CemntBd":4,"Stone":4,"ImStucc":4,}

df.replace({'Exterior1st':exterior1st_map},inplace=True)

df['Exterior1st'] = df['Exterior1st'].astype('category')

train_df.groupby('Exterior2nd')['SalePrice'].agg({'median':np.median}).sort_values(by='median').plot(kind='bar')

C:\Users\Nithin\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  """Entry point for launching an IPython kernel.

<matplotlib.axes._subplots.AxesSubplot at 0x210917dd0f0>

png

exterior2nd_map = {'VinylSd':3, 'MetalSd':2, 'Wd Shng':2, 'HdBoard':2, 'Plywood':2, 'Wd Sdng':2,
       'CmentBd':4, 'BrkFace':2, 'Stucco':2, 'AsbShng':1, 'Brk Cmn':2, 'ImStucc':3,
       'AsphShn':2, 'Stone':2, 'Other':4, 'CBlock':1, 'None':0}

df.replace({'Exterior2nd':exterior2nd_map},inplace=True)

df['Exterior2nd'] = df['Exterior2nd'].astype('category')

df[categorical].describe().T

	count	unique	top	freq
Alley	2919	3	NA	2721
BldgType	2919	5	1Fam	2425
CentralAir	2919	2	Y	2723
Condition1	2919	9	Norm	2511
Condition2	2919	8	Norm	2889
Electrical	2919	5	SBrkr	2672
Exterior1st	2919	4	2	1402
Exterior2nd	2919	5	2	1721
Fence	2919	5	NA	2348
Foundation	2919	6	PConc	1308
GarageFinish	2919	4	Unf	1230
GarageType	2919	7	Attchd	1723
Heating	2919	6	GasA	2874
HouseStyle	2919	8	1Story	1471
LandContour	2919	4	Lvl	2622
LandSlope	2919	3	Gtl	2778
LotConfig	2919	5	Inside	2133
LotShape	2919	4	Reg	1859
MSZoning	2915	5	RL	2265
MasVnrType	2919	4	None	1766
MiscFeature	2919	5	NA	2814
Neighborhood	2919	5	2	1344
PavedDrive	2919	3	Y	2641
RoofMatl	2919	8	CompShg	2876
RoofStyle	2919	6	Gable	2310
SaleCondition	2919	6	Normal	2402
SaleType	2919	9	WD	2526
Street	2919	2	Pave	2907
Utilities	2919	2	AllPub	2918
MSSubClass	2919	16	class1	1079

###we will drop the month sold as it does not impact the saleprice
df.drop('MoSold',axis=1,inplace=True)
#numerical.remove('MoSold')

numerical.remove('MoSold')
sns.boxplot(x='MoSold',y='SalePrice', data=train_df)

<matplotlib.axes._subplots.AxesSubplot at 0x21092f44ac8>

png

#df = pd.get_dummies(df,columns=categorical,drop_first=True)

###now we have 213 columns due to one-hot encoding
df.shape

(2919, 80)

###now we plot the distribution of numerical variables in our dataset
#num = [f for f in df.columns if df.dtypes[f] != 'object']
numdf=pd.melt(df,value_vars=numerical)
numgrid=sns.FacetGrid(numdf,col='variable',col_wrap=4,sharex=False,sharey=False)
numgrid=numgrid.map(sns.distplot,'value')
numgrid

<seaborn.axisgrid.FacetGrid at 0x210935ac588>

png

We see that the target variable SalePrice has a right-skewed distribution. We’ll need to log transform this variable so that it becomes normally distributed. A normally distributed (or close to normal) target variable helps in better modeling the relationship between target and independent variables. In addition, linear algorithms assume constant variance in the error term. Alternatively, we can also confirm this skewed behavior using the skewness metric. ,final result is evaluated using the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price,therefore it would be better to log-transform this variable

print ("The skewness of SalePrice is {}".format(df['SalePrice'].skew()))

The skewness of SalePrice is 1.8828757597682129

target = np.log(df['SalePrice'].dropna())
print ('Skewness is', target.skew())
sns.distplot(target)

Skewness is 0.121335062205

<matplotlib.axes._subplots.AxesSubplot at 0x21099fb4240>

png

df['SalePrice'] = np.log(df['SalePrice'])

#df['SalePrice']

#numdf1 = pd.melt(train_df, id_vars=['SalePrice'],value_vars=numerical)
#numgrid1 = sns.FacetGrid(numdf1, col="variable",  col_wrap=4 , size=3.0,aspect=1.2,sharex=False, sharey=False)
#numgrid1.map(plt.scatter, "value",'SalePrice',s=1.5)
#numgrid1

from scipy.stats import skew
skewed = df[numerical].apply(lambda x: skew(x.dropna().astype(float)))
skewed = skewed[skewed>0.75].index

skewed

Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtHalfBath', 'BsmtUnfSF', 'EnclosedPorch', 'GrLivArea',
       'KitchenAbvGr', 'LotArea', 'LotFrontage', 'LowQualFinSF', 'MasVnrArea',
       'MiscVal', 'OpenPorchSF', 'PoolArea', 'ScreenPorch', 'TotRmsAbvGrd',
       'TotalBsmtSF', 'WoodDeckSF', 'BsmtExposure', 'BsmtFinType2',
       'ExterCond', 'ExterQual', 'PoolQC'],
      dtype='object')

df[skewed]=np.log1p(df[skewed])

##we will plot the distributions again to check skewness
numdf=pd.melt(df,value_vars=numerical)
numgrid=sns.FacetGrid(numdf,col='variable',col_wrap=4,sharex=False,sharey=False)
numgrid=numgrid.map(sns.distplot,'value')
numgrid

<seaborn.axisgrid.FacetGrid at 0x21097addf98>

png

As we have already encoded ordinal categorical variables we are left with encoding the nominal categorical variables using one-hot encoding

df = pd.get_dummies(df,columns=categorical,drop_first=True)

df.shape

(2919, 184)

Now we have numerically encoded all our features as seen below in the plot of dtypes.We can now split the train and test data.

variable_dtype_plot(df)

png

X_train  = df[:1460].drop(['SalePrice','Id'], axis=1)
y_train  = df[:1460]['SalePrice']
X_test  = df[1460:].drop(['SalePrice','Id'], axis=1)

Now, we’ll standardize the numeric features.(those which are not dummy variables).Ensure to fit using train data and transform both train and test data

numerical.remove('SalePrice')
numerical.remove('Id')
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train[numerical])
X_train[numerical]= scaler.transform(X_train[numerical])
X_test[numerical]= scaler.transform(X_test[numerical])

X_train.shape,y_train.shape,X_test.shape

((1460, 182), (1460,), (1459, 182))

Feature Selection

Decision trees are by nature immune to multi-collinearity. For example, if you have 2 features which are 99% correlated, when deciding upon a split the tree will choose only one of them. Other models such as Logistic regression would use both the features.

Since boosted trees use individual decision trees, they also are unaffected by multi-collinearity. However, its a good practice to remove any redundant features from any dataset used for training, irrespective of the model’s algorithm. In your case since you’re deriving new features, you could use this approach, evaluate each feature’s importance and retain only the best features for your final model.

Now there are 221 features due to a large amount of the dummy variable. Overfitting can easily occur when there are redundant features. Therefore I use XGBoost regressor to generate the rank of “feature importance”

import xgboost as xgb
from sklearn.model_selection import GridSearchCV

Parameters max_depth and min_child_weight Those parameters add constraints on the architecture of the trees.

i)max_depth is the maximum number of nodes allowed from the root to the farthest leaf of a tree. Deeper trees can model more complex relationships by adding more nodes, but as we go deeper, splits become less relevant and are sometimes only due to noise, causing the model to overfit.

ii)min_child_weight is the minimum weight (or number of samples if all samples have a weight of 1) required in order to create a new node in the tree. A smaller min_child_weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit. Thus, those parameters can be used to control the complexity of the trees.

It is important to tune them together in order to find a good trade-off between model bias and variance

param_grid = {'max_depth': [3,5,7], 'min_child_weight': [1,3,5]}
xgb_param = {'learning_rate': 0.1, 'n_estimators': 1000, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'reg:linear'}
xgb_reg = xgb.XGBRegressor(**xgb_param)
optimize_xgb = GridSearchCV(estimator=xgb_reg,param_grid=param_grid,scoring='neg_mean_squared_error',cv = 5, n_jobs = -1)

optimize_xgb.fit(X_train,y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=0.8,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=1000, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=0.8),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'max_depth': [3, 5, 7], 'min_child_weight': [1, 3, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

optimize_xgb.grid_scores_

C:\Users\Nithin\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:747: DeprecationWarning: The grid_scores_ attribute was deprecated in version 0.18 in favor of the more elaborate cv_results_ attribute. The grid_scores_ attribute will not be available from 0.20
  DeprecationWarning)





[mean: -0.01533, std: 0.00226, params: {'max_depth': 3, 'min_child_weight': 1},
 mean: -0.01583, std: 0.00167, params: {'max_depth': 3, 'min_child_weight': 3},
 mean: -0.01605, std: 0.00176, params: {'max_depth': 3, 'min_child_weight': 5},
 mean: -0.01640, std: 0.00197, params: {'max_depth': 5, 'min_child_weight': 1},
 mean: -0.01618, std: 0.00136, params: {'max_depth': 5, 'min_child_weight': 3},
 mean: -0.01624, std: 0.00166, params: {'max_depth': 5, 'min_child_weight': 5},
 mean: -0.01702, std: 0.00210, params: {'max_depth': 7, 'min_child_weight': 1},
 mean: -0.01641, std: 0.00230, params: {'max_depth': 7, 'min_child_weight': 3},
 mean: -0.01603, std: 0.00190, params: {'max_depth': 7, 'min_child_weight': 5}]

we could tune more hyperparameters but for now we will use these values due to compute resources shortage for tuning large sets of hyperparameters. Parameters num_boost_round and early_stopping_rounds will be helpful in improving accuracy. This parameter is called num_boost_round and corresponds to the number of boosting rounds or trees to build. Its optimal value highly depends on the other parameters, and thus it should be re-tuned each time you update a parameter. You could do this by tuning it together with all parameters in a grid-search, but it requires a lot of computational effort.

Fortunately XGBoost provides a nice way to find the best number of rounds whilst training. Since trees are built sequentially, instead of fixing the number of rounds at the beginning, we can test our model at each step and see if adding a new tree/round improves performance.

To do so, we define a test dataset and a metric that is used to assess performance at each round. If performance haven’t improved for N rounds (N is defined by the variable early_stopping_round), we stop the training and keep the best number of boosting rounds.

For this we will be using XGBoost API’s cv method.

https://cambridgespark.com/content/tutorials/hyperparameter-tuning-in-xgboost/index.html

xgdmat = xgb.DMatrix(X_train, y_train) # Create our DMatrix to make XGBoost more efficient

##from above cross-validation GridSearch
our_params = {'eta': 0.1, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 
             'objective': 'reg:linear', 'max_depth':3, 'min_child_weight':1} 
cv_xgb = xgb.cv(params = our_params, dtrain = xgdmat, num_boost_round = 3000, nfold = 5,
                metrics = ['rmse'], # Make sure you enter metrics inside a list or you may encounter issues!
                early_stopping_rounds = 100)

cv_xgb is a dataframe where the rows correspond to the number of boosting trees used, here again, we stopped before the 3000 rounds

cv_xgb

	test-rmse-mean	test-rmse-std	train-rmse-mean	train-rmse-std
0	10.379822	0.033839	10.379870	0.007184
1	9.344127	0.034248	9.344181	0.006840
2	8.412292	0.035014	8.412352	0.006178
3	7.574085	0.035852	7.574152	0.005454
4	6.819066	0.035495	6.819701	0.004287
5	6.140290	0.032772	6.140391	0.003767
6	5.530189	0.030912	5.529307	0.003420
7	4.979819	0.029225	4.979449	0.003466
8	4.484459	0.027517	4.483990	0.003051
9	4.039501	0.026085	4.038489	0.002739
10	3.638263	0.025406	3.637138	0.002289
11	3.277578	0.025044	3.276346	0.002096
12	2.952482	0.024389	2.951513	0.002167
13	2.659741	0.023205	2.659352	0.001912
14	2.396945	0.022857	2.396294	0.001507
15	2.160222	0.021756	2.159610	0.001576
16	1.947135	0.020532	1.946340	0.001429
17	1.755617	0.020124	1.754581	0.001241
18	1.582870	0.018943	1.581904	0.001236
19	1.427049	0.018012	1.426389	0.001120
20	1.287859	0.017427	1.286661	0.000922
21	1.162530	0.016839	1.161176	0.000801
22	1.049724	0.016123	1.048125	0.000848
23	0.948474	0.015700	0.946466	0.000842
24	0.856856	0.015552	0.854900	0.000680
25	0.774883	0.014861	0.772707	0.000879
26	0.700937	0.014548	0.698810	0.001033
27	0.635141	0.014218	0.632490	0.000780
28	0.576207	0.013976	0.572851	0.000707
29	0.523479	0.013901	0.519411	0.000751
...	...	...	...	...
254	0.122656	0.023641	0.058347	0.002054
255	0.122669	0.023616	0.058241	0.002025
256	0.122685	0.023640	0.058115	0.001961
257	0.122685	0.023619	0.058017	0.001935
258	0.122629	0.023698	0.057908	0.001909
259	0.122571	0.023667	0.057779	0.001914
260	0.122605	0.023673	0.057654	0.001936
261	0.122616	0.023668	0.057507	0.001962
262	0.122573	0.023633	0.057365	0.001951
263	0.122546	0.023652	0.057179	0.001928
264	0.122535	0.023687	0.057031	0.001918
265	0.122499	0.023673	0.056924	0.001915
266	0.122504	0.023701	0.056810	0.001923
267	0.122500	0.023755	0.056659	0.001932
268	0.122542	0.023772	0.056524	0.001897
269	0.122571	0.023703	0.056375	0.001889
270	0.122558	0.023742	0.056271	0.001878
271	0.122540	0.023715	0.056126	0.001883
272	0.122537	0.023754	0.055990	0.001877
273	0.122529	0.023788	0.055833	0.001884
274	0.122520	0.023803	0.055730	0.001892
275	0.122519	0.023809	0.055624	0.001864
276	0.122499	0.023802	0.055531	0.001843
277	0.122440	0.023850	0.055432	0.001824
278	0.122426	0.023860	0.055281	0.001801
279	0.122456	0.023842	0.055165	0.001770
280	0.122492	0.023872	0.055077	0.001774
281	0.122458	0.023897	0.054932	0.001792
282	0.122454	0.023887	0.054816	0.001775
283	0.122425	0.023908	0.054718	0.001784

284 rows × 4 columns

cv_xgb['test-rmse-mean'].min()
cv_xgb.loc[30:,["test-rmse-mean", "train-rmse-mean"]].plot()

<matplotlib.axes._subplots.AxesSubplot at 0x2109d80c6d8>

png

As you can see we stopped before reaching the maximum number of boosting rounds, that’s because after the 284th tree, adding more rounds did not lead to improvements of RMSE on the test dataset.

#xgb_reg = xgb.XGBRegressor(**our_params)
final_gb = xgb.train(our_params, xgdmat, num_boost_round = 284)

%matplotlib inline
import seaborn as sns
sns.set(font_scale = 1.5)

sorted(final_gb.get_fscore().items())

[('1stFlrSF', 122),
 ('2ndFlrSF', 39),
 ('3SsnPorch', 5),
 ('Alley_NA', 4),
 ('Alley_Pave', 10),
 ('BedroomAbvGr', 24),
 ('BldgType_2fmCon', 2),
 ('BldgType_Duplex', 1),
 ('BsmtCond', 11),
 ('BsmtExposure', 16),
 ('BsmtFinSF1', 67),
 ('BsmtFinSF2', 17),
 ('BsmtFinType1', 12),
 ('BsmtFinType2', 6),
 ('BsmtFullBath', 19),
 ('BsmtHalfBath', 3),
 ('BsmtQual', 11),
 ('BsmtUnfSF', 77),
 ('CentralAir_Y', 13),
 ('Condition1_Feedr', 3),
 ('Condition1_Norm', 19),
 ('Condition1_PosA', 1),
 ('Condition1_PosN', 3),
 ('Condition1_RRAe', 5),
 ('Condition1_RRAn', 1),
 ('Electrical_FuseF', 2),
 ('Electrical_SBrkr', 4),
 ('EnclosedPorch', 39),
 ('ExterCond', 17),
 ('ExterQual', 3),
 ('Exterior1st_2', 5),
 ('Exterior1st_3', 11),
 ('Exterior1st_4', 2),
 ('Exterior2nd_1', 10),
 ('Exterior2nd_2', 1),
 ('Exterior2nd_3', 1),
 ('Fence_GdWo', 8),
 ('Fence_MnPrv', 2),
 ('Fence_NA', 4),
 ('FireplaceQu', 8),
 ('Fireplaces', 4),
 ('Foundation_CBlock', 1),
 ('Foundation_PConc', 8),
 ('Foundation_Wood', 3),
 ('FullBath', 8),
 ('Functional', 31),
 ('GarageArea', 70),
 ('GarageCars', 8),
 ('GarageCond', 4),
 ('GarageFinish_RFn', 2),
 ('GarageFinish_Unf', 5),
 ('GarageQual', 5),
 ('GarageType_Attchd', 12),
 ('GarageType_Basment', 2),
 ('GarageType_BuiltIn', 1),
 ('GarageType_CarPort', 1),
 ('GarageType_Detchd', 4),
 ('GarageYrBlt', 56),
 ('GrLivArea', 116),
 ('HalfBath', 7),
 ('HeatingQC', 10),
 ('Heating_GasA', 3),
 ('Heating_Grav', 1),
 ('HouseStyle_1Story', 4),
 ('HouseStyle_2.5Unf', 2),
 ('HouseStyle_2Story', 1),
 ('HouseStyle_SLvl', 2),
 ('KitchenAbvGr', 8),
 ('KitchenQual', 20),
 ('LandContour_HLS', 1),
 ('LandContour_Low', 1),
 ('LandContour_Lvl', 5),
 ('LandSlope_Mod', 2),
 ('LandSlope_Sev', 1),
 ('LotArea', 102),
 ('LotConfig_CulDSac', 8),
 ('LotConfig_FR2', 4),
 ('LotConfig_FR3', 1),
 ('LotConfig_Inside', 5),
 ('LotFrontage', 52),
 ('LotShape_Reg', 2),
 ('LowQualFinSF', 7),
 ('MSSubClass_class2', 5),
 ('MSSubClass_class5', 7),
 ('MSSubClass_class6', 4),
 ('MSSubClass_class7', 1),
 ('MSSubClass_class8', 1),
 ('MSSubClass_class9', 1),
 ('MSZoning_FV', 2),
 ('MSZoning_RH', 1),
 ('MSZoning_RL', 5),
 ('MSZoning_RM', 5),
 ('MasVnrArea', 17),
 ('MasVnrType_BrkFace', 3),
 ('MasVnrType_Stone', 1),
 ('MiscVal', 3),
 ('Neighborhood_1', 4),
 ('Neighborhood_2', 7),
 ('Neighborhood_3', 10),
 ('Neighborhood_4', 8),
 ('OpenPorchSF', 38),
 ('OverallCond', 52),
 ('OverallQual', 68),
 ('PavedDrive_P', 1),
 ('PavedDrive_Y', 7),
 ('PoolArea', 8),
 ('RoofMatl_CompShg', 1),
 ('RoofMatl_Tar&Grv', 1),
 ('RoofMatl_WdShngl', 4),
 ('RoofStyle_Gable', 2),
 ('RoofStyle_Gambrel', 7),
 ('RoofStyle_Hip', 2),
 ('SaleCondition_Alloca', 5),
 ('SaleCondition_Family', 13),
 ('SaleCondition_Normal', 23),
 ('SaleCondition_Partial', 9),
 ('SaleType_ConLI', 1),
 ('SaleType_New', 9),
 ('SaleType_WD', 3),
 ('ScreenPorch', 22),
 ('Street_Pave', 9),
 ('TotRmsAbvGrd', 11),
 ('TotalBsmtSF', 62),
 ('WoodDeckSF', 37),
 ('YearBuilt', 65),
 ('YearRemodAdd', 42),
 ('YrSold', 19)]

##xgb.plot_importance(final_gb,max_num_features=50)
fig, ax = plt.subplots(figsize=(12,18))
#xgb.plot_importance(final_gb, max_num_features=50, height=0.8, ax=ax)
#plt.show()
%matplotlib inline
max_x=10
xgb.plot_importance(dict(sorted(final_gb.get_fscore().items(), reverse = True, key=lambda x:x[1])[:max_x]), height = 0.8)

<matplotlib.axes._subplots.AxesSubplot at 0x210fbb01240>

png

Now that we have an understanding of the feature importances, we can at least figure out better what is driving the splits most for the trees and where we may be able to make some improvements in feature engineering if possible. You can try playing around with the hyperparameters yourself or engineer some new features to see if you can beat the current benchmarks

The model has now been tuned using cross-validation grid search through the sklearn API and early stopping through the built-in XGBoost API. Now, we can see how it finally performs on the test set.

testdmat = xgb.DMatrix(X_test)
y_pred = final_gb.predict(testdmat) # Predict using our testdmat

y_pred

array([ 11.73430061,  11.97035789,  12.14089966, ...,  12.00389957,
        11.66779327,  12.29656219], dtype=float32)

y_pred = np.exp(y_pred)

output = pd.DataFrame({'Id': test_df['Id'], 'SalePrice': y_pred})

output.to_csv('prediction-ensemble.csv', index=False)

As we can see

#cv_xgb.shape
#print("Best MAE: {:.2f} with {} rounds".format(
#                 model.best_score,
#                 model.best_iteration+1))

#from xgboost import XGBRegressor
#xgb = XGBRegressor()
#xgb.fit(X_train, y_train)
#print(xgb.booster().get_score())
#imp = pd.DataFrame(xgb.feature_importances_ ,columns = ['Importance'],index = X_train.columns)
#imp = pd.DataFrame(xgb.booster().get_score(),index = X_train.columns)
#imp = imp.sort_values(['Importance'], ascending = False)
#model.booster().get_score(importance_type='weight')
#print(imp)

Regression-Advanced Housing Prices Kaggle

Python implementation of Housing Prices prediction competition from Kaggle

Regression-Advanced Housing Prices Kaggle

Python implementation of Housing Prices prediction competition from Kaggle

Housing Prices - Regression - Kaggle

Load the csv data into dataframes

Get feel of data

Exploratory Data Analysis

Data Preprocessing

Checking for null values ,missing data

Treating Null values

Treating Outliers

Categorical variables encoding

Feature Selection