The automotive industry is one of the most dynamic and competitive sectors globally, with pricing being a critical factor in the decision-making process for both consumers and manufacturers. In this project, we delve into the world of car pricing analysis using a comprehensive dataset. Our objective is to gain valuable insights into the factors that influence car prices .
EDA (Exploratory Data Analysis) is a crucial first step in the data analysis process, allowing us to understand the dataset’s structure, identify patterns, outliers, and relationships among variables. Through visualizations, statistical summaries, and data exploration, we aim to uncover hidden trends and insights that can inform pricing strategies and consumer decisions in the market dataset car pricing .
Using Jupyter Notebook for Exploratory Data Analysis (EDA) is a common and effective choice, as it allows you to interactively explore your data, visualize it, and document your analysis steps. It is not only used to explore your data but also to communicate your findings effectively. Therefore, strive for clarity, transparency, and reproducibility in your analysis.
The above download option will let you to the CSV file that contains the data. The data are collected from the most famous Kaggle website.
PROCESS
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') df = pd.read_csv('C:/Users/asus/OneDrive/Desktop/CarPrice_Assignment.csv') df
you will get output something like the above mentioned image .Now lets browse through the columns present in the dataset.
df.columns
output
Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration','doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase','carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype','cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke','compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg','price'],dtype='object')
Remember these codes work well only in jupyter notebook because in other interface we have to always mention the print function to get the output.
- check wether there are null values present in any column or not.
df.isnull().sum()
output
car_ID 0 symboling 0 CarName 0 fueltype 0 aspiration 0 doornumber 0 carbody 0 drivewheel 0 enginelocation 0 wheelbase 0 carlength 0 carwidth 0 carheight 0 curbweight 0 enginetype 0 cylindernumber 0 enginesize 0 fuelsystem 0 boreratio 0 stroke 0 compressionratio 0 horsepower 0 peakrpm 0 citympg 0 highwaympg 0 price 0 dtype: int64
There aren’t any null values so we are ready to go ahead, the next thing we need to check is the data types of each column.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 , Data columns (total 26 columns): # Column Non-Null Count Dtype 0 car_ID 205 non-null int64 1 symboling 205 non-null int64 2 CarName 205 non-null object 3 fueltype 205 non-null object 4 aspiration 205 non-null object 5 doornumber 205 non-null object 6 carbody 205 non-null object 7 drivewheel 205 non-null object 8 enginelocation 205 non-null object 9 wheelbase 205 non-null float64 10 carlength 205 non-null float64 11 carwidth 205 non-null float64 12 carheight 205 non-null float64 13 curbweight 205 non-null int64 14 enginetype 205 non-null object 15 cylindernumber 205 non-null object 16 enginesize 205 non-null int64 17 fuelsystem 205 non-null object 18 boreratio 205 non-null float64 19 stroke 205 non-null float64 20 compressionratio 205 non-null float64 21 horsepower 205 non-null int64 22 peakrpm 205 non-null int64 23 citympg 205 non-null int64 24 highwaympg 205 non-null int64 25 price 205 non-null float64 dtypes: float64(8), int64(8), object(10) , memory usage: 41.8+ KB
- dropping Car_id column as its of no use
df.drop(columns=['car_ID'],inplace=True) df
Now as you can check the car_id column has been removed. Lets now inspect through the different car’s columns of the dataset.
df.describe()
so this gave the statistical insights of the various columns with numeric values in the dataset .
- Number of Cars having fuel type diesel and gas.
x=(df['fueltype']=='diesel').sum() print("diesel type fuel:",x) y=(df['fueltype']=='gas').sum() print("gas type fuel:",y)
diesel type fuel: 20 gas type fuel: 185
#plot for fuel type fuel_count= df['fueltype'].value_counts() plt.figure(figsize = (4,4)) sns.countplot(x='fueltype',data =df,order=fuel_count.index,palette ='magma') plt.title("cars with different fuel type");
- Number of cars having aspiration type standard and turbo.
x=(df['aspiration']=='std').sum() print("aspiration type standard:",x) y=(df['aspiration']=='turbo').sum() print("aspiration type turbo:",y)
aspiration type standard: 168 aspiration type turbo: 37
#Visualization for aspiration aspira_count= df['aspiration'].value_counts() plt.figure(figsize=(4,4)) sns.countplot(x='aspiration',data=df,order=aspira_count.index,palette ='viridis') plt.title("cars with different aspirations");
- Number of cars with different door number.
y=(df['doornumber']=='two').sum() print("cars with two doors: ",y) z=(df['doornumber']=='four').sum() print("cars with four doors: ",z)
cars with two doors: 90 cars with four doors: 115
#visualization for doornumbers of cars. door_count= df['doornumber'].value_counts() plt.figure(figsize=(4,4)) sns.countplot(x='doornumber',data=df,order=door_count.index,palette='viridis') plt.title("cars with different door number");
- Number of cars with different car bodies.
o = (df['carbody']=='hatchback').sum() f = (df['carbody']=='sedan').sum() g = (df['carbody']=='hardtop').sum() h = (df['carbody']=='convertible').sum() j = (df['carbody']=='wagon').sum() print('sedan:',f,'hardtop:',g,'hatchback:',o,'convertible:',h,'wagon:',j)
sedan: 96 hardtop: 8 hatchback: 70 convertible: 6 wagon: 25
body_count= df['carbody'].value_counts() plt.figure(figsize=(4,4)) sns.countplot(x='carbody',data=df,order=body_count.index,palette='viridis') plt.xticks(rotation=90) plt.title("cars with different car body");
- Number of cars having different drive wheel.
o= (df['drivewheel']=='rwd').sum() y=(df['drivewheel']=='fwd').sum() t =(df['drivewheel']=='4wd').sum() print("cars with rear wheel drive:",o) print("cars with front wheel drive:",y) print("cars with 4 wheel drive:",t)
cars with rear wheel drive: 76 cars with front wheel drive: 120 cars with 4 wheel drive: 9
- Distribution of Wheelbase of Cars.
plt.figure(figsize=(10,6)) sns.distplot(df['wheelbase'],color='olivedrab') plt.title("car wheels distribution plot");
- Distribution of Width and Length of Cars.
plt.figure(figsize=(10,6)) sns.distplot(df['carwidth']) (plt.title("car distribution plot for car width"); # -- 1 #Run the codes separately in jupyter notebook plt.figure(figsize=(10,6)) sns.distplot(df['carlength']) plt.title('car distribution plot for car lengths'); # -- 2
- Top 20 most expensive cars.
y=df.groupby(['CarName'])['price'].max() v=y.sort_values(ascending=False).head(20) plt.figure(figsize=(15,8)) sns.barplot(x=v.index,y=v,palette='magma') plt.title("Most expensive Cars (top 20)") plt.xticks(rotation=90);
- Top 20 Car companies with highest prices of Cars in total.
y=df.groupby(['CarName'])['price'].sum() v=y.sort_values(ascending=False).head(20) plt.figure(figsize=(15,8)) sns.barplot(x=v.index,y=v,palette='cividis') plt.title("Car companies with their total expenses(top 20)") plt.xticks(rotation=90);
- Top 20 Cars with highest Horse power
f=df.groupby(['CarName'])['horsepower'].max() f=f.sort_values(ascending=False).head(20) plt.figure(figsize=(15,8)) sns.barplot(x=f.index,y=f,palette='viridis_r') plt.title("Car with most horsepower(top 20)") plt.xticks(rotation=90);
- Top 20 Cars with Highest RPM.
f=df.groupby(['CarName'])['peakrpm'].max() f=f.sort_values(ascending=False).head(20) plt.figure(figsize=(15,8)) sns.barplot(x=f.index,y=f,palette='mako_r') plt.title("Car with top RPM(top 20)") plt.xticks(rotation=90);
- Cars with best mileage in City and Highway.
f=df.groupby(['CarName'])['citympg'].max() f=f.sort_values(ascending=False).head(20) plt.figure(figsize=(15,8)) sns.barplot(x=f.index,y=f,palette='gist_earth') plt.title("Car with best Mileage in city (top 20)") plt.xticks(rotation=90); # -- 1 #Run the codes separately in jupyter notebook f=df.groupby(['CarName'])['highwaympg'].max() f=f.sort_values(ascending=False).head(20) plt.figure(figsize=(15,8)) sns.barplot(x=f.index,y=f,palette='cividis_r') plt.title("Car with best Mileage on highway (top 20)") plt.xticks(rotation=90); # -- 2
- Price Vs Engine type & Price Vs Fuel System.
plt.figure(figsize=(10,5)) sns.boxplot(x='enginetype',y='price',data=df,palette='viridis_r') plt.xticks(rotation=90);
plt.figure(figsize=(10,5)) sns.boxplot(x='fuelsystem',y='price',data=df,palette='inferno_r') plt.xticks(rotation=90);
INSIGHTS
- The majority of automobiles utilize diesel fuel, which is more expensive to purchase.
- Compared to the conventional engine, extremely few cars have a turbocharged aspirated engine. Even though turbocharged engines may cost more up front, they may eventually result in fuel savings and increased resale value. They might also cost more to maintain and fix.
- Majority of the cars have four doors and car body type as Sedan.
- Around 58% of cars have front wheel drive while more than 35% of cars have rear wheel drive and a very few percentage of cars offer 4 wheel drive.
- Most of the cars have low wheel base. 75% wheelbase are around 102 or we can say 25% wheelbase are in between 102 to 120.
- Most of the car lengths are in between range 160 to 190 and 25% car lengths are in between 183 to above 200.
- Most of the car width are between 65.5 to 66 and only a 25% car width are between 67 and 72.
- Buick regal sport coupe (turbo) has the highest pricing of 45400.0 USD, followed by bmw x5 with a price of 41315.0 USD.
- Peugeot is the company having highest pricing of cars in total, followed by Porsche cayenne.
- Porsche cayenne is the car with the top horsepower while Toyota corolla tercel has the highest RPM(rotation per minute) and Honda civic tops both the highway and city mileage test.
- Most of the cars with different fuel system lies in the range between prices of 10000.0 USD – 20000 USD .
So this is all about the data driven insights of cars depending on various factors. If you want the raw codes of this EDA ,go through the link below.
link – https://github.com/Kai1817/Cars-Exploratory-Data-Analysis