Cars-Exploratory Data Analysis(EDA)

Share your love

The automotive industry is one of the most dynamic and competitive sectors globally, with pricing being a critical factor in the decision-making process for both consumers and manufacturers. In this project, we delve into the world of car pricing analysis using a comprehensive dataset. Our objective is to gain valuable insights into the factors that influence car prices .

EDA (Exploratory Data Analysis) is a crucial first step in the data analysis process, allowing us to understand the dataset’s structure, identify patterns, outliers, and relationships among variables. Through visualizations, statistical summaries, and data exploration, we aim to uncover hidden trends and insights that can inform pricing strategies and consumer decisions in the market dataset car pricing .

Using Jupyter Notebook for Exploratory Data Analysis (EDA) is a common and effective choice, as it allows you to interactively explore your data, visualize it, and document your analysis steps. It is not only used to explore your data but also to communicate your findings effectively. Therefore, strive for clarity, transparency, and reproducibility in your analysis.

The above download option will let you to the CSV file that contains the data. The data are collected from the most famous Kaggle website.

PROCESS

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('C:/Users/asus/OneDrive/Desktop/CarPrice_Assignment.csv')
df

you will get output something like the above mentioned image .Now lets browse through the columns present in the dataset.

df.columns

output

Index(['car_ID', 'symboling', 'CarName', 'fueltype', 'aspiration','doornumber', 'carbody', 'drivewheel', 'enginelocation', 'wheelbase','carlength', 'carwidth', 'carheight', 'curbweight', 'enginetype','cylindernumber', 'enginesize', 'fuelsystem', 'boreratio', 'stroke','compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg','price'],dtype='object')

Remember these codes work well only in jupyter notebook because in other interface we have to always mention the print function to get the output.

  • check wether there are null values present in any column or not.
df.isnull().sum()  

output

car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

There aren’t any null values so we are ready to go ahead, the next thing we need to check is the data types of each column.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204 , Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 17  fuelsystem        205 non-null    object 
 18  boreratio         205 non-null    float64
 19  stroke            205 non-null    float64
 20  compressionratio  205 non-null    float64
 21  horsepower        205 non-null    int64  
 22  peakrpm           205 non-null    int64  
 23  citympg           205 non-null    int64  
 24  highwaympg        205 non-null    int64  
 25  price             205 non-null    float64
dtypes: float64(8), int64(8), object(10) , memory usage: 41.8+ KB
  • dropping Car_id column as its of no use
df.drop(columns=['car_ID'],inplace=True)
df

Now as you can check the car_id column has been removed. Lets now inspect through the different car’s columns of the dataset.

df.describe()

so this gave the statistical insights of the various columns with numeric values in the dataset .

  • Number of Cars having fuel type diesel and gas.
x=(df['fueltype']=='diesel').sum()
print("diesel type fuel:",x)
y=(df['fueltype']=='gas').sum()
print("gas type fuel:",y)
diesel type fuel: 20
gas type fuel: 185
#plot for fuel type
fuel_count= df['fueltype'].value_counts()
plt.figure(figsize = (4,4))
sns.countplot(x='fueltype',data =df,order=fuel_count.index,palette ='magma')
plt.title("cars with different fuel type");
  • Number of cars having aspiration type standard and turbo.
x=(df['aspiration']=='std').sum()
print("aspiration type standard:",x)
y=(df['aspiration']=='turbo').sum()
print("aspiration type turbo:",y)
aspiration type standard: 168
aspiration type turbo: 37
#Visualization for aspiration
aspira_count= df['aspiration'].value_counts()
plt.figure(figsize=(4,4))
sns.countplot(x='aspiration',data=df,order=aspira_count.index,palette ='viridis')
plt.title("cars with different aspirations");
  • Number of cars with different door number.
y=(df['doornumber']=='two').sum()
print("cars with two doors: ",y)
z=(df['doornumber']=='four').sum()
print("cars with four doors: ",z)
cars with two doors:  90
cars with four doors:  115
#visualization for doornumbers of cars.
door_count= df['doornumber'].value_counts()
plt.figure(figsize=(4,4))
sns.countplot(x='doornumber',data=df,order=door_count.index,palette='viridis')
plt.title("cars with different door number");
  • Number of cars with different car bodies.
o = (df['carbody']=='hatchback').sum()
f = (df['carbody']=='sedan').sum()
g = (df['carbody']=='hardtop').sum()
h = (df['carbody']=='convertible').sum()
j = (df['carbody']=='wagon').sum()
print('sedan:',f,'hardtop:',g,'hatchback:',o,'convertible:',h,'wagon:',j)
sedan: 96 hardtop: 8 hatchback: 70 convertible: 6 wagon: 25
body_count= df['carbody'].value_counts()
plt.figure(figsize=(4,4))
sns.countplot(x='carbody',data=df,order=body_count.index,palette='viridis')
plt.xticks(rotation=90)
plt.title("cars with different car body");
  • Number of cars having different drive wheel.
o= (df['drivewheel']=='rwd').sum()
y=(df['drivewheel']=='fwd').sum()
t =(df['drivewheel']=='4wd').sum()
print("cars with rear wheel drive:",o)
print("cars with front wheel drive:",y)
print("cars with 4 wheel drive:",t)
cars with rear wheel drive: 76
cars with front wheel drive: 120
cars with 4 wheel drive: 9
  • Distribution of Wheelbase of Cars.
plt.figure(figsize=(10,6))
sns.distplot(df['wheelbase'],color='olivedrab')
plt.title("car wheels distribution plot");
  • Distribution of Width and Length of Cars.
plt.figure(figsize=(10,6))
sns.distplot(df['carwidth'])
(plt.title("car distribution plot for car width"); # -- 1
 
#Run the codes separately in jupyter notebook
 
plt.figure(figsize=(10,6))
sns.distplot(df['carlength'])
plt.title('car distribution plot for car lengths'); # -- 2
  • Top 20 most expensive cars.
y=df.groupby(['CarName'])['price'].max()
v=y.sort_values(ascending=False).head(20)
plt.figure(figsize=(15,8))
sns.barplot(x=v.index,y=v,palette='magma')
plt.title("Most expensive Cars (top 20)")
plt.xticks(rotation=90);
  • Top 20 Car companies with highest prices of Cars in total.
y=df.groupby(['CarName'])['price'].sum()
v=y.sort_values(ascending=False).head(20)
plt.figure(figsize=(15,8))
sns.barplot(x=v.index,y=v,palette='cividis')
plt.title("Car companies with their total expenses(top 20)")
plt.xticks(rotation=90);
  • Top 20 Cars with highest Horse power
f=df.groupby(['CarName'])['horsepower'].max()
f=f.sort_values(ascending=False).head(20)
plt.figure(figsize=(15,8))
sns.barplot(x=f.index,y=f,palette='viridis_r')
plt.title("Car with most horsepower(top 20)")
plt.xticks(rotation=90);
  • Top 20 Cars with Highest RPM.
f=df.groupby(['CarName'])['peakrpm'].max()
f=f.sort_values(ascending=False).head(20)
plt.figure(figsize=(15,8))
sns.barplot(x=f.index,y=f,palette='mako_r')
plt.title("Car with top RPM(top 20)")
plt.xticks(rotation=90);
  • Cars with best mileage in City and Highway.
f=df.groupby(['CarName'])['citympg'].max()
f=f.sort_values(ascending=False).head(20)
plt.figure(figsize=(15,8))
sns.barplot(x=f.index,y=f,palette='gist_earth')
plt.title("Car with best Mileage in city (top 20)")
plt.xticks(rotation=90);    # -- 1

#Run the codes separately in jupyter notebook

f=df.groupby(['CarName'])['highwaympg'].max()
f=f.sort_values(ascending=False).head(20)
plt.figure(figsize=(15,8))
sns.barplot(x=f.index,y=f,palette='cividis_r')
plt.title("Car with best Mileage on highway (top 20)")
plt.xticks(rotation=90);   # -- 2
  • Price Vs Engine type & Price Vs Fuel System.
plt.figure(figsize=(10,5))
sns.boxplot(x='enginetype',y='price',data=df,palette='viridis_r')
plt.xticks(rotation=90);
plt.figure(figsize=(10,5))
sns.boxplot(x='fuelsystem',y='price',data=df,palette='inferno_r')
plt.xticks(rotation=90);

INSIGHTS

  1. The majority of automobiles utilize diesel fuel, which is more expensive to purchase.
  2. Compared to the conventional engine, extremely few cars have a turbocharged aspirated engine. Even though turbocharged engines may cost more up front, they may eventually result in fuel savings and increased resale value. They might also cost more to maintain and fix.
  3. Majority of the cars have four doors and car body type as Sedan.
  4. Around 58% of cars have front wheel drive while more than 35% of cars have rear wheel drive and a very few percentage of cars offer 4 wheel drive.
  5. Most of the cars have low wheel base. 75% wheelbase are around 102 or we can say 25% wheelbase are in between 102 to 120.
  6. Most of the car lengths are in between range 160 to 190 and 25% car lengths are in between 183 to above 200.
  7. Most of the car width are between 65.5 to 66 and only a 25% car width are between 67 and 72.
  8. Buick regal sport coupe (turbo) has the highest pricing of 45400.0 USD, followed by bmw x5 with a price of 41315.0 USD.
  9. Peugeot is the company having highest pricing of cars in total, followed by Porsche cayenne.
  10. Porsche cayenne is the car with the top horsepower while Toyota corolla tercel has the highest RPM(rotation per minute) and Honda civic tops both the highway and city mileage test.
  11. Most of the cars with different fuel system lies in the range between prices of 10000.0 USD – 20000 USD .

So this is all about the data driven insights of cars depending on various factors. If you want the raw codes of this EDA ,go through the link below.

link – https://github.com/Kai1817/Cars-Exploratory-Data-Analysis

Share your love
Kaibalya Biswal
Kaibalya Biswal

An aspiring Data Science📊 and Machine learning 📈Intern ,who wants to give the best according to the requirements, followed by great work ethic🖋️.
As a postgraduate in Applied Physics from a VSSUT,Burla ,Odisha with a deep-rooted passion for exploring the intricacies of the quantum world, He embarked on a transformative journey into the dynamic field of data science. This transition was catalyzed by his captivating dissertation on experiments with quantum computers, a pivotal experience that unveiled the profound intersection of quantum physics and data-driven technology.
His academic pursuit led him to the forefront of quantum computing, where he engaged in groundbreaking experiments and harnessed the power of quantum algorithms. Collaborating with industry leaders like IBM, He immersed himself in the quantum realm, both theoretically and practically. The culmination of this research was marked by his proficiency in programming quantum computers and conducting simulations within the versatile environment of Jupyter Notebook. His transition to data science is fueled by a belief in the transformative potential of this synergy.
In the world of data science, He sees an ever-expanding horizon of possibilities. By merging his expertise in quantum computing with the vast landscape of data analytics, He aims to pioneer novel approaches to solving real-world problems. The marriage of quantum and data science has the power to usher in a new era of innovation.
With a strong foundation in applied physics and a newfound passion for data science, He is poised to leverage the exponential growth of data and quantum computing, creating solutions that have far-reaching implications across industries. His journey is fueled by an insatiable curiosity, a thirst for knowledge, and an unwavering commitment to drive positive change through data-driven insights.
Currently, He is actively engaged in diverse projects spanning various topics in data science, harnessing the power of data to drive insights and actionable solutions.

linkdin : https://www.linkedin.com/in/kaibalya-biswal-701b6421a
github: https://github.com/Kai1817

Articles: 13

Leave a Reply

Your email address will not be published. Required fields are marked *