Importing and Splitting Data into Dependent and Independent Features for Machine Learning

By Vaibhav Sharma

Jan 15, 2019 • 6 Minute Read

Introduction

Importing data from different sources is fundamental to data science and machine learning. The abundance of good quality data not only eliminates a lot of pre-processing steps but also determines how likely your model is going to succeed in predicting plausible outcomes. The Python Panda library is the workhorse of a data scientist when dealing with table or matrix forms of data. Panda is written on top of NumPy and provides the additional level of abstraction. This helps users focus more on solving the problem statement by hiding the elaborate implementation details. It takes the input in the form of csv, txt or sql file and converts it into the dataframe object which is then available for splicing and analysis.

Importing Data from Various Sources

In this guide, we are going to work with household_data.csv; the contents of which are displayed below. Unless explicitly mentioned, the data of file will remain throughout this guide.

Reading the household_data.csv

          import pandas as pd
df = pd.read_csv('household_data.csv')
print(df)
    

Output:

          Item_Category	Gender	Age	Salary	Purchased
Fitness	Male	20	30000	Yes
Fitness	Female	50	70000	No
Food		Male	35	50000	Yes
Kitchen	Male	22	40000	No
Kitchen	Female	30	35000	Yes
    

Reading Excel Files

          import pandas as pd
df = pd.read_excel('household_data.xlsx', sheetname='household_data')
print(df)
    

Output:

          Item_Category	Gender	Age	Salary	Purchased
Fitness	Male	20	30000	Yes
Fitness	Female	50	70000	No
Food		Male	35	50000	Yes
Kitchen	Male	22	40000	No
Kitchen	Female	30	35000	Yes
    

Reading the SQL File and Putting the Contents of It to Dataframe

We are going to see how to read the contents returned by the select statement to the dataframe. The below snippet is for Oracle but the idea remains same for other databases. Only the connection details should change.

          import cx_Oracle
import pandas as pd 
dsn_tns = cx_Oracle.makedsn('server', 'port', 'SID')
conn = cx_Oracle.connect(user='username', password='password', dsn=dsn_tns)
cursor = conn.cursor()
df = pd.read_sql_query("select * from household_data where Item_Category = 'Food'",conn)
print(df)
    

Output:

          Item_Category	Gender	Age	Salary	Purchased
0  Food		Male	35	50000	Yes
    

Splitting, Splicing, and Analysis of Data Using Dataframes

Panda provides various in-built functions that come in handy when dealing with the data set.

Getting the Minimum, Maximum and Average of a Column

          print(df["Salary"].min())
print(df["Salary"].max())
print(df["Salary"].mean())
    

Output:

Getting the Count for the Column

Count: This method is useful when the user is interested in getting the number of elements present per column. If there is any value that is left null than that is eliminated from the count. Assume if the value of purchased is left blank for one of the rows then following would be the output.

      print(df.count(0))

Output:

          Item_Category    5
Gender           5
Age              5
Salary           5
Purchased        4
    

Shape and Size of the Dataframe

Shape is used to get the dimensions of the dataframe.

      print(df.shape)

Output:

      (5, 5)

Size is used to get the number of elements in the dataframe.

      print(df.size)

Output:

Dependent and Independent Variables

See the below equation:

      y = 10a + 2b - 4.3c

It demonstrates that the value of y is dependent on the value of a, b, and c. So, y is referred to as dependent feature or variable and a, b, and c are independent features or variables. Any predictive mathematical model tends to divide the observations (data) into dependent/ independent features in order to determine the causal effect. It should be noted that relationship between dependent and independent variables need not be linear, it can be polynomial. It is common practise while doing experiments to change one independent variable while keeping others constant to see the change caused on the dependent variable.

Splitting the Data-set into Independent and Dependent Features

In machine learning, the concept of dependent and independent variables is important to understand. In the above dataset, if you look closely, the first four columns (Item_Category, Gender, Age, Salary) determine the outcome of the fifth, or last, column (Purchased). Intuitively, it means that the decision to buy a product of a given category (Fitness item, Food product, kitchen goods) is determined by the Gender (Male, Female), Age, and the Salary of the individual. So, we can say that Purchased is the dependent variable, the value of which is determined by the other four variables.

With this in mind, we need to split our dataset into the matrix of independent variables and the vector or dependent variable. Mathematically, Vector is defined as a matrix that has just one column.

Splitting the Dataset into the Independent Feature Matrix:

          X = df.iloc[:, :-1].values
print(X)
    

Output:

          ['Fitness' 'Male' 20 30000]
['Fitness' 'Female' 50 70000]
['Food' 'Male' 35 50000]
['Kitchen' 'Male' 22 40000]
['Kitchen' 'Female' 30 35000]
    

Extracting the Dataset to Get the Dependent Vector

          Y = df.iloc[:, -1].values
print(Y)
    

Output:

      'Yes'

Conclusion

There are many other sophisticated methods available in Python Pandas that can help the user to import data from different sources to its dataframe. Once you have the data in the dataframe, it can then be used for various kinds of analysis. We also saw how to segregate the data into dependent and independent variables. In the next guide, we will see how to carry on a few more pre-processing steps before data can be presented to the machine learning models.

Vaibhav S.

Written content author.

More about this author