Data Science Beginners
Python is an extremely powerful language for data science and this guide will explain fundamental Python concepts like Variables, Arrays, Dataframes, and more.
Oct 24, 2019 • 13 Minute Read
Introduction
Python is an extremely powerful programming language for data science and artificial intelligence. As in every programming language, there are building blocks or fundamentals in Python that need to be understood to master the language.
In this guide, you will acquire the fundamental knowledge required to succeed in data science with Python. We will go through eight concepts:
- Variables
- Lists
- Dictionary
- Arrays
- Functions
- Packages
- Dataframes
- Introduction to Machine Learning
Variables
A variable is a specific, case-sensitive name that allows you to refer to a value, as shown in the example below. In the first line, we assign the value 88 to the variable 'marks'. In the second line, we call the variable that returns the value it stored.
marks = 88
marks
Output:
88
There are many types of variables.The most common ones are float (for real number), int (for integer), str (for string or text), and bool (for True or False). The type of variable can be checked with the type() function.
month = "march"
average_marks = 67.3
print(type(marks)); print(type(month)); print(type(average_marks))
Output:
<class 'int'>
<class 'str'>
<class 'float'>
Lists
Lists are one of the most versatile data structures in Python. They contain a collection of values. Lists can contain items of the same or different types, and the elements can also be changed. It is easy to create a list by simply defining a collection of comma separated values in square brackets. The lines of code below create a list, 'movie', containing the names of the most successful movies and their box office collections in billion dollars.
movie = ["Avengers: Endgame", 2.796, "Avatar", 2.789, "Titanic", 2.187, "Star Wars: The Force Awakens", 2.068, "Avengers: Infinity War", 2.048]
type(movie)
Output:
list
It is also possible to create a list of lists, as shown below.
movie2 = [["Avengers: Endgame", 2.796], ["Avatar", 2.789], ["Titanic", 2.187], ["Star Wars: The Force Awakens", 2.068], ["Avengers: Infinity War", 2.048]]
movie2
Output:
['Avengers: Endgame'
Subsetting lists is easy in Python, and it starts with the index of 0. This is why it is called zero based-indexing. Suppose we want to see the first element of the list 'movie'.We can do this easily with the syntax ***movie[0]***.
movie[0]
Output:
'Avengers: Endgame'
Like subsetting, list slicing is easy in Python. The syntax is in the format ***list[start:end]***, in which the 'start' element is inclusive but the 'end' element is exclusive. The example below prints output for the third through fifth elements, excluding the sixth.
movie[3:6]
Output:
2.789
We can also perform list manipulation to change, add or remove list elements, as shown below.
# Adding the new element
movie + ["Jurassic World", 1.67]
Output:
'Avengers: Endgame'
Tuples
A tuple is a number of values separated by commas. The major differences between tuples and lists are that tuples cannot be changed and tuples use parentheses, whereas lists use square brackets.
tuple_1 = 20,30,40,50,60
tuple_1
Output:
(20, 30, 40, 50, 60)
Dictionary
Lists are convenient, but not intuitive. Dictionary is an intuitive alternative to lists. The major difference between the two is that lists are indexed by a range of numbers, whereas a dictionary is indexed by unique keys that can be used to create lookup tables. A pair of braces, {}, creates an empty dictionary, as shown below:
mov = {"Avengers: Endgame": 2.796, "Avatar":2.789, "Titanic":2.187}
mov["Avatar"]
Output:
2.789
Numpy Arrays
Another alternative to Python lists is numpy arrays, which are collections of data points. Lists are powerful, but for data science, we need an alternative that has speed and allows mathematical operations over the elements. Numpy arrays allow for simpler computations, as illustrated in the example below.
import numpy as np
runs = [100, 89, 75, 28]
np_runs = np.array(runs)
over = [10, 9, 6, 2]
np_over = np.array(over)
runs_over = np_runs / np_over
runs_over
Output:
array([10, 9.89, 12.5, 14])
It is possible to do subsetting of the numpy arrays in a similar manner as with lists.
runs_over[0]
Output:
10.0
It is also possible to create n-dimensional arrays using numpy. In the example below, we create a two dimensional arrays, containing runs and over.
example_2d = np.array([[100, 89, 75, 28, 35],
[10, 9, 6, 2, 4]])
example_2d
Output:
array([[100, 89, 75, 28, 35],
[ 10, 9, 6, 2, 4]])
Subsetting of the n-dimensional numpy arrays also follows the zero-based indexing method. A few examples are shown in the lines of code below.
# Extracting the first row of the array
print(example_2d[0])
# Extracting the first row and second element of the array
print(example_2d[0][1])
# Extracting the second and third columns and subsequent rows
example_2d[:,1:3]
Output:
100 89 75 28 35]
89
array([[89
Functions
Functions are arguably the most widely used component in predictive modeling. In simple terms, a function is a chunk of resuable code that can be called upon to solve a particular problem. This reduces a lot of coding work at the data scientist's end. We have already used one such function: 'type()'. There are many inbuilt functions in Python, and for any standard task, there is likely to be a function. In the example below, we print the maximum value and the type of list with the help of two functions.
s1 = [20, 30, 26, 32, 43, 13]
print(max(s1)); print(type(s1))
Output:
43
<class 'list'>
To understand the documentation of a particular function, we can use the function help.
help(max)
Output:
Help on built-in function max in module builtins:
max(...)
max(iterable, *[, default=obj, key=func]) -> value
max(arg1, arg2, *args, *[, key=func]) -> value
With a single iterable argument, return its biggest item. The
default keyword-only argument specifies an object to return if
the provided iterable is empty.
With two or more arguments, return the largest argument.
Packages
Functions are powerful, but complex code can get messy, requiring a lot of maintenance. In such cases, we can get help from Python packages.
A package can be considered a directory of Python scripts, where each script is contained within a module. These modules specify several functions and methods. Python has several powerful packages, and some of the most common ones are:
1. NumPy - stands for Numerical Python. It is used for creating and dealing with n-dimensional arrays and contains basic linear algebra functions, along with many other numerical capabilities.
2. Matplotlib - for visualization.
3. Pandas - for structured data operations and manipulations.
4. Scikit Learn - for machine learning. This is the most popular package for building machine learning models.
5. Statsmodels - for statistical modeling.
6. NLTK - for natural language processing.
7. SciPy - stands for Scientific Python, and is built on NumPy.
There are thousands of other packages, but the ones listed above are the most widely used for data science. It is easy to import these packages using the import command. For example, the lines of code below import the 'numpy' package and create an array.
import numpy
numpy.array([20, 21, 22])
import numpy as np
np.array([20, 21, 22])
Output:
array([20, 21, 22])
Dataframes
In a previous section, we learned about numpy arrays, which are collections of data points. However, the limitation of an array is that it can handle only one data type. But for real world data science problems, you need datasets to handle different types of data, such as text, float, integer, etc. The solution is 'Dataframes', which is the defacto data format for machine learning and predictive modeling. The typical structure of a dataframe contains observations in rows and variables in columns.
A dataframe can be constructed using the dictionary, as in the lines of codes below. The first line of code below imports the pandas library. The second line creates the dictionary that contains the values stored in the variables 'movie', 'collections', and 'release_yr', respectively. The third line converts this dictionary into a pandas dataframe, while the fourth line prints the resulting dataframe.
import pandas as pd
dic_movie = {
"movie":["Avengers: Endgame", "Avatar", "Titanic", "Star Wars: The Force Awakens", "Avengers: Infinity War"],
"collections":[2.796, 2.789, 2.187, 2.068, 2.048],
"release_yr":[2019, 2009, 1997, 2015, 2018]}
movie_df = pd.DataFrame(dic_movie)
movie_df
Output:
| | movie | collections | release_yr |
|--- |------------------------------ |------------- |------------ |
| 0 | Avengers: Endgame | 2.796 | 2019 |
| 1 | Avatar | 2.789 | 2009 |
| 2 | Titanic | 2.187 | 1997 |
| 3 | Star Wars: The Force Awakens | 2.068 | 2015 |
| 4 | Avengers: Infinity War | 2.048 | 2018 |
We can also create dataframes from existing comma separated files, also called 'csv' files. The 'read_csv' function from the pandas library can be used to read the files, as in the line of code below.
df = pd.read_csv("data_desc.csv")
df.head()
Output:
| | Marital_status | Dependents | Is_graduate | Income | Loan_amount | Term_months | Credit_score | approval_status | Age | Sex |
|--- |---------------- |------------ |------------- |-------- |------------- |------------- |-------------- |----------------- |----- |----- |
| 0 | Yes | 2 | Yes | 306800 | 43500 | 204 | Satisfactory | Yes | 76 | M |
| 1 | Yes | 3 | Yes | 702100 | 104000 | 384 | Satisfactory | Yes | 75 | M |
| 2 | No | 0 | Yes | 558800 | 66500 | 384 | Satisfactory | Yes | 75 | M |
| 3 | Yes | 2 | Yes | 534500 | 64500 | 384 | Satisfactory | Yes | 75 | M |
| 4 | Yes | 2 | Yes | 468000 | 135000 | 384 | Satisfactory | Yes | 75 | M |
Machine Learning
In the previous sections, we have learned the basic concepts related to Python for data science. However, the most popular and challenging data science task is to build machine learning models. In simple terms, machine learning is the field of teaching machines and computers to learn from existing data and to make predictions on the new, unseen data. Python is one of the most powerful languages for machine learning, and is extensively used for building data science products.
Machine learning is a vast concept and is not in the scope of this guide. To learn about data preparation and building machine learning models using Python, please refer to the following guides:
- Scikit Machine Learning
- Linear, Lasso, and Ridge Regression with scikit-learn
- Non-Linear Regression Trees with scikit-learn
- Machine Learning with Neural Networks Using scikit-learn
- Validating Machine Learning Models with scikit-learn
- Ensemble Modeling with scikit-learn
- Preparing Data for Modeling with scikit-learn
Conclusion
In this guide, you have acquired the fundamental knowledge to succeed in data science with Python. Specifically, you now have a basic understanding of:
- Variables
- Lists
- Dictionary
- Arrays
- Functions
- Packages
- Dataframes
- Introduction to Machine Learning
Understanding these concepts will enable you handle basic data science tasks successfully and provide a foundation for more complex skills.