Data Science in Python: Pandas Introduction

Data Science in Python: Pandas Introduction

In this post I show basic knowledge and notes for data science beginners. You will find in this post an link to jupyter file with code and execution.

Pandas Basics

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Use the following import convention:

import pandas as pd

Pandas Data Structure

Series

A one-dimensional labeled array capable on hold any data type

s = pd.Series([23, 55, -7, 2], index=['a', 'b', 'c', 'd'])
s

Output:
a    23
b    55
c   -7
d    2
dtype: int64

DataFrame

A two-dimensional labeled data structure with columns of potentially different types

data = {'Country' : ['China', 'India', 'United States', 'Indonesia', 'Pakistan', 'Brazil', 'Nigeria', 'Bangladesh', 'Russia', 'Mexico'],
'Population':[1406371640, 1372574449, 331058112, 270203917, 225200000, 212656200, 211401000, 170054094, 146748590, 126014024] }
df = pd.DataFrame(data, columns=['Country', 'Population'])
df

Output:
Country    Population
0    China    1406371640
1    India    1372574449
2    United States    331058112
3    Indonesia    270203917
4    Pakistan    225200000
5    Brazil    212656200
6    Nigeria    211401000
7    Bangladesh    170054094
8    Russia    146748590
9    Mexico    126014024

Selection

Also see NumPy Arrays

Getting

s['b']

Output: 5

AND

df[6:]

Output:
Country    Population
6    Nigeria    211401000
7    Bangladesh    170054094
8    Russia    146748590
9    Mexico    126014024

Selecting, Boolean, Indexing & Selecting

By Position

df.iloc[3, 0]

Output: 'Indonesia'

By Label

df.loc[[6], 'Country']

Output:
6    Nigeria
Name: Country, dtype: object

Boolean Indexing

result = df[df['Population'] > 270203917]
result

Output:
Country    Population
0    China    1406371640
1    India    1372574449
2    United States    331058112

Setting

s['a'] = 777
s['d'] = 999
s

Output:
a    777
b      5
c     -7
d    999
dtype: int64

Conclusion

Pandas is flexible and easy to use analysis and manipulation data.

See on Practice - Code and Execution

Credits

Photo by fabio on Unsplash