Python and Machine Learning Blog

Thursday, 29 September 2022

5. Data Analysis - Data Preparation

September 29, 2022 No comments

Check Missing data

X.isnull()
X.notnull()

Delete Missing Data

X.drop_na() -- delete any row that has missing data
X.drop_na(how = all) - if all column in a row has missing data.
X.drop_na(axis=1) - will delete a column that has missing data
X.drop_na(axis=1, how=all) - will delete a column if all value in column has missing data.

Impute missing value

X.fillna(99) - fill all missing value with same constant value.
X.fillna({"col1":99, "col2":10}) -- 'col1' NA value will be imputed with 99 and 'col2' NA value will be imputed with 10.
x.fillna(method = 'ffill') -- will be imputed with previous value in the same column.
x.fillna(method = 'ffill', limit =2) -- In a column only 2 value will be imputed.
X.fillna(X.mean()), -- imputing with statistical values.
X.fillna(X.median())

Handling duplicate values

X.duplicated() -- returns boolean array to check if two rows are duplicated.

If two rows are duplicated then 2nd row will show as duplicate.

X.drop_duplicates() -- removes the duplicate.. ( 2nd row will be removed )
X.duplicated["col1"]
X.drop_duplicates["col1"]
X.drop_duplicates(["col1"], keep="last") -- all duplicates will be removed except the last one.

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

mapFunction = {"Floor1":"Lower Floor", "Floor2":"Lower Floor", "Floor3":"Upper Floor", "Floor4":"Upper Floor"}
X["Floor Group] = X["Floor Number"].map(mapFunction)

Replace

X.replace(-999, np.NAN)
X.replace([[-999, -1000],[np.NAN, 1000]]) --- replace -999 with np.NAN and -1000 with 1000
X.replace({-999:np.NAA, -1000:1000}) --- replace -999 with np.NAN and -1000 with 1000 using dictionary

Update Row and Column names

X.columns.map(lambda x:x.upper()) -- works with series object
X.index.map(lambda x:x.upper) -- works for series object
X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}) -- works with dataframe.

Changing to same object ( Inplace=True )

X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}, Inplace=True)

Binning - CUT and QCUT function

salary=[10,20,30,40,50,60]
salarybins=[0,20,40,60]
pd.cut(salary, salarybins) -- create bins 0-20, 20-40, 40-60
pd.cut(salary, 8) -- 8 bins will be created.
pd.qcut(salary, [0,0.1,0.5,0.9,1]) - no from 0 to 1 is paased and these number shows quartile info.

Identify outliers

X.decscribe() -- and any value greater than max of 75 percentile can be an outlier.

Taking Samples

data.sample(3) -- get 3 sample from the dataframe
data.sample(8, replace=True) - will create samples with replacement = True

Creating dummy variables

X1= pd.get_dummies(X["col1"], prefix = "dummy")
X["col1"].join(X1) -- joining to original dataframe.

4. Data Analysis - Data Ingestion

September 29, 2022 No comments

Data Ingestion

pd.read_csv("XYZ.csv", )
pd.read_table("XYZ.csv", sep=",")
pd.read_table("XYZ.csv", sep=",", header=None) -- pandas will provide header column.
pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e']) --- provide column names.
pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col="Names") --- Make one column as row labels.
pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col=["Names1", "Names2"]) --- Make two column as row labels.

Checkk NULL values in dataframe

X.isnull()

Read a particular character as NULL value from file

pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"]) -- 'd' and "e" character will be read a NULL.
pd.read_csv("XYZ.csv", sep=",", na_values="Col1":["d","e"], "Col2":["a") -- 'd' and "e" from 'COL1" and "a" from "Col2" column will be read a NULL.

Reading large files

pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - will skip rows 3 and 5 reading

Defining max no of rows to be read from file

pd.options.display.max_rows = 10
pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - MAX=10 rows will be read. In this case first 5 rows and last 5 rows will be read.
pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], nrows=5) - First 5 rows will be read.

Reading a chunk of file

fileChunk = pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], chunksize=5) -- every chunk will have 5 rows.
for temp_chunksize in fileChunk:

print(fileChunk)

Writing to a CSV file

X.to_csv("XYZ.csv") -- NAN value will be empty string
X.to_csv("XYZ.csv", na_rep="NULL") -- NAN value will be written as NULL
X.to_csv("XYZ.csv", na_rep="NULL", index = False, header=False) -- NAN value will be written as NULL, No row and column label will be printed.
X.to_csv("XYZ.csv", na_rep="NULL", index = False, columns=["col1", "col2"]) -- only 2 columns will be printed.

Reading JSON, HTML, Pickle file

pd.read_json("iris.json") - READ JSON file
X = pd.read_html("*.html")
X.to_pickle('file') -- stores to pickle file
Y - pd.read_pickle('file') -- read from pickle file.

3. Data Analysis - Pandas Dataframe

September 28, 2022 No comments

Pandas Dataframe creation

Dataframe creation using dictionary ( with only column values)

data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
X= pd.DataFrame(data1) ------- dataframe creation with all features
X= pd.DataFrame(data1, columns=["State", "Year"]) ---- dataframe creation with 2 features
X= pd.DataFrame(data1, columns=["State", "Year", "JUNK"]) -- Creating dataframe with invalid column.

Dataframe creation using dictionary with column/index values

data1 = {State:['one':"Karnataka", 'two':"Jharkhand"], Year:['one':"2021", 'two':"2022"], Name:['one':'ABC', 'two':'DEF]}
X=pd.DataFrame(data1)
X=pd.DataFrame(data1, columns=["State", "Year"])

By passing row index separately

data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
rowIndex=['one', 'two']
X=pd.DataFrame(data1, columns=['State', 'Year'], index=rowIndex)

Index and Column Update

Access row/columns

X.columns
X.index

Accessing Column Names

colName = X.columns

Access one particular column

X["State"]
X.State

Assign same value to all column

X['State"] = "ABC"

Assign different value to all column

X["State"] = ["AP", "HP", "KA", "TN"]

Add a column

X['newColumn'] = X['State']>"AP" -- Boolean array will be returned.

Delete a column

del X["newColumn"]

Check if row/column is present in dataframe

"newColumn" in X.columns
"one" in X.index

Index Object are immutable

Row reindexing - Use reindex to Change order of row

X.index[0] = 10 --- Error
X1 = X.reindex([10,20,30,0,1,1,3]) --- Row reindexing . Note that new dataframe object is created.

Column reindexing - Use reindex to Change order of columns

X1 = X.reindex(columns=["Country", "State"]) -- column reindexing

Index value can be repeated

rowIndex=[0,1,1,1,3]

Transpose

Change the Heading of Columns

X.columns.name = "ColumnName"
X.index.name = "IndexName"

Element access from dataframe

Pandas Series Object

X["row"]
X[0]
X[["row1","row3"]]
X[[0,2]]
X[X>50]
X["row1":"row3"]
X[0:2]

Pandas Dataframe

X["col1"] -- will extract a column
X["col1":"col2"] - will extract a column
X[1:3] - row 1 to row=2 will be extracted --- CONFUSION ???

Loc/At syntax

X.loc["row1"]
X.iloc[0]
X.loc["row1", ["col1", "col3"]]
X.iloc[0, [0,2]]
X[X>5]
X.loc[:"row3]
X.iloc[:3
X.loc[:"row3", "col1":]
X.iloc[:3, 2:]
X.at["row3", "col2"] ---- single value is extracted through AT command
X.iat[3,2]

Hierarchial indexing

X.iloc[:3, :2][:, [0,1]] -- accessing is stages.

Arithematic operation

X - X1frame ---- row wise subtraction
X1.sub(X1frame) --- same as X1-X1frame
X1.rsub(X1frame, axis = "index") ---- same as X1frame - X
X1.add(X1frame, fill_value = 99) -- for NAN, it will use 99.

Apply

X1.apply(lambda x: x.max()) -- will return max value from each column
X1.apply(lambda x:x.max(), axis="columns") -- max value for each row is calculated.

How to get min/max for each row

def minmax(x):

return pd.Series([x.min(), x.max()], index=["min", "max"])

X1.apply(minmax, axis = "columns") -- will return min/max for each row.

Applymap

X.applymap(lambda x: x*10) -- applymap works for each element of array and in this case multiply each element by 10.

Sorting

X.sort_index() -- sorting of row index
X.sort_index(axis =1 ) -- sorting of column index
X.sort_values() -- will sort based on values. -- for a series object
X.sort_values(by="col2") -- will be sorted based on col='col2' -- for dataframe object.
X.sort_values(by=["col2", "col1"])

Rank

X.rank() -- column wise ranking
X.rank(ascending=true)
X.rank(axis=1) --row wise ranking.

Summarizing data

X.sum()
X.sum(axis=1)
X.describe()
X.corr()
X["col1"].corr(X["corr2"])
X.cov()
X["col1"].cov(X["corr2"])
Others ( sum, min,max, quantile,mean, median, kurt, skew, cumsum, cummax, cummin etc)

2. Data Analysis - Pandas Series

September 27, 2022 No comments

Series Creation

Series creation with default index

X= pandas.series([10,20,30,40]) -- by passing a list
X.index(), X.values()
print(X[0], X[[0,2,3]] , X[1:3] --- Access the series value

Series creation with labeled index

X = pd.series([10,20,30,40], Index = ['l1', 'l2', 'l3', 'l4'])
X['l1'], X[['l1', 'l2']], X['l2':'l4']

Series creation using dictionary

pd.Series(dict1) -- in same order as dictionary
pd.Series(dict1, index=['k2', 'k1']) -- change the order other than what specified in dictionary

Series Filter

X(X<0) -- return all negative element.

Mathematical operation

obj + obj2

if index is not matching then it will be NAN

check NULL value

X.isnull() -- returns boolean array
X.notnull() -- returns bollean array

change the index

X.index = ['k1', 'k2']
X.index.name = "indexName"

1. Data Analysis - NumPy Operations

September 26, 2022Numpy No comments

Numpy Operations

Numpy Array Creation -

X = np.array[[10,20,30,40]]
X = np.zeros(10)
X= np.ones(10)
X = np.empty (10)
X=np.arange([1,11]) ---- will create 10 element array starting from element=1 to 10

Array Creation using data type

X = np.array([10,20,30,40], dtype = np.floar64)

Changing data type

X = X.astype(np.int32)

Arithmetic operation

X = np.array([[10,20,30,40],[50,60.70,80])

Indexing and slicing

One dimension

X[1] ----- Will return X[1]
X[2:4] ------ will return X[2[, X[3]

Multiple dimension

X[0] ------- Will return 0th row
X[0][1] -- Will return 0th row and 1st column.

Propagation and Broadcast

X[2:4] = 100 ---------- X[2], X[3] will be assigned 100.

Row slicing & Column slicing using booleans

Y = np.array(["True", "True", "False", "False", "True"]),
X[Y=="True"] -- row slicing
X[:, Y="True"] -- column slicing

Manipulating all elements

X = X[X<0] = 0 --- Any element less than 0 will be assigned a value = 0

Fancy Indexing

Get rows

X[1]
X[[1,3,5]]

Get Columns

X[:, 1:4]
X[:, [1,4,5]]

Get Multiple rows and columns

X[[1,3,4], [2,4,5]]

Fancy Indexing using pipe

X[[1,4,5]][:, [0,1]]

Reshape

X.reshape(8,4)
X.transpose()

Universal Function

One input

np.sqrt(X)
np.exp(X)

Multiple Input

np.maximum(X,Y)
np.add(X,Y)

Finding NAN value

np.nan(X) -- Ouput will be [False, False,..., True] -- in this form. -- TRUE will indicate corresponding value in NAN.

Any element has NAN value

np.nan(X).any()

Vectorization

Mesh Grid

Xaxis, Yaxis = np.meshgrid(X,Y)

np.where

K = np.where(Z, X, Y)
K = np.where(X>0, Y ,Z )

np.where(X>99, X, 100) -> combination of scaler and numpy array

Statistical Functions

np.mean(X) --- Mean of all element of the array.
np.std(X)
np.sum()
X.mean(axis=0) -- Mean of each column
X.mean(axis=1) -- Mean of each row.

Boolean arrays

X.any() - IF any value is TRUE
X.sum() - Count of all TRUE value
X.all() - If all value is TRUE
X>0 -> returns a boolean array.

Sorting

np.sort(X, axis = None) -- sort all element of array and return one dimensional array
np.sort(X, axis = 0) - Column wise sorting
np.sort(X, axis = 1 ) - Row wise sorting

Remove duplicates

np.unique(X) -- removes the duplicates

Check common values between two array

np.in1d(X,Y) -.. returns bollean array with TRUE indicating common value
np.union1d(X,Y) - returns union of X and Y

Saving to a file

Saving one array in one file

np.save("OneDimension", X)
np.load("OneDimension.npy")

Save multiple array using key/value

np.savez("Onedimension", key1=X, key2=Y)
k = np.load("Onedimension")
k[key1], k[key2]

Other functionality

np.dot(X,Y)
k= inv(y) ---- from numpy.linalg import inv

What is Python Programming Language ?

August 25, 2020 4 comments

Python is open source programming language that stresses on simplicity, readability and brevity. It is free that means you don't have to pay anything to use Python. It is now being used in some or other form in many domains like Data Analysis, Machine Learning, Gaming, GUI development, Network programming and so on and this list is continuously growing.

Do check he design philosophy of Python in the below screen shot (output of ‘import this’ in python). You will be able to see the words like ‘Simple’, ‘Readability’, ‘easy to explain’, ‘Explicit’ etc that is the guiding force behind the Python.

Do checkout my video on Python History and why we should use Python.

Python and Machine Learning Blog

Blogger templates

Thursday, 29 September 2022

5. Data Analysis - Data Preparation

Check Missing data

Delete Missing Data

Impute missing value

Handling duplicate values

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

Replace

Binning - CUT and QCUT function

Taking Samples

Creating dummy variables

4. Data Analysis - Data Ingestion

Data Ingestion

Data Ingestion

Reading a chunk of file

Writing to a CSV file

Reading JSON, HTML, Pickle file

Wednesday, 28 September 2022

3. Data Analysis - Pandas Dataframe

Pandas Dataframe creation

Dataframe creation using dictionary ( with only column values)

Dataframe creation using dictionary with column/index values

By passing row index separately

Index and Column Update

Index Object are immutable

Row reindexing - Use reindex to Change order of row

Column reindexing - Use reindex to Change order of columns

Index value can be repeated

Transpose

Change the Heading of Columns

Element access from dataframe

Pandas Series Object

Pandas Dataframe

Loc/At syntax

Hierarchial indexing

Arithematic operation

Apply

How to get min/max for each row

Applymap

Sorting

Rank

Summarizing data

Tuesday, 27 September 2022

2. Data Analysis - Pandas Series

Series Creation

Series Filter

Mathematical operation

check NULL value

change the index

Monday, 26 September 2022

1. Data Analysis - NumPy Operations

Numpy Operations

Universal Function

Vectorization

Statistical Functions

Boolean arrays

Sorting

Remove duplicates

Check common values between two array

Saving to a file

Saving one array in one file

Save multiple array using key/value

Other functionality

Tuesday, 25 August 2020

What is Python Programming Language ?

Feature Top (Full Width)

Pageviews

Search This Blog

Blogs

Blog Archive