September 2022 ~ Python and Machine Learning Blog

Thursday, 29 September 2022

5. Data Analysis - Data Preparation

September 29, 2022 No comments

Check Missing data

X.isnull()
X.notnull()

Delete Missing Data

X.drop_na() -- delete any row that has missing data
X.drop_na(how = all) - if all column in a row has missing data.
X.drop_na(axis=1) - will delete a column that has missing data
X.drop_na(axis=1, how=all) - will delete a column if all value in column has missing data.

Impute missing value

X.fillna(99) - fill all missing value with same constant value.
X.fillna({"col1":99, "col2":10}) -- 'col1' NA value will be imputed with 99 and 'col2' NA value will be imputed with 10.
x.fillna(method = 'ffill') -- will be imputed with previous value in the same column.
x.fillna(method = 'ffill', limit =2) -- In a column only 2 value will be imputed.
X.fillna(X.mean()), -- imputing with statistical values.
X.fillna(X.median())

Handling duplicate values

X.duplicated() -- returns boolean array to check if two rows are duplicated.

If two rows are duplicated then 2nd row will show as duplicate.

X.drop_duplicates() -- removes the duplicate.. ( 2nd row will be removed )
X.duplicated["col1"]
X.drop_duplicates["col1"]
X.drop_duplicates(["col1"], keep="last") -- all duplicates will be removed except the last one.

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

mapFunction = {"Floor1":"Lower Floor", "Floor2":"Lower Floor", "Floor3":"Upper Floor", "Floor4":"Upper Floor"}
X["Floor Group] = X["Floor Number"].map(mapFunction)

Replace

X.replace(-999, np.NAN)
X.replace([[-999, -1000],[np.NAN, 1000]]) --- replace -999 with np.NAN and -1000 with 1000
X.replace({-999:np.NAA, -1000:1000}) --- replace -999 with np.NAN and -1000 with 1000 using dictionary

Update Row and Column names

X.columns.map(lambda x:x.upper()) -- works with series object
X.index.map(lambda x:x.upper) -- works for series object
X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}) -- works with dataframe.

Changing to same object ( Inplace=True )

X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}, Inplace=True)

Binning - CUT and QCUT function

salary=[10,20,30,40,50,60]
salarybins=[0,20,40,60]
pd.cut(salary, salarybins) -- create bins 0-20, 20-40, 40-60
pd.cut(salary, 8) -- 8 bins will be created.
pd.qcut(salary, [0,0.1,0.5,0.9,1]) - no from 0 to 1 is paased and these number shows quartile info.

Identify outliers

X.decscribe() -- and any value greater than max of 75 percentile can be an outlier.

Taking Samples

data.sample(3) -- get 3 sample from the dataframe
data.sample(8, replace=True) - will create samples with replacement = True

Creating dummy variables

X1= pd.get_dummies(X["col1"], prefix = "dummy")
X["col1"].join(X1) -- joining to original dataframe.

4. Data Analysis - Data Ingestion

September 29, 2022 No comments

Data Ingestion

pd.read_csv("XYZ.csv", )
pd.read_table("XYZ.csv", sep=",")
pd.read_table("XYZ.csv", sep=",", header=None) -- pandas will provide header column.
pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e']) --- provide column names.
pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col="Names") --- Make one column as row labels.
pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col=["Names1", "Names2"]) --- Make two column as row labels.

Checkk NULL values in dataframe

X.isnull()

Read a particular character as NULL value from file

pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"]) -- 'd' and "e" character will be read a NULL.
pd.read_csv("XYZ.csv", sep=",", na_values="Col1":["d","e"], "Col2":["a") -- 'd' and "e" from 'COL1" and "a" from "Col2" column will be read a NULL.

Reading large files

pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - will skip rows 3 and 5 reading

Defining max no of rows to be read from file

pd.options.display.max_rows = 10
pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - MAX=10 rows will be read. In this case first 5 rows and last 5 rows will be read.
pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], nrows=5) - First 5 rows will be read.

Reading a chunk of file

fileChunk = pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], chunksize=5) -- every chunk will have 5 rows.
for temp_chunksize in fileChunk:

print(fileChunk)

Writing to a CSV file

X.to_csv("XYZ.csv") -- NAN value will be empty string
X.to_csv("XYZ.csv", na_rep="NULL") -- NAN value will be written as NULL
X.to_csv("XYZ.csv", na_rep="NULL", index = False, header=False) -- NAN value will be written as NULL, No row and column label will be printed.
X.to_csv("XYZ.csv", na_rep="NULL", index = False, columns=["col1", "col2"]) -- only 2 columns will be printed.

Reading JSON, HTML, Pickle file

pd.read_json("iris.json") - READ JSON file
X = pd.read_html("*.html")
X.to_pickle('file') -- stores to pickle file
Y - pd.read_pickle('file') -- read from pickle file.

3. Data Analysis - Pandas Dataframe

September 28, 2022 No comments

Pandas Dataframe creation

Dataframe creation using dictionary ( with only column values)

data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
X= pd.DataFrame(data1) ------- dataframe creation with all features
X= pd.DataFrame(data1, columns=["State", "Year"]) ---- dataframe creation with 2 features
X= pd.DataFrame(data1, columns=["State", "Year", "JUNK"]) -- Creating dataframe with invalid column.

Dataframe creation using dictionary with column/index values

data1 = {State:['one':"Karnataka", 'two':"Jharkhand"], Year:['one':"2021", 'two':"2022"], Name:['one':'ABC', 'two':'DEF]}
X=pd.DataFrame(data1)
X=pd.DataFrame(data1, columns=["State", "Year"])

By passing row index separately

data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
rowIndex=['one', 'two']
X=pd.DataFrame(data1, columns=['State', 'Year'], index=rowIndex)

Index and Column Update

Access row/columns

X.columns
X.index

Accessing Column Names

colName = X.columns

Access one particular column

X["State"]
X.State

Assign same value to all column

X['State"] = "ABC"

Assign different value to all column

X["State"] = ["AP", "HP", "KA", "TN"]

Add a column

X['newColumn'] = X['State']>"AP" -- Boolean array will be returned.

Delete a column

del X["newColumn"]

Check if row/column is present in dataframe

"newColumn" in X.columns
"one" in X.index

Index Object are immutable

Row reindexing - Use reindex to Change order of row

X.index[0] = 10 --- Error
X1 = X.reindex([10,20,30,0,1,1,3]) --- Row reindexing . Note that new dataframe object is created.

Column reindexing - Use reindex to Change order of columns

X1 = X.reindex(columns=["Country", "State"]) -- column reindexing

Index value can be repeated

rowIndex=[0,1,1,1,3]

Transpose

Change the Heading of Columns

X.columns.name = "ColumnName"
X.index.name = "IndexName"

Element access from dataframe

Pandas Series Object

X["row"]
X[0]
X[["row1","row3"]]
X[[0,2]]
X[X>50]
X["row1":"row3"]
X[0:2]

Pandas Dataframe

X["col1"] -- will extract a column
X["col1":"col2"] - will extract a column
X[1:3] - row 1 to row=2 will be extracted --- CONFUSION ???

Loc/At syntax

X.loc["row1"]
X.iloc[0]
X.loc["row1", ["col1", "col3"]]
X.iloc[0, [0,2]]
X[X>5]
X.loc[:"row3]
X.iloc[:3
X.loc[:"row3", "col1":]
X.iloc[:3, 2:]
X.at["row3", "col2"] ---- single value is extracted through AT command
X.iat[3,2]

Hierarchial indexing

X.iloc[:3, :2][:, [0,1]] -- accessing is stages.

Arithematic operation

X - X1frame ---- row wise subtraction
X1.sub(X1frame) --- same as X1-X1frame
X1.rsub(X1frame, axis = "index") ---- same as X1frame - X
X1.add(X1frame, fill_value = 99) -- for NAN, it will use 99.

Apply

X1.apply(lambda x: x.max()) -- will return max value from each column
X1.apply(lambda x:x.max(), axis="columns") -- max value for each row is calculated.

How to get min/max for each row

def minmax(x):

return pd.Series([x.min(), x.max()], index=["min", "max"])

X1.apply(minmax, axis = "columns") -- will return min/max for each row.

Applymap

X.applymap(lambda x: x*10) -- applymap works for each element of array and in this case multiply each element by 10.

Sorting

X.sort_index() -- sorting of row index
X.sort_index(axis =1 ) -- sorting of column index
X.sort_values() -- will sort based on values. -- for a series object
X.sort_values(by="col2") -- will be sorted based on col='col2' -- for dataframe object.
X.sort_values(by=["col2", "col1"])

Rank

X.rank() -- column wise ranking
X.rank(ascending=true)
X.rank(axis=1) --row wise ranking.

Summarizing data

X.sum()
X.sum(axis=1)
X.describe()
X.corr()
X["col1"].corr(X["corr2"])
X.cov()
X["col1"].cov(X["corr2"])
Others ( sum, min,max, quantile,mean, median, kurt, skew, cumsum, cummax, cummin etc)

2. Data Analysis - Pandas Series

September 27, 2022 No comments

Series Creation

Series creation with default index

X= pandas.series([10,20,30,40]) -- by passing a list
X.index(), X.values()
print(X[0], X[[0,2,3]] , X[1:3] --- Access the series value

Series creation with labeled index

X = pd.series([10,20,30,40], Index = ['l1', 'l2', 'l3', 'l4'])
X['l1'], X[['l1', 'l2']], X['l2':'l4']

Series creation using dictionary

pd.Series(dict1) -- in same order as dictionary
pd.Series(dict1, index=['k2', 'k1']) -- change the order other than what specified in dictionary

Series Filter

X(X<0) -- return all negative element.

Mathematical operation

obj + obj2

if index is not matching then it will be NAN

check NULL value

X.isnull() -- returns boolean array
X.notnull() -- returns bollean array

change the index

X.index = ['k1', 'k2']
X.index.name = "indexName"

1. Data Analysis - NumPy Operations

September 26, 2022Numpy No comments

Numpy Operations

Numpy Array Creation -

X = np.array[[10,20,30,40]]
X = np.zeros(10)
X= np.ones(10)
X = np.empty (10)
X=np.arange([1,11]) ---- will create 10 element array starting from element=1 to 10

Array Creation using data type

X = np.array([10,20,30,40], dtype = np.floar64)

Changing data type

X = X.astype(np.int32)

Arithmetic operation

X = np.array([[10,20,30,40],[50,60.70,80])

Indexing and slicing

One dimension

X[1] ----- Will return X[1]
X[2:4] ------ will return X[2[, X[3]

Multiple dimension

X[0] ------- Will return 0th row
X[0][1] -- Will return 0th row and 1st column.

Propagation and Broadcast

X[2:4] = 100 ---------- X[2], X[3] will be assigned 100.

Row slicing & Column slicing using booleans

Y = np.array(["True", "True", "False", "False", "True"]),
X[Y=="True"] -- row slicing
X[:, Y="True"] -- column slicing

Manipulating all elements

X = X[X<0] = 0 --- Any element less than 0 will be assigned a value = 0

Fancy Indexing

Get rows

X[1]
X[[1,3,5]]

Get Columns

X[:, 1:4]
X[:, [1,4,5]]

Get Multiple rows and columns

X[[1,3,4], [2,4,5]]

Fancy Indexing using pipe

X[[1,4,5]][:, [0,1]]

Reshape

X.reshape(8,4)
X.transpose()

Universal Function

One input

np.sqrt(X)
np.exp(X)

Multiple Input

np.maximum(X,Y)
np.add(X,Y)

Finding NAN value

np.nan(X) -- Ouput will be [False, False,..., True] -- in this form. -- TRUE will indicate corresponding value in NAN.

Any element has NAN value

np.nan(X).any()

Vectorization

Mesh Grid

Xaxis, Yaxis = np.meshgrid(X,Y)

np.where

K = np.where(Z, X, Y)
K = np.where(X>0, Y ,Z )

np.where(X>99, X, 100) -> combination of scaler and numpy array

Statistical Functions

np.mean(X) --- Mean of all element of the array.
np.std(X)
np.sum()
X.mean(axis=0) -- Mean of each column
X.mean(axis=1) -- Mean of each row.

Boolean arrays

X.any() - IF any value is TRUE
X.sum() - Count of all TRUE value
X.all() - If all value is TRUE
X>0 -> returns a boolean array.

Sorting

np.sort(X, axis = None) -- sort all element of array and return one dimensional array
np.sort(X, axis = 0) - Column wise sorting
np.sort(X, axis = 1 ) - Row wise sorting

Remove duplicates

np.unique(X) -- removes the duplicates

Check common values between two array

np.in1d(X,Y) -.. returns bollean array with TRUE indicating common value
np.union1d(X,Y) - returns union of X and Y

Saving to a file

Saving one array in one file

np.save("OneDimension", X)
np.load("OneDimension.npy")

Save multiple array using key/value

np.savez("Onedimension", key1=X, key2=Y)
k = np.load("Onedimension")
k[key1], k[key2]

Other functionality

np.dot(X,Y)
k= inv(y) ---- from numpy.linalg import inv

Python and Machine Learning Blog

Blogger templates

Thursday, 29 September 2022

5. Data Analysis - Data Preparation

Check Missing data

Delete Missing Data

Impute missing value

Handling duplicate values

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

Replace

Binning - CUT and QCUT function

Taking Samples

Creating dummy variables

4. Data Analysis - Data Ingestion

Data Ingestion

Data Ingestion

Reading a chunk of file

Writing to a CSV file

Reading JSON, HTML, Pickle file

Wednesday, 28 September 2022

3. Data Analysis - Pandas Dataframe

Pandas Dataframe creation

Dataframe creation using dictionary ( with only column values)

Dataframe creation using dictionary with column/index values

By passing row index separately

Index and Column Update

Index Object are immutable

Row reindexing - Use reindex to Change order of row

Column reindexing - Use reindex to Change order of columns

Index value can be repeated

Transpose

Change the Heading of Columns

Element access from dataframe

Pandas Series Object

Pandas Dataframe

Loc/At syntax

Hierarchial indexing

Arithematic operation

Apply

How to get min/max for each row

Applymap

Sorting

Rank

Summarizing data

Tuesday, 27 September 2022

2. Data Analysis - Pandas Series

Series Creation

Series Filter

Mathematical operation

check NULL value

change the index

Monday, 26 September 2022

1. Data Analysis - NumPy Operations

Numpy Operations

Universal Function

Vectorization

Statistical Functions

Boolean arrays

Sorting

Remove duplicates

Check common values between two array

Saving to a file

Saving one array in one file

Save multiple array using key/value

Other functionality

Feature Top (Full Width)

Pageviews

Search This Blog

Blogs

Blog Archive