Join my course at Udemy (Python Programming Bible-From beginner to advanced )

Blogger templates

Thursday, 29 September 2022

5. Data Analysis - Data Preparation

Check Missing data

  • X.isnull()
  • X.notnull()

Delete Missing Data

  • X.drop_na() -- delete any row that has missing data
  • X.drop_na(how = all) - if all column in a row has missing data.
  • X.drop_na(axis=1) - will delete a column that has missing data
  • X.drop_na(axis=1, how=all) - will delete a column if all value in column has missing data.

Impute missing value

  • X.fillna(99) - fill all missing value with same constant value.
  • X.fillna({"col1":99, "col2":10}) -- 'col1' NA value will be imputed with 99 and 'col2' NA value will be imputed with 10.
  • x.fillna(method = 'ffill') -- will be imputed with previous value in the same column.
  • x.fillna(method = 'ffill', limit =2) -- In a column only 2 value will be imputed.
  • X.fillna(X.mean()), -- imputing with statistical values.
  • X.fillna(X.median())

Handling duplicate values

  • X.duplicated() -- returns boolean array to check if two rows are duplicated.
    • If two rows are duplicated then 2nd row will show as duplicate.
  • X.drop_duplicates() -- removes the duplicate.. ( 2nd row will be removed )
  • X.duplicated["col1"]
  • X.drop_duplicates["col1"]
  • X.drop_duplicates(["col1"], keep="last") -- all duplicates will be removed except the last one.

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

  • mapFunction = {"Floor1":"Lower Floor", "Floor2":"Lower Floor", "Floor3":"Upper Floor", "Floor4":"Upper Floor"}
  • X["Floor Group] = X["Floor Number"].map(mapFunction)

Replace

  • X.replace(-999, np.NAN)
  • X.replace([[-999, -1000],[np.NAN, 1000]]) --- replace -999 with np.NAN and -1000 with 1000
  • X.replace({-999:np.NAA, -1000:1000}) --- replace -999 with np.NAN and -1000 with 1000 using dictionary
Update Row and Column names
  • X.columns.map(lambda x:x.upper())  -- works with series object
  • X.index.map(lambda x:x.upper) -- works for series object
  • X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}) -- works with dataframe.
Changing to same object ( Inplace=True )
  • X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}, Inplace=True)

Binning - CUT and QCUT  function

  • salary=[10,20,30,40,50,60]
  • salarybins=[0,20,40,60]
  • pd.cut(salary, salarybins) -- create bins 0-20, 20-40, 40-60
  • pd.cut(salary, 8) -- 8 bins will be created.
  • pd.qcut(salary, [0,0.1,0.5,0.9,1]) - no from 0 to 1 is paased and these number shows quartile info.
Identify outliers
  • X.decscribe() -- and any value greater than max of 75 percentile can be an outlier.

Taking Samples

  • data.sample(3) -- get 3 sample from the dataframe
  • data.sample(8, replace=True) - will create samples with replacement = True

Creating dummy variables

  • X1= pd.get_dummies(X["col1"], prefix = "dummy")
  • X["col1"].join(X1) -- joining to original dataframe.


Share:

4. Data Analysis - Data Ingestion

Data Ingestion

Data Ingestion

  • pd.read_csv("XYZ.csv", )
  • pd.read_table("XYZ.csv", sep=",")
  • pd.read_table("XYZ.csv", sep=",", header=None) -- pandas will provide header column.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e']) --- provide column names.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col="Names") --- Make one column as row labels.
  • pd.read_table("XYZ.csv", sep=",", names=['a', 'b', 'c', 'd', 'e'], index_col=["Names1", "Names2"]) --- Make two column as row labels.
Checkk NULL values in dataframe 
  • X.isnull()
Read a particular character as NULL value from file
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"]) -- 'd' and "e" character will be read a NULL.
  • pd.read_csv("XYZ.csv", sep=",", na_values="Col1":["d","e"], "Col2":["a") -- 'd' and "e" from 'COL1" and "a" from "Col2" column will be read a NULL.
Reading large files
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5]) - will skip rows 3 and 5 reading
Defining max no of rows to be read from file
  • pd.options.display.max_rows = 10
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], skiprows=[3,5])  - MAX=10 rows will be read. In this case first 5 rows and last 5 rows will be read.
  • pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], nrows=5)  - First 5 rows will be read.

Reading a chunk of file 

  • fileChunk = pd.read_csv("XYZ.csv", sep=",", na_values=["d","e"], chunksize=5) -- every chunk will have 5 rows.
  • for temp_chunksize in fileChunk:
    • print(fileChunk)

Writing to a CSV file

  • X.to_csv("XYZ.csv") -- NAN value will be empty string
  • X.to_csv("XYZ.csv", na_rep="NULL") -- NAN value will be written as NULL
  • X.to_csv("XYZ.csv", na_rep="NULL", index = False, header=False) -- NAN value will be written as NULL, No row and column label will be printed.
  • X.to_csv("XYZ.csv", na_rep="NULL"index = False, columns=["col1", "col2"]) -- only 2 columns will be printed.

Reading JSON, HTML, Pickle file

  • pd.read_json("iris.json") - READ JSON file
  • X = pd.read_html("*.html")
  • X.to_pickle('file') -- stores to pickle file
  • Y - pd.read_pickle('file') -- read from pickle file.

Share:

Wednesday, 28 September 2022

3. Data Analysis - Pandas Dataframe

 Pandas Dataframe creation

Dataframe creation using dictionary ( with only column values)

  • data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
  • X= pd.DataFrame(data1)  ------- dataframe creation with all features
  • X= pd.DataFrame(data1, columns=["State", "Year"]) ---- dataframe creation with 2 features
  • X= pd.DataFrame(data1, columns=["State", "Year", "JUNK"]) -- Creating dataframe with invalid column. 

Dataframe creation using dictionary with column/index values

  •    data1 = {State:['one':"Karnataka", 'two':"Jharkhand"], Year:['one':"2021", 'two':"2022"], Name:['one':'ABC', 'two':'DEF]}
  • X=pd.DataFrame(data1)
  • X=pd.DataFrame(data1, columns=["State", "Year"])

By passing row index separately

  • data1 = {State:["Karnataka", "Jharkhand"], Year:["2021", "2022"], Name:['ABC', 'DEF]}
  • rowIndex=['one', 'two']
  • X=pd.DataFrame(data1, columns=['State', 'Year'], index=rowIndex)

Index and Column Update

  • Access row/columns
    • X.columns
    • X.index
  • Accessing Column Names
    • colName = X.columns
  • Access one particular column
    • X["State"]
    • X.State
  • Assign same value to all column
    • X['State"] = "ABC"
  • Assign different value to all column
    • X["State"] = ["AP", "HP", "KA", "TN"]
  • Add a column
    • X['newColumn'] = X['State']>"AP" -- Boolean array will be returned.
  • Delete a column
    • del X["newColumn"]
  • Check if row/column is present in dataframe
    • "newColumn" in X.columns
    • "one" in X.index

Index Object are immutable 

Row reindexing - Use reindex to Change order of row

  • X.index[0] = 10 --- Error
  • X1 = X.reindex([10,20,30,0,1,1,3])  --- Row reindexing . Note that new dataframe object is created.

Column reindexing - Use reindex to Change order of columns

    • X1 = X.reindex(columns=["Country", "State"]) -- column reindexing

    Index value can be repeated

    • rowIndex=[0,1,1,1,3]

    Transpose

    • X.T

    Change the Heading of Columns 

    • X.columns.name = "ColumnName"
    • X.index.name = "IndexName"

    Element access from dataframe

    Pandas Series Object

    • X["row"]
    • X[0]
    • X[["row1","row3"]]
    • X[[0,2]]
    • X[X>50]
    • X["row1":"row3"]
    • X[0:2]

    Pandas Dataframe

    • X["col1"] -- will extract a column
    • X["col1":"col2"] - will extract a column
    • X[1:3] - row 1 to row=2 will be extracted --- CONFUSION ???

    Loc/At syntax

    • X.loc["row1"]
    • X.iloc[0]
    • X.loc["row1", ["col1", "col3"]]
    • X.iloc[0, [0,2]]
    • X[X>5]
    • X.loc[:"row3]
    • X.iloc[:3
    • X.loc[:"row3", "col1":]
    • X.iloc[:3, 2:]
    • X.at["row3", "col2"]   ---- single value is extracted through AT command
    • X.iat[3,2]

    Hierarchial indexing

    • X.iloc[:3, :2][:, [0,1]]  -- accessing is stages.

    Arithematic operation

    • X - X1frame ---- row wise subtraction
    • X1.sub(X1frame)  --- same as X1-X1frame
    • X1.rsub(X1frame, axis = "index") ---- same as X1frame - X
    • X1.add(X1frame, fill_value = 99) -- for NAN, it will use 99.

    Apply

    • X1.apply(lambda x: x.max()) -- will return max value from each column
    • X1.apply(lambda x:x.max(), axis="columns")  -- max value for each row is calculated.

    How to get min/max for each row

    • def minmax(x):
      • return pd.Series([x.min(), x.max()], index=["min", "max"])
    • X1.apply(minmax, axis = "columns")  -- will return min/max for each row.

    Applymap

    • X.applymap(lambda x: x*10)  -- applymap works for each element of array and in this case multiply each element by 10.

    Sorting

    • X.sort_index() -- sorting of row index
    • X.sort_index(axis =1 ) -- sorting of column index 
    • X.sort_values() -- will sort based on values. -- for a series object
    • X.sort_values(by="col2") -- will be sorted based on col='col2' -- for dataframe object.
    • X.sort_values(by=["col2", "col1"])

    Rank

    • X.rank()  -- column wise ranking
    • X.rank(ascending=true)
    • X.rank(axis=1) --row wise ranking.

    Summarizing data

    • X.sum()
    • X.sum(axis=1)
    • X.describe()
    • X.corr()
    • X["col1"].corr(X["corr2"])
    • X.cov()
    • X["col1"].cov(X["corr2"])
    • Others ( sum, min,max, quantile,mean, median, kurt, skew, cumsum, cummax, cummin etc)


    Share:

    Tuesday, 27 September 2022

    2. Data Analysis - Pandas Series

     Series Creation

    Series creation with default index
    • X= pandas.series([10,20,30,40])  -- by passing a list
    • X.index(), X.values()
    • print(X[0], X[[0,2,3]] , X[1:3] --- Access the series value
    Series creation with labeled index
    • X = pd.series([10,20,30,40], Index = ['l1', 'l2', 'l3', 'l4'])
    • X['l1'], X[['l1', 'l2']], X['l2':'l4']
    Series creation using dictionary
    • pd.Series(dict1) -- in same order as dictionary
    • pd.Series(dict1, index=['k2', 'k1'])  -- change the order other than what specified in dictionary

    Series Filter

    • X(X<0)   -- return all negative element.

    Mathematical operation

    • obj + obj2
      • if index is not matching then it will be NAN

    check NULL value

    • X.isnull() -- returns boolean array
    • X.notnull() -- returns bollean array

    change the index

    • X.index = ['k1', 'k2']
    • X.index.name = "indexName"

    Share:

    Monday, 26 September 2022

    1. Data Analysis - NumPy Operations

     Numpy Operations  

    Numpy Array Creation - 
    • X = np.array[[10,20,30,40]]
    • X = np.zeros(10)
    • X= np.ones(10)
    • X = np.empty (10)
    • X=np.arange([1,11])  ---- will create 10 element array starting from element=1 to 10
    Array Creation using data type
    • X = np.array([10,20,30,40], dtype = np.floar64)
    Changing data type
    • X = X.astype(np.int32)
    Arithmetic operation
    • X = np.array([[10,20,30,40],[50,60.70,80])
    Indexing and slicing 
    • One dimension 
      • X[1] ----- Will return X[1]
      • X[2:4] ------ will return X[2[, X[3]

    • Multiple dimension
      • X[0] ------- Will return 0th row
      • X[0][1] -- Will return 0th row and 1st column.
    Propagation and Broadcast
    • X[2:4] = 100   ---------- X[2], X[3] will be assigned 100.
    Row slicing & Column slicing using booleans
    • Y = np.array(["True", "True", "False", "False", "True"]),     
    • X[Y=="True"]      -- row slicing
    • X[:, Y="True"]     -- column slicing
    Manipulating all elements
    • X = X[X<0] = 0  --- Any element less than 0 will be assigned a value = 0
    Fancy Indexing
    • Get rows
      • X[1]
      • X[[1,3,5]]

    • Get Columns
      • X[:, 1:4]
      • X[:, [1,4,5]] 

    • Get Multiple rows and columns
      • X[[1,3,4], [2,4,5]]
    Fancy Indexing using pipe
    • X[[1,4,5]][:, [0,1]]
    Reshape
    • X.reshape(8,4)
    • X.transpose()

    Universal Function

    • One input
      • np.sqrt(X)
      • np.exp(X)
    • Multiple Input
      • np.maximum(X,Y)
      • np.add(X,Y)
    • Finding NAN value 
      • np.nan(X)  -- Ouput will be [False, False,..., True] -- in this form. -- TRUE will indicate corresponding value in NAN.
    • Any element has NAN value
      • np.nan(X).any()

    Vectorization

    • Mesh Grid
      • Xaxis, Yaxis = np.meshgrid(X,Y)
    • np.where
      • K = np.where(Z, X, Y)
      • K = np.where(X>0, Y ,Z )
        • np.where(X>99, X, 100) -> combination of scaler and numpy array

    Statistical Functions

    • np.mean(X) --- Mean of all element of the array.
    • np.std(X)
    • np.sum()
    • X.mean(axis=0) -- Mean of each column
    • X.mean(axis=1) -- Mean of each row.

    Boolean arrays

    • X.any() - IF any value is TRUE
    • X.sum() - Count of all TRUE value
    • X.all() - If all value is TRUE
    • X>0 -> returns a boolean array.

    Sorting

    • np.sort(X, axis = None) -- sort all element of array and return one dimensional array
    • np.sort(X, axis = 0) - Column wise sorting
    • np.sort(X, axis = 1 ) - Row wise sorting

    Remove duplicates

    • np.unique(X) --  removes the duplicates

    Check common values between two array

    • np.in1d(X,Y) -.. returns bollean array with TRUE indicating common value
    • np.union1d(X,Y) - returns union of X and Y

    Saving to a file

    Saving one array in one file

    • np.save("OneDimension", X)
    • np.load("OneDimension.npy")

    Save multiple array using key/value

    • np.savez("Onedimension", key1=X, key2=Y)
    • k = np.load("Onedimension")
    • k[key1], k[key2]

    Other functionality

    • np.dot(X,Y)
    • k= inv(y) ---- from numpy.linalg import inv




    Share:

    Tuesday, 25 August 2020

    What is Python Programming Language ?

     Python is open source programming language that stresses on simplicity, readability and brevity. It is free that means you don't have to pay anything to use Python. It is now being used in some or other form in many domains like Data Analysis, Machine Learning, Gaming, GUI development, Network programming and so on and this list is continuously growing.

    Do check he design philosophy of Python in the below screen shot (output of ‘import this’ in python). You will be able to see the words like ‘Simple’, ‘Readability’, ‘easy to explain’, ‘Explicit’ etc that is the guiding force behind the Python.

    Do checkout my video on Python History and why we should use Python.


    Share:

    Feature Top (Full Width)

    Pageviews

    Search This Blog