5. Data Analysis - Data Preparation ~ Python and Machine Learning Blog

Thursday, 29 September 2022

5. Data Analysis - Data Preparation

September 29, 2022 No comments

Check Missing data

X.isnull()
X.notnull()

Delete Missing Data

X.drop_na() -- delete any row that has missing data
X.drop_na(how = all) - if all column in a row has missing data.
X.drop_na(axis=1) - will delete a column that has missing data
X.drop_na(axis=1, how=all) - will delete a column if all value in column has missing data.

Impute missing value

X.fillna(99) - fill all missing value with same constant value.
X.fillna({"col1":99, "col2":10}) -- 'col1' NA value will be imputed with 99 and 'col2' NA value will be imputed with 10.
x.fillna(method = 'ffill') -- will be imputed with previous value in the same column.
x.fillna(method = 'ffill', limit =2) -- In a column only 2 value will be imputed.
X.fillna(X.mean()), -- imputing with statistical values.
X.fillna(X.median())

Handling duplicate values

X.duplicated() -- returns boolean array to check if two rows are duplicated.

If two rows are duplicated then 2nd row will show as duplicate.

X.drop_duplicates() -- removes the duplicate.. ( 2nd row will be removed )
X.duplicated["col1"]
X.drop_duplicates["col1"]
X.drop_duplicates(["col1"], keep="last") -- all duplicates will be removed except the last one.

Data Transformation

MAPS - Group various "Floors" to Lower/Higher category

mapFunction = {"Floor1":"Lower Floor", "Floor2":"Lower Floor", "Floor3":"Upper Floor", "Floor4":"Upper Floor"}
X["Floor Group] = X["Floor Number"].map(mapFunction)

Replace

X.replace(-999, np.NAN)
X.replace([[-999, -1000],[np.NAN, 1000]]) --- replace -999 with np.NAN and -1000 with 1000
X.replace({-999:np.NAA, -1000:1000}) --- replace -999 with np.NAN and -1000 with 1000 using dictionary

Update Row and Column names

X.columns.map(lambda x:x.upper()) -- works with series object
X.index.map(lambda x:x.upper) -- works for series object
X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}) -- works with dataframe.

Changing to same object ( Inplace=True )

X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}, Inplace=True)

Binning - CUT and QCUT function

salary=[10,20,30,40,50,60]
salarybins=[0,20,40,60]
pd.cut(salary, salarybins) -- create bins 0-20, 20-40, 40-60
pd.cut(salary, 8) -- 8 bins will be created.
pd.qcut(salary, [0,0.1,0.5,0.9,1]) - no from 0 to 1 is paased and these number shows quartile info.

Identify outliers

X.decscribe() -- and any value greater than max of 75 percentile can be an outlier.

Taking Samples

data.sample(3) -- get 3 sample from the dataframe
data.sample(8, replace=True) - will create samples with replacement = True

Creating dummy variables

X1= pd.get_dummies(X["col1"], prefix = "dummy")
X["col1"].join(X1) -- joining to original dataframe.

Share:

0 comments:

Post a Comment