Check Missing data
- X.isnull()
- X.notnull()
Delete Missing Data
- X.drop_na() -- delete any row that has missing data
- X.drop_na(how = all) - if all column in a row has missing data.
- X.drop_na(axis=1) - will delete a column that has missing data
- X.drop_na(axis=1, how=all) - will delete a column if all value in column has missing data.
Impute missing value
- X.fillna(99) - fill all missing value with same constant value.
- X.fillna({"col1":99, "col2":10}) -- 'col1' NA value will be imputed with 99 and 'col2' NA value will be imputed with 10.
- x.fillna(method = 'ffill') -- will be imputed with previous value in the same column.
- x.fillna(method = 'ffill', limit =2) -- In a column only 2 value will be imputed.
- X.fillna(X.mean()), -- imputing with statistical values.
- X.fillna(X.median())
Handling duplicate values
- X.duplicated() -- returns boolean array to check if two rows are duplicated.
- If two rows are duplicated then 2nd row will show as duplicate.
- X.drop_duplicates() -- removes the duplicate.. ( 2nd row will be removed )
- X.duplicated["col1"]
- X.drop_duplicates["col1"]
- X.drop_duplicates(["col1"], keep="last") -- all duplicates will be removed except the last one.
Data Transformation
MAPS - Group various "Floors" to Lower/Higher category
- mapFunction = {"Floor1":"Lower Floor", "Floor2":"Lower Floor", "Floor3":"Upper Floor", "Floor4":"Upper Floor"}
- X["Floor Group] = X["Floor Number"].map(mapFunction)
Replace
- X.replace(-999, np.NAN)
- X.replace([[-999, -1000],[np.NAN, 1000]]) --- replace -999 with np.NAN and -1000 with 1000
- X.replace({-999:np.NAA, -1000:1000}) --- replace -999 with np.NAN and -1000 with 1000 using dictionary
Update Row and Column names
- X.columns.map(lambda x:x.upper()) -- works with series object
- X.index.map(lambda x:x.upper) -- works for series object
- X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}) -- works with dataframe.
Changing to same object ( Inplace=True )
- X.rename(index={'a':'A', 'b':'B'}, columns={'col1':'COL1', 'col2':'COL2'}, Inplace=True)
Binning - CUT and QCUT function
- salary=[10,20,30,40,50,60]
- salarybins=[0,20,40,60]
- pd.cut(salary, salarybins) -- create bins 0-20, 20-40, 40-60
- pd.cut(salary, 8) -- 8 bins will be created.
- pd.qcut(salary, [0,0.1,0.5,0.9,1]) - no from 0 to 1 is paased and these number shows quartile info.
Identify outliers
- X.decscribe() -- and any value greater than max of 75 percentile can be an outlier.
Taking Samples
- data.sample(3) -- get 3 sample from the dataframe
- data.sample(8, replace=True) - will create samples with replacement = True
Creating dummy variables
- X1= pd.get_dummies(X["col1"], prefix = "dummy")
- X["col1"].join(X1) -- joining to original dataframe.