Title: | Familiarity with Dataframes Before Data Manipulation |
---|---|
Description: | Real life data is muddy, fuzzy and unpredictable. This makes data manipulations tedious and bringing the data to right shape alone is a major chunk of work. Functions in this package help us get an understanding of dataframes to dramatically reduces data coding time. |
Authors: | Ashrith Reddy [aut, cre] |
Maintainer: | Ashrith Reddy <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.2 |
Built: | 2025-02-17 03:14:00 UTC |
Source: | https://github.com/cran/dataframeexplorer |
It occasionally happens that a column in dataframe contains a single value throughout. This could lead to redundant computational cost and unexpected behavior in Machine Learning methods. This function scans though all columns of dataframe to examine if any column has no variation.
detect_const_cols(dataset, return_type = "col_names", ignore_na = F)
detect_const_cols(dataset, return_type = "col_names", ignore_na = F)
dataset |
A data.frame |
return_type |
How to return detected constant columns Use "col_names", "col_positions" or "dataset" to return dataset with deleted constant columns |
ignore_na |
Whether NA should be ignored while checking if a column has just 1 unique value |
. A vector of constant column-names or column positions or dataset with deleted constant columns. Use return_type parameter to specify.
## Not run: detect_const_cols(dataset = head(mutate(mtcars, mpg_2 = 999))) ## End(Not run)
## Not run: detect_const_cols(dataset = head(mutate(mtcars, mpg_2 = 999))) ## End(Not run)
It occasionally happens that 2 (or more) columns in dataframe are exactly identical. This could lead to redundant computational cost and unexpected behavior in Machine Learning methods. This function scans though all column combinations of dataframe to examine if any 2 columns are exactly identical.
detect_dupl_cols(dataset, return_type = "col_names", duplicate_col = "right")
detect_dupl_cols(dataset, return_type = "col_names", duplicate_col = "right")
dataset |
A data.frame |
return_type |
How to return detected duplicate columns Use "col_names", "col_positions" or "dataset" to return dataset with deleted duplicate columns |
duplicate_col |
If 2 columns are identical, which of the 2 columns should be treated as duplicate? Use "right" for right column, "left" for left. |
A vector of duplicate column names or column positions or dataset with deleted duplicate columns. Use return_type parameter to specify.
## Not run: detect_dupl_cols(dataset = head(mutate(mtcars, mpg_2 = mpg)), duplicate_col = "right") ## End(Not run)
## Not run: detect_dupl_cols(dataset = head(mutate(mtcars, mpg_2 = mpg)), duplicate_col = "right") ## End(Not run)
Real-life data is seldom perfect and fields in a data.frame contains entries not anticipated by the data scientist. This function helps to know your data entries before performing any manipulations on it. This function generates frequency table excel, each column of input dataframe in a separate sheet in output excel file. Warning: An excel sheet can support 2^20 rows of data only (approx. 1 million). If the number of unique entries in a column exceeds that, excel will drop the low frequency entries.
frequency_table( dataset, output_filename = "", maximum_entries = 2^20, format_width = TRUE, sl_no_required = TRUE, frequency_required = TRUE, percentage_required = TRUE, cumulative_percentage_required = FALSE, string_length_required = TRUE )
frequency_table( dataset, output_filename = "", maximum_entries = 2^20, format_width = TRUE, sl_no_required = TRUE, frequency_required = TRUE, percentage_required = TRUE, cumulative_percentage_required = FALSE, string_length_required = TRUE )
dataset |
A data.frame |
output_filename |
Name of the output text file (should end in ".xlsx") Strongly advised to pass this parameter, else the function's default is "frequency_table_<system_time>.xlsx" |
maximum_entries |
Maximum unique entries in output. For e.g. setting this parameter to 10000 will return only top 10000 occurring entries in each column |
format_width |
Boolean input indicating if output excel cells' column width need to be formatted to "auto" |
sl_no_required |
Boolean input indicating if Sl_No column needs to be present in output excel |
frequency_required |
Boolean input indicating if Frequency column needs to be present in output excel |
percentage_required |
Boolean input indicating if Percentage column needs to be present in output excel |
cumulative_percentage_required |
Boolean input indicating if Cumulative_Percentage column needs to be present in output excel |
string_length_required |
Boolean input indicating if String_Length column needs to be present in output excel |
Does not return to calling function, writes to file system rather
## Not run: frequency_table(dataset = iris, output_filename = "frequency_table_iris.xlsx") frequency_table(dataset = mtcars, output_filename = "C/Users/Desktop/frequency_table_mtcars.xlsx") ## End(Not run)
## Not run: frequency_table(dataset = iris, output_filename = "frequency_table_iris.xlsx") frequency_table(dataset = mtcars, output_filename = "C/Users/Desktop/frequency_table_mtcars.xlsx") ## End(Not run)
Understanding the dataset through a glimpse of it will come handy while data manipulation coding. This function generates the glimpse of data.frame (similar to str()) using tibble::glimpse and write to a text file. Using same file name for different datasets will append the outputs to a same file.
glimpse_to_file(dataset, output_filename = "")
glimpse_to_file(dataset, output_filename = "")
dataset |
A data.frame object |
output_filename |
Name of the output text file (prefer to end in ".txt", although the backend will append if not) Function's default is "glimpse_<system_time>.txt" |
Does not return any value, writes to disk rather
## Not run: glimpse_to_file(dataset = mtcars, output_filename = "glimpse_mtcars.txt") glimpse_to_file(dataset = iris, output_filename = "C/Users/Desktop/glimpse_iris.txt") ## End(Not run)
## Not run: glimpse_to_file(dataset = mtcars, output_filename = "glimpse_mtcars.txt") glimpse_to_file(dataset = iris, output_filename = "C/Users/Desktop/glimpse_iris.txt") ## End(Not run)
Knowing the level of dataset is paramount to effectively and efficiently manipulate data, and the level of dataset is unknown oftentimes. This function checks for count of unique records in all possible column combinations to determine the level of dataset. Check for text file generated for column combinations with unique records.
level_of_data(dataset, output_filename = "", verbose = TRUE)
level_of_data(dataset, output_filename = "", verbose = TRUE)
dataset |
A data.frame |
output_filename |
Name of the output text file (should end in ".txt", although the backend will append if not) Function's default is "level_of_dataset_<system_time>.txt" |
verbose |
Pass TRUE for detailed output |
Does not return to calling function, writes to file system rather
## Not run: level_of_data(dataset = iris[,c("mpg", "cyl", "disp", "hp")], output_filename = "level_mtcars.txt") ## End(Not run)
## Not run: level_of_data(dataset = iris[,c("mpg", "cyl", "disp", "hp")], output_filename = "level_mtcars.txt") ## End(Not run)
This function generates percentiles for all numeric columns in the dataframe. This will come handy while understanding the distribution of data and in outlier treatment.
percentiles_table( dataset, output_filename = "", percentiles = c(0:10, seq(10, 90, 10), seq(25, 75, 25), 91:100), format_width = TRUE, sd_required = TRUE, min_required = TRUE, max_required = TRUE, mean_required = TRUE, missing_percentage_required = TRUE, class_required = TRUE )
percentiles_table( dataset, output_filename = "", percentiles = c(0:10, seq(10, 90, 10), seq(25, 75, 25), 91:100), format_width = TRUE, sd_required = TRUE, min_required = TRUE, max_required = TRUE, mean_required = TRUE, missing_percentage_required = TRUE, class_required = TRUE )
dataset |
A data.frame |
output_filename |
Name of the output excel file (should end in ".xlsx") Strongly advised to pass this parameter, else the function's default is "percentiles_table_<system_time>.xlsx" |
percentiles |
numeric vector of probabilities with values in [0,100] |
format_width |
Boolean input indicating if output excel cells' column width need to be formatted to "auto" |
sd_required |
Boolean input indicating if standard deviation column needs to be present in output excel |
min_required |
Boolean input indicating if minimum column needs to be present in output excel |
max_required |
Boolean input indicating if maximum column needs to be present in output excel |
mean_required |
Boolean input indicating if mean column needs to be present in output excel |
missing_percentage_required |
Boolean input indicating if missing percentage column needs to be present in output excel |
class_required |
Boolean input indicating if datatype column should be the last column in output excel |
Does not return to calling function, writes to file system rather
## Not run: percentiles_table(mtcars, output_filename = "percentiles_table_mtcars.xlsx") percentiles_table(iris, output_filename = "C/Users/Desktop/percentiles_table_iris.xlsx") ## End(Not run)
## Not run: percentiles_table(mtcars, output_filename = "percentiles_table_mtcars.xlsx") percentiles_table(iris, output_filename = "C/Users/Desktop/percentiles_table_iris.xlsx") ## End(Not run)