Introduction Dirty data problems: Missing values, data manipulation, duplicates, forms of
data dates, outliers, spelling
Missing Values in R:
is.na() Function for Finding Missing values:
A logical vector is returned by this function that indicates all the NA values present. It returns a
Boolean value. If NA is present in a vector it returns TRUE else FALSE.
example:
x<- c(NA, 3, 4, NA, NA, NA)
is.na(x)
output:
[1] TRUE FALSE FALSE TRUE TRUE TRUE
is.nan() Function for Finding Missing values:
A logical vector is returned by this function that indicates all the NaN values present. It returns a
Boolean value. If NaN is present in a vector it returns TRUE else FALSE.
example:
x<- c(NA, 3, 4, NA, NA, 0 / 0, 0 / 0)
is.nan(x)
output:
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE
DUPLICATE VALUES IN R:
1.duplicated():
The R function duplicated() returns a logical vector where TRUE specifies which elements of a
vector or data frame are duplicates.
example:
x <- c(1, 1, 4, 5, 4, 6)
duplicated(x)
output:
[1] FALSE TRUE FALSE FALSE TRUE FALSE
Extract duplicate elements:
x <- c(1, 1, 4, 5, 4, 6)
x[duplicated(x)]
[1] 1 4
REMOVING DUPLICATES IN R:
If you want to remove duplicated elements, use !duplicated(), where ! is a logical negation:
EXAMPLE:
x <- c(1, 1, 4, 5, 4, 6)
x[!duplicated(x)]
OUTPUT:
[1] 1 4 5 6
forms of DATA DATES IN R:
1) Sys.Date():
In R programming, if you use Sys.Date() function, it will give you the system date.
syntax:
Sys.Date()
output:
[1] "2022-04-20"
2)Sys.timezone() :
a function named Sys.timezone() that allows us to get the timezone based on the location at which
the user is running the code on the system.
syntax:
Sys.timezone()
output:
[1] "Asia/Calcutta"
3)Sys.time() :
we have the Sys.time() function. Which, if used, will return the current date as well as the time of
the system with the timezone details.
syntax:
Sys.time()
output:
[1] "2022-04-20 11:02:56 IST"
4) as.date():
as.Date() function allows us to create a date value (without time) in R programming. It allows the
various input formats of the date value as well through the format = argument.
example:
mydate<-as.date("2014-04-30")
mydate
ouptut:
[1] "2014-04-30"