ref_are_duplicatesR Documentation

Help to identify if two HPs (pairwise) are real duplicates or not

Description

values coming from different records will be put one against another to facilitate the comparisons between possible duplicates. A fuzzy matching between these values is computed and stored in the column 'dist' to resume the information. The distance calculation uses the 'stringdist()' function from the 'stringdist' package. The lower is this value, the closer are the compared values (probable duplicates).

Usage

ref_are_duplicates(
  db.con = NA,
  d = NA,
  field = "are_duplicates",
  hp.list = c("EAMENA-0207209", "EAMENA-0182057"),
  selected.fields = c("Assessment Investigator - Actor", "Assessment Activity Date",
    "Resource Name"),
  geojson.path = paste0(system.file(package = "eamenaR"),
    "/extdata/caravanserail.geojson"),
  dist.method = "jw",
  round.dist = 2,
  export.table = FALSE,
  dirOut = paste0(system.file(package = "eamenaR"), "/results/"),
  fileOut = "duplicates.xlsx",
  verbose = TRUE
)

Arguments

db.con

the parameters for the DB, in a RPostgres::dbConnect() format. If NA (by default), will read a GeoJSON file.

d

a hash() object (a Python-like dictionary).

field

the field name that will be created in the a hash() object.

hp.list

a list with the HP IDs to compare. By default: 'c("EAMENA-0207209", "EAMENA-0182057")'

selected.fields

the list of fields that will be selected to compare their values between different potential duplicates. By default the GeoJSON geometry is also selected.

dist.method

'stringdist' method to calculate the pairwise distance (see the function documentation), by default '"jw"'

round.dist

an integer for the number of digit to preserve in the distance computing. By default: 2.

export.table

if TRUE will export the table of duplicates (FALSE by default).

dirOut

the folder where the outputs will be saved. By default: '/results'. If it doesn't exist, it will be created. Only useful is export.plot.g is TRUE.

fileOut

the output file name. It could be an XLSX or a CSV file. Only useful is export.plot.g is TRUE. By default "duplicates.xlsx".

verbose

if TRUE (by default), print messages.

Value

a matrix stored in hash() object. This matrix has the ResourceID of the compared HP in column. If export.table is set to TRUE it will also create an CSV or XLSX table with the potential duplicates, and the fuzzy matching value (column 'dist')

Examples


export as CSV in the default folder
d <- hash::hash()
d <- ref_are_duplicates(d = d,
                        export.table = T,
                        fileOut = "duplicates.csv")

export as XLSX in another folder
d <- ref_are_duplicates(d = d,
                        export.table = T,
                        fileOut = "test_duplicates.xlsx",
                        dirOut = "C:/Rprojects/eamenaR/inst/extdata/")