ref_are_duplicates | R Documentation |
values coming from different records will be put one against another to facilitate the comparisons between possible duplicates. A fuzzy matching between these values is computed and stored in the column 'dist' to resume the information. The distance calculation uses the 'stringdist()' function from the 'stringdist' package. The lower is this value, the closer are the compared values (probable duplicates).
ref_are_duplicates( db.con = NA, d = NA, field = "are_duplicates", hp.list = c("EAMENA-0207209", "EAMENA-0182057"), selected.fields = c("Assessment Investigator - Actor", "Assessment Activity Date", "Resource Name"), geojson.path = paste0(system.file(package = "eamenaR"), "/extdata/caravanserail.geojson"), dist.method = "jw", round.dist = 2, export.table = FALSE, dirOut = paste0(system.file(package = "eamenaR"), "/results/"), fileOut = "duplicates.xlsx", verbose = TRUE )
db.con |
the parameters for the DB, in a RPostgres::dbConnect() format. If NA (by default), will read a GeoJSON file. |
d |
a hash() object (a Python-like dictionary). |
field |
the field name that will be created in the a hash() object. |
hp.list |
a list with the HP IDs to compare. By default: 'c("EAMENA-0207209", "EAMENA-0182057")' |
selected.fields |
the list of fields that will be selected to compare their values between different potential duplicates. By default the GeoJSON geometry is also selected. |
dist.method |
'stringdist' method to calculate the pairwise distance (see the function documentation), by default '"jw"' |
round.dist |
an integer for the number of digit to preserve in the distance computing. By default: 2. |
export.table |
if TRUE will export the table of duplicates (FALSE by default). |
dirOut |
the folder where the outputs will be saved. By default: '/results'. If it doesn't exist, it will be created. Only useful is export.plot.g is TRUE. |
fileOut |
the output file name. It could be an XLSX or a CSV file. Only useful is export.plot.g is TRUE. By default "duplicates.xlsx". |
verbose |
if TRUE (by default), print messages. |
a matrix stored in hash() object. This matrix has the ResourceID of the compared HP in column. If export.table is set to TRUE it will also create an CSV or XLSX table with the potential duplicates, and the fuzzy matching value (column 'dist')
export as CSV in the default folder d <- hash::hash() d <- ref_are_duplicates(d = d, export.table = T, fileOut = "duplicates.csv") export as XLSX in another folder d <- ref_are_duplicates(d = d, export.table = T, fileOut = "test_duplicates.xlsx", dirOut = "C:/Rprojects/eamenaR/inst/extdata/")