people in the EDH datasetInstall and load a version of "sdam" package.
install.packages("sdam") # from CRAN
devtools::install_github("sdam-au/sdam") # development version
devtools::install_github("mplex/cedhar", subdir="pkg/sdam") # legacy version R 3.6.x[1] '1.1.4'
EDH is a dataset in "sdam" that contains
the texts of Latin and Latin-Greek inscriptions of the Roman Empire,
which have been retrieved from the Epigraphic
Database Heidelberg API repository through routines
get.edh() and get.edhw().Since the year 2022 and still today, the API repository does not
support people variables, and the EDH dataset serves as an
alternative for the analysis of people-related inscriptions.
One challenge with people variables in EDH is that some
records contain characters in Greek and Latin extended that need
re-encoding for a proper rendering and display.
people in EDHAncient inscriptions in some Roman provinces have Greek characters
written and, due to encoding and decoding steps in the process of
extraction, loading, and transformation of the data (perhaps Treating
UTF-8 Bytes as Windows-1252?), Greek and other Latin characters are not
displayed properly with the actual version of the EDH
dataset. Most of the encoding issues are in variables related to people,
and some examples with inscriptions in Roman provinces are next.
The Roman province of Achaia in the EDH
dataset has inscriptions related to people.
Roman province of Achaia (ca 117 AD).
Function edhw() is to obtain the available inscriptions
per province in the EDH dataset, which is a list that is the input for
the same function to extract people variables
cognomen and nomen. In this case, the
'province' argument is Ach that stands for
Achaia.
# select two people variables from Achaia
Ach <- edhw(province="Ach") |>
edhw(vars="people", select=c("cognomen","nomen"))Error in `is.null(unlist(w)) == FALSE && is.na(unlist(w)) == FALSE`:
! 'length = 17' in coercion to 'logical(1)'
There are 1539 records with people in Ach that
corresponds to the number of rows in this data frame.
Error:
! object 'Ach' not found
However, some records have either missing data or are inscriptions where cognomen and nomen are not available.
# also remove NAs
Ach <- edhw(province="Ach") |>
edhw(vars="people", select=c("cognomen","nomen"), na.rm=TRUE)
nrow(Ach)Error in `is.null(unlist(w)) == FALSE && is.na(unlist(w)) == FALSE`:
! 'length = 17' in coercion to 'logical(1)'
Error:
! object 'Ach' not found
Treating with people attribute variables requires many
times re-encoding that is one option in function cln(). For
instance, values in cognomen in the first entries of
Ach are likely in Greek.
Error:
! object 'Ach' not found
Function cln() serves to re-encode Greek and Latin
characters to render Greek, Greek extended, and Latin extended
glyphs.
cognomen
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
For cognomen in the last people entries in
Achaia.
Error:
! object 'Ach' not found
After re-encoding the last records in Ach with
cln(), it is easier to see, for example, that some have
identical cognomen where entries having <NA>
in the input become NA.
cognomen
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
Error:
! object 'Ach' not found
In the case of the province of Aegyptus, three people variables have a mixing og Greek and Latin characters scripted that need re-codification as well.
Roman province of Aegyptus (ca 117 AD).
Error in `is.null(unlist(w)) == FALSE && is.na(unlist(w)) == FALSE`:
! 'length = 4' in coercion to 'logical(1)'
Error:
! object 'Aeg' not found
For people in Aegyptus, columns three, and five to six
correspond to cognomen, name, and nomen,
where the output from cln() in the console is a
dataframe.
cognomen
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
name
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
nomen
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Some entries in Aeg have Greek extended characters, and
one entry in Latin has a special character at the end
(Sulpicius*), which can be omitted for further computations
by raising the cleaning level to 2.
Benefits from re-encoding and cleaning text from the EDH
dataset are evident like when counting occurrences in the different
attribute variables as with nomen in Aeg.
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
etc.
...
By raising the cleaning level to 2, all special
characters are removed from the end, and it is possible to see that, in
the Roman province of Aegyptus, Sempronius,
Sentius, Valerius are the three most common
nomen in inscriptions with four occurrences each.
# raise cleaning level and remove NAs
Aeg$nomen |>
cln(level=2, na.rm=TRUE) |>
table() |>
sort(decreasing=TRUE) Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
Error:
! object 'Aeg' not found
etc.
...
See Warnings section in manual.