An R package for reading environmental data from raw formats into data.frames.
This is project was initiated at the inaugural IMCR Hackathon hosted by the Environmental Data Institute. The end product of this effort will be an R package on CRAN. The package will primarily deal with reading data from files, though there will be some utilities for initial cleanup of files such as removing blank rows and columns at the end of a CSV file. Our work at the hackathon focused on package infrastructure, standardization, and template construction.
The guiding principles of ingestr are that
You can install ingestr from github with:
# install.packages("devtools")
devtools::install_github("jpshanno/ingestr")
# or from source
ingestr_source <- file.path(tempdir(), "ingestr-master.zip")
download.file("https://github.com/jpshanno/ingestr/archive/master.zip",
ingestr_source)
unzip(ingestr_source,
exdir = dirname(ingestr_source))
install.packages(sub(".zip$", "", ingestr_source),
repos = NULL,
type = "source")
Each ingestr function to read in data starts with ingest_
to make autocomplete easier. We are targetting any source of environmental data that returns data in a standard format: native sensor files, delimited outputs, HTML tables, PDF tables, Excel sheets, API returns, …
Running any ingest function will read in the data and format the data into a clean R data.frame. Column names are taken directly from the data file, and users have the option to read the header information into a temporary file that can then be loaded using ingest_header()
.
All data and header data that are read in will have the data source appended to the data as a column called input_source.
Many sensors provide their output as delimited files with header information contained above the recorded data.
campbell_file <-
system.file("example_data",
"campbell_scientific_tao5.dat",
package = "ingestr")
campbell_data <-
ingest_campbell(input.source = campbell_file,
export.header = TRUE,
add.units = TRUE,
add.measurements = TRUE)
str(campbell_data)
campbell_header <-
ingest_header(input.source = campbell_file)
str(campbell_header)
Some environmental data is published online as html elements. This data can be difficult to read directly from the websites where they are hosted into R. To facilitate access, we have created functions that parse the html so that this data can be directly downloaded in R. To track the provenance of these data, the column input_source is populated by the URL location from which the data were downloaded.
PDO_data <-
ingest_PDO(input.source = "http://jisao.washington.edu/pdo/PDO.latest",
end.year = NULL,
export.header = TRUE)
#> Header info for http://jisao.washington.edu/pdo/PDO.latest has been saved to a temporary file. Run ingest_header('http://jisao.washington.edu/pdo/PDO.latest') to load the header data.
str(PDO_data)
#> 'data.frame': 121 obs. of 14 variables:
#> $ YEAR : chr "1901" "1902" "1903" "1904" ...
#> $ JAN : chr "0.79" "0.82" "0.86" "0.63" ...
#> $ FEB : chr "-0.12" "1.58" "-0.24" "-0.91" ...
#> $ MAR : chr "0.35" "0.48" "-0.22" "-0.71" ...
#> $ APR : chr "0.61" "1.37" "-0.50" "-0.07" ...
#> $ MAY : chr "-0.42" "1.09" "0.43" "-0.22" ...
#> $ JUN : chr "-0.05" "0.52" "0.23" "-1.53" ...
#> $ JUL : chr "-0.60" "1.58" "0.40" "-1.58" ...
#> $ AUG : chr "-1.20" "1.57" "1.01" "-0.64" ...
#> $ SEP : chr "-0.33" "0.44" "-0.24" "0.06" ...
#> $ OCT : chr "0.16" "0.70" "0.18" "0.43" ...
#> $ NOV : chr "-0.60" "0.16" "0.08" "1.45" ...
#> $ DEC : chr "-0.14" "-1.10" "-0.03" "0.06" ...
#> $ input_source: chr "http://jisao.washington.edu/pdo/PDO.latest" "http://jisao.washington.edu/pdo/PDO.latest" "http://jisao.washington.edu/pdo/PDO.latest" "http://jisao.washington.edu/pdo/PDO.latest" ...
PDO_header <-
ingest_header(input.source = "http://jisao.washington.edu/pdo/PDO.latest")
#> Header data was loaded from cached results created when http://jisao.washington.edu/pdo/PDO.latest was ingested previously in this R session.
str(PDO_header)
#> 'data.frame': 1 obs. of 2 variables:
#> $ header_text : chr "PDO INDEX If the columns of the table appear without formatting on your browser, use http://research.jisao.wash"| __truncated__
#> $ input_source: chr "http://jisao.washington.edu/pdo/PDO.latest"
Sensor data stored in folders is available for batch import using ingest_
functions, or any other function that reads in data (i.e. read.csv
, readr::read_csv
. When a directory is read in file names are checked for duplicates, and imported data is checked for duplicate file contents. The user is warned and can choose to suppress the warning or remove the duplicates. Parallel bath processing is supported for large batch processing (requires the parallel package.
Filenames generally include information about the data collected: site, sensor, measurement type, date collected, etc. We are working on a generalized approach (probably just a function or two) that would split the filename into data columns using a template would be very useful.
For example if a set of file names read as “site-variable-year” (152-soil_moisture-2017.csv, 152-soil_temperature-2017.csv, 140-soil_moisture_2017.csv, etc), then the function would take an argument supplying the template as column headers: “site-variable-year” with either delimiters or the length of each variable to enable splitting. These functions will likely build off of the great work done on tidyr::separate()
and we suggest using that until we have incorporated a solution.
Basic data cleaning utilities will be included in ingestr. These will include identifying duplicate rows, empty rows, empty columns, and columns that contain suspicious entries (i.e. “.”). These utilities will be able to flag or correct the problems depending upon user preference. In keeping with our commitment to data provenance and reproduciblity all cleaning utilities will provide a record of identified and corrected issues which can be saved by the user and stored with the final dataset.
While ingestr is focused on getting data into R and running preliminary checks, another group at the IMCR Hackathon focused on quality assurance checks for environmental data. qaqcr provides a simple, standard way to apply the quality control checks that are applicable to your data.
The packages are the start of a larger ecosystem including EMLassemblyline for environmental data management to create a convenient, reproducible workflow moving from raw data to archived datasets with rich EML metadata.