Parallel Downloads
downloads.RmdOverview
Large fisheries logbook tables can be slow to download over a single
connection. pifsc.odbc speeds this up by partitioning
tables by year and downloading chunks in parallel across multiple
database connections.
Make sure you have completed the one-time setup steps described in
vignette("connections") before using the download
functions.
Downloading a single table
Parallel download
Use parallel_download() to download one table. The
function automatically detects available years and selects an
appropriate number of cores:
library(pifsc.odbc)
hdr <- parallel_download(
table = "LLDS_HDR_20240315HAC",
year_col = "LANDYR"
)You can restrict the download to specific years or control the number of parallel workers:
hdr_recent <- parallel_download(
table = "LLDS_HDR_20240315HAC",
year_col = "LANDYR",
years = 2015:2023,
n_cores = 4
)Single-threaded download
For small tables or environments where parallel connections are not
available, use simple_download():
# Download an entire table (no year filtering)
ref <- simple_download(table = "LLDS_SOME_REF_TABLE")
# Download specific years
hdr_recent <- simple_download(
table = "LLDS_HDR_20240315HAC",
year_col = "LANDYR",
years = 2020:2023
)simple_download() opens a single connection, downloads
the data, and disconnects. It accepts the same year_col,
years, and connection_args arguments as
parallel_download() so switching between the two is
straightforward.
Downloading multiple tables
download_tables() processes a list of table definitions
sequentially, downloading each one in parallel. Provide an
output_dir to save results as CSVs:
tables <- list(
list(table = "LLDS_HDR_20240315HAC", year_col = "LANDYR"),
list(table = "LLDS_DETAIL_20240315HAC", year_col = "HDR_LANDYR")
)
results <- download_tables(tables, output_dir = "logbook-data")The return value is a named list of data.table objects,
so you can also work with the data directly in R:
hdr <- results[["LLDS_HDR_20240315HAC"]]
detail <- results[["LLDS_DETAIL_20240315HAC"]]If you omit output_dir, data is returned but not written
to disk:
results <- download_tables(tables)Core selection
By default, parallel_download() and
download_tables() use optimal_cores() to pick
the number of workers. This uses 75% of your available cores, bounded
between 2 and 8 to avoid overloading the database:
optimal_cores() # uses default max of 8
optimal_cores(max_cores = 4) # cap at 4You can override this per download with the n_cores
argument.
Custom connection parameters
If you use non-default connection settings (e.g., custom keyring
service names or a different driver), pass them via
connection_args:
results <- download_tables(
tables,
connection_args = list(
uid_service = "MY_DB_user",
pwd_service = "MY_DB_pwd",
driver = "Oracle in instantclient_21_3"
)
)These arguments are forwarded to create_connection() in
every worker.
How it works
The parallel download strategy:
- Open a single connection to query the distinct years available in the table
- Split years evenly across workers
- Each worker opens its own database connection, downloads its assigned years, and disconnects
- Results are combined into a single
data.table
This approach keeps the total number of concurrent database connections equal to the number of cores, which is friendlier to the database than opening one connection per year.
Downloading observer data
In addition to logbook data, you can download observer data from the
PIRO LOTUS database. By default, observer tables are downloaded in
parallel by partitioning on the HAULBEGIN_YR column, just
like logbook tables use year columns.
Using the convenience function
The download_observer_tables() function provides a
simple workflow for downloading the most commonly used observer tables
in parallel:
# Download default tables in parallel (LDS_SET_ENVIRON_V, LDS_CATCH_V, LDS_GEAR_CFG_V)
obs <- download_observer_tables()
# Access individual tables from the returned list
catch <- obs[["LDS_CATCH_V"]]
environ <- obs[["LDS_SET_ENVIRON_V"]]You can control the number of parallel workers:
obs <- download_observer_tables(n_cores = 4)To save results to disk:
# Save with timestamps (TABLE_YYYYMMDDHHMMSS.csv)
obs <- download_observer_tables(
output_dir = "obs-data",
timestamp = TRUE
)
# Save without timestamps (TABLE.csv)
obs <- download_observer_tables(
output_dir = "obs-data",
timestamp = FALSE
)You can also specify custom tables:
obs <- download_observer_tables(
tables = c("LDS_CATCH_V", "LDS_SET_ENVIRON_V"),
output_dir = "obs-data"
)Single-threaded fallback
If parallel connections are not available, set
year_col = NULL to fall back to sequential downloads over a
single connection:
obs <- download_observer_tables(year_col = NULL)Using parallel_dsn_download() directly
For more control over individual observer tables, use
parallel_dsn_download() directly. This works like
parallel_download() but uses DSN connections instead of
Oracle connections:
# Download a single observer table in parallel
catch <- parallel_dsn_download(
table = "LDS_CATCH_V",
year_col = "HAULBEGIN_YR"
)
# Download specific years with custom core count
environ <- parallel_dsn_download(
table = "LDS_SET_ENVIRON_V",
year_col = "HAULBEGIN_YR",
years = 2015:2023,
n_cores = 4
)Using simple_download() with a DSN connection
For small tables or custom queries, you can use
simple_download() with a DSN connection:
# Open a connection
con <- create_dsn_connection("PIRO LOTUS")
# Download multiple tables using the same connection
catch <- simple_download(
table = "LDS_CATCH_V",
schema = "newobs",
con = con
)
environ <- simple_download(
table = "LDS_SET_ENVIRON_V",
schema = "newobs",
con = con
)
# Close connection when finished
DBI::dbDisconnect(con)Important notes
- Observer tables are partitioned by
HAULBEGIN_YRfor parallel downloads - You must have the PIRO LOTUS DSN configured (see
vignette("connections")) - You must be on the PIFSC network or NOAA VPN
- Tables are in the
newobsschema, notllds