Parallel Downloads • pifsc.odbc

Overview

Large fisheries logbook tables can be slow to download over a single connection. pifsc.odbc speeds this up by partitioning tables by year and downloading chunks in parallel across multiple database connections.

Make sure you have completed the one-time setup steps described in vignette("connections") before using the download functions.

Downloading a single table

Parallel download

Use parallel_download() to download one table. The function automatically detects available years and selects an appropriate number of cores:

library(pifsc.odbc)

hdr <- parallel_download(
  table = "LLDS_HDR_20240315HAC",
  year_col = "LANDYR"
)

You can restrict the download to specific years or control the number of parallel workers:

hdr_recent <- parallel_download(
  table = "LLDS_HDR_20240315HAC",
  year_col = "LANDYR",
  years = 2015:2023,
  n_cores = 4
)

Single-threaded download

For small tables or environments where parallel connections are not available, use simple_download():

# Download an entire table (no year filtering)
ref <- simple_download(table = "LLDS_SOME_REF_TABLE")

# Download specific years
hdr_recent <- simple_download(
  table = "LLDS_HDR_20240315HAC",
  year_col = "LANDYR",
  years = 2020:2023
)

simple_download() opens a single connection, downloads the data, and disconnects. It accepts the same year_col, years, and connection_args arguments as parallel_download() so switching between the two is straightforward.

Downloading multiple tables

download_tables() processes a list of table definitions sequentially, downloading each one in parallel. Provide an output_dir to save results as CSVs:

tables <- list(
  list(table = "LLDS_HDR_20240315HAC", year_col = "LANDYR"),
  list(table = "LLDS_DETAIL_20240315HAC", year_col = "HDR_LANDYR")
)

results <- download_tables(tables, output_dir = "logbook-data")

The return value is a named list of data.table objects, so you can also work with the data directly in R:

hdr <- results[["LLDS_HDR_20240315HAC"]]
detail <- results[["LLDS_DETAIL_20240315HAC"]]

If you omit output_dir, data is returned but not written to disk:

results <- download_tables(tables)

Core selection

By default, parallel_download() and download_tables() use optimal_cores() to pick the number of workers. This uses 75% of your available cores, bounded between 2 and 8 to avoid overloading the database:

optimal_cores()      # uses default max of 8
optimal_cores(max_cores = 4)  # cap at 4

You can override this per download with the n_cores argument.

Custom connection parameters

If you use non-default connection settings (e.g., custom keyring service names or a different driver), pass them via connection_args:

results <- download_tables(
  tables,
  connection_args = list(
    uid_service = "MY_DB_user",
    pwd_service = "MY_DB_pwd",
    driver = "Oracle in instantclient_21_3"
  )
)

These arguments are forwarded to create_connection() in every worker.

How it works

The parallel download strategy:

Open a single connection to query the distinct years available in the table
Split years evenly across workers
Each worker opens its own database connection, downloads its assigned years, and disconnects
Results are combined into a single data.table

This approach keeps the total number of concurrent database connections equal to the number of cores, which is friendlier to the database than opening one connection per year.

Downloading observer data

In addition to logbook data, you can download observer data from the PIRO LOTUS database. By default, observer tables are downloaded in parallel by partitioning on the HAULBEGIN_YR column, just like logbook tables use year columns.

Using the convenience function

The download_observer_tables() function provides a simple workflow for downloading the most commonly used observer tables in parallel:

# Download default tables in parallel (LDS_SET_ENVIRON_V, LDS_CATCH_V, LDS_GEAR_CFG_V)
obs <- download_observer_tables()

# Access individual tables from the returned list
catch <- obs[["LDS_CATCH_V"]]
environ <- obs[["LDS_SET_ENVIRON_V"]]

You can control the number of parallel workers:

obs <- download_observer_tables(n_cores = 4)

To save results to disk:

# Save with timestamps (TABLE_YYYYMMDDHHMMSS.csv)
obs <- download_observer_tables(
  output_dir = "obs-data",
  timestamp = TRUE
)

# Save without timestamps (TABLE.csv)
obs <- download_observer_tables(
  output_dir = "obs-data",
  timestamp = FALSE
)

You can also specify custom tables:

obs <- download_observer_tables(
  tables = c("LDS_CATCH_V", "LDS_SET_ENVIRON_V"),
  output_dir = "obs-data"
)

Single-threaded fallback

If parallel connections are not available, set year_col = NULL to fall back to sequential downloads over a single connection:

obs <- download_observer_tables(year_col = NULL)

Using parallel_dsn_download() directly

For more control over individual observer tables, use parallel_dsn_download() directly. This works like parallel_download() but uses DSN connections instead of Oracle connections:

# Download a single observer table in parallel
catch <- parallel_dsn_download(
  table = "LDS_CATCH_V",
  year_col = "HAULBEGIN_YR"
)

# Download specific years with custom core count
environ <- parallel_dsn_download(
  table = "LDS_SET_ENVIRON_V",
  year_col = "HAULBEGIN_YR",
  years = 2015:2023,
  n_cores = 4
)

Using simple_download() with a DSN connection

For small tables or custom queries, you can use simple_download() with a DSN connection:

# Open a connection
con <- create_dsn_connection("PIRO LOTUS")

# Download multiple tables using the same connection
catch <- simple_download(
  table = "LDS_CATCH_V",
  schema = "newobs",
  con = con
)

environ <- simple_download(
  table = "LDS_SET_ENVIRON_V",
  schema = "newobs",
  con = con
)

# Close connection when finished
DBI::dbDisconnect(con)

Important notes

Observer tables are partitioned by HAULBEGIN_YR for parallel downloads
You must have the PIRO LOTUS DSN configured (see vignette("connections"))
You must be on the PIFSC network or NOAA VPN
Tables are in the newobs schema, not llds