---
title: "Accessing Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Accessing Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Databrary is a data library, one specialized for storing and sharing video with a restricted audience of institutionally approved researchers.
In this vignette, we'll see how to use `databraryr` to access data.

We'll start simply.
Let's download a test video from [volume 1 on Databrary](https://databrary.org/volume/1).

The `download_video()` function handles this for us.
Running it with the default parameters downloads a simple test video with numbers
than increment.
The file is stored in a temporary directory created by the file system using the function `tempdir()`.
The `download_video()` function returns a character string with the full file name.

```{r, eval=FALSE}
databraryr::download_video()
```

The function returns a path to the video on your file system.
The file name chosen, since we didn't specify one, is the session number `9807`, an underscore, the file/asset number `1`, another underscore, and a time stamp.

**Note**: You can navigate to the location where we're downloading files by opening the following URL in your browser: <https://databrary.org/volume/1/slot/9807>

Depending on your operating system, the following commands may open the file so that you can play it with your default video player.

```{r, eval=FALSE}
nums_vid <- download_video()
system(paste0("open ", nums_vid))
```

Or, you can navigate to the temporary directory to open and play the video manually.
Use `tempdir()` to find the directory where `test.mp4` is stored.

Now, let's see what other files are shared in volume 1, not just those in session (slot) 9807.
This takes a moment to run because there are *many* files in this volume.

```{r}
databraryr::list_volume_assets()
```

**NOTE**: These commands return public data, that is, data we do not need an account or log-in to see.
We have not provided an `httr2` request parameter, so the function generates a default one.
We can see this happening if we set `vb = TRUE`.

```{r}
databraryr::list_volume_assets(vb = TRUE)
```

If we log-in using the commands described in [authorized users](authorized-users.Rmd), and provide the function with a valid (non-NULL) `httr2` request parameter, the following function call would show files and data that are restricted to authorized users:

```{r, eval=FALSE}
databraryr::list_volume_assets(vol_id = `<SOME_OTHER_VOLUME_ID>`, rq = rq)
```

Obviously, you would need to supply a `vol_id` for some other non-public dataset for this to return useful information.

The `list_volume_assets()` command returns a data frame we can manipulate using standard R commands.
Here are the variables in the data frame.

```{r}
vol1_assets <- databraryr::list_volume_assets()
names(vol1_assets)
```

Or, if you use the R (> version 4.3) 'pipe' syntax:

```{r}
databraryr::list_volume_assets() |>
  names()
```

The `magrittr` package pipe ('%>%') also works (as of databraryr v0.6.2).

The `asset_format_id` variable tells us information about the type of the data file.

```{r}
unique(vol1_assets$asset_format_id)
```

But this isn't especially informative since the `asset_format` is a code, and we don't really know what `-800` or `6` or the other numbers refer to.
To decode it, we create a data frame of all the file formats Databrary currently recognizes.

```{r}
db_constants <- databraryr::assign_constants()

formats_df <- purrr::map(db_constants$format, as.data.frame) |>
  purrr::list_rbind()

formats_df
```

From this, we see that format `-800` is an MP4-formatted video. 
There are lots of those in Volume 1.
We see that`6` is a PDF document.

As of 0.6.0, `list_volume_assets()` adds this information to the data frame.

We can summarize the number of files using the `stats::xtabs()` function:

```{r}
stats::xtabs(~ format_name, data = vol1_assets)
```
So, there are lots of videos and PDFs to examine in volume 1.
Here is a table of the ten longest videos.

```{r}
vol1_assets |>
  dplyr::filter(format_name == "MPEG-4 video") |>
  dplyr::select(asset_name, asset_duration) |>
  dplyr::mutate(hrs = asset_duration/(60*60*1000)) |>
  dplyr::select(asset_name, hrs) |>
  dplyr::arrange(desc(hrs)) |>
  head(n = 10) |>
  knitr::kable(format = 'html')
```

## Accessing metadata

Imagine you are interested in knowing more about this volume, the people who created it, or the agencies that funded it.

The `list_volume_owners()` function returns a data frame with information about the people who created and "own" this particular dataset.
The function has a parameter `this_vol_id` which is an integer, unique across Databrary, that refers to the specific dataset.
The `list_volume_owners()` function uses volume 1 as the default.

```{r}
databraryr::list_volume_owners()
```

The command (and many like it) can be "vectorized" using the `purrr` package.
This let's us generate a tibble with the owners of the first fifteen volumes.

```{r}
purrr::map(1:15, databraryr::list_volume_owners) |> 
  purrr::list_rbind()
```

As of 0.6.0, the `get_volume_by_id()` returns a list of all data about a volume that is accessible to a particular user.
The default is volume 1.

```{r}
vol1_list <- databraryr::get_volume_by_id()
names(vol1_list)
```
Let's create our own tibble/data frame with a subset of these variables.

```{r}
vol1_df <- tibble::tibble(id = vol1_list$id,
                          name = vol1_list$name,
                          doi = vol1_list$creation,
                          permission = vol1_list$permission)
vol1_df
```

The `permission` variable indicates whether a volume is visible by you, and if so with what privileges.

So, if you are not logged-in to Databrary, only data that are visible to the public will be returned.
Assuming you are *not* logged-in, the above commands will show volumes with `permission` equal to 1.
The `permission` field derives from a set of constants the system uses.

```{r}
db_constants <- databraryr::assign_constants()
db_constants$permission
```

The `permission` array is indexed beginning with 0.
So the 1th (1st) value is "`r db_constants$permission[2]`".
So, the `1` means that the volumes shown above are all visible to the public, and to you.

Volumes that you have not shared and are not visible to the public, will have `permission` equal to 5, or "`r db_constants$permission[6]`".
We can't demonstrate this to you because we don't have privileges on the same unshared volume, but you can try it on a volume you've created but not yet shared.

Other functions with the form `list_volume_*()` provide information about Databrary volumes.
For example, the `list_volume_funding()` command returns information about any funders listed for the project.
Again, the default volume is 1.

```{r}
databraryr::list_volume_funding()
```

The `list_volume_links()` command returns information about any external (web) links that have been added to a volume, such as to related publications or a GitHub repo.

```{r}
databraryr::list_volume_links()
```

There's much more to learn about accessing Databrary information using `databraryr`, but this should get you started.

## Downloading multiple files

As of 0.6.3, it's possible to download multiple files.
The following set of commands downloads all of the 'csv' files in volume 1 using the output from `list_volume_session_assets()` (when you have a volume ID and a session ID) or `list_session_assets()` when you have only the session ID.
The code below creates a new directory based on the session/slot ID (9807).
The function returns the file path to the downloaded files.

```{r}
vol1_assets |> 
  dplyr::filter(format_extension == "csv") |>
  databraryr::download_session_assets_fr_df()
```