<- tibble::list(
flights_schema
airlines_labelled,
airports_labelled,
flights_labelled,
planes_labelled,
weathers_labelled
)
# Creating a data dictionary
<- flights_schema |>
flights_dictionary map(labelled::generate_dictionary) |>
enframe() |>
unnest(cols = value)
Sessions (Day 1)
Day 1
- 08/13/2024 | [GitHub]
Introducing Positron, a new data science IDE
Tuesday, Aug 13 10:20 AM - 11:40 AM PDT
Positron discussions: https://github.com/posit-dev/positron/discussions
Introducing Positron
Session Info
- Presenter: Julia Silge
- Sildes
- Positron
- Next generation data science IDE
- Multilingual/polyglot IDE (R & Python support currently)
- Built on VS Code
Exploratory Data Analysis with Python in Positron
Session Info
- Presenter: Isabel Zimmerman
- Viewer pane
- Connections pane
- Help pane
- Plots pane
Debugging Data with the Positron Data Explorer
Session Info
- Presenter: Tom Mock
- A next-gen data explorer
- Data viewer grid
- Summary panel (summary statistics appropriate for data; measures of central tendencies for numeric data or unique values for categorical data, missing data, etc.)
- Filter bar
- Grid design
- Polyglot tool for in-memory dataframes
- Highly scalable
- Grid sort and multi-sort
- Automatic column width
- Monospaced fonts in the data grid
- Filter design (outside of data viewer)
- Dedicated filter bar
- Quick add filter
- Special filters by type
- Summary panel design
- Type and name
- Missing data percentage
- Summary statistics
Architecture and Design of Positron
Session Info
- Presenter: Jonathan McPherson
- Positron is a polyglot IDE focused on data science.
- It’s a Code OSS fork, like Visual Studio Code is.
- Positron’s language features exist in extensions.
- The language features use existing standards, like the Jupyter Protocol and the Language Server Protocol.
- Positron has a public API that lets anyone add new languages or features.
- The way Positron is built allows Positron to be extended to other languages in the future (foundation upon more data science to be built)
Beautiful And Effective Tables
Tuesday, Aug 13 1:00 PM - 2:20 PM PDT
Adequate Tables? No, We Want Great Tables
Session Info
- Presenter: Richard Iannone
gt
R package- Initial goals
- Comprehensive table structuring
- Large selection of functions for formatting values (29 formatters that cover a ton of formatting tasks)
- Flexible and easy-to-use methods for table styling
- 3 methods
tab_style()
tab_options()
opt_stylize()
- 3 methods
- Table rendering to multiple output types
- table data -> gt object -> output table
- You shouldn’t have to change the table code depending on the output type
- Three output types were targeted: HTML, LaTeX, RTF
- LaTeX and RTF outputs underwent improvements over several improvements
- HTML changes: speed and accessibility
- .docx/Word output type was implemented (currently improving)
- Looking to adding Excel and PowerPoint output
- Three output types were targeted: HTML, LaTeX, RTF
- Later goals
- Useful across many disciplines and use cases
- Added formatters both general and domain-specific
- Localization options for users all over the world
- Most formatting options have a
locale
option that ensures that numbers, dates/times, and even words fit the language and region
- Most formatting options have a
- Good documentation to get you building quickly
- Useful for Pharma’s specific table-making needs
- Useful across many disciplines and use cases
- The future of gt
- Row ordering functionality - so you don’t have to order rows beforehand
- Improvements to footnotes - flexibility for affixing the marks
- Excel output tables - it’s a popular file format
- Overall better table-splitting - better options, more dependable
- Integration with database tables - so it can be used as input data
- Ways to better style text - e.g., styling spans of text in a cell
- Methods for merging cells - right now: no way to merge adjacent cells
- Great Tables Python package
- Important features are there, secondary features coming later
- Python version examples
- reactable Package available in R and Python (newly released) - interactive data tables
- Questions
- HTML tables are translated to typst
Context is King
Session Info
- Presenter: Shannon Pileggi
- Data stewardship: How do we insure data integrity?
- Source data context can and should be embedded in your data
- Using variable labels (like in sas/SPSS/Stata)
haven
R package allows use to import sas/SPSS/Stata data files and retain variable labels
- Assigning variable labels
labelled::set_variable_labels()
or using base R
- Blog post: The case for variable labels in R
- Applications
- Data dictionary
- Figures, labelled
ggeasy::easy_labs()
to automatically substitute variable labels into ggplots
- Tabling, labelled
gtsummary::tbl_summary()
will use your variable label instead of variable name in output (gt has same behavior)
- In practice
- Maintain a csv with metadata
- Apply custom function for bulk label assignment (e.g.,
croquett::set_derived_variable_labels()
); in an iterative framework if you have many lists
- Apply custom function for bulk label assignment (e.g.,
gtsummary: Streamlining Summary Tables for Research and Regulatory Submissions
Session Info
- Presenter: Daniel Sjoberg
# Simple, customizable code
|>
trial tbl_summary(
by = trt, # group by treatment (Drug A vs. Drug B)
include = c(age, grade, response), # by characteristics
statistic = all_continuous() ~ "{mean} ({sd})" # instead of default median
)
|>
trial tbl_summary(
by = trt, # group by treatment (Drug A vs. Drug B)
include = c(marker, response), # by characteristics
missing = "no",
statistic = list(marker = "{mean} ({sd})", response = "{p}%")
|>
) add_difference()
# Regression model summaries
<- glm(response ~ trt + marker, trial, family = binomial)
mod summary(mod)
<- tbl_regression(mod, exponentiate = TRUE)
tbl_mod
# Table cobbling (stacking tables)
list(tbl_uni, tbl_mod) |>
tbl_merge(tab_spanner = c("**Univariable**", "**Multivariable**"))
Lightning Talks
Tuesday, Aug 13 1:00 PM - 2:20 PM PDT
Templated Analyses within R Packages for Collaborative, Reproducible Research
- Our approach: Using package architecture and templates
- Set up a simple package
- The package handles the research code and project management tools
- Junior members run
devtools::load_all()
to access everything
- See it in action: https://alarm-redist.org/fifty-states/
Why’d you load that package for?
- You should include code comments to explain why you’re loading the packages
annotate
R package as a solution to add information to library load calls through functions or add ins- Automate informative comments by
- Leveraging built-in descriptions
- Checking code for package components being used
- Automate informative comments by
DataPages for interactive data sharing using Quarto
- datapages.github.io: Tools and templates for rich data sharing
- Create a website for your data (interactive visualizations, documentations, data preview, etc.)
Automated Reporting With Quarto: Beyond Copy And Paste
Tuesday, Aug 13 2:40 PM - 4:00 PM PDT
Beyond Dashboards: Dynamic Data Storytelling with Python, R, and Quarto Emails
Session Info
- Presenter: Sean Nguyen
- Remove friction
- Logging in can create barriers
- Meet executives whre they are (e.g., Slack, Gmail, Outlook)
- “No-Click” Insights
- Add key metrics/alerts in the subject line or notification
- Dynamic emails: Insights in your Inbox
- Concise, personalized content
- Automated with Python, R, Quarto
- Powered by Posit Connect
- Personalize the delivery
- When: Send email only when needed
- So What: Deliver insights to deliver action
- Tools: Quarto Emails, Pins, Posit Connect
- Streamline your data with Pins
- Data Sources -> Data Warehouse -> create_pins.qmd
- save modeled data (.csv files) as a pin
- Data Sources -> Data Warehouse -> create_pins.qmd
- Streamline your data with Pins
- Steps for dynamic emails in Quarto
- Setup YAML (
format: email
) - Run Code (R/Python code that you will normally run)
- Email Logic (Establish logic to deliver email or not)
- Email Content
- Setup YAML (
- Components
- Parametrize your Quarto doc
- Create variables in the YAML (e.g., different variations for different depratments)
- Define your parameters
- Reference them with
params$
- -> 1a: Generate multiple Quarto Docs with Purrr
- posit::2023 talk - Parameterized Quarto reports improve understanding of soil health
- Run your code
- Can include plots, metrics, etc.
- Conditional email logic
- Different .qmd files to house your emails to send
- Conditional logic
- Marketing Example: Send email when we have leads in Q3 2024
- Quarto email content
- YAML (
format: email
) - Body (
:::{.email}
div) - Subject (
:::{.subject}
div nested within email div) - Schedule logic
- Return something “truthly” or “falsly”
- Quarto V 1.4+
- Example Email: Conditional Logic
# Example conditional logic to send email
<- nrow(leads_data)
num_leads
<- function(num_leads, system_date) {
send_email_with_leads <- Sys.Date()
system_date if (num_leads > 0 && system_date > ymd("2024-07-01")) {
return("yes")
else {
} return("no")
} }
- posit Connect
- Can control email recipients with groups
- Can schedule - can specify when it renders (email will only be sent if it meets your conditional logic step)
- Considerations with Quarto emails
- Static visualizations
- Table constraints
- Limited interactivity
- Quarto email tips
- Focus on key metrics
- Use dynamic subject lines
- Be selective with delivery criteria
- Think about “so what”
Automated Reporting With Quarto: Beyond Copy and Paste
Session Info
- Presenter: Orla Doyle
- A pharmaceutical company (Novartis) created a package ({rdocx}) that generates reproducible word documents using R Markdown using a company template
- Mandatory elements are represented as R6 classes where each attribute has a unique input requirement that goes through a series of checks
- Embedded good software development practices
Quarto: A Multifaceted Publishing Powerhouse for Medical Researchers
Session Info
- Presenter: Joshua Cook
- Quarto allows us to efficiency create various polished formats from a single source document (in this case, stored in the /Analysis folder)
- Never include all of your code in one Quarto markdown; use shortcodes instead to import analyses/figures, etc. so that you can output the same figure in multiple formats (e.g., PPT/PDF Manuscript, etc.)
- If an error is identified or something about the data, tables, or figures needs to be changed, simply altering the files in /Analysis will prompt all the documents to update when they are next rendered.
- Quarto shortcodes: special markdown directives that generate content
- Key shortcodes:
- embed - embeds cells from Quarto markdown (.qmd) file or a Jupyter (.ipynb) notebook (
{{< embed Analysis/data_missing.qmd#fig-missing-1 >}} # will only include a SPECIFIC figure
) - include - direct copy/paste of content from another Quarto markdown or Jupyter notebook (`
{{< include Analysis/data.processing.qmd >}} # will include EVERYTHING from the file
)
- embed - embeds cells from Quarto markdown (.qmd) file or a Jupyter (.ipynb) notebook (
- Key shortcodes:
- Advanced features
- In-line programming
- Dynamically updating - using quarto.yml
execute: freeze: auto
to auto-update if any changes detected in the files
- Templates
- Collaboration tools available - notes, highlighting, commenting with Hypothesis
- Works with Zotero libraries
Is It Supposed To Hurt This Much?
Tuesday, Aug 13 2:40 PM - 4:00 PM PDT
“Please Let Me Merge Before I Start Crying”: And Other Things I’ve Said at the Git Terminal
Session Info
- Presenter: Meghan Harris
- Git =/= GitHub
- Git: Version control system
- GitHub: Developer platform that uses Git software
- Three ways (R users) can interact with Git
- A CLI terminal
- R Studio GUI
- A Third Party Client (e.g., GitHub Desktop)
- What is Git merge?
- Join two or more development histories (branches) together
- What is a Git merge conflict?
- Content conflict: Competing changes are made to the same line of a file
- Structure conflict: When someone edits a file and someone else deletes the same file
- Resolving Git merge conflicts
- Don’t panic and abort merge
- Terminal:
git merge --abort
- Terminal:
- Assess the damage
- Terminal:
git status
- Terminal:
- Choose your own adventure - you are in control to choose what version you want
- Don’t panic and abort merge
- Merge conflicts are not Git problems, but are communication problems, workflow problems, knowledge gap problems
- Communication
- Talking with others
- Naming/styling conventions
- Consistent formatting
- Leverage developer platforms
- “Pull request templates”
- “Labels”
- “Issues”
- “Pull Request Comments”
- “Branch Rules and Protection”
- Workflow
- Before you code:
- Check Git environment
- Check branch status and “drift”
- ALWAYS pull first before touching ANYTHING
- During your coding:
- Commit often (with repeated amends)
- Push thoughtfully but consider “branch drift” risk
- Use git stashes when there’s “emergencies”
- After you code
- Leave nothing behind
- You are reviewer #1
- Before you code:
- Communication
Easing the pain of connecting to databases
Session Info
- Presenter: Edgar Ruiz
- Improvements to the
odbc
R package- Connecting to Databricks/Snowflakes becomes simpler
odbc::databricks()
odbc::snowflake()
- Flexible ways to authenticate
- Use browser-based SSO (currently in the development version; only for desktops)
- Use a traditional username & PW
- Connecting to Databricks/Snowflakes becomes simpler
- DB Connection Pane in Positron
Auth is the product, making data access simple with Posit Workbench
Session Info
- Presenter: Aaron Jacobs
- Posit Public Package Manager (P3M)
- A free, hosted service based on our professional Posit Package Manager
- Provides everything in a complete CRAN mirror, plus many additional features
- Used widely in the R community since 2020, with over 40 million package installs each month
- Designed to address common pain points in using public mirrors
- Package pain: My code used to work, but now it doesn’t
- Why does this happen?
- New versions may have changes that break old code
- Dependent packages may be out-of-date
- CRAN only guarantees package compatibility at any given “latest” point in time
- Packages are no longer available
- How P3M helps
- Posit takes daily snapshots of CRAN, PyPI, and Bioconductor
- Users can configure R to install packages from a specific date, ensuring all packages installed from that snapshot are compatible with each other
- How can I use snapshots?
- Recreate the package environment for an old project easily by knowing when the project was originally written
- Future-proof your work by tying it to today’s snapshot when sharing with others or reproducing it later
- Why does this happen?
- Package pain: Installing packages is really slow
- Why does this happen?
- Installing packages from source is slow
- Any C, C++, or Fortran code needs to be compiled
- Many packages require additional libraries and build tools
- Binary packages reduce installation time by pre-building the package
- CRAN only provides binary packages for Windows and macOS on the current and previous version of R
- Installing packages from source is slow
- How P3M helps
- Pre-built binary packages for all of CRAN
- On the current plus 5 previous R versions
- For Windows, macOS, and 12 Linux distributions
- Also ideal for cloud compute and containerized environments where package installation is repetitive and automated
- Pre-built binary packages for all of CRAN
- Package binary builds
- Why does this happen?
- p3m.dev
- Using P3M as your CRAN repository
options("repos)
in R- RStudio -> Global Options -> Packages
- Get detailed instructions specifically for your R environment on the Public Package Manager Setup page