R for Statistical Analysis — 4-Day Workshop
4-day workshop: Tidyverse, ggplot2, statistical modelling, R Markdown.
Instructor: Dr. Yaé Ulrich Gaba Duration: 4 days (24 hours) Level: Beginner to Intermediate Language: English
Overview
This workshop provides a comprehensive introduction to R for data analysis and statistical modelling. Participants learn the Tidyverse ecosystem, create publication-quality visualizations with ggplot2, and build reproducible reports with R Markdown. The workshop serves as a practical companion to The Shape of Data (No Starch Press).
Prerequisites
- Basic statistics (mean, variance, hypothesis testing concepts)
- No prior R experience required
- Laptop with internet access
Learning Objectives
By the end of this workshop, participants will be able to:
- Navigate RStudio and write clean, readable R code
- Wrangle data efficiently with dplyr and tidyr
- Create compelling statistical visualizations with ggplot2
- Perform common statistical analyses (t-tests, ANOVA, regression)
- Produce reproducible reports with R Markdown
Software Requirements
- R 4.3+ (CRAN)
- RStudio Desktop (posit.co)
- Packages: tidyverse, rmarkdown, knitr, broom, palmerpenguins
Day-by-Day Program
Day 1: R Fundamentals & RStudio
Objectives: Set up the environment, learn R syntax, understand data types and structures.
| Time | Topic |
|---|---|
| 09:00–10:30 | Setup & RStudio Tour — Installing R & RStudio, console, scripts, projects, environment pane, help system |
| 10:30–10:45 | Break |
| 10:45–12:30 | R Basics — Variables, vectors, types (numeric, character, logical, factor), indexing, vectorized operations |
| 12:30–14:00 | Lunch |
| 14:00–15:30 | Data Structures — Matrices, lists, data frames, tibbles. Creating, subsetting, modifying |
| 15:30–15:45 | Break |
| 15:45–17:00 | Control Flow & Functions — if/else, for loops, apply family, writing custom functions, pipes (|> and %>%) |
Lab 1: Load the built-in iris dataset, compute summary statistics per species, and write a function that classifies a flower based on petal measurements.
Homework: Explore the mtcars dataset — compute grouped statistics and answer 3 questions about the data.
Day 2: Data Wrangling with the Tidyverse
Objectives: Master data manipulation with dplyr and tidyr.
| Time | Topic |
|---|---|
| 09:00–09:30 | Homework Review |
| 09:30–10:30 | Reading Data — read_csv, read_excel, read_delim, readr options, handling encodings and messy files |
| 10:30–10:45 | Break |
| 10:45–12:30 | dplyr Core Verbs — filter(), select(), mutate(), arrange(), summarise(), group_by(), across() |
| 12:30–14:00 | Lunch |
| 14:00–15:30 | tidyr & Reshaping — pivot_longer(), pivot_wider(), separate(), unite(), handling NAs (drop_na, replace_na) |
| 15:30–15:45 | Break |
| 15:45–17:00 | Joining & Combining — left_join, inner_join, anti_join, bind_rows, bind_cols. Relational data patterns |
Lab 2: Work with a multi-file dataset (e.g., WHO health statistics for African countries): read multiple CSVs, join them, reshape from wide to long, clean missing values, and produce a tidy analysis-ready dataset.
Homework: Using the cleaned dataset, compute 5 summary tables using dplyr pipelines.
Day 3: Visualization with ggplot2
Objectives: Build a complete visualization toolkit using the grammar of graphics.
| Time | Topic |
|---|---|
| 09:00–09:30 | Homework Review |
| 09:30–10:30 | Grammar of Graphics — Aesthetics, geoms, scales, coordinate systems, the layered approach |
| 10:30–10:45 | Break |
| 10:45–12:30 | Core Geoms — geom_point, geom_line, geom_bar, geom_histogram, geom_boxplot, geom_violin, geom_density |
| 12:30–14:00 | Lunch |
| 14:00–15:30 | Customization — Themes (theme_minimal, theme_classic, custom themes), color palettes (viridis, brewer), labels, annotations, legends |
| 15:30–15:45 | Break |
| 15:45–17:00 | Advanced Techniques — Faceting (facet_wrap, facet_grid), stat layers, coordinate transformations, combining plots (patchwork), saving with ggsave |
Lab 3: Create a visual report with 6+ publication-quality figures exploring the health dataset. Include: distribution plots, trend lines, faceted comparisons, and a correlation heatmap.
Homework: Reproduce a figure from a published paper (provided) using ggplot2.
Day 4: Statistical Modelling & Reproducible Reports
Objectives: Perform common statistical analyses and produce reproducible reports.
| Time | Topic |
|---|---|
| 09:00–09:30 | Homework Review |
| 09:30–10:30 | Descriptive & Inferential Statistics — Summary statistics, t-tests, chi-squared tests, confidence intervals, p-values and their interpretation |
| 10:30–10:45 | Break |
| 10:45–12:30 | Linear Models — lm(), interpreting coefficients, R², residual diagnostics, multiple regression, ANOVA with aov() |
| 12:30–14:00 | Lunch |
| 14:00–15:00 | Model Diagnostics & Beyond — Residual plots, Q-Q plots, influence measures, logistic regression with glm(), broom for tidy model output |
| 15:00–15:15 | Break |
| 15:15–16:15 | R Markdown — YAML header, code chunks, inline code, tables (kable), figure options, output formats (HTML, PDF, Word) |
| 16:15–17:00 | Capstone & Wrap-Up — Mini-project presentations, resources, Q&A, certificates |
Lab 4 (Capstone): Produce a complete reproducible R Markdown report: load data, perform EDA with ggplot2, fit a linear model, diagnose it, and present findings with narrative, tables, and figures. Topics:
- Health outcome modelling (life expectancy ~ GDP, education, etc.)
- Agricultural yield analysis
- Financial indicator analysis
Assessment
- Daily labs (50%) — Completion and quality of exercises
- Capstone report (30%) — R Markdown document produced on Day 4
- Participation (20%) — Engagement and homework
Resources
- R for Data Science (2e) — Hadley Wickham & Garrett Grolemund
- ggplot2 Book
- The Shape of Data — Companion book with R implementations
- RStudio Cheat Sheets
- CRAN Task Views
Certificate
Participants who complete all labs and the capstone report receive a certificate of completion.