class: left, middle, inverse, title-slide # Exploring and understanding the individual experience from longitudinal data, or… ##
How to make better spaghetti (plots)
###
Nicholas Tierney, Monash University
###
OzVis
Thursday 21st November, 2019
bit.ly/njt-ozvis
nj_tierney
--- layout: true <div class="my-footer"><span>bit.ly/njt-ozvis • @nj_tierney</span></div> --- class: inverse, middle, centre .huge[ (My) Background ] ??? I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things. --- ## Background: Undergraduate .pull-left.large[ Undergraduate in Psychology - Statistics - Experiment Design - Cognitive Theory - Neurology - Humans ] .pull-right[ <img src="imgs/psy_brain.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Background: Honours .left-code.medium.center[ Psychophysics: illusory contours in 3D <img src="imgs/phil-grove.jpeg" width="40%" style="display: block; margin: auto;" /> Phil Grove ] .right-plot[ <img src="imgs/kanizsa-triangle.png" width="444" style="display: block; margin: auto;" /> ] --- ## Background: Honours .left-code.medium.center[ <img src="imgs/christina-lee.jpeg" width="60%" style="display: block; margin: auto;" /> Christina Lee ] .right-plot.medium[ > If every psychologist in the world delivered gold standard smoking cessation therapy, the rate of smoking would still increase. You need to change policy to make change. To make effective policy, you need to have good data, and do good statistics. ] ??? I discovered an interest in public health and statistics. --- ## Background: PhD .left-code.medium.center[ <img src="imgs/kerrie.jpg" width="60%" style="display: block; margin: auto;" /> Kerrie Mengersen ] .right-plot.medium[ - _Statistical Approaches to Revealing Structure in Complex Health Data_ - Exploring missing values - Bayesian Models of people's health over time - Geospatial statistics of cardiac arrest - Fun, applied, real data, real people ] --- ## Background: PhD .pull-left.large[ - "Ah, statistics, everything is black and white! - "There's always an answer" - "data in, answer out" ] .pull-right[ <img src="imgs/bw-mountains.jpg" width="1778" style="display: block; margin: auto;" /> ] ??? I started a PhD in statistics at QUT, under (now distinguished) Professor Kerrie Mengersen, Looking at people's health over time. - There were several things that I noticed: - There were equations, but not as many clear-cut, black and white answers --- ## 😱 Missing Values <img src="imgs/fig-1-miss-map-1.png" width="469" style="display: block; margin: auto;" /> ??? ## Journey of Analytic Discovery - map --- ## Background: PhD .pull-left.large[ - Data is really messy - Missing values are frustrating - How to Explore data? ] .pull-right[ <img src="imgs/explorer.jpg" width="569" style="display: block; margin: auto;" /> ] --- ## Background: PhD - But in psych .pull-left.large[ - Focus on experiment design - No focus on exploring data - Exploring data felt...wrong? - But it was so critical. ] .pull-right[ <img src="imgs/sound-barrier.jpg" width="853" style="display: block; margin: auto;" /> ] --- ## (My personal) motivation .large[ A lot of research in new statistical methods - imputation, inference, prediction ] -- .large[ Not much research on how we explore our data, and the methods that we use to do this. ] --- ## (My personal) motivation .pull-left[ <img src="imgs/bw-bridge.jpg" width="2560" style="display: block; margin: auto;" /> ] .pull-right.large[ > Focus on building a bridge across a river. Less focus on **how** it is built, and the **tools** used. ] ??? - I became very interested in how we explore our data - exploratory data analysis. --- class: inverse, middle, center .huge[ My research: Design and improve tools for (exploratory) data analysis ] --- # EDA: Exploratory Data Analysis .large[ > ...EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. (Wikipedia) John Tukey, Frederick Mosteller, Bill Cleveland, Dianne Cook, Heike Hoffman, Rob Hyndman, Hadley Wickham ] --- # EDA: Why it's worth it -- <img src="gifs/dino-saurus.gif" style="display: block; margin: auto;" /> -- From ["Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"](https://www.autodeskresearch.com/publications/samestats) --- class: inverse, middle, center <img src="imgs/hex-visdat-and-naniar.png" width="70%" style="display: block; margin: auto;" /> .pull-left[ .center[ .hugew[ [visdat.njtierney.com](https://visdat.njtierney.com) ] ] ] .pull-right[ .center[ .hugew[ [naniar.njtierney.com](https://naniar.njtierney.com) ] ] ] --- ## `visdat` .pull-left[ <img src="imgs/hex-visdat.png" width="231" style="display: block; margin: auto;" /> ] .pull-right[ [](https://github.com/ropensci/onboarding/issues/87)[](http://joss.theoj.org/papers/c85f57adbc565b064fb4bfc9b59a1b2a)[](https://zenodo.org/badge/latestdoi/50553382) > published under "visdat: Visualising Whole Data Frames" in the Journal of Open Source Software ] --- ## `visdat::vis_dat(airquality)` <img src="figures/show-visdat-1.png" width="150%" style="display: block; margin: auto;" /> --- ## `visdat::vis_miss(airquality)` <img src="figures/show-vis-miss-1.png" width="150%" style="display: block; margin: auto;" /> --- ## `naniar` .left-code[ <img src="imgs/hex-naniar.png" width="231" style="display: block; margin: auto;" /> ] .right-plot.large[ Tierney, NJ. Cook, D. "Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations." [[Pre-print](https://arxiv.org/abs/1809.02264)] ] --- ## `naniar::gg_miss_var(airquality)` <img src="figures/gg-miss-var-1.png" width="150%" style="display: block; margin: auto;" /> --- ### `naniar::gg_miss_var(airquality, facet = Month)` <img src="figures/gg-miss-var-facet-1.png" width="150%" style="display: block; margin: auto;" /> --- ## `naniar::gg_miss_upset(airquality)` <img src="figures/gg-miss-upset-1.png" width="150%" style="display: block; margin: auto;" /> --- class: middle, center, inverse .vhuge[Current work:] .huge[ How to explore longitudinal data effectively ] --- class: inverse, middle, # What is longitudinal data? .huge[ > Something observed sequentially over time ] --- # What is longitudinal data? .large[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> height_cm </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1910 </td> <td style="text-align:right;"> 172.7 </td> </tr> </tbody> </table> ] --- # What is longitudinal data? .large[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> height_cm </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1910 </td> <td style="text-align:right;"> 172.700 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1920 </td> <td style="text-align:right;"> 172.846 </td> </tr> </tbody> </table> ] --- # What is longitudinal data? .large[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> height_cm </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1910 </td> <td style="text-align:right;"> 172.700 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1920 </td> <td style="text-align:right;"> 172.846 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1960 </td> <td style="text-align:right;"> 176.300 </td> </tr> </tbody> </table> ] --- # What is longitudinal data? .large[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> height_cm </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1910 </td> <td style="text-align:right;"> 172.700 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1920 </td> <td style="text-align:right;"> 172.846 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1960 </td> <td style="text-align:right;"> 176.300 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:right;"> 1970 </td> <td style="text-align:right;"> 178.400 </td> </tr> </tbody> </table> ] --- <img src="figures/reveal-height-1.gif" width="150%" style="display: block; margin: auto;" /> --- <img src="figures/gg-example-1.png" width="150%" style="display: block; margin: auto;" /> --- # All of Australia <img src="figures/gg-all-australia-1.png" width="936" style="display: block; margin: auto;" /> --- # ...And New Zealand <img src="figures/gg-show-a-few-countries-1.png" width="936" style="display: block; margin: auto;" /> --- # ... And Afghanistan and Albania <img src="figures/sample-more-heights-1.png" width="936" style="display: block; margin: auto;" /> --- # And the rest? <img src="figures/animate-all-data-1.gif" style="display: block; margin: auto;" /> --- # And the rest? <img src="figures/gg-show-all-1.png" width="936" style="display: block; margin: auto;" /> --- <img src="gifs/noodle-explode.gif" width="50%" style="display: block; margin: auto;" /> --- # Problems: .pull-left.large[ - Overplotting - We don't see the individuals - We could look at 144 individual plots, but this doesn't help. ] .pull-right[ <img src="figures/gg-heights-heights-again-1.png" width="936" style="display: block; margin: auto;" /> ] --- # Does transparency help? <img src="figures/gg-show-all-w-alpha-1.png" width="936" style="display: block; margin: auto;" /> --- # Does transparency + a model help? <img src="figures/gg-show-all-w-model-1.png" width="936" style="display: block; margin: auto;" /> ??? - This helps reduce the overplotting - We only get the overall average. We dont get the rest of the information --- # This is still useful <img src="figures/show-height-lm-1.png" width="936" style="display: block; margin: auto;" /> ??? - We get information on what the average is, and how that behaves - But we don't get the full story - So, it depends on your need. If you have a designed experiment, where you stated that you would run some analysis, then you are doing this. - ... But even then, wouldn't you rather _explore_ the data? - Who fits the models well / worst / best? --- # But we forget about the **individuals** <img src="figures/heights-dec-1.png" width="936" style="display: block; margin: auto;" /> ??? - The model might make some good overall predictions - But it can be really _ill suited_ for some individual - Exploring this is somewhat clumsy - we need another way to explore --- # How do we get the most out of this plot? <img src="figures/gg-height-highlight-2-1.png" width="936" style="display: block; margin: auto;" /> --- # How do I even get started?! <img src="figures/gg-height-highlight-3-1.png" width="936" style="display: block; margin: auto;" /> --- class: inverse, middle .large[ Problem #1: How do I look at **some** of the data? ] -- .large[ Problem #2: How do I find **interesting** observations? ] -- .large[ Problem #3: How do I **understand** my statistical model ] --- # Introducing `brolgar`: brolgar.njtierney.com .pull-left.large[ * **br**owsing * **o**ver * **l**ongitudinal data * **g**raphically, and * **a**nalytically, in * **r** ] .pull-right[ <img src="imgs/brolga-bird.jpg" width="569" style="display: block; margin: auto;" /> ] ??? * It's a crane, it fishes, and it's a native Australian bird --- <img src="figures/gg-remind-spaghetti-1.png" width="200%" style="display: block; margin: auto;" /> --- class: inverse, middle, center # What is longitudinal data? .vlarge[ > Something observed sequentially over time ] --- class: inverse, middle, center # What is longitudinal data? .vlarge[ > ~~Something~~ **Anything that is** observed sequentially over time **is a time series** ] --- class: inverse, middle, center # ~~What is longitudinal data?~~ Longitudinal data is a time series. .vlarge[ > ~~Something~~ **Anything that is** observed sequentially over time **is a time series** ] .large[ [-- Rob Hyndman and George Athanasopolous, Forecasting: Principles and Practice](https://otexts.com/fpp2/data-methods.html) ] --- # Longitudinal data as a time series <img src="https://tsibble.tidyverts.org/reference/figures/logo.png" align="right" height=140/> ```r heights <- as_tsibble(heights, index = year, key = country, * regular = FALSE) ``` 1. **index**: Your time variable 2. **key**: Variable(s) defining individual groups (or series) `1. + 2.` determine distinct rows in a tsibble. (From Earo Wang's talk: [Melt the clock](https://slides.earo.me/rstudioconf19/#8)) --- # Longitudinal data as a time series <img src="https://tsibble.tidyverts.org/reference/figures/logo.png" align="right" height=140/> ## Key Concepts: .large[ Record important time series information once, and use it many times in other places - We add information about **index** + **key**: - Index = Year - Key = Country ] --- .large[ ``` ## # A tsibble: 1,490 x 3 [!] ## # Key: country [144] ## country year height_cm ## <chr> <dbl> <dbl> ## 1 Afghanistan 1870 168. ## 2 Afghanistan 1880 166. ## 3 Afghanistan 1930 167. ## 4 Afghanistan 1990 167. ## 5 Afghanistan 2000 161. ## 6 Albania 1880 170. ## # … with 1,484 more rows ``` ] --- class: inverse, middle, center .huge[ Remember: **key** = variable(s) defining individual groups (or series) ] --- # Problem #1: How do I look at **some** of the data? -- .pull-left.large[ Look at only a sample of the data: ] .pull-right.large[ <img src="figures/ggplot-sample-keys-1.png" width="936" style="display: block; margin: auto;" /> ] --- # Sample `n` rows with `sample_n()` -- ```r heights %>% sample_n(5) ``` -- ``` ## # A tsibble: 5 x 3 [!] ## # Key: country [5] ## country year height_cm ## <chr> <dbl> <dbl> ## 1 Cambodia 1860 165. ## 2 Bolivia 1890 164. ## 3 Macedonia 1930 169. ## 4 United States 1920 173. ## 5 Papua New Guinea 1880 152. ``` --- # Sample `n` rows with `sample_n()` <img src="figures/plot-sample-n-1.png" width="936" style="display: block; margin: auto;" /> --- # Sample `n` rows with `sample_n()` ``` ## # A tsibble: 5 x 3 [!] ## # Key: country [5] ## country year height_cm ## <chr> <dbl> <dbl> ## 1 Cambodia 1860 165. ## 2 Bolivia 1890 164. ## 3 Macedonia 1930 169. ## 4 United States 1920 173. ## 5 Papua New Guinea 1880 152. ``` -- .large[ ... sampling needs to select not random rows of the data, but the **keys - the countries**. ] --- # `sample_n_keys()` to sample ... **keys** ```r sample_n_keys(heights, 5) ``` ``` ## # A tsibble: 56 x 3 [!] ## # Key: country [5] ## country year height_cm ## <chr> <dbl> <dbl> ## 1 Hungary 1730 167. ## 2 Hungary 1740 168. ## 3 Hungary 1750 167. ## 4 Hungary 1760 167 ## 5 Hungary 1770 162. ## 6 Hungary 1780 163. ## # … with 50 more rows ``` --- # `sample_n_keys()` to sample ... **keys** <img src="figures/ggplot-sample-keys-2-1.png" width="936" style="display: block; margin: auto;" /> --- # Problem #1: How do I look at **some** of the data? .large.left[ ~~Look at subsamples~~ ] .large.right[ Sample **keys** ] -- .large.left[ Look at **many** subsamples ] -- .large.right[ **?** ] --- # Look at many subsamples <img src="figures/all-heights-1.png" width="936" style="display: block; margin: auto;" /> --- # Look at many subsamples <img src="figures/all-heights-samples-1.png" width="936" style="display: block; margin: auto;" /> --- # Look at many subsamples <img src="figures/heights-strata-1.png" width="936" style="display: block; margin: auto;" /> --- # **How** to look at many subsamples .large[ - How many facets to look at? (2, 4, ... 16?) ] -- .large[ - How many keys per facets? - 144 keys into 16 facets = 9 each ] -- .large[ - Randomly pick 16 groups of size 9. ] -- .large[ - This might not look like much extra work, but it hits the **distraction threshold** quite quickly. ] --- # Distraction threshold (time to rabbit hole) -- .large[ (Something I made up) ] -- .large[ > If you have to solve 3+ substantial smaller problems in order to solve a larger problem, your focus shifts from the current goal to something else. You are distracted. ] ??? - Task one - Task one being overshadowed slightly by minor task 1 - Task one being overshadowed slightly by minor task 2 - Task one being overshadowed slightly by minor task 3 --- # Distraction threshold (time to rabbit hole) **I want to look at many subsamples of the data** -- How many keys are there? -- How many facets do I want to look at -- How many keys per facet should I look at -- How do I ensure there are the same number of keys per plot -- What is `rep`, `rep.int`, and `rep_len`? -- Do I want `length.out` or `times`? --- # Avoiding the rabbit hole -- .large[ We can blame ourselves when we are distracted for not being better. ] -- .large[ It's not that we should be better, rather **with better tools we could be more efficient**. ] -- .large[ We need to make things **as easy as reasonable**, with the least amount of distraction. ] --- # Removing the distraction threshold means asking the most relevant question -- .large[ > How many plots do I want to look at? ] -- ```r heights_plot + facet_sample( * n_per_facet = 3, * n_facets = 9 ) ``` --- <img src="figures/show-facet-sample-print-1.png" width="100%" style="display: block; margin: auto;" /> --- # `facet_sample()`: See more individuals ```r ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() ``` <img src="figures/gg-facet-sample-all-1.png" width="60%" style="display: block; margin: auto;" /> --- # `facet_sample()`: See more individuals ```r ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + * facet_sample() ``` --- # `facet_sample()`: See more individuals <img src="figures/print-gg-facet-sample-1.png" width="936" style="display: block; margin: auto;" /> --- # `facet_strata()`: See all individuals ```r ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + * facet_strata() ``` --- # `facet_strata()`: See all individuals <img src="figures/print-gg-facet-strata-1.png" width="936" style="display: block; margin: auto;" /> --- ## Can we re-order these facets in a meaningful way? <img src="figures/print-gg-facet-strata-again-1.png" width="936" style="display: block; margin: auto;" /> ??? In asking these questions we can solve something else interesting --- ## `facet_strata(along = -year)`: see all individuals **along** some variable ```r ggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + * facet_strata(along = -year) ``` --- ## `facet_strata(along = -year)`: see all individuals **along** some variable <img src="figures/print-gg-facet-strata-along-1.png" width="936" style="display: block; margin: auto;" /> --- # Focus on answering relevant questions instead of the minutae: .pull-left[ "How many lines per facet" "How many facets?" ```r ggplot + facet_sample( * n_per_facet = 10, * n_facets = 12 ) ``` ] -- .pull-right[ "How many facets to shove all the data in?" ```r ggplot + facet_strata( * n_strata = 10, ) ``` ] --- # `facet_strata()` & `facet_sample()` Under the hood .large[
using `sample_n_keys()` & `stratify_keys()` ] -- .large[ You can still get at data and do manipulations ] --- ## Problem #1: How do I look at some of the data? -- .left-code.large[ `as_tsibble()` `sample_n_keys()` `facet_sample()` `facet_strata()` ] -- .right-plot.large[ Store useful information View subsamples of data View many subsamples View all subsamples ] --- ## ~~Problem #1: How do I look at some of the data?~~ .left-code.large[ `as_tsibble()` `sample_n_keys()` `facet_sample()` `facet_strata()` ] .right-plot.large[ Store useful information View subsamples of data View many subsamples View all subsamples ] --- ## Problem #2: How do I find **interesting** observations? <img src="figures/quite-interesting-obs-1.png" width="936" style="display: block; margin: auto;" /> --- ## Problem #2: How do I find **interesting** observations? <img src="figures/quite-interesting-obs-2-1.png" width="936" style="display: block; margin: auto;" /> --- class: inverse, center, middle .huge[ **Define** interesting? ] --- ## Identify features: summarise down to one observation <img src="figures/anim-line-flat-max-1.gif" style="display: block; margin: auto;" /> --- ## Identify features: summarise down to one observation <img src="figures/show-line-range-point-1.png" width="936" style="display: block; margin: auto;" /> --- ## Identify features: summarise down to one observation <img src="figures/gg-show-point-1.png" width="936" style="display: block; margin: auto;" /> --- ## Identify important features and decide how to filter <img src="figures/gg-show-red-points-1.png" width="936" style="display: block; margin: auto;" /> --- ## Identify important features and decide how to filter <img src="figures/gg-just-red-points-1.png" width="936" style="display: block; margin: auto;" /> --- ## Join this feature back to the data <img src="figures/gg-join-red-1.png" width="936" style="display: block; margin: auto;" /> --- ## Join this feature back to the data <img src="figures/gg-join-red-show-all-1.png" width="936" style="display: block; margin: auto;" /> --- ## 🎉 Countries with smallest and largest max height <img src="figures/show-red-all-again-1.png" width="936" style="display: block; margin: auto;" /> --- class: inverse, middle, cetner .vhuge[ Let's see that **one more time**, but with the data ] --- ## Identify features: summarise down to one observation ``` ## # A tsibble: 1,490 x 3 [!] ## # Key: country [144] ## country year height_cm ## <chr> <dbl> <dbl> ## 1 Afghanistan 1870 168. ## 2 Afghanistan 1880 166. ## 3 Afghanistan 1930 167. ## 4 Afghanistan 1990 167. ## 5 Afghanistan 2000 161. ## 6 Albania 1880 170. ## 7 Albania 1890 170. ## 8 Albania 1900 169. ## 9 Albania 2000 168. ## 10 Algeria 1910 169. ## # … with 1,480 more rows ``` --- ## Identify features: summarise down to one observation ``` ## # A tibble: 144 x 6 ## country min q25 med q75 max ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 161. 164. 167. 168. 168. ## 2 Albania 168. 168. 170. 170. 170. ## 3 Algeria 166. 168. 169 170. 171. ## 4 Angola 159. 160. 167. 168. 169. ## 5 Argentina 167. 168. 168. 170. 174. ## 6 Armenia 164. 166. 169. 172. 172. ## 7 Australia 170 171. 172. 173. 178. ## 8 Austria 162. 164. 167. 169. 179. ## 9 Azerbaijan 170. 171. 172. 172. 172. ## 10 Bahrain 161. 161. 164. 164. 164 ## # … with 134 more rows ``` --- ## Identify important features and decide how to filter ```r heights_five %>% filter(max == max(max) | max == min(max)) ``` ``` ## # A tibble: 2 x 6 ## country min q25 med q75 max ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Denmark 165. 168. 170. 178. 183. ## 2 Papua New Guinea 152. 152. 156. 160. 161. ``` --- ## Join summaries back to data ```r heights_five %>% filter(max == max(max) | max == min(max)) %>% left_join(heights, by = "country") ``` ``` ## # A tibble: 21 x 8 ## country min q25 med q75 max year height_cm ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Denmark 165. 168. 170. 178. 183. 1820 167. ## 2 Denmark 165. 168. 170. 178. 183. 1830 165. ## 3 Denmark 165. 168. 170. 178. 183. 1850 167. ## 4 Denmark 165. 168. 170. 178. 183. 1860 168. ## 5 Denmark 165. 168. 170. 178. 183. 1870 168. ## 6 Denmark 165. 168. 170. 178. 183. 1880 170. ## 7 Denmark 165. 168. 170. 178. 183. 1890 169. ## 8 Denmark 165. 168. 170. 178. 183. 1900 170. ## 9 Denmark 165. 168. 170. 178. 183. 1910 170 ## 10 Denmark 165. 168. 170. 178. 183. 1920 174. ## # … with 11 more rows ``` --- <img src="gifs/dog-solve-problem.gif" width="70%" style="display: block; margin: auto;" /> --- ## Identify features: one per **key** <img src="https://feasts.tidyverts.org/reference/figures/logo.png" align="right" height=140/> ```r heights %>% * features(height_cm, * feat_five_num) ``` ``` ## # A tibble: 144 x 6 ## country min q25 med q75 max ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 161. 164. 167. 168. 168. ## 2 Albania 168. 168. 170. 170. 170. ## 3 Algeria 166. 168. 169 170. 171. ## 4 Angola 159. 160. 167. 168. 169. ## 5 Argentina 167. 168. 168. 170. 174. ## 6 Armenia 164. 166. 169. 172. 172. ## # … with 138 more rows ``` --- ## features: Summaries that are **aware of data structure** <img src="https://feasts.tidyverts.org/reference/figures/logo.png" align="right" height=140/> -- ```r heights %>% features(height_cm, #<< # variable we want to summarise feat_five_num) #<< # feature to calculate ``` -- ``` ## # A tibble: 144 x 6 ## country min q25 med q75 max ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 161. 164. 167. 168. 168. ## 2 Albania 168. 168. 170. 170. 170. ## 3 Algeria 166. 168. 169 170. 171. ## 4 Angola 159. 160. 167. 168. 169. ## 5 Argentina 167. 168. 168. 170. 174. ## 6 Armenia 164. 166. 169. 172. 172. ## # … with 138 more rows ``` --- class: middle, center # Other available `features()` in `brolgar` --- ## What is the range of the data? `feat_ranges` ```r heights %>% features(height_cm, feat_ranges) ``` ``` ## # A tibble: 144 x 5 ## country min max range_diff iqr ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 161. 168. 7 3.27 ## 2 Albania 168. 170. 2.20 1.53 ## 3 Algeria 166. 171. 5.06 2.15 ## 4 Angola 159. 169. 10.5 7.87 ## 5 Argentina 167. 174. 7 2.21 ## 6 Armenia 164. 172. 8.82 5.30 ## 7 Australia 170 178. 8.4 2.58 ## 8 Austria 162. 179. 17.2 5.35 ## 9 Azerbaijan 170. 172. 1.97 1.12 ## 10 Bahrain 161. 164 3.3 2.75 ## # … with 134 more rows ``` --- ## Does it only increase or decrease? `feat_monotonic` ```r heights %>% features(height_cm, feat_monotonic) ``` ``` ## # A tibble: 144 x 5 ## country increase decrease unvary monotonic ## <chr> <lgl> <lgl> <lgl> <lgl> ## 1 Afghanistan FALSE FALSE FALSE FALSE ## 2 Albania FALSE TRUE FALSE TRUE ## 3 Algeria FALSE FALSE FALSE FALSE ## 4 Angola FALSE FALSE FALSE FALSE ## 5 Argentina FALSE FALSE FALSE FALSE ## 6 Armenia FALSE FALSE FALSE FALSE ## 7 Australia FALSE FALSE FALSE FALSE ## 8 Austria FALSE FALSE FALSE FALSE ## 9 Azerbaijan FALSE FALSE FALSE FALSE ## 10 Bahrain TRUE FALSE FALSE TRUE ## # … with 134 more rows ``` --- ## What is the spread of my data? `feat_spread` ```r heights %>% features(height_cm, feat_spread) ``` ``` ## # A tibble: 144 x 5 ## country var sd mad iqr ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 7.20 2.68 1.65 3.27 ## 2 Albania 0.950 0.975 0.667 1.53 ## 3 Algeria 3.30 1.82 0.741 2.15 ## 4 Angola 16.9 4.12 3.11 7.87 ## 5 Argentina 2.89 1.70 1.36 2.21 ## 6 Armenia 10.6 3.26 3.60 5.30 ## 7 Australia 7.63 2.76 1.66 2.58 ## 8 Austria 26.6 5.16 3.93 5.35 ## 9 Azerbaijan 0.516 0.718 0.621 1.12 ## 10 Bahrain 3.42 1.85 0.297 2.75 ## # … with 134 more rows ``` --- ## features: MANY more features in `feasts` <img src="https://feasts.tidyverts.org/reference/figures/logo.png" align="right" height=140/> .large[ Such as: - `feat_acf`: autocorrelation-based features - `feat_stl`: STL (Seasonal, Trend, and Remainder by LOESS) decomposition ] --- ## features: what is `feat_five_num?` <img src="https://feasts.tidyverts.org/reference/figures/logo.png" align="right" height=140/> -- ```r feat_five_num ``` ``` ## function(x, ...) { ## list( ## min = b_min(x, ...), ## q25 = b_q25(x, ...), ## med = b_median(x, ...), ## q75 = b_q75(x, ...), ## max = b_max(x, ...) ## ) ## } ## <bytecode: 0x7fa11b5b9d28> ## <environment: namespace:brolgar> ``` --- ## features: what is `feat_five_num?` <img src="https://feasts.tidyverts.org/reference/figures/logo.png" align="right" height=140/> .huge[ [Create your own features (from brolgar.njtierney.com)](http://brolgar.njtierney.com/articles/finding-features.html#creating-your-own-features) ] --- # ~~Problem #1: How do I look at **some** of the data?~~ -- # Problem #2: How do I find **interesting** observations? -- .large[ - Decide what features are interesting - Summarise down to one observation - Decide how to filter - Join this feature back to the data ] --- # ~~Problem #1: How do I look at **some** of the data?~~ # ~~Problem #2: How do I find **interesting** observations?~~ -- # Problem #3: How do I **understand** my statistical model --- # Problem #3: How do I **understand** my statistical model .medium[ Let's fit a simple mixed effects model to the data Fixed effect of year + Random intercept for country ] ```r heights_fit <- lmer(height_cm ~ year + (1|country), heights) heights_aug <- heights %>% add_predictions(heights_fit, var = "pred") %>% add_residuals(heights_fit, var = "res") ``` --- # Problem #3: How do I **understand** my statistical model ``` ## # A tsibble: 1,490 x 5 [!] ## # Key: country [144] ## country year height_cm pred res ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan 1870 168. 164. 4.59 ## 2 Afghanistan 1880 166. 164. 1.52 ## 3 Afghanistan 1930 167. 166. 0.823 ## 4 Afghanistan 1990 167. 168. -1.04 ## 5 Afghanistan 2000 161. 169. -7.10 ## 6 Albania 1880 170. 168. 2.39 ## 7 Albania 1890 170. 168. 1.73 ## 8 Albania 1900 169. 168. 0.769 ## 9 Albania 2000 168. 172. -4.14 ## 10 Algeria 1910 169. 168. 1.28 ## # … with 1,480 more rows ``` --- # Problem #3: How do I **understand** my statistical model <img src="figures/fits-1.png" width="936" style="display: block; margin: auto;" /> --- ## Look at subsamples? -- <img src="figures/subsamples-heights-1.png" width="936" style="display: block; margin: auto;" /> --- ## Look at **many** subsamples? -- ```r gg_heights_fit + facet_sample() ``` <img src="figures/heights-fit-facet-sample-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Look at **all** subsamples? ```r gg_heights_fit + facet_strata() ``` <img src="figures/heights-fit-facet-strata-1.png" width="70%" style="display: block; margin: auto;" /> --- ## Look at **all** subsamples **along** residuals? ```r gg_heights_fit + facet_strata(along = -res) ``` <img src="figures/heights-fit-facet-strata-res-1.png" width="70%" style="display: block; margin: auto;" /> --- # Look at the predictions with the data? ```r set.seed(2019-11-13) heights_sample <- heights_aug %>% sample_n_keys(size = 9) %>% #<< sample the data ggplot(aes(x = year, y = pred, group = country)) + geom_line() + facet_wrap(~country) heights_sample ``` --- # Look at the predictions with the data? <img src="figures/small-sample-out-1.png" width="100%" style="display: block; margin: auto;" /> --- # Look at the predictions with the data? ```r heights_sample + geom_point(aes(y = height_cm)) ``` <img src="figures/small-sample-add-real-data-1.png" width="70%" style="display: block; margin: auto;" /> --- # What if we grabbed a sample of those who have the best, middle, and worst residuals? -- ```r summary(heights_aug$res) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8.1707 -1.6202 -0.1558 0.0000 1.3545 12.1729 ``` -- .large[ Which countries are **nearest** to these statistics? ] --- # use `keys_near()` .pull-left[ ```r heights_aug %>% keys_near(key = country, var = res) ``` ``` ## # A tibble: 6 x 5 ## country res stat stat_value stat_diff ## <chr> <dbl> <fct> <dbl> <dbl> ## 1 Ireland -8.17 min -8.17 0 ## 2 Azerbaijan -1.62 q_25 -1.62 0.000269 ## 3 Laos -0.157 med -0.156 0.00125 ## 4 Mongolia -0.155 med -0.156 0.00125 ## 5 Egypt 1.35 q_75 1.35 0.000302 ## 6 Poland 12.2 max 12.2 0 ``` ] .pull-right[ > This shows us the keys that closely match the five number summary. ] --- ## Show data by joining it with residuals, to explore spread <img src="figures/plot-join-aug-1.png" width="936" style="display: block; margin: auto;" /> --- # Take homes ## Problem #1: How do I look at **some** of the data? .large[ 1. Longitudinal data is a time series 2. Specify structure once, get a free lunch. 3. Look at as much of the raw data as possible 4. Use `facet_sample()` / `facet_strata()` to look at data ] --- # Take homes ## Problem #2: How do I find **interesting** observations? .large[ 1. Decide what features are interesting 2. Summarise down to one observation 3. Decide how to filter 4. Join this feature back to the data ] --- # Take homes ## Problem #3: How do I **understand** my statistical model .large[ 1. Look at (one, more or all!) subsamples 1. Arrange subsamples 1. Find keys near some summary 1. Join keys to data to explore representatives ] --- # Thanks .large[ - Di Cook - Tania Prvan - Stuart Lee - Mitchell O'Hara Wild - Earo Wang - Rob Hyndman - Miles McBain - Monash University ] --- # Resources .large[ - [feasts](http://feasts.tidyverts.org/) - [tsibble](http://tsibble.tidyverts.org/) - [Time series graphics using feasts](https://robjhyndman.com/hyndsight/feasts/) - [Feature-based time series analysis](https://robjhyndman.com/hyndsight/fbtsa/) ] --- # Colophon .large[ - Slides made using [xaringan](https://github.com/yihui/xaringan) - Extended with [xaringanthemer](https://github.com/gadenbuie/xaringanthemer) - Colours taken + modified from [lorikeet theme from ochRe](https://github.com/ropenscilabs/ochRe) - Header font is **Josefin Sans** - Body text font is **Montserrat** - Code font is **Fira Mono** ] --- # Learning more .large[
[brolgar.njtierney.com](http://brolgar.njtierney.com/)
[bit.ly/njt-ozvis](https://bit.ly/njt-ozvis)
nj_tierney
njtierney
nicholas.tierney@gmail.com ] --- .vhuge[ **End.** ] ??? - now let's go through these same principles: - Sample the fits - Many samples of the fits - Explore the residuals - Find the best, worst, middle of the ground residuals - follow the "gliding" process again. - Which are most similar to which stats? - "who is similar to me?" - "who is the most average?" - "who is the most extreme?" - "Who is the most different to me?" # EDA: Why it's worth it Anscombe's quartet