Exploring and understanding the individual experience from longitudinal data, or…

How to make better spaghetti (plots)

Nicholas Tierney, Monash University

OzVis

Thursday 21st November, 2019

bit.ly/njt-ozvis

nj_tierney

1 / 131

(My) Background

2 / 131

I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.

Background: Undergraduate

Undergraduate in Psychology

Statistics
Experiment Design
Cognitive Theory
Neurology
Humans

3 / 131

Background: Honours

Psychophysics: illusory contours in 3D

Phil Grove

4 / 131

Background: Honours

Christina Lee

If every psychologist in the world delivered gold standard smoking cessation therapy, the rate of smoking would still increase. You need to change policy to make change. To make effective policy, you need to have good data, and do good statistics.

5 / 131

I discovered an interest in public health and statistics.

Background: PhD

Kerrie Mengersen

Statistical Approaches to Revealing Structure in Complex Health Data
Exploring missing values
Bayesian Models of people's health over time
Geospatial statistics of cardiac arrest
Fun, applied, real data, real people

6 / 131

Background: PhD

"Ah, statistics, everything is black and white!
"There's always an answer"
"data in, answer out"

7 / 131

I started a PhD in statistics at QUT, under (now distinguished) Professor Kerrie Mengersen, Looking at people's health over time.

There were several things that I noticed:
- There were equations, but not as many clear-cut, black and white answers

😱 Missing Values

8 / 131

Journey of Analytic Discovery

Background: PhD

Data is really messy
Missing values are frustrating
How to Explore data?

9 / 131

Background: PhD - But in psych

Focus on experiment design
No focus on exploring data
Exploring data felt...wrong?
But it was so critical.

10 / 131

(My personal) motivation

A lot of research in new statistical methods - imputation, inference, prediction

11 / 131

(My personal) motivation

A lot of research in new statistical methods - imputation, inference, prediction

Not much research on how we explore our data, and the methods that we use to do this.

11 / 131

(My personal) motivation

Focus on building a bridge across a river. Less focus on how it is built, and the tools used.

12 / 131

I became very interested in how we explore our data - exploratory data analysis.

My research:

Design and improve tools for (exploratory) data analysis

13 / 131

EDA: Exploratory Data Analysis

...EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. (Wikipedia)

John Tukey, Frederick Mosteller, Bill Cleveland, Dianne Cook, Heike Hoffman, Rob Hyndman, Hadley Wickham

14 / 131

bit.ly/njt-ozvis • @nj_tierney
EDA: Why it's worth it15 / 131

EDA: Why it's worth it

-- From "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing"

15 / 131

visdat.njtierney.com

naniar.njtierney.com

16 / 131

`visdat`

published under "visdat: Visualising Whole Data Frames" in the Journal of Open Source Software

17 / 131

`visdat::vis_dat(airquality)`

18 / 131

`visdat::vis_miss(airquality)`

19 / 131

`naniar`

Tierney, NJ. Cook, D. "Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations." [Pre-print]

20 / 131

`naniar::gg_miss_var(airquality)`

21 / 131

`naniar::gg_miss_var(airquality, facet = Month)`

22 / 131

`naniar::gg_miss_upset(airquality)`

23 / 131

Current work:

How to explore longitudinal data effectively

24 / 131

What is longitudinal data?

Something observed sequentially over time

25 / 131

bit.ly/njt-ozvis • @nj_tierney
What is longitudinal data?
 
    country 
    year 
    height_cm 
  
    Australia 
    1910 
    172.7 
  
26 / 131

country	year	height_cm
Australia	1910	172.7

bit.ly/njt-ozvis • @nj_tierney
What is longitudinal data?
 
    country 
    year 
    height_cm 
  
    Australia 
    1910 
    172.700 
  
    Australia 
    1920 
    172.846 
  
27 / 131

country	year	height_cm
Australia	1910	172.700
Australia	1920	172.846

bit.ly/njt-ozvis • @nj_tierney
What is longitudinal data?
 
    country 
    year 
    height_cm 
  
    Australia 
    1910 
    172.700 
  
    Australia 
    1920 
    172.846 
  
    Australia 
    1960 
    176.300 
  
28 / 131

country	year	height_cm
Australia	1910	172.700
Australia	1920	172.846
Australia	1960	176.300

bit.ly/njt-ozvis • @nj_tierney
What is longitudinal data?
 
    country 
    year 
    height_cm 
  
    Australia 
    1910 
    172.700 
  
    Australia 
    1920 
    172.846 
  
    Australia 
    1960 
    176.300 
  
    Australia 
    1970 
    178.400 
  
29 / 131

country	year	height_cm
Australia	1910	172.700
Australia	1920	172.846
Australia	1960	176.300
Australia	1970	178.400

30 / 131

31 / 131

All of Australia

32 / 131

...And New Zealand

33 / 131

... And Afghanistan and Albania

34 / 131

And the rest?

35 / 131

And the rest?

36 / 131

37 / 131

Problems:

Overplotting
We don't see the individuals
We could look at 144 individual plots, but this doesn't help.

38 / 131

Does transparency help?

39 / 131

Does transparency + a model help?

40 / 131

This helps reduce the overplotting
We only get the overall average. We dont get the rest of the information

This is still useful

41 / 131

We get information on what the average is, and how that behaves
But we don't get the full story
So, it depends on your need. If you have a designed experiment, where you stated that you would run some analysis, then you are doing this.
... But even then, wouldn't you rather explore the data?
Who fits the models well / worst / best?

But we forget about the individuals

42 / 131

The model might make some good overall predictions
But it can be really ill suited for some individual
Exploring this is somewhat clumsy - we need another way to explore

How do we get the most out of this plot?

43 / 131

How do I even get started?!

44 / 131

Problem #1: How do I look at some of the data?

45 / 131

Problem #1: How do I look at some of the data?

Problem #2: How do I find interesting observations?

45 / 131

Problem #1: How do I look at some of the data?

Problem #2: How do I find interesting observations?

Problem #3: How do I understand my statistical model

45 / 131

Introducing `brolgar`: brolgar.njtierney.com

browsing
over
longitudinal data
graphically, and
analytically, in
r

46 / 131

It's a crane, it fishes, and it's a native Australian bird

47 / 131

What is longitudinal data?

Something observed sequentially over time

48 / 131

What is longitudinal data?

~~Something~~ Anything that is observed sequentially over time is a time series

49 / 131

What is longitudinal data? Longitudinal data is a time series.

~~Something~~ Anything that is observed sequentially over time is a time series

-- Rob Hyndman and George Athanasopolous, Forecasting: Principles and Practice

50 / 131

Longitudinal data as a time series

heights <- as_tsibble(heights,
                      index = year,
                      key = country,
                      regular = FALSE)

index: Your time variable
key: Variable(s) defining individual groups (or series)

1. + 2. determine distinct rows in a tsibble.

(From Earo Wang's talk: Melt the clock)

51 / 131

Longitudinal data as a time series

Key Concepts:

Record important time series information once, and use it many times in other places

We add information about index + key:
- Index = Year
- Key = Country

52 / 131

bit.ly/njt-ozvis • @nj_tierney
## # A tsibble: 1,490 x 3 [!]
## # Key:       country [144]
##   country      year height_cm
##   <chr>       <dbl>     <dbl>
## 1 Afghanistan  1870      168.
## 2 Afghanistan  1880      166.
## 3 Afghanistan  1930      167.
## 4 Afghanistan  1990      167.
## 5 Afghanistan  2000      161.
## 6 Albania      1880      170.
## # … with 1,484 more rows
53 / 131

Remember:

key = variable(s) defining individual groups (or series)

54 / 131

bit.ly/njt-ozvis • @nj_tierney
Problem #1: How do I look at some of the data?55 / 131

Problem #1: How do I look at some of the data?

Look at only a sample of the data:

55 / 131

bit.ly/njt-ozvis • @nj_tierney
Sample n rows with sample_n()56 / 131

Sample `n` rows with `sample_n()`

heights %>% sample_n(5)

56 / 131

Sample `n` rows with `sample_n()`

heights %>% sample_n(5)

## # A tsibble: 5 x 3 [!]
## # Key:       country [5]
##   country           year height_cm
##   <chr>            <dbl>     <dbl>
## 1 Cambodia          1860      165.
## 2 Bolivia           1890      164.
## 3 Macedonia         1930      169.
## 4 United States     1920      173.
## 5 Papua New Guinea  1880      152.

56 / 131

Sample `n` rows with `sample_n()`

57 / 131

Sample `n` rows with `sample_n()`

## # A tsibble: 5 x 3 [!]
## # Key:       country [5]
##   country           year height_cm
##   <chr>            <dbl>     <dbl>
## 1 Cambodia          1860      165.
## 2 Bolivia           1890      164.
## 3 Macedonia         1930      169.
## 4 United States     1920      173.
## 5 Papua New Guinea  1880      152.

58 / 131

Sample `n` rows with `sample_n()`

## # A tsibble: 5 x 3 [!]
## # Key:       country [5]
##   country           year height_cm
##   <chr>            <dbl>     <dbl>
## 1 Cambodia          1860      165.
## 2 Bolivia           1890      164.
## 3 Macedonia         1930      169.
## 4 United States     1920      173.
## 5 Papua New Guinea  1880      152.

... sampling needs to select not random rows of the data, but the keys - the countries.

58 / 131

`sample_n_keys()` to sample ... keys

sample_n_keys(heights, 5)

## # A tsibble: 56 x 3 [!]
## # Key:       country [5]
##   country  year height_cm
##   <chr>   <dbl>     <dbl>
## 1 Hungary  1730      167.
## 2 Hungary  1740      168.
## 3 Hungary  1750      167.
## 4 Hungary  1760      167 
## 5 Hungary  1770      162.
## 6 Hungary  1780      163.
## # … with 50 more rows

59 / 131

`sample_n_keys()` to sample ... keys

60 / 131

Problem #1: How do I look at some of the data?

~~Look at subsamples~~

Sample keys

61 / 131

Problem #1: How do I look at some of the data?

~~Look at subsamples~~

Sample keys

Look at many subsamples

61 / 131

Problem #1: How do I look at some of the data?

~~Look at subsamples~~

Sample keys

Look at many subsamples

61 / 131

Look at many subsamples

62 / 131

Look at many subsamples

63 / 131

Look at many subsamples

64 / 131

bit.ly/njt-ozvis • @nj_tierney
How to look at many subsamplesHow many facets to look at? (2, 4, ... 16?)

65 / 131

bit.ly/njt-ozvis • @nj_tierney
How to look at many subsamplesHow many facets to look at? (2, 4, ... 16?)

How many keys per facets?144 keys into 16 facets = 9 each

65 / 131

bit.ly/njt-ozvis • @nj_tierney
How to look at many subsamplesHow many facets to look at? (2, 4, ... 16?)

How many keys per facets?144 keys into 16 facets = 9 each

Randomly pick 16 groups of size 9.

65 / 131

bit.ly/njt-ozvis • @nj_tierney
How to look at many subsamplesHow many facets to look at? (2, 4, ... 16?)

How many keys per facets?144 keys into 16 facets = 9 each

Randomly pick 16 groups of size 9.

This might not look like much extra work, but it hits the
distraction threshold quite quickly.

65 / 131

bit.ly/njt-ozvis • @nj_tierney
Distraction threshold (time to rabbit hole)66 / 131

Distraction threshold (time to rabbit hole)

(Something I made up)

66 / 131

Distraction threshold (time to rabbit hole)

(Something I made up)

If you have to solve 3+ substantial smaller problems in order to solve a larger problem, your focus shifts from the current goal to something else. You are distracted.

66 / 131

Task one
Task one being overshadowed slightly by minor task 1
Task one being overshadowed slightly by minor task 2
Task one being overshadowed slightly by minor task 3

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

How do I ensure there are the same number of keys per plot

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

How do I ensure there are the same number of keys per plot

What is rep, rep.int, and rep_len?

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

How do I ensure there are the same number of keys per plot

What is rep, rep.int, and rep_len?

Do I want length.out or times?

67 / 131

bit.ly/njt-ozvis • @nj_tierney
Avoiding the rabbit hole68 / 131

Avoiding the rabbit hole

We can blame ourselves when we are distracted for not being better.

68 / 131

Avoiding the rabbit hole

We can blame ourselves when we are distracted for not being better.

It's not that we should be better, rather with better tools we could be more efficient.

68 / 131

Avoiding the rabbit hole

We can blame ourselves when we are distracted for not being better.

It's not that we should be better, rather with better tools we could be more efficient.

We need to make things as easy as reasonable, with the least amount of distraction.

68 / 131

bit.ly/njt-ozvis • @nj_tierney
Removing the distraction threshold means asking the most relevant question69 / 131

Removing the distraction threshold means asking the most relevant question

How many plots do I want to look at?

69 / 131

Removing the distraction threshold means asking the most relevant question

How many plots do I want to look at?

heights_plot +
  facet_sample(
    n_per_facet = 3,
    n_facets = 9
  )

69 / 131

70 / 131

`facet_sample()`: See more individuals

ggplot(heights, aes(x = year, 
                    y = height_cm, 
                    group = country)) + 
  geom_line()

71 / 131

`facet_sample()`: See more individuals

ggplot(heights,
       aes(x = year,
             y = height_cm,
             group = country)) + 
  geom_line() + 
  facet_sample()

72 / 131

`facet_sample()`: See more individuals

73 / 131

`facet_strata()`: See all individuals

ggplot(heights,
       aes(x = year,
             y = height_cm,
             group = country)) + 
  geom_line() + 
  facet_strata()

74 / 131

`facet_strata()`: See all individuals

75 / 131

76 / 131

In asking these questions we can solve something else interesting

`facet_strata(along = -year)`: see all individuals along some variable

ggplot(heights,
       aes(x = year,
             y = height_cm,
             group = country)) + 
  geom_line() + 
  facet_strata(along = -year)

77 / 131

`facet_strata(along = -year)`: see all individuals along some variable

78 / 131

Focus on answering relevant questions instead of the minutae:

"How many lines per facet"

"How many facets?"

ggplot + 
  facet_sample(
    n_per_facet = 10,
    n_facets = 12
    )

79 / 131

Focus on answering relevant questions instead of the minutae:

"How many lines per facet"

"How many facets?"

ggplot + 
  facet_sample(
    n_per_facet = 10,
    n_facets = 12
    )

"How many facets to shove all the data in?"

ggplot + 
  facet_strata(
    n_strata = 10,
    )

79 / 131

`facet_strata()` & `facet_sample()` Under the hood

using sample_n_keys() & stratify_keys()

80 / 131

`facet_strata()` & `facet_sample()` Under the hood

using sample_n_keys() & stratify_keys()

You can still get at data and do manipulations

80 / 131

bit.ly/njt-ozvis • @nj_tierney
Problem #1: How do I look at some of the data?81 / 131

Problem #1: How do I look at some of the data?

as_tsibble()

sample_n_keys()

facet_sample()

facet_strata()

81 / 131

Problem #1: How do I look at some of the data?

as_tsibble()

sample_n_keys()

facet_sample()

facet_strata()

Store useful information

View subsamples of data

View many subsamples

View all subsamples

81 / 131

Problem #1: How do I look at some of the data?

as_tsibble()

sample_n_keys()

facet_sample()

facet_strata()

Store useful information

View subsamples of data

View many subsamples

View all subsamples

82 / 131

Problem #2: How do I find interesting observations?

83 / 131

Problem #2: How do I find interesting observations?

84 / 131

Define interesting?

85 / 131

Identify features: summarise down to one observation

86 / 131

Identify features: summarise down to one observation

87 / 131

Identify features: summarise down to one observation

88 / 131

Identify important features and decide how to filter

89 / 131

Identify important features and decide how to filter

90 / 131

Join this feature back to the data

91 / 131

Join this feature back to the data

92 / 131

🎉 Countries with smallest and largest max height

93 / 131

Let's see that one more time, but with the data

94 / 131

Identify features: summarise down to one observation

## # A tsibble: 1,490 x 3 [!]
## # Key:       country [144]
##    country      year height_cm
##    <chr>       <dbl>     <dbl>
##  1 Afghanistan  1870      168.
##  2 Afghanistan  1880      166.
##  3 Afghanistan  1930      167.
##  4 Afghanistan  1990      167.
##  5 Afghanistan  2000      161.
##  6 Albania      1880      170.
##  7 Albania      1890      170.
##  8 Albania      1900      169.
##  9 Albania      2000      168.
## 10 Algeria      1910      169.
## # … with 1,480 more rows

95 / 131

Identify features: summarise down to one observation

## # A tibble: 144 x 6
##    country       min   q25   med   q75   max
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Afghanistan  161.  164.  167.  168.  168.
##  2 Albania      168.  168.  170.  170.  170.
##  3 Algeria      166.  168.  169   170.  171.
##  4 Angola       159.  160.  167.  168.  169.
##  5 Argentina    167.  168.  168.  170.  174.
##  6 Armenia      164.  166.  169.  172.  172.
##  7 Australia    170   171.  172.  173.  178.
##  8 Austria      162.  164.  167.  169.  179.
##  9 Azerbaijan   170.  171.  172.  172.  172.
## 10 Bahrain      161.  161.  164.  164.  164 
## # … with 134 more rows

96 / 131

Identify important features and decide how to filter

heights_five %>% 
  filter(max == max(max) | max == min(max))

## # A tibble: 2 x 6
##   country            min   q25   med   q75   max
##   <chr>            <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Denmark           165.  168.  170.  178.  183.
## 2 Papua New Guinea  152.  152.  156.  160.  161.

97 / 131

Join summaries back to data

heights_five %>% 
  filter(max == max(max) | max == min(max)) %>% 
  left_join(heights, by = "country")

## # A tibble: 21 x 8
##    country   min   q25   med   q75   max  year height_cm
##    <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
##  1 Denmark  165.  168.  170.  178.  183.  1820      167.
##  2 Denmark  165.  168.  170.  178.  183.  1830      165.
##  3 Denmark  165.  168.  170.  178.  183.  1850      167.
##  4 Denmark  165.  168.  170.  178.  183.  1860      168.
##  5 Denmark  165.  168.  170.  178.  183.  1870      168.
##  6 Denmark  165.  168.  170.  178.  183.  1880      170.
##  7 Denmark  165.  168.  170.  178.  183.  1890      169.
##  8 Denmark  165.  168.  170.  178.  183.  1900      170.
##  9 Denmark  165.  168.  170.  178.  183.  1910      170 
## 10 Denmark  165.  168.  170.  178.  183.  1920      174.
## # … with 11 more rows

98 / 131

99 / 131

Identify features: one per key

heights %>%
  features(height_cm,
           feat_five_num)

## # A tibble: 144 x 6
##   country       min   q25   med   q75   max
##   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan  161.  164.  167.  168.  168.
## 2 Albania      168.  168.  170.  170.  170.
## 3 Algeria      166.  168.  169   170.  171.
## 4 Angola       159.  160.  167.  168.  169.
## 5 Argentina    167.  168.  168.  170.  174.
## 6 Armenia      164.  166.  169.  172.  172.
## # … with 138 more rows

100 / 131

features: Summaries that are aware of data structure

101 / 131

features: Summaries that are aware of data structure

heights %>%
  features(height_cm, #<< # variable we want to summarise
           feat_five_num) #<< # feature to calculate

101 / 131

features: Summaries that are aware of data structure

heights %>%
  features(height_cm, #<< # variable we want to summarise
           feat_five_num) #<< # feature to calculate

## # A tibble: 144 x 6
##   country       min   q25   med   q75   max
##   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan  161.  164.  167.  168.  168.
## 2 Albania      168.  168.  170.  170.  170.
## 3 Algeria      166.  168.  169   170.  171.
## 4 Angola       159.  160.  167.  168.  169.
## 5 Argentina    167.  168.  168.  170.  174.
## 6 Armenia      164.  166.  169.  172.  172.
## # … with 138 more rows

101 / 131

bit.ly/njt-ozvis • @nj_tierney
Other available features() in brolgar102 / 131

What is the range of the data? `feat_ranges`

heights %>%
  features(height_cm, feat_ranges)

## # A tibble: 144 x 5
##    country       min   max range_diff   iqr
##    <chr>       <dbl> <dbl>      <dbl> <dbl>
##  1 Afghanistan  161.  168.       7     3.27
##  2 Albania      168.  170.       2.20  1.53
##  3 Algeria      166.  171.       5.06  2.15
##  4 Angola       159.  169.      10.5   7.87
##  5 Argentina    167.  174.       7     2.21
##  6 Armenia      164.  172.       8.82  5.30
##  7 Australia    170   178.       8.4   2.58
##  8 Austria      162.  179.      17.2   5.35
##  9 Azerbaijan   170.  172.       1.97  1.12
## 10 Bahrain      161.  164        3.3   2.75
## # … with 134 more rows

103 / 131

Does it only increase or decrease? `feat_monotonic`

heights %>%
  features(height_cm, feat_monotonic)

## # A tibble: 144 x 5
##    country     increase decrease unvary monotonic
##    <chr>       <lgl>    <lgl>    <lgl>  <lgl>    
##  1 Afghanistan FALSE    FALSE    FALSE  FALSE    
##  2 Albania     FALSE    TRUE     FALSE  TRUE     
##  3 Algeria     FALSE    FALSE    FALSE  FALSE    
##  4 Angola      FALSE    FALSE    FALSE  FALSE    
##  5 Argentina   FALSE    FALSE    FALSE  FALSE    
##  6 Armenia     FALSE    FALSE    FALSE  FALSE    
##  7 Australia   FALSE    FALSE    FALSE  FALSE    
##  8 Austria     FALSE    FALSE    FALSE  FALSE    
##  9 Azerbaijan  FALSE    FALSE    FALSE  FALSE    
## 10 Bahrain     TRUE     FALSE    FALSE  TRUE     
## # … with 134 more rows

104 / 131

What is the spread of my data? `feat_spread`

heights %>%
  features(height_cm, feat_spread)

## # A tibble: 144 x 5
##    country        var    sd   mad   iqr
##    <chr>        <dbl> <dbl> <dbl> <dbl>
##  1 Afghanistan  7.20  2.68  1.65   3.27
##  2 Albania      0.950 0.975 0.667  1.53
##  3 Algeria      3.30  1.82  0.741  2.15
##  4 Angola      16.9   4.12  3.11   7.87
##  5 Argentina    2.89  1.70  1.36   2.21
##  6 Armenia     10.6   3.26  3.60   5.30
##  7 Australia    7.63  2.76  1.66   2.58
##  8 Austria     26.6   5.16  3.93   5.35
##  9 Azerbaijan   0.516 0.718 0.621  1.12
## 10 Bahrain      3.42  1.85  0.297  2.75
## # … with 134 more rows

105 / 131

features: MANY more features in `feasts`

Such as:

feat_acf: autocorrelation-based features
feat_stl: STL (Seasonal, Trend, and Remainder by LOESS) decomposition

106 / 131

features: what is `feat_five_num?`

107 / 131

features: what is `feat_five_num?`

feat_five_num

## function(x, ...) {
##   list(
##     min = b_min(x, ...),
##     q25 = b_q25(x, ...),
##     med = b_median(x, ...),
##     q75 = b_q75(x, ...),
##     max = b_max(x, ...)
##   )
## }
## <bytecode: 0x7fa11b5b9d28>
## <environment: namespace:brolgar>

107 / 131

features: what is `feat_five_num?`

Create your own features (from brolgar.njtierney.com)

108 / 131

bit.ly/njt-ozvis • @nj_tierney
Problem #1: How do I look at some of the data?109 / 131

bit.ly/njt-ozvis • @nj_tierney
Problem #1: How do I look at some of the data?Problem #2: How do I find interesting observations?109 / 131

bit.ly/njt-ozvis • @nj_tierney
Problem #1: How do I look at some of the data?Problem #2: How do I find interesting observations?Decide what features are interesting
Summarise down to one observation
Decide how to filter 
Join this feature back to the data

109 / 131

bit.ly/njt-ozvis • @nj_tierney
Problem #1: How do I look at some of the data?Problem #2: How do I find interesting observations?110 / 131

bit.ly/njt-ozvis • @nj_tierney
Problem #1: How do I look at some of the data?Problem #2: How do I find interesting observations?Problem #3: How do I understand my statistical model110 / 131

Problem #3: How do I understand my statistical model

Let's fit a simple mixed effects model to the data

Fixed effect of year + Random intercept for country

heights_fit <- lmer(height_cm ~ year + (1|country), heights)
heights_aug <- heights %>%
  add_predictions(heights_fit, var = "pred") %>%
  add_residuals(heights_fit, var = "res")

111 / 131

Problem #3: How do I understand my statistical model

## # A tsibble: 1,490 x 5 [!]
## # Key:       country [144]
##    country      year height_cm  pred    res
##    <chr>       <dbl>     <dbl> <dbl>  <dbl>
##  1 Afghanistan  1870      168.  164.  4.59 
##  2 Afghanistan  1880      166.  164.  1.52 
##  3 Afghanistan  1930      167.  166.  0.823
##  4 Afghanistan  1990      167.  168. -1.04 
##  5 Afghanistan  2000      161.  169. -7.10 
##  6 Albania      1880      170.  168.  2.39 
##  7 Albania      1890      170.  168.  1.73 
##  8 Albania      1900      169.  168.  0.769
##  9 Albania      2000      168.  172. -4.14 
## 10 Algeria      1910      169.  168.  1.28 
## # … with 1,480 more rows

112 / 131

Problem #3: How do I understand my statistical model

113 / 131

bit.ly/njt-ozvis • @nj_tierney
Look at subsamples?114 / 131

Look at subsamples?

114 / 131

bit.ly/njt-ozvis • @nj_tierney
Look at many subsamples?115 / 131

Look at many subsamples?

gg_heights_fit + facet_sample()

115 / 131

Look at all subsamples?

gg_heights_fit + facet_strata()

116 / 131

Look at all subsamples along residuals?

gg_heights_fit + facet_strata(along = -res)

117 / 131

Look at the predictions with the data?

set.seed(2019-11-13)
heights_sample <- 
heights_aug %>%
  sample_n_keys(size = 9) %>% #<< sample the data
  ggplot(aes(x = year,
             y = pred,
             group = country)) + 
  geom_line() +
  facet_wrap(~country)
heights_sample

118 / 131

Look at the predictions with the data?

119 / 131

Look at the predictions with the data?

heights_sample + geom_point(aes(y = height_cm))

120 / 131

bit.ly/njt-ozvis • @nj_tierney
What if we grabbed a sample of those who have the best, middle, and worst residuals?121 / 131

What if we grabbed a sample of those who have the best, middle, and worst residuals?

summary(heights_aug$res)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -8.1707 -1.6202 -0.1558  0.0000  1.3545 12.1729

121 / 131

What if we grabbed a sample of those who have the best, middle, and worst residuals?

summary(heights_aug$res)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -8.1707 -1.6202 -0.1558  0.0000  1.3545 12.1729

Which countries are nearest to these statistics?

121 / 131

use `keys_near()`

heights_aug %>%
  keys_near(key = country,
            var = res)

## # A tibble: 6 x 5
##   country       res stat  stat_value stat_diff
##   <chr>       <dbl> <fct>      <dbl>     <dbl>
## 1 Ireland    -8.17  min       -8.17   0       
## 2 Azerbaijan -1.62  q_25      -1.62   0.000269
## 3 Laos       -0.157 med       -0.156  0.00125 
## 4 Mongolia   -0.155 med       -0.156  0.00125 
## 5 Egypt       1.35  q_75       1.35   0.000302
## 6 Poland     12.2   max       12.2    0

This shows us the keys that closely match the five number summary.

122 / 131

Show data by joining it with residuals, to explore spread

123 / 131

bit.ly/njt-ozvis • @nj_tierney
Take homesProblem #1: How do I look at some of the data?Longitudinal data is a time series
Specify structure once, get a free lunch.
Look at as much of the raw data as possible 
Use facet_sample() / facet_strata() to look at data

124 / 131

bit.ly/njt-ozvis • @nj_tierney
Take homesProblem #2: How do I find interesting observations?Decide what features are interesting
Summarise down to one observation
Decide how to filter
Join this feature back to the data

125 / 131

bit.ly/njt-ozvis • @nj_tierney
Take homesProblem #3: How do I understand my statistical modelLook at (one, more or all!) subsamples
Arrange subsamples
Find keys near some summary 
Join keys to data to explore representatives

126 / 131

bit.ly/njt-ozvis • @nj_tierney
ThanksDi Cook
Tania Prvan
Stuart Lee
Mitchell O'Hara Wild
Earo Wang
Rob Hyndman
Miles McBain
Monash University

127 / 131

Resources

128 / 131

Colophon

Slides made using xaringan
Extended with xaringanthemer
Colours taken + modified from lorikeet theme from ochRe
Header font is Josefin Sans
Body text font is Montserrat
Code font is Fira Mono

129 / 131

Learning more

brolgar.njtierney.com

bit.ly/njt-ozvis

nj_tierney

njtierney

nicholas.tierney@gmail.com

130 / 131

End.

131 / 131

now let's go through these same principles:
- Sample the fits
- Many samples of the fits
- Explore the residuals
- Find the best, worst, middle of the ground residuals
- follow the "gliding" process again.
Which are most similar to which stats?
- "who is similar to me?"
- "who is the most average?"
- "who is the most extreme?"
- "Who is the most different to me?"

EDA: Why it's worth it

Anscombe's quartet

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Exploring and understanding the individual experience from longitudinal data, or…

How to make better spaghetti (plots)

Nicholas Tierney, Monash University

OzVis Thursday 21st November, 2019 bit.ly/njt-ozvis nj_tierney

Background: Undergraduate

Background: Honours

Background: Honours

Background: PhD

Background: PhD

😱 Missing Values

Journey of Analytic Discovery

Background: PhD

Background: PhD - But in psych

(My personal) motivation

(My personal) motivation

(My personal) motivation

EDA: Exploratory Data Analysis

EDA: Why it's worth it

EDA: Why it's worth it

visdat

visdat::vis_dat(airquality)

visdat::vis_miss(airquality)

naniar

naniar::gg_miss_var(airquality)

naniar::gg_miss_var(airquality, facet = Month)

naniar::gg_miss_upset(airquality)

What is longitudinal data?

What is longitudinal data?

What is longitudinal data?

What is longitudinal data?

What is longitudinal data?

All of Australia

...And New Zealand

... And Afghanistan and Albania

And the rest?

And the rest?

Problems:

Does transparency help?

Does transparency + a model help?

This is still useful

But we forget about the individuals

How do we get the most out of this plot?

How do I even get started?!

Introducing brolgar: brolgar.njtierney.com

What is longitudinal data?

What is longitudinal data?

What is longitudinal data? Longitudinal data is a time series.

Longitudinal data as a time series

Longitudinal data as a time series

Key Concepts:

Problem #1: How do I look at some of the data?

Problem #1: How do I look at some of the data?

Sample n rows with sample_n()

Sample n rows with sample_n()

Sample n rows with sample_n()

Sample n rows with sample_n()

Sample n rows with sample_n()

Sample n rows with sample_n()

sample_n_keys() to sample ... keys

sample_n_keys() to sample ... keys

Problem #1: How do I look at some of the data?

Problem #1: How do I look at some of the data?

Problem #1: How do I look at some of the data?

Look at many subsamples

Look at many subsamples

Look at many subsamples

How to look at many subsamples

How to look at many subsamples

How to look at many subsamples

How to look at many subsamples

Distraction threshold (time to rabbit hole)

Distraction threshold (time to rabbit hole)

Distraction threshold (time to rabbit hole)

Distraction threshold (time to rabbit hole)

Distraction threshold (time to rabbit hole)

Distraction threshold (time to rabbit hole)

Distraction threshold (time to rabbit hole)

Distraction threshold (time to rabbit hole)

Distraction threshold (time to rabbit hole)

Distraction threshold (time to rabbit hole)

OzVis

Thursday 21st November, 2019

bit.ly/njt-ozvis

nj_tierney

`visdat`

`visdat::vis_dat(airquality)`

`visdat::vis_miss(airquality)`

`naniar`

`naniar::gg_miss_var(airquality)`

`naniar::gg_miss_var(airquality, facet = Month)`

`naniar::gg_miss_upset(airquality)`

Introducing `brolgar`: brolgar.njtierney.com

Sample `n` rows with `sample_n()`

Sample `n` rows with `sample_n()`

Sample `n` rows with `sample_n()`

Sample `n` rows with `sample_n()`

Sample `n` rows with `sample_n()`

Sample `n` rows with `sample_n()`

`sample_n_keys()` to sample ... keys

`sample_n_keys()` to sample ... keys

`facet_sample()`: See more individuals

`facet_sample()`: See more individuals

`facet_sample()`: See more individuals

`facet_strata()`: See all individuals

`facet_strata()`: See all individuals

`facet_strata(along = -year)`: see all individuals along some variable

`facet_strata(along = -year)`: see all individuals along some variable

`facet_strata()` & `facet_sample()` Under the hood

`facet_strata()` & `facet_sample()` Under the hood

Other available `features()` in `brolgar`

What is the range of the data? `feat_ranges`

Does it only increase or decrease? `feat_monotonic`

What is the spread of my data? `feat_spread`

features: MANY more features in `feasts`

features: what is `feat_five_num?`

features: what is `feat_five_num?`

features: what is `feat_five_num?`