+ - 0:00:00
Notes for current slide
Notes for next slide

I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.

Exploring and understanding the individual experience from longitudinal data, or…

How to make better spaghetti (plots)

Nicholas Tierney, Monash University

1 / 131

(My) Background

2 / 131

I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.

Background: Undergraduate

Undergraduate in Psychology

  • Statistics
  • Experiment Design
  • Cognitive Theory
  • Neurology
  • Humans

3 / 131

Background: Honours

Psychophysics: illusory contours in 3D

Phil Grove

4 / 131

Background: Honours

Christina Lee

If every psychologist in the world delivered gold standard smoking cessation therapy, the rate of smoking would still increase. You need to change policy to make change. To make effective policy, you need to have good data, and do good statistics.

5 / 131

I discovered an interest in public health and statistics.

Background: PhD

Kerrie Mengersen

  • Statistical Approaches to Revealing Structure in Complex Health Data
  • Exploring missing values
  • Bayesian Models of people's health over time
  • Geospatial statistics of cardiac arrest
  • Fun, applied, real data, real people
6 / 131

Background: PhD

  • "Ah, statistics, everything is black and white!
  • "There's always an answer"
  • "data in, answer out"

7 / 131

I started a PhD in statistics at QUT, under (now distinguished) Professor Kerrie Mengersen, Looking at people's health over time.

  • There were several things that I noticed:
    • There were equations, but not as many clear-cut, black and white answers

😱 Missing Values

8 / 131

Journey of Analytic Discovery

  • map

Background: PhD

  • Data is really messy
  • Missing values are frustrating
  • How to Explore data?

9 / 131

Background: PhD - But in psych

  • Focus on experiment design
  • No focus on exploring data
  • Exploring data felt...wrong?
  • But it was so critical.

10 / 131

(My personal) motivation

A lot of research in new statistical methods - imputation, inference, prediction

11 / 131

(My personal) motivation

A lot of research in new statistical methods - imputation, inference, prediction

Not much research on how we explore our data, and the methods that we use to do this.

11 / 131

(My personal) motivation

Focus on building a bridge across a river. Less focus on how it is built, and the tools used.

12 / 131
  • I became very interested in how we explore our data - exploratory data analysis.

My research:

Design and improve tools for (exploratory) data analysis

13 / 131

EDA: Exploratory Data Analysis

...EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. (Wikipedia)

John Tukey, Frederick Mosteller, Bill Cleveland, Dianne Cook, Heike Hoffman, Rob Hyndman, Hadley Wickham

14 / 131

EDA: Why it's worth it

15 / 131

16 / 131

visdat

rOpenSci BadgeJOSS statusDOI

published under "visdat: Visualising Whole Data Frames" in the Journal of Open Source Software

17 / 131

visdat::vis_dat(airquality)

18 / 131

visdat::vis_miss(airquality)

19 / 131

naniar

Tierney, NJ. Cook, D. "Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations." [Pre-print]

20 / 131

naniar::gg_miss_var(airquality)

21 / 131

naniar::gg_miss_var(airquality, facet = Month)

22 / 131

naniar::gg_miss_upset(airquality)

23 / 131

Current work:

How to explore longitudinal data effectively

24 / 131

What is longitudinal data?

Something observed sequentially over time

25 / 131

What is longitudinal data?

country year height_cm
Australia 1910 172.7
26 / 131

What is longitudinal data?

country year height_cm
Australia 1910 172.700
Australia 1920 172.846
27 / 131

What is longitudinal data?

country year height_cm
Australia 1910 172.700
Australia 1920 172.846
Australia 1960 176.300
28 / 131

What is longitudinal data?

country year height_cm
Australia 1910 172.700
Australia 1920 172.846
Australia 1960 176.300
Australia 1970 178.400
29 / 131

30 / 131

31 / 131

All of Australia

32 / 131

...And New Zealand

33 / 131

... And Afghanistan and Albania

34 / 131

And the rest?

35 / 131

And the rest?

36 / 131

37 / 131

Problems:

  • Overplotting
  • We don't see the individuals
  • We could look at 144 individual plots, but this doesn't help.

38 / 131

Does transparency help?

39 / 131

Does transparency + a model help?

40 / 131
  • This helps reduce the overplotting
  • We only get the overall average. We dont get the rest of the information

This is still useful

41 / 131
  • We get information on what the average is, and how that behaves
  • But we don't get the full story
  • So, it depends on your need. If you have a designed experiment, where you stated that you would run some analysis, then you are doing this.
  • ... But even then, wouldn't you rather explore the data?
  • Who fits the models well / worst / best?

But we forget about the individuals

42 / 131
  • The model might make some good overall predictions
  • But it can be really ill suited for some individual
  • Exploring this is somewhat clumsy - we need another way to explore

How do we get the most out of this plot?

43 / 131

How do I even get started?!

44 / 131

Problem #1: How do I look at some of the data?

45 / 131

Problem #1: How do I look at some of the data?

Problem #2: How do I find interesting observations?

45 / 131

Problem #1: How do I look at some of the data?

Problem #2: How do I find interesting observations?

Problem #3: How do I understand my statistical model

45 / 131

Introducing brolgar: brolgar.njtierney.com

  • browsing
  • over
  • longitudinal data
  • graphically, and
  • analytically, in
  • r

46 / 131
  • It's a crane, it fishes, and it's a native Australian bird

47 / 131

What is longitudinal data?

Something observed sequentially over time

48 / 131

What is longitudinal data?

Something Anything that is observed sequentially over time is a time series

49 / 131

What is longitudinal data? Longitudinal data is a time series.

Something Anything that is observed sequentially over time is a time series

50 / 131

Longitudinal data as a time series

heights <- as_tsibble(heights,
index = year,
key = country,
regular = FALSE)
  1. index: Your time variable
  2. key: Variable(s) defining individual groups (or series)

1. + 2. determine distinct rows in a tsibble.

(From Earo Wang's talk: Melt the clock)

51 / 131

Longitudinal data as a time series

Key Concepts:

Record important time series information once, and use it many times in other places

  • We add information about index + key:
    • Index = Year
    • Key = Country
52 / 131
## # A tsibble: 1,490 x 3 [!]
## # Key: country [144]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Afghanistan 1870 168.
## 2 Afghanistan 1880 166.
## 3 Afghanistan 1930 167.
## 4 Afghanistan 1990 167.
## 5 Afghanistan 2000 161.
## 6 Albania 1880 170.
## # … with 1,484 more rows
53 / 131

Remember:

key = variable(s) defining individual groups (or series)

54 / 131

Problem #1: How do I look at some of the data?

55 / 131

Problem #1: How do I look at some of the data?

Look at only a sample of the data:

55 / 131

Sample n rows with sample_n()

56 / 131

Sample n rows with sample_n()

heights %>% sample_n(5)
56 / 131

Sample n rows with sample_n()

heights %>% sample_n(5)
## # A tsibble: 5 x 3 [!]
## # Key: country [5]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Cambodia 1860 165.
## 2 Bolivia 1890 164.
## 3 Macedonia 1930 169.
## 4 United States 1920 173.
## 5 Papua New Guinea 1880 152.
56 / 131

Sample n rows with sample_n()

57 / 131

Sample n rows with sample_n()

## # A tsibble: 5 x 3 [!]
## # Key: country [5]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Cambodia 1860 165.
## 2 Bolivia 1890 164.
## 3 Macedonia 1930 169.
## 4 United States 1920 173.
## 5 Papua New Guinea 1880 152.
58 / 131

Sample n rows with sample_n()

## # A tsibble: 5 x 3 [!]
## # Key: country [5]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Cambodia 1860 165.
## 2 Bolivia 1890 164.
## 3 Macedonia 1930 169.
## 4 United States 1920 173.
## 5 Papua New Guinea 1880 152.

... sampling needs to select not random rows of the data, but the keys - the countries.

58 / 131

sample_n_keys() to sample ... keys

sample_n_keys(heights, 5)
## # A tsibble: 56 x 3 [!]
## # Key: country [5]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Hungary 1730 167.
## 2 Hungary 1740 168.
## 3 Hungary 1750 167.
## 4 Hungary 1760 167
## 5 Hungary 1770 162.
## 6 Hungary 1780 163.
## # … with 50 more rows
59 / 131

sample_n_keys() to sample ... keys

60 / 131

Problem #1: How do I look at some of the data?

Look at subsamples

Sample keys

61 / 131

Problem #1: How do I look at some of the data?

Look at subsamples

Sample keys

Look at many subsamples

61 / 131

Problem #1: How do I look at some of the data?

Look at subsamples

Sample keys

Look at many subsamples

?

61 / 131

Look at many subsamples

62 / 131

Look at many subsamples

63 / 131

Look at many subsamples

64 / 131

How to look at many subsamples

  • How many facets to look at? (2, 4, ... 16?)
65 / 131

How to look at many subsamples

  • How many facets to look at? (2, 4, ... 16?)
  • How many keys per facets?
    • 144 keys into 16 facets = 9 each
65 / 131

How to look at many subsamples

  • How many facets to look at? (2, 4, ... 16?)
  • How many keys per facets?
    • 144 keys into 16 facets = 9 each
  • Randomly pick 16 groups of size 9.
65 / 131

How to look at many subsamples

  • How many facets to look at? (2, 4, ... 16?)
  • How many keys per facets?
    • 144 keys into 16 facets = 9 each
  • Randomly pick 16 groups of size 9.
  • This might not look like much extra work, but it hits the distraction threshold quite quickly.
65 / 131

Distraction threshold (time to rabbit hole)

66 / 131

Distraction threshold (time to rabbit hole)

(Something I made up)

66 / 131

Distraction threshold (time to rabbit hole)

(Something I made up)

If you have to solve 3+ substantial smaller problems in order to solve a larger problem, your focus shifts from the current goal to something else. You are distracted.

66 / 131
  • Task one

  • Task one being overshadowed slightly by minor task 1

  • Task one being overshadowed slightly by minor task 2
  • Task one being overshadowed slightly by minor task 3

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

How do I ensure there are the same number of keys per plot

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

How do I ensure there are the same number of keys per plot

What is rep, rep.int, and rep_len?

67 / 131

Distraction threshold (time to rabbit hole)

I want to look at many subsamples of the data

How many keys are there?

How many facets do I want to look at

How many keys per facet should I look at

How do I ensure there are the same number of keys per plot

What is rep, rep.int, and rep_len?

Do I want length.out or times?

67 / 131

Avoiding the rabbit hole

68 / 131

Avoiding the rabbit hole

We can blame ourselves when we are distracted for not being better.

68 / 131

Avoiding the rabbit hole

We can blame ourselves when we are distracted for not being better.

It's not that we should be better, rather with better tools we could be more efficient.

68 / 131

Avoiding the rabbit hole

We can blame ourselves when we are distracted for not being better.

It's not that we should be better, rather with better tools we could be more efficient.

We need to make things as easy as reasonable, with the least amount of distraction.

68 / 131

Removing the distraction threshold means asking the most relevant question

69 / 131

Removing the distraction threshold means asking the most relevant question

How many plots do I want to look at?

69 / 131

Removing the distraction threshold means asking the most relevant question

How many plots do I want to look at?

heights_plot +
facet_sample(
n_per_facet = 3,
n_facets = 9
)
69 / 131

70 / 131

facet_sample(): See more individuals

ggplot(heights, aes(x = year,
y = height_cm,
group = country)) +
geom_line()

71 / 131

facet_sample(): See more individuals

ggplot(heights,
aes(x = year,
y = height_cm,
group = country)) +
geom_line() +
facet_sample()
72 / 131

facet_sample(): See more individuals

73 / 131

facet_strata(): See all individuals

ggplot(heights,
aes(x = year,
y = height_cm,
group = country)) +
geom_line() +
facet_strata()
74 / 131

facet_strata(): See all individuals

75 / 131

Can we re-order these facets in a meaningful way?

76 / 131

In asking these questions we can solve something else interesting

facet_strata(along = -year): see all individuals along some variable

ggplot(heights,
aes(x = year,
y = height_cm,
group = country)) +
geom_line() +
facet_strata(along = -year)
77 / 131

facet_strata(along = -year): see all individuals along some variable

78 / 131

Focus on answering relevant questions instead of the minutae:

"How many lines per facet"

"How many facets?"

ggplot +
facet_sample(
n_per_facet = 10,
n_facets = 12
)
79 / 131

Focus on answering relevant questions instead of the minutae:

"How many lines per facet"

"How many facets?"

ggplot +
facet_sample(
n_per_facet = 10,
n_facets = 12
)

"How many facets to shove all the data in?"

ggplot +
facet_strata(
n_strata = 10,
)
79 / 131

facet_strata() & facet_sample() Under the hood

using sample_n_keys() & stratify_keys()

80 / 131

facet_strata() & facet_sample() Under the hood

using sample_n_keys() & stratify_keys()

You can still get at data and do manipulations

80 / 131

Problem #1: How do I look at some of the data?

81 / 131

Problem #1: How do I look at some of the data?

as_tsibble()

sample_n_keys()

facet_sample()

facet_strata()

81 / 131

Problem #1: How do I look at some of the data?

as_tsibble()

sample_n_keys()

facet_sample()

facet_strata()

Store useful information

View subsamples of data

View many subsamples

View all subsamples

81 / 131

Problem #1: How do I look at some of the data?

as_tsibble()

sample_n_keys()

facet_sample()

facet_strata()

Store useful information

View subsamples of data

View many subsamples

View all subsamples

82 / 131

Problem #2: How do I find interesting observations?

83 / 131

Problem #2: How do I find interesting observations?

84 / 131

Define interesting?

85 / 131

Identify features: summarise down to one observation

86 / 131

Identify features: summarise down to one observation

87 / 131

Identify features: summarise down to one observation

88 / 131

Identify important features and decide how to filter

89 / 131

Identify important features and decide how to filter

90 / 131

Join this feature back to the data

91 / 131

Join this feature back to the data

92 / 131

🎉 Countries with smallest and largest max height

93 / 131

Let's see that one more time, but with the data

94 / 131

Identify features: summarise down to one observation

## # A tsibble: 1,490 x 3 [!]
## # Key: country [144]
## country year height_cm
## <chr> <dbl> <dbl>
## 1 Afghanistan 1870 168.
## 2 Afghanistan 1880 166.
## 3 Afghanistan 1930 167.
## 4 Afghanistan 1990 167.
## 5 Afghanistan 2000 161.
## 6 Albania 1880 170.
## 7 Albania 1890 170.
## 8 Albania 1900 169.
## 9 Albania 2000 168.
## 10 Algeria 1910 169.
## # … with 1,480 more rows
95 / 131

Identify features: summarise down to one observation

## # A tibble: 144 x 6
## country min q25 med q75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 161. 164. 167. 168. 168.
## 2 Albania 168. 168. 170. 170. 170.
## 3 Algeria 166. 168. 169 170. 171.
## 4 Angola 159. 160. 167. 168. 169.
## 5 Argentina 167. 168. 168. 170. 174.
## 6 Armenia 164. 166. 169. 172. 172.
## 7 Australia 170 171. 172. 173. 178.
## 8 Austria 162. 164. 167. 169. 179.
## 9 Azerbaijan 170. 171. 172. 172. 172.
## 10 Bahrain 161. 161. 164. 164. 164
## # … with 134 more rows
96 / 131

Identify important features and decide how to filter

heights_five %>%
filter(max == max(max) | max == min(max))
## # A tibble: 2 x 6
## country min q25 med q75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Denmark 165. 168. 170. 178. 183.
## 2 Papua New Guinea 152. 152. 156. 160. 161.
97 / 131

Join summaries back to data

heights_five %>%
filter(max == max(max) | max == min(max)) %>%
left_join(heights, by = "country")
## # A tibble: 21 x 8
## country min q25 med q75 max year height_cm
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Denmark 165. 168. 170. 178. 183. 1820 167.
## 2 Denmark 165. 168. 170. 178. 183. 1830 165.
## 3 Denmark 165. 168. 170. 178. 183. 1850 167.
## 4 Denmark 165. 168. 170. 178. 183. 1860 168.
## 5 Denmark 165. 168. 170. 178. 183. 1870 168.
## 6 Denmark 165. 168. 170. 178. 183. 1880 170.
## 7 Denmark 165. 168. 170. 178. 183. 1890 169.
## 8 Denmark 165. 168. 170. 178. 183. 1900 170.
## 9 Denmark 165. 168. 170. 178. 183. 1910 170
## 10 Denmark 165. 168. 170. 178. 183. 1920 174.
## # … with 11 more rows
98 / 131

99 / 131

Identify features: one per key

heights %>%
features(height_cm,
feat_five_num)
## # A tibble: 144 x 6
## country min q25 med q75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 161. 164. 167. 168. 168.
## 2 Albania 168. 168. 170. 170. 170.
## 3 Algeria 166. 168. 169 170. 171.
## 4 Angola 159. 160. 167. 168. 169.
## 5 Argentina 167. 168. 168. 170. 174.
## 6 Armenia 164. 166. 169. 172. 172.
## # … with 138 more rows
100 / 131

features: Summaries that are aware of data structure

101 / 131

features: Summaries that are aware of data structure

heights %>%
features(height_cm, #<< # variable we want to summarise
feat_five_num) #<< # feature to calculate
101 / 131

features: Summaries that are aware of data structure

heights %>%
features(height_cm, #<< # variable we want to summarise
feat_five_num) #<< # feature to calculate
## # A tibble: 144 x 6
## country min q25 med q75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 161. 164. 167. 168. 168.
## 2 Albania 168. 168. 170. 170. 170.
## 3 Algeria 166. 168. 169 170. 171.
## 4 Angola 159. 160. 167. 168. 169.
## 5 Argentina 167. 168. 168. 170. 174.
## 6 Armenia 164. 166. 169. 172. 172.
## # … with 138 more rows
101 / 131

Other available features() in brolgar

102 / 131

What is the range of the data? feat_ranges

heights %>%
features(height_cm, feat_ranges)
## # A tibble: 144 x 5
## country min max range_diff iqr
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 161. 168. 7 3.27
## 2 Albania 168. 170. 2.20 1.53
## 3 Algeria 166. 171. 5.06 2.15
## 4 Angola 159. 169. 10.5 7.87
## 5 Argentina 167. 174. 7 2.21
## 6 Armenia 164. 172. 8.82 5.30
## 7 Australia 170 178. 8.4 2.58
## 8 Austria 162. 179. 17.2 5.35
## 9 Azerbaijan 170. 172. 1.97 1.12
## 10 Bahrain 161. 164 3.3 2.75
## # … with 134 more rows
103 / 131

Does it only increase or decrease? feat_monotonic

heights %>%
features(height_cm, feat_monotonic)
## # A tibble: 144 x 5
## country increase decrease unvary monotonic
## <chr> <lgl> <lgl> <lgl> <lgl>
## 1 Afghanistan FALSE FALSE FALSE FALSE
## 2 Albania FALSE TRUE FALSE TRUE
## 3 Algeria FALSE FALSE FALSE FALSE
## 4 Angola FALSE FALSE FALSE FALSE
## 5 Argentina FALSE FALSE FALSE FALSE
## 6 Armenia FALSE FALSE FALSE FALSE
## 7 Australia FALSE FALSE FALSE FALSE
## 8 Austria FALSE FALSE FALSE FALSE
## 9 Azerbaijan FALSE FALSE FALSE FALSE
## 10 Bahrain TRUE FALSE FALSE TRUE
## # … with 134 more rows
104 / 131

What is the spread of my data? feat_spread

heights %>%
features(height_cm, feat_spread)
## # A tibble: 144 x 5
## country var sd mad iqr
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 7.20 2.68 1.65 3.27
## 2 Albania 0.950 0.975 0.667 1.53
## 3 Algeria 3.30 1.82 0.741 2.15
## 4 Angola 16.9 4.12 3.11 7.87
## 5 Argentina 2.89 1.70 1.36 2.21
## 6 Armenia 10.6 3.26 3.60 5.30
## 7 Australia 7.63 2.76 1.66 2.58
## 8 Austria 26.6 5.16 3.93 5.35
## 9 Azerbaijan 0.516 0.718 0.621 1.12
## 10 Bahrain 3.42 1.85 0.297 2.75
## # … with 134 more rows
105 / 131

features: MANY more features in feasts

Such as:

  • feat_acf: autocorrelation-based features
  • feat_stl: STL (Seasonal, Trend, and Remainder by LOESS) decomposition
106 / 131

features: what is feat_five_num?

107 / 131

features: what is feat_five_num?

feat_five_num
## function(x, ...) {
## list(
## min = b_min(x, ...),
## q25 = b_q25(x, ...),
## med = b_median(x, ...),
## q75 = b_q75(x, ...),
## max = b_max(x, ...)
## )
## }
## <bytecode: 0x7fa11b5b9d28>
## <environment: namespace:brolgar>
107 / 131

features: what is feat_five_num?

108 / 131

Problem #1: How do I look at some of the data?

109 / 131

Problem #1: How do I look at some of the data?

Problem #2: How do I find interesting observations?

109 / 131

Problem #1: How do I look at some of the data?

Problem #2: How do I find interesting observations?

  • Decide what features are interesting
  • Summarise down to one observation
  • Decide how to filter
  • Join this feature back to the data
109 / 131

Problem #1: How do I look at some of the data?

Problem #2: How do I find interesting observations?

110 / 131

Problem #1: How do I look at some of the data?

Problem #2: How do I find interesting observations?

Problem #3: How do I understand my statistical model

110 / 131

Problem #3: How do I understand my statistical model

Let's fit a simple mixed effects model to the data

Fixed effect of year + Random intercept for country

heights_fit <- lmer(height_cm ~ year + (1|country), heights)
heights_aug <- heights %>%
add_predictions(heights_fit, var = "pred") %>%
add_residuals(heights_fit, var = "res")
111 / 131

Problem #3: How do I understand my statistical model

## # A tsibble: 1,490 x 5 [!]
## # Key: country [144]
## country year height_cm pred res
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1870 168. 164. 4.59
## 2 Afghanistan 1880 166. 164. 1.52
## 3 Afghanistan 1930 167. 166. 0.823
## 4 Afghanistan 1990 167. 168. -1.04
## 5 Afghanistan 2000 161. 169. -7.10
## 6 Albania 1880 170. 168. 2.39
## 7 Albania 1890 170. 168. 1.73
## 8 Albania 1900 169. 168. 0.769
## 9 Albania 2000 168. 172. -4.14
## 10 Algeria 1910 169. 168. 1.28
## # … with 1,480 more rows
112 / 131

Problem #3: How do I understand my statistical model

113 / 131

Look at subsamples?

114 / 131

Look at subsamples?

114 / 131

Look at many subsamples?

115 / 131

Look at many subsamples?

gg_heights_fit + facet_sample()

115 / 131

Look at all subsamples?

gg_heights_fit + facet_strata()

116 / 131

Look at all subsamples along residuals?

gg_heights_fit + facet_strata(along = -res)

117 / 131

Look at the predictions with the data?

set.seed(2019-11-13)
heights_sample <-
heights_aug %>%
sample_n_keys(size = 9) %>% #<< sample the data
ggplot(aes(x = year,
y = pred,
group = country)) +
geom_line() +
facet_wrap(~country)
heights_sample
118 / 131

Look at the predictions with the data?

119 / 131

Look at the predictions with the data?

heights_sample + geom_point(aes(y = height_cm))

120 / 131

What if we grabbed a sample of those who have the best, middle, and worst residuals?

121 / 131

What if we grabbed a sample of those who have the best, middle, and worst residuals?

summary(heights_aug$res)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.1707 -1.6202 -0.1558 0.0000 1.3545 12.1729
121 / 131

What if we grabbed a sample of those who have the best, middle, and worst residuals?

summary(heights_aug$res)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.1707 -1.6202 -0.1558 0.0000 1.3545 12.1729

Which countries are nearest to these statistics?

121 / 131

use keys_near()

heights_aug %>%
keys_near(key = country,
var = res)
## # A tibble: 6 x 5
## country res stat stat_value stat_diff
## <chr> <dbl> <fct> <dbl> <dbl>
## 1 Ireland -8.17 min -8.17 0
## 2 Azerbaijan -1.62 q_25 -1.62 0.000269
## 3 Laos -0.157 med -0.156 0.00125
## 4 Mongolia -0.155 med -0.156 0.00125
## 5 Egypt 1.35 q_75 1.35 0.000302
## 6 Poland 12.2 max 12.2 0

This shows us the keys that closely match the five number summary.

122 / 131

Show data by joining it with residuals, to explore spread

123 / 131

Take homes

Problem #1: How do I look at some of the data?

  1. Longitudinal data is a time series
  2. Specify structure once, get a free lunch.
  3. Look at as much of the raw data as possible
  4. Use facet_sample() / facet_strata() to look at data
124 / 131

Take homes

Problem #2: How do I find interesting observations?

  1. Decide what features are interesting
  2. Summarise down to one observation
  3. Decide how to filter
  4. Join this feature back to the data
125 / 131

Take homes

Problem #3: How do I understand my statistical model

  1. Look at (one, more or all!) subsamples
  2. Arrange subsamples
  3. Find keys near some summary
  4. Join keys to data to explore representatives
126 / 131

Thanks

  • Di Cook
  • Tania Prvan
  • Stuart Lee
  • Mitchell O'Hara Wild
  • Earo Wang
  • Rob Hyndman
  • Miles McBain
  • Monash University
127 / 131

Resources

128 / 131

Colophon

129 / 131

Learning more

brolgar.njtierney.com

bit.ly/njt-ozvis

nj_tierney

njtierney

nicholas.tierney@gmail.com

130 / 131

End.

131 / 131
  • now let's go through these same principles:

    • Sample the fits
    • Many samples of the fits
    • Explore the residuals
    • Find the best, worst, middle of the ground residuals
    • follow the "gliding" process again.
  • Which are most similar to which stats?

    • "who is similar to me?"
    • "who is the most average?"
    • "who is the most extreme?"
    • "Who is the most different to me?"

EDA: Why it's worth it

Anscombe's quartet

(My) Background

2 / 131

I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.

Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow