I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.
(My) Background
I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.
Undergraduate in Psychology
Psychophysics: illusory contours in 3D
Phil Grove
Christina Lee
If every psychologist in the world delivered gold standard smoking cessation therapy, the rate of smoking would still increase. You need to change policy to make change. To make effective policy, you need to have good data, and do good statistics.
I discovered an interest in public health and statistics.
Kerrie Mengersen
I started a PhD in statistics at QUT, under (now distinguished) Professor Kerrie Mengersen, Looking at people's health over time.
A lot of research in new statistical methods - imputation, inference, prediction
A lot of research in new statistical methods - imputation, inference, prediction
Not much research on how we explore our data, and the methods that we use to do this.
Focus on building a bridge across a river. Less focus on how it is built, and the tools used.
My research:
Design and improve tools for (exploratory) data analysis
...EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. (Wikipedia)
John Tukey, Frederick Mosteller, Bill Cleveland, Dianne Cook, Heike Hoffman, Rob Hyndman, Hadley Wickham
visdat::vis_dat(airquality)
visdat::vis_miss(airquality)
naniar
Tierney, NJ. Cook, D. "Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations." [Pre-print]
naniar::gg_miss_var(airquality)
naniar::gg_miss_var(airquality, facet = Month)
naniar::gg_miss_upset(airquality)
Current work:
How to explore longitudinal data effectively
Something observed sequentially over time
country | year | height_cm |
---|---|---|
Australia | 1910 | 172.7 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 172.700 |
Australia | 1920 | 172.846 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 172.700 |
Australia | 1920 | 172.846 |
Australia | 1960 | 176.300 |
country | year | height_cm |
---|---|---|
Australia | 1910 | 172.700 |
Australia | 1920 | 172.846 |
Australia | 1960 | 176.300 |
Australia | 1970 | 178.400 |
Problem #1: How do I look at some of the data?
Problem #1: How do I look at some of the data?
Problem #2: How do I find interesting observations?
Problem #1: How do I look at some of the data?
Problem #2: How do I find interesting observations?
Problem #3: How do I understand my statistical model
brolgar
: brolgar.njtierney.comSomething observed sequentially over time
SomethingAnything that is observed sequentially over time is a time series
SomethingAnything that is observed sequentially over time is a time series
heights <- as_tsibble(heights, index = year, key = country, regular = FALSE)
1. + 2.
determine distinct rows in a tsibble.
(From Earo Wang's talk: Melt the clock)
Record important time series information once, and use it many times in other places
## # A tsibble: 1,490 x 3 [!]## # Key: country [144]## country year height_cm## <chr> <dbl> <dbl>## 1 Afghanistan 1870 168.## 2 Afghanistan 1880 166.## 3 Afghanistan 1930 167.## 4 Afghanistan 1990 167.## 5 Afghanistan 2000 161.## 6 Albania 1880 170.## # … with 1,484 more rows
Remember:
key = variable(s) defining individual groups (or series)
Look at only a sample of the data:
n
rows with sample_n()
n
rows with sample_n()
heights %>% sample_n(5)
n
rows with sample_n()
heights %>% sample_n(5)
## # A tsibble: 5 x 3 [!]## # Key: country [5]## country year height_cm## <chr> <dbl> <dbl>## 1 Cambodia 1860 165.## 2 Bolivia 1890 164.## 3 Macedonia 1930 169.## 4 United States 1920 173.## 5 Papua New Guinea 1880 152.
n
rows with sample_n()
n
rows with sample_n()
## # A tsibble: 5 x 3 [!]## # Key: country [5]## country year height_cm## <chr> <dbl> <dbl>## 1 Cambodia 1860 165.## 2 Bolivia 1890 164.## 3 Macedonia 1930 169.## 4 United States 1920 173.## 5 Papua New Guinea 1880 152.
n
rows with sample_n()
## # A tsibble: 5 x 3 [!]## # Key: country [5]## country year height_cm## <chr> <dbl> <dbl>## 1 Cambodia 1860 165.## 2 Bolivia 1890 164.## 3 Macedonia 1930 169.## 4 United States 1920 173.## 5 Papua New Guinea 1880 152.
... sampling needs to select not random rows of the data, but the keys - the countries.
sample_n_keys()
to sample ... keyssample_n_keys(heights, 5)
## # A tsibble: 56 x 3 [!]## # Key: country [5]## country year height_cm## <chr> <dbl> <dbl>## 1 Hungary 1730 167.## 2 Hungary 1740 168.## 3 Hungary 1750 167.## 4 Hungary 1760 167 ## 5 Hungary 1770 162.## 6 Hungary 1780 163.## # … with 50 more rows
sample_n_keys()
to sample ... keysLook at subsamples
Sample keys
Look at subsamples
Sample keys
Look at many subsamples
Look at subsamples
Sample keys
Look at many subsamples
?
(Something I made up)
(Something I made up)
If you have to solve 3+ substantial smaller problems in order to solve a larger problem, your focus shifts from the current goal to something else. You are distracted.
Task one
Task one being overshadowed slightly by minor task 1
I want to look at many subsamples of the data
I want to look at many subsamples of the data
How many keys are there?
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
How do I ensure there are the same number of keys per plot
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
How do I ensure there are the same number of keys per plot
What is rep
, rep.int
, and rep_len
?
I want to look at many subsamples of the data
How many keys are there?
How many facets do I want to look at
How many keys per facet should I look at
How do I ensure there are the same number of keys per plot
What is rep
, rep.int
, and rep_len
?
Do I want length.out
or times
?
We can blame ourselves when we are distracted for not being better.
We can blame ourselves when we are distracted for not being better.
It's not that we should be better, rather with better tools we could be more efficient.
We can blame ourselves when we are distracted for not being better.
It's not that we should be better, rather with better tools we could be more efficient.
We need to make things as easy as reasonable, with the least amount of distraction.
How many plots do I want to look at?
How many plots do I want to look at?
heights_plot + facet_sample( n_per_facet = 3, n_facets = 9 )
facet_sample()
: See more individualsggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line()
facet_sample()
: See more individualsggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_sample()
facet_sample()
: See more individualsfacet_strata()
: See all individualsggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata()
facet_strata()
: See all individualsIn asking these questions we can solve something else interesting
facet_strata(along = -year)
: see all individuals along some variableggplot(heights, aes(x = year, y = height_cm, group = country)) + geom_line() + facet_strata(along = -year)
facet_strata(along = -year)
: see all individuals along some variable"How many lines per facet"
"How many facets?"
ggplot + facet_sample( n_per_facet = 10, n_facets = 12 )
"How many lines per facet"
"How many facets?"
ggplot + facet_sample( n_per_facet = 10, n_facets = 12 )
"How many facets to shove all the data in?"
ggplot + facet_strata( n_strata = 10, )
facet_strata()
& facet_sample()
Under the hood using sample_n_keys()
& stratify_keys()
facet_strata()
& facet_sample()
Under the hood using sample_n_keys()
& stratify_keys()
You can still get at data and do manipulations
as_tsibble()
sample_n_keys()
facet_sample()
facet_strata()
as_tsibble()
sample_n_keys()
facet_sample()
facet_strata()
Store useful information
View subsamples of data
View many subsamples
View all subsamples
as_tsibble()
sample_n_keys()
facet_sample()
facet_strata()
Store useful information
View subsamples of data
View many subsamples
View all subsamples
Define interesting?
Let's see that one more time, but with the data
## # A tsibble: 1,490 x 3 [!]## # Key: country [144]## country year height_cm## <chr> <dbl> <dbl>## 1 Afghanistan 1870 168.## 2 Afghanistan 1880 166.## 3 Afghanistan 1930 167.## 4 Afghanistan 1990 167.## 5 Afghanistan 2000 161.## 6 Albania 1880 170.## 7 Albania 1890 170.## 8 Albania 1900 169.## 9 Albania 2000 168.## 10 Algeria 1910 169.## # … with 1,480 more rows
## # A tibble: 144 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 164. 167. 168. 168.## 2 Albania 168. 168. 170. 170. 170.## 3 Algeria 166. 168. 169 170. 171.## 4 Angola 159. 160. 167. 168. 169.## 5 Argentina 167. 168. 168. 170. 174.## 6 Armenia 164. 166. 169. 172. 172.## 7 Australia 170 171. 172. 173. 178.## 8 Austria 162. 164. 167. 169. 179.## 9 Azerbaijan 170. 171. 172. 172. 172.## 10 Bahrain 161. 161. 164. 164. 164 ## # … with 134 more rows
heights_five %>% filter(max == max(max) | max == min(max))
## # A tibble: 2 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Denmark 165. 168. 170. 178. 183.## 2 Papua New Guinea 152. 152. 156. 160. 161.
heights_five %>% filter(max == max(max) | max == min(max)) %>% left_join(heights, by = "country")
## # A tibble: 21 x 8## country min q25 med q75 max year height_cm## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Denmark 165. 168. 170. 178. 183. 1820 167.## 2 Denmark 165. 168. 170. 178. 183. 1830 165.## 3 Denmark 165. 168. 170. 178. 183. 1850 167.## 4 Denmark 165. 168. 170. 178. 183. 1860 168.## 5 Denmark 165. 168. 170. 178. 183. 1870 168.## 6 Denmark 165. 168. 170. 178. 183. 1880 170.## 7 Denmark 165. 168. 170. 178. 183. 1890 169.## 8 Denmark 165. 168. 170. 178. 183. 1900 170.## 9 Denmark 165. 168. 170. 178. 183. 1910 170 ## 10 Denmark 165. 168. 170. 178. 183. 1920 174.## # … with 11 more rows
heights %>% features(height_cm, feat_five_num)
## # A tibble: 144 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 164. 167. 168. 168.## 2 Albania 168. 168. 170. 170. 170.## 3 Algeria 166. 168. 169 170. 171.## 4 Angola 159. 160. 167. 168. 169.## 5 Argentina 167. 168. 168. 170. 174.## 6 Armenia 164. 166. 169. 172. 172.## # … with 138 more rows
heights %>% features(height_cm, #<< # variable we want to summarise feat_five_num) #<< # feature to calculate
heights %>% features(height_cm, #<< # variable we want to summarise feat_five_num) #<< # feature to calculate
## # A tibble: 144 x 6## country min q25 med q75 max## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 164. 167. 168. 168.## 2 Albania 168. 168. 170. 170. 170.## 3 Algeria 166. 168. 169 170. 171.## 4 Angola 159. 160. 167. 168. 169.## 5 Argentina 167. 168. 168. 170. 174.## 6 Armenia 164. 166. 169. 172. 172.## # … with 138 more rows
features()
in brolgar
feat_ranges
heights %>% features(height_cm, feat_ranges)
## # A tibble: 144 x 5## country min max range_diff iqr## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 161. 168. 7 3.27## 2 Albania 168. 170. 2.20 1.53## 3 Algeria 166. 171. 5.06 2.15## 4 Angola 159. 169. 10.5 7.87## 5 Argentina 167. 174. 7 2.21## 6 Armenia 164. 172. 8.82 5.30## 7 Australia 170 178. 8.4 2.58## 8 Austria 162. 179. 17.2 5.35## 9 Azerbaijan 170. 172. 1.97 1.12## 10 Bahrain 161. 164 3.3 2.75## # … with 134 more rows
feat_monotonic
heights %>% features(height_cm, feat_monotonic)
## # A tibble: 144 x 5## country increase decrease unvary monotonic## <chr> <lgl> <lgl> <lgl> <lgl> ## 1 Afghanistan FALSE FALSE FALSE FALSE ## 2 Albania FALSE TRUE FALSE TRUE ## 3 Algeria FALSE FALSE FALSE FALSE ## 4 Angola FALSE FALSE FALSE FALSE ## 5 Argentina FALSE FALSE FALSE FALSE ## 6 Armenia FALSE FALSE FALSE FALSE ## 7 Australia FALSE FALSE FALSE FALSE ## 8 Austria FALSE FALSE FALSE FALSE ## 9 Azerbaijan FALSE FALSE FALSE FALSE ## 10 Bahrain TRUE FALSE FALSE TRUE ## # … with 134 more rows
feat_spread
heights %>% features(height_cm, feat_spread)
## # A tibble: 144 x 5## country var sd mad iqr## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 7.20 2.68 1.65 3.27## 2 Albania 0.950 0.975 0.667 1.53## 3 Algeria 3.30 1.82 0.741 2.15## 4 Angola 16.9 4.12 3.11 7.87## 5 Argentina 2.89 1.70 1.36 2.21## 6 Armenia 10.6 3.26 3.60 5.30## 7 Australia 7.63 2.76 1.66 2.58## 8 Austria 26.6 5.16 3.93 5.35## 9 Azerbaijan 0.516 0.718 0.621 1.12## 10 Bahrain 3.42 1.85 0.297 2.75## # … with 134 more rows
feasts
Such as:
feat_acf
: autocorrelation-based featuresfeat_stl
: STL (Seasonal, Trend, and Remainder by LOESS) decompositionfeat_five_num?
feat_five_num?
feat_five_num
## function(x, ...) {## list(## min = b_min(x, ...),## q25 = b_q25(x, ...),## med = b_median(x, ...),## q75 = b_q75(x, ...),## max = b_max(x, ...)## )## }## <bytecode: 0x7fa11b5b9d28>## <environment: namespace:brolgar>
feat_five_num?
Let's fit a simple mixed effects model to the data
Fixed effect of year + Random intercept for country
heights_fit <- lmer(height_cm ~ year + (1|country), heights)heights_aug <- heights %>% add_predictions(heights_fit, var = "pred") %>% add_residuals(heights_fit, var = "res")
## # A tsibble: 1,490 x 5 [!]## # Key: country [144]## country year height_cm pred res## <chr> <dbl> <dbl> <dbl> <dbl>## 1 Afghanistan 1870 168. 164. 4.59 ## 2 Afghanistan 1880 166. 164. 1.52 ## 3 Afghanistan 1930 167. 166. 0.823## 4 Afghanistan 1990 167. 168. -1.04 ## 5 Afghanistan 2000 161. 169. -7.10 ## 6 Albania 1880 170. 168. 2.39 ## 7 Albania 1890 170. 168. 1.73 ## 8 Albania 1900 169. 168. 0.769## 9 Albania 2000 168. 172. -4.14 ## 10 Algeria 1910 169. 168. 1.28 ## # … with 1,480 more rows
gg_heights_fit + facet_sample()
gg_heights_fit + facet_strata()
gg_heights_fit + facet_strata(along = -res)
set.seed(2019-11-13)heights_sample <- heights_aug %>% sample_n_keys(size = 9) %>% #<< sample the data ggplot(aes(x = year, y = pred, group = country)) + geom_line() + facet_wrap(~country)heights_sample
heights_sample + geom_point(aes(y = height_cm))
summary(heights_aug$res)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8.1707 -1.6202 -0.1558 0.0000 1.3545 12.1729
summary(heights_aug$res)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -8.1707 -1.6202 -0.1558 0.0000 1.3545 12.1729
Which countries are nearest to these statistics?
keys_near()
heights_aug %>% keys_near(key = country, var = res)
## # A tibble: 6 x 5## country res stat stat_value stat_diff## <chr> <dbl> <fct> <dbl> <dbl>## 1 Ireland -8.17 min -8.17 0 ## 2 Azerbaijan -1.62 q_25 -1.62 0.000269## 3 Laos -0.157 med -0.156 0.00125 ## 4 Mongolia -0.155 med -0.156 0.00125 ## 5 Egypt 1.35 q_75 1.35 0.000302## 6 Poland 12.2 max 12.2 0
This shows us the keys that closely match the five number summary.
facet_sample()
/ facet_strata()
to look at dataEnd.
now let's go through these same principles:
Which are most similar to which stats?
Anscombe's quartet
(My) Background
I want to talk a bit about where I've started from, because I think it might be useful to understand my perspective, and why I'm interested in doing these things.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |