Data Science Education
  • About
  • Team
  • Research
  • Course
  • Blog

The Outreach Course

Given the broader objectives of the project, this intensive one-week long summer course aims to provide students with a taste of data science as an emerging field that is likely to have great impacts on many aspects of our daily lives in the years to come. While the course does not assume prior background in mathematics or statistics, we aim to teach a set of skills and principles that are often considered fundamental to the field. Nevertheless, we view data science as an evolving field and believe that no single discipline should determine its course of development. We think we all, including the younger generations we teach, have a responsibility to make this field as inclusive and vibrant as possible.

Prerequisites & Logistics 

The course is open to teenagers who are nearing the end of their secondary education and have an interest in data science in general and in developing some analytical skills with R in particular. Ideally, they should be close to Exeter so that they can come and join the sessions with us in Exeter for three hours each day for a week in late August and early September 2019. However, students who are unable to join us on campus can take the course via live web conferences, which will be recorded and made available as screencasts on this website afterwards.

At the end of the course, there will be a one-day event on the beautiful Streatham campus of the University of Exeter. We will invite real world data scientists and academics to come and talk to the students in this course and ask the students to present as groups what they will have learnt with us as well. In order to generate some impact beyond the end of the project, we will make the one-week intensive course available online as screencasts using zoom, so that students of similar backgrounds elsewhere can study the materials themselves and learn more about data science in future.

Throughout the week, we will not only teach students some essential technical and analytical skills that will prove useful for (further) studies and (future) work, but we'll also emphasise statistical re-thinking and other soft skills, such as communication, teamwork, and storytelling. We will analyse real world research data in multiple ways, observe how the findings resemble and/or differ from one another, and eventually bring into awareness the importance of design in data-intensive research and decision-making. Together, we aim to construct a (hopefully) better (shared) understanding of a particular research design, namely, Randomised Controlled Trials (RCTs), by approaching (research) data with an open and enabling mindset and analysing them in principled ways.

Each three-hour session on campus consists of some lectures and hands-on-the-keyboard exercises. These sessions will take place in BC217 of Baring Court on St Luke's Campus between 10:30 am and 2:30 pm (with a lunch break for one hour) each day for five days. Tea, coffee, and lunch will be served in BC03 of St Luke's for those who join the live sessions with us on campus. The lectures will introduce students to some core topics related to data science, and the workshops aim to help students complete an end-of-course group project. Depending on the number of students who will take part in the course, at least two group projects will be made available for students to choose from. Teaching assistants and school teachers who know the students best will be available online and on campus to help answer certain questions.

As we really want students to leave the course with a set of skills that will be helpful to them and a better understanding of data science for causal inference as a field, we expect them to follow the course on campus, online, or both, and complete a piece of group work. It is therefore very important that students engage with the materials and the team who deliver the course throughout the week. However, we will make sure that help is always available when needed, either on campus or online.

Topics to be Covered

 We dedicate the whole week to the analyses of data from one particular type of design, Randomised Controlled Trials (RCTs), which are widely adopted in many scientific disciplines and are controversial to researchers from diverse backgrounds too. However, we do not assume you have any prior knowledge about RCTs either. In the course, we will apply both conventional and machine learning techniques to one real-world research dataset, which you can download from this website. You will need to select your college name and create an account with UK Data Service before you can download it.

Please note that this is not a toy dataset you will simply play with, it is from a study funded by the Economic and Social Research Council (ESRC). It costed £621,022 and took over two years (2012-2014) to complete
. Please read as much as possible about the project before you come here, but do not worry if you cannot understand much about it yet.

​Although this is a research dataset, the models you will learn from the week can be applied elsewhere too. More importantly, you will learn that real world data analysis is not always, if at all, straightforward, particularly when the stakes are high and you need to generate evidence for decision-making.


Day One (27/08/19)
1. Introduction to R and RStudio using the MOVE data.
2. Basic ideas about RCTs: Randomisation, potential outcomes, and causal inference.
3. Researcher degrees of freedom in data pre-processing processes and subset construction for final analysis.
Screencasts: AM, PM 
Please note that we encountered many technical difficulties on the first morning, which was anticipated. As a result, the quality of recording for the morning session was very poor.
R codes and slides.

Day Two (28/08/19)
1. The promise and perils of statistical modelling (together with Larisa Seward from Exeter College).
2. Missing data and multiple imputation using Amelia.
3. Causal inference as a missing data problem.
Screencasts: AM, PM
Because the guest lecturer used a different computer to present the topic, screencast for the morning session is not available. However, we do have an audio recording of the session, which we will upload to this website later. As we learned through the process, the afternoon session was much better than before.
R codes and slides.

Day Three (29/08/19)
1. Statistical hypothesis testing (together with Vicky Crockett-Matthews from Exeter College).
2. Diversity in analytical approaches to a single dataset: Difference-in-means, linear regression, and multilevel modelling.
3. Let data speak for itself?: Multiple ways of seeing.
Screencasts: AM, PM
Because the guest lecturer used a different computer to present the topic, screencast for the morning session is not available. However, we do have an audio recording of the session, which we will upload to this website later. We also have a video backup of the morning session, but we need to remove student images before we upload it as a screencast here.
R codes and slides.

Day Four (30/08/19)
1. Introduction to statistical learning: From inference to prediction.
2. The concepts of training, testing, and cross-validation.
3. Prediction using simple linear regression, multilevel modelling, and random forests.
Screencasts: AM, PM
The morning session went exceptionally well.
R codes and slides.

Day Five (02/09/19)
1. Philosophical underpinnings of social science research (
together with James Adams).
2. 
Predicting counterfactual outcomes for individuals in an RCT.
3. Limitations of data-driven approaches to data analytics.
Screencasts: AM, PM

Final delivery day of the course! Many students enjoyed the morning session James delivered.
R codes and slides.

Day Six (03/09/19)
Final event day held in Reed Hall of Streatham Campus. On this day, we invite the students to campus again to present their group work and talk to some real world data scientists and academics who will come and present their work and research in an easy-to-understand way.

As we promised not to film student presentations, we did not record their presentations in any format. However, they did really well and four students received some extra Amazon voucher for their active participation and impressive contribution to the group work.

Video recordings: Welcome from Professor Susan Banducci, RCTs in Education by Dr Hugues Lortie-Forgues from the University of York, Real world data analytics by Sarah Littler from Select Statistics.

Slides from the final event day.