Study Heart Disease using PySyft

Welcome!

Can we run Machine Learning on medical data, without seeing the data ?

In this tutorial we will learn how to use PySyft to run Machine learning experiments on multiple medical datasets distributed across four Datasite servers, while maintaining privacy!

Read more on this post on OpenMined blog.

‐‐‐ Let's get started! πŸš€

Heart Disease Study πŸ«€

We will use the full version of the Heart Disease dataset, as available on the UCI ML repository.

The database is the result of a study for the diagnosis of coronary artery disease, as described in this paper.

The dataset contains data as collected by patients in four different hospitals, in 1988:

  • Cleveland Clinic in Cleveland, Ohio (303 patients);
  • Hungarian Institute of Cardiology in Budapest, Hungary (425 patients);
  • Veterans Administration Medical Center in Long Beach, California (200 patients)
  • University Hospitals in Zurich and Basel (143 patients).

πŸ’‘Each hospital will be mapped to a single PySyft Datasite, hosting their own version of the Heart Study Data data. We will pretend that these data were not public - as it is most likely the case with real medical data. Therefore our main focus in the tutorial will be to learn how to work with non-public data, while maintaining privacy.

What you will learn πŸŽ“

In this tutorial, you will learn how to…
  1. work remotely with non-public medical data.
  2. use PySyft to run Machine learning experiments on non-public and distributed medical datasets.
  3. take advantage of getting access to multiple medical datasets for better Machine learning modelling.

Materials πŸ§‘β€πŸ’»

The tutorial is organised into multiple Jupyter notebooks that will guide you to the different steps of our Machine learning experiment, using PySyft.

  • 🧭 (Intro) Setup Datasites

    This introductory notebook will help you to get your bearings with the data, and the PySyft Datasites.

  • πŸ“Š 1. Compare Demographics

    In this notebook we will start our data science journey, analysing the distribution of demographics in the data. Why this is interesting ? We will learn how to get our first insights about the data, without ever seeing the data!

  • πŸ€– 2. ML Model Training Experiment:

    Let's use PySyft to train a Machine learning classifier, using data across the four distributed datasites, while also maintain the non-public data private! (This is going to be our 🌟 research experiment with PySyft!)

  • πŸ“ 3. ML Model Evaluation Experiment:

    It's now time to assess the performance of the trained classifiers on each remote datasite. We will gather evaluation metrics for the trained models, and compare the results! We will learn how to create specialised Syft function that guarantees control over the input/output policies of our remote code execution. (This will be our 🌟🌟 research experiment with PySyft!)

  • πŸ—³οΈ 4. Ensemble Learning Experiment :

    In this last step, we will use an Ensemble as a strategy to combine the multiple ML models trained on the four medical datasets. In this way, we will be able to use a ML model that has seen 4x more data to generate our heart disease prediction! (This will be our 🌟🌟🌟 and last research experiment with PySyft!)

Ready to get started ?

Everything you need to start working with the tutorial is available on GitHub! You can start by cloning the repo, and follow the instructions in the README file:

$ git clone https://github.com/openmined/syft-heart-disease-tutorial
Feedback? Always welcome!

If you liked this tutorial, or for any additional question, or feedback you may have, please feel free to use one of the options below:

Star the repository.
Open an issue
Reach out in #support channel on Slack
This tutorial was authored by
Valerio Maggio
Valerio Maggio
Hey there! πŸ‘‹ I am Valerio, Computer Geek, Machine learners, Community Advocate and Education Lead at OpenMined! I am also fellow of the Software Sustainability Institute, and a long-standing member of the Python community.