Study Heart Disease using PySyft

Welcome!

Can we run Machine Learning on medical data, without seeing the data ?

In this tutorial we will learn how to use PySyft to run Machine learning experiments on multiple medical datasets distributed across four Datasite servers, while maintaining privacy!

Read more on this post on OpenMined blog.

UPDATE

Two new notebooks added to the tutorial demonstrating how to run a complete Federated Learning example using PySyft!
Check them out in the Materials section, and read more on ➡️ this new post ⬅️ on OpenMined blog.

‐‐‐ Let's get started! 🚀

Heart Disease Study 🫀

We will use the full version of the Heart Disease dataset, as available on the UCI ML repository.

The database is the result of a study for the diagnosis of coronary artery disease, as described in this paper.

The dataset contains data as collected by patients in four different hospitals, in 1988:

  • Cleveland Clinic in Cleveland, Ohio (303 patients);
  • Hungarian Institute of Cardiology in Budapest, Hungary (425 patients);
  • Veterans Administration Medical Center in Long Beach, California (200 patients)
  • University Hospitals in Zurich and Basel (143 patients).

💡Each hospital will be mapped to a single PySyft Datasite, hosting their own version of the Heart Study Data data. We will pretend that these data were not public - as it is most likely the case with real medical data. Therefore our main focus in the tutorial will be to learn how to work with non-public data, while maintaining privacy.

What you will learn 🎓

In this tutorial, you will learn how to…
  1. work remotely with non-public medical data.
  2. use PySyft to run Machine learning experiments on non-public and distributed medical datasets.
  3. take advantage of getting access to multiple medical datasets for better Machine learning modelling.

Materials 🧑‍💻

The tutorial is organised into multiple Jupyter notebooks that will guide you to the different steps of our Machine learning experiment, using PySyft.

  • 🧭 (Intro) Setup Datasites

    This introductory notebook will help you to get your bearings with the data, and the PySyft Datasites.

  • 📊 1. Compare Demographics

    In this notebook we will start our data science journey, analysing the distribution of demographics in the data. Why this is interesting ? We will learn how to get our first insights about the data, without ever seeing the data!

  • 🤖 2. ML Model Training Experiment:

    Let's use PySyft to train a Machine learning classifier, using data across the four distributed datasites, while also maintain the non-public data private! (This is going to be our 🌟 research experiment with PySyft!)

  • 📝 3. ML Model Evaluation Experiment:

    It's now time to assess the performance of the trained classifiers on each remote datasite. We will gather evaluation metrics for the trained models, and compare the results! We will learn how to create specialised Syft function that guarantees control over the input/output policies of our remote code execution. (This will be our 🌟🌟 research experiment with PySyft!)

  • 🗳️ 4. Ensemble Learning Experiment :

    In this step, we will use an Ensemble as a strategy to combine the multiple ML models trained on the four medical datasets. In this way, we will be able to use a ML model that has seen 4x more data to generate our heart disease prediction! (This will be our 🌟🌟🌟 experiment with PySyft!)

  • ⚗️️ 5. Federated Learning Experiment :

    In this step, we will take a step further, and we will learn how to run a full Federated Learning experiment using PySyft. We will train a linear classifier on each datasite and explore how to pass model parameters as inputs to a Syft function. (This will be our 🌟🌟🌟🌟 experiment with PySyft!)

  • 🔮 6. Federated Learning Experiment (with PyTorch) :

    In this last step, we will run another complete Federated Learning experiment, but this time using PySyft and PyTorch. We'll train a non-linear Neural Network across multiple datasites and learn how to leverage PyTorch within PySyft to seamlessly execute FL experiments. (This will be our 🌟🌟🌟🌟🌟 and last research experiment with PySyft!)

Ready to get started ?

Everything you need to start working with the tutorial is available on GitHub! You can start by cloning the repo, and follow the instructions in the README file:

$ git clone https://github.com/openmined/syft-heart-disease-tutorial
Feedback? Always welcome!

If you liked this tutorial, or for any additional question, or feedback you may have, please feel free to use one of the options below:

Star the repository.
Open an issue
Reach out in #support channel on Slack
This tutorial was authored by
Valerio Maggio
Valerio Maggio
Hey there! 👋 I am Valerio, Computer Geek, Machine learners, Community Advocate and Education Lead at OpenMined! I am also fellow of the Software Sustainability Institute, and a long-standing member of the Python community.