Skip to content

Workshop Nolai

General approach and goals

In this workshop you will learn the basics of PEP. Step by step, we will guide you through the system. After completing this workshop you will have a basic understanding of:

  • Downloading the software and starting
  • Logging in
  • Down- and uploading data
  • Repository management

For any questions, please contact support@pep.cs.ru.nl

Downloading the software and starting

For Windows

Download the installer from: Windows Installer - Click the downloaded file and go through the installer - Unmark "PEP Assessor starten" and finish - Then, go to your start menu and go to All Apps, look up and open the PEP folder, and click on PEP Opdrachtregel

For MacOS

Download the installer from: MacOS Installer - Click the downloaded file to unzip it, and go through the installer - Open the Go menu, then Applications (or press Shift+Cmd+A), and double-click on PEP Command Line Interface (nolai-sandbox acc)

You should now have a PEP command prompt window opened. We will refer to this window as the terminal.

Disclaimer: Feedback from the system

At times, the PEP system returns output that seem to indicate that a process has run with errors. Sometimes, this output can safely be ignored. The PEP team is working on solutions for this incorrect output. For now, please ignore the following output:

  • When logging in, the terminal might show:
Qt: Session management error: Could not open network socket
  • When using the pepcli (explained later), the terminal might show:
<error> [Cli] Unexpected problem shutting down SSL streams: boost::system::error_code: uninitialized (SSL routines) | Forcefully shutting down.

Logging in

The terminal is where all interaction with the pep repository happens. At the moment we can not do much, as we are not yet logged in. To log in, type pepLogon into your terminal and hit Enter. This will open up an internet browser window with the SURFconext log in screen. After logging in, you will see all UserGroups to which you have access. More on UserGroups later. For now, we will use the UserGroup WorkshopUser. After you click that role, you may close the browser tab/window, and return to the terminal.

  • In your terminal, type the pepLogon command:
pepLogon

You are now logged in. The terminal should show you the message: Wrote enrollment result (keys) to "LOCATION/ON/YOUR/COMPUTER/ClientKeys.json" This file contains keys that PEP will automatically use.

Looking around

Now that we are logged in, we can have a first look around in the PEP repository. For this workshop, we have already prepared some data, considering the students Alice, Bob, Charles, and Danielle. We will refer to all data subjects (in this case the students) as participants. First, let's download the data for these participants with the following command (which we will explain later):

pepcli pull --participant-groups AllExampleParticipants --column-groups AllExampleColumns

After the process has finished, you should have a (new) folder called pulled-data in your current working directory. Open it in an explorer window and look around. You will see that we have four subfolders with the prefix NOLSU and then a number. These subfolder names are the participant identifiers for this specific UserGroup. Each datapoint is stored in a separate file. Look around in the subfolders.

We can visualise these data points in a table:

Name Address Class French_Level French_Grade Maths_Level Maths_Grade
Alice Fruitstraat 12 2A A 8 A 3
Bob Lange Laan 3 3B A 4 C 7
Charles Lange Laan 3 2A C 7 D 6
Danielle Appelstraat 7 1C B 5 A 8

The table visualises the datapoints in rows (the participants Alice, Bob, Charles and Danielle), and columns (Name, Address, Class, etc.).

Imagine yourself a researcher studying the achievements of students. This data would be useful, but also contains a lot of information that is not required. You have no need for the actual names of the students, let alone the address where the students live. As long as you can consistently identify single participants yourself, you have enough information.

Intermezzo: Privacy by design

The ground rule for working with personal data (and all data in the table above is personal data, not just the names and the addresses) is that the person working with these data should only have access to the data that is required for the purpose. This concept is called data minimization.

Another important aspect to keep in mind is proportionality and purpose limitation, for example: it might be valuable to register mental conditions of participants when performing a study, but on the other hand this imposes a much larger impact on the privacy of an individual than for example sharing a math grade. Although efforts are made to prevent identification of individuals, this can never be completely prevented. Therefore the proportionality of using such details should be considered, and in the case of very sensitive data, (even) stronger ethical checks should be performed concerning the consent basis and the taken privacy and security measures. But the bottom line about data sharing is: less is more, which concerns both the amount of data and the resolution and sensitivity of data. Instead of, for example, sharing a birth date (day), in many cases the year of birth may suffice. In those cases, only this year of birth (a lower resolution) should be shared/used.


Back to the tutorial

Considering the rules of privacy by design, access to personal data should be limited where possible. In the previous example more details on the participants were shared than required.

PEP helps you to follow the principles of privacy by design in multiple ways. Instead of giving downloaders access to all data, we can specify access exclusively to specific columns and participants for all members in a UserGroup.

This is done by grouping columns and participants in ColumnGroups and ParticipantGroups. The UserGroup WorkshopUser from the example has access to ParticipantGroup AllExampleParticipants and ColumnGroup AllExampleColumns, but also to the smaller ParticipantGroup MaleParticipants and ColumnGroup Grades. Let us see what happens if we use those parameters in the pull command.

  • Before we continue, delete the old downloaded data folder pulled-data for a clean start.

  • Then download the new data using the pull command:

pepcli pull --participant-groups MaleParticipants --column-groups Grades

You see that we now only have two subfolders in the pulled-data folder for the two participants that are in ParticipantGroup MaleParticipants, and those only contain files with the French and Maths grades. A researcher would be able to directly use this data and never even know about the names of the students. Obviously, in a real world situation, a researcher should only have access to these groups, and not for the AllExample... groups.

Registering new participants and uploading data

Of course we should also be able to add new participants and upload data for them. Before we can add datapoints on a new student (participant), we must first register the new participant with the command:

pepcli register id

Make sure to write down the identifier (e.g. NOLS596389441202). We will use this identifier to store data in the columns we saw above.

Storing a value from the terminal prompt

We start by storing the name of this new participant. Using the --data parameter we can add the data directly from the terminal. Make up a nice student name and use it in the following command. NOTE: if you want to store values with spaces in it, put double quotes around them (e.g.: --data "Henk de Vries"):

pepcli store --participant <IDENTIFIER> --column Name --data <PARTICIPANTS_NAME>

where <IDENTIFIER> should be replaced with the identifier you have just written down, and <PARTICIPANTS_NAME> should be replaced by the name you made up.

This should produce an output on your terminal screen similar to:

2024-04-09 16:16:24: <info> [Application binary] 02209646710da21f7dfc
2024-04-09 16:16:24: <info> [Application configuration] No version information available. Running a local build?
{
    "id": "0A526690753CDB47163B1AA25155082A2EF9C0276DBE10D33D9A99A4F2E6C3C46BE8291E3DE5D408C52F2705B443E77E81B44A71CFA8FFE5C6E8A8878337EE5CF20AFA9D2C3B986CAC705F03815E782E1101E6EA12109932F382082A18EE9A2E98DAD5974EF81A107FF9F29D100D38311553D85ED540D5A9"
}

Storing a value from a file

It is also possible to upload files, using the --input-path parameter and filling in the path to the file.

  • Go to your Desktop and make a file called participantAddress.txt.\
  • Open it and type a nice address for this new participant to live at. Save the file.\
  • Store the address data with the command:
pepcli store --participant <IDENTIFIER> --column Address --input-path <PATH/TO/participantAddress.txt>
  • Use one of the above methods to upload the Class for the new participant. Make sure they are in the second class! This will be important in a bit.\
  • Use the commands above to fill in values for the columns Maths_Level and Maths_Grade. Since French is not very important anyway, we will skip those columns for now.

When we retry the pull command we used earlier, we will notice that the newly created student is not part of the pulled_data. The reason for that is that the participant is not part of any ParticipantGroups yet. We will address this in the next paragraph.

Creating a more realistic administration

In the next section we will create a more realistic UserGroup that only has access to a limited selection of the data. This UserGroup will represent a research group that is interested in the Maths levels and grades of students in their second year.

Intermezzo: Four Eyes principle

The administrative superpowers in PEP are divided over two specific types of users. One is part of the Data Administrator UserGroup and the other in the Access Administrator UserGroup. Both administrators work together and need each other to administer the PEP repository.

The reason for this separation of tasks is that a potential breach of one of the administrator roles will not immediately lead to a data leak.

In this tutorial, you will act as both roles although this should never be the case in operational environments. The Data Administrator, as the name suggests, deals with the data itself. They create columns and put those columns in ColumnGroups. They also add new participants to ParticipantGroups. Note that in the example above, the ColumnGroup name Grades and ParticipantGroup name MaleParticipants have semantic value, but which participants or columns are in a particular group is not based of data in the repository but defined by the Data Administrator. The Access Administrator creates UserGroups (such as WorkshopUser in the example) and grant these UserGroups access privileges to ColumnGroups and ParticipantGroups that were created by the Data Administrator.


Naming the new groups

In the next part we will create our own UserGroups, ParticipantGroups, and ColumnGroups. Since everybody following this tutorial may be using the same PEP repository, and to avoid naming collisions, we would like you to prefix these new groups with the first letters of your name(s). We (Mathijs and Joep) will make our own groups, and prefix them with MaJo, ending up with MaJo_User, MaJo_Participants, and MaJo_Columns. Please decide on your own PREFIX now.

Creating the ColumnGroup

At the moment, you are logged in as the WorkshopUser, but we want to switch to Data Administrator.

Use the pepLogon command again to log in again and switch UserGroups:

pepLogon

Now, as Data Administrator, use the following commands to create the ColumnGroup. Do not forget to prefix the ColumnGroup name with your abbreviated names:

pepcli ama columnGroup create "PREFIX_MathLevelsAndGrades"

Then, add the columns we are interested in:

pepcli ama column addTo "Maths_Level" "PREFIX_MathLevelsAndGrades"
pepcli ama column addTo "Maths_Grade" "PREFIX_MathLevelsAndGrades"

Creating the ParticipantGroup

The researchers are interested in the grades of second year students, so lets make a ParticipantGroup containing only those participants:

pepcli ama group create "PREFIX_SecondYearStudents"

The data administrator needs to know which participants have to be in that ParticipantGroup. To do this, we ask which class every participant is in and add them when required. Remember that we (and everyone else doing this workshop) just created a new participant, who is not yet in any ParticipantGroups. To find them, we have a special ParticipantGroup: *. The * ParticipantGroup contains ALL participants in the repository. Access to this ParticipantGroup is very limited, and only used sparingly.

Also, because we are dealing with a small set of data that we do not necessarily want to keep, instead of the pull command, we use the list command. This will not write the data to files, but show them in the terminal. Using the --columns parameter twice, we can specify multiple columns for viewing. The --group-output parameter makes sure that all data belonging to a participant is grouped and shown together.

  • Use the pepcli list command to download the columns Class and ParticipantIdentifier for all participants
pepcli list --participant-groups "*" --columns "Class" --columns "ParticipantIdentifier" --group-output
  • We now have an overview of all ParticipantIdentifiers and Class values. Look for participants that are in their second year, and use the following command to add them to our new ParticipantGroup. (If you have added four of them, you can stop, no need for busy work).
pepcli ama group addTo "PREFIX_SecondYearStudents" <FOUND_IDENTIFIER>

Don't forget to replace PREFIX with your own prefix and <FOUND_IDENTIFIER> with the identifier you have selected.

Intermezzo: End-to-end encryption

The process just used is rather tedious. It would be nice if PEP supported data driven processes such as the one above. However, there is a very good reason we do not. PEP is end-to-end encrypted, another principle of Privacy by Design. The whole time data is in the PEP repository, it is encrypted, meaning that the PEP servers can only perform very limited logic on it. Things as automating ParticipantGroup assignment based on data contents is by choice impossible to do. Logic like this will have to be performed or scripted by the Data Administrator themselves.


Intermezzo: Polymorphic Encryption

Another thing you might have noticed is that, when you downloaded data as the Data Administrator, the subfolder names were different from when you downloaded data as the "WorkshopUser". This is a very important concept within PEP. So much so, that it is the namesake of the system, Polymorphic Encryption and Pseudonymisation. Every UserGroup in the repository has their own pseudonyms for the participants. These are consistent within a UserGroup, but differ from the pseudonyms used by other UserGroups. This makes sharing data and coupling different parts of the dataset a lot harder.


Creating the UserGroup

Now that the ParticipantGroup and ColumnGroup exist, let's make the UserGroup and its access privileges in order. To do this, we need to log in as the "Access Administrator". Again, in a real life scenario, this would be a different person from the "Data Administrator".

  • Use the pepLogon command again to log in again and switch UserGroup to "Access Administrator".

  • Create the UserGroup:

pepcli asa group create "PREFIX_downloadUser"
  • Give the newly created UserGroup access to the ColumnGroup by creating ColumnGroupAccessRules (cgar):
pepcli ama cgar create "PREFIX_MathLevelsAndGrades" "PREFIX_downloadUser" read
  • Give the newly created UserGroup access to the ParticipantGroup by creating ParticipantGroupAccessRules (pgar) by giving the UserGroup both access and enumerate rights to the ParticipantGroup (given the scope of this workshop, please just accept this for now):
pepcli ama pgar create "PREFIX_SecondYearStudents" "PREFIX_downloadUser" access
pepcli ama pgar create "PREFIX_SecondYearStudents" "PREFIX_downloadUser" enumerate
  • Add Users to the UserGroup. Use your log in (U-number with a capital "U", including the @ru.nl domain, e.g. U123456@ru.nl):
pepcli asa user addTo "U_NUMBER" "PREFIX_downloadUser"

Checking progress

Everything should be in order. To check, we can query the repository for the status of our administration. There will be quite a bit of text here. If it makes things easier, you can add > some_filename.txt to the end of the command to write the results to a .txt file, and then open that file in a text editor, for easier searching.

  • Display an overview of the ColumnGroups, ParticipantGroups, and Access Rules (AMA = Access Manager Administration):
pepcli ama query

or:

pepcli ama query > amaQuery.txt

The output comprises a few sections:

  1. Columns
  2. ColumnGroups
  3. ColumnGroupsAccessRules
  4. ParticipantGroups
  5. ParticipantGroupAccessRules

Look in all sections and see whether or not your newly created administration is present. You should also be able to see the administration made by your fellow workshoppers.

  • Display an overview of the UserGroups (ASA = Authentication Server Administration). This output is much less elaborate:
pepcli asa query
  • Check whether your UserGroup exists and it is added to your U-number.

Downloading the specific data

We now have a UserGroup that only has specific access to a small part of the data repository.

  • Use the pepLogon command again to log in again and switch UserGroup to "PREFIX_downloadUser".\
  • If needed, delete the old 'pulled-data' folder.\
  • Use the pull command to download the data, using the handy parameter --all-accessible:
pepcli pull --all-accessible

You should now only have data of the math grade and level of the second year students. We managed this without the downloader even having to know the names of the ColumnGroups and ParticipantGroups that are used behind the scenes.

How could PEP be used in your project?

What applications within your project could benefit from storing and sharing data using PEP? What kind of users do you expect could benefit from that? What would help you implement PEP in your project?

Please write this down on the provided form.

Wrapping up

In this workshop we have explored the basics of PEP. First, we downloaded the software and got it installed. We then logged in as a premade User and were able to download some data. We then registered a new participant and stored data for it. We made it possible for a new research group to download a very specific subset of the data by creating a new UserGroup, ParticipantGroup, and ColumnGroup, and made ColumnGroupAccessRules and ParticipantGroupAccessRules pertaining to those groups. Logging in as the new group, we saw that only the data that we specified was downloaded using a handy command, not needing to know anything about the rest of the administration of the repository.

Thank you for your participation in this workshop! If you have any questions or comments, please let us know.