Workshop Nolai
General approach and goals
In this workshop you will learn the basics of PEP. Step by step, we will guide you through the system. After completing this workshop you will have a basic understanding of:
- Downloading the software and starting
- Logging in
- Down- and uploading data
- Repository management
For any questions, please contact support@pep.cs.ru.nl
Downloading the software and starting
For Windows
Download the installer from:
Windows Installer
- Click the downloaded file and go through the installer
- Unmark "PEP Assessor starten" and finish
- Then, go to your start menu and go to All Apps
, look up and open the PEP folder, and click on PEP Opdrachtregel
For MacOS
Download the installer from:
MacOS Installer
- Click the downloaded file to unzip it, and go through the installer
- Open the Go
menu, then Applications
(or press Shift+Cmd+A), and double-click on PEP Command Line Interface (nolai-sandbox acc)
You should now have a PEP command prompt window opened. We will refer to this window as the terminal.
Disclaimer: Feedback from the system
At times, the PEP system returns output that seem to indicate that a process has run with errors. Sometimes, this output can safely be ignored. The PEP team is working on solutions for this incorrect output. For now, please ignore the following output:
- When logging in, the terminal might show:
Qt: Session management error: Could not open network socket
- When using the pepcli (explained later), the terminal might show:
<error> [Cli] Unexpected problem shutting down SSL streams: boost::system::error_code: uninitialized (SSL routines) | Forcefully shutting down.
Logging in
The terminal is where all interaction with the pep repository happens. At the moment we can not do much, as we are not yet logged in. To log in, type pepLogon
into your terminal and hit Enter. This will open up an internet browser window with the SURFconext log in screen. After logging in, you will see all UserGroups to which you have access. More on UserGroups later. For now, we will use the UserGroup WorkshopUser
. After you click that role, you may close the browser tab/window, and return to the terminal.
- In your terminal, type the
pepLogon
command:
pepLogon
You are now logged in. The terminal should show you the message: Wrote enrollment result (keys) to "LOCATION/ON/YOUR/COMPUTER/ClientKeys.json"
This file contains keys that PEP will automatically use.
Looking around
Now that we are logged in, we can have a first look around in the PEP repository. For this workshop, we have already prepared some data, considering the students Alice
, Bob
, Charles
, and Danielle
. We will refer to all data subjects (in this case the students) as participants
. First, let's download the data for these participants with the following command (which we will explain later):
pepcli pull --participant-groups AllExampleParticipants --column-groups AllExampleColumns
After the process has finished, you should have a (new) folder called pulled-data
in your current working directory. Open it in an explorer window and look around. You will see that we have four subfolders with the prefix NOLSU and then a number. These subfolder names are the participant identifiers for this specific UserGroup. Each datapoint is stored in a separate file. Look around in the subfolders.
We can visualise these data points in a table:
Name | Address | Class | French_Level | French_Grade | Maths_Level | Maths_Grade |
Alice | Fruitstraat 12 | 2A | A | 8 | A | 3 |
Bob | Lange Laan 3 | 3B | A | 4 | C | 7 |
Charles | Lange Laan 3 | 2A | C | 7 | D | 6 |
Danielle | Appelstraat 7 | 1C | B | 5 | A | 8 |
The table visualises the datapoints in rows
(the participants Alice, Bob, Charles and Danielle), and columns
(Name, Address, Class, etc.).
Imagine yourself a researcher studying the achievements of students. This data would be useful, but also contains a lot of information that is not required. You have no need for the actual names of the students, let alone the address where the students live. As long as you can consistently identify single participants yourself, you have enough information.
Intermezzo: Privacy by design
The ground rule for working with personal data (and all data in the table above is personal data, not just the names and the addresses) is that the person working with these data should only have access to the data that is required for the purpose. This concept is called data minimization
.
Another important aspect to keep in mind is proportionality
and purpose limitation
, for example: it might be valuable to register mental conditions of participants when performing a study, but on the other hand this imposes a much larger impact on the privacy of an individual than for example sharing a math grade. Although efforts are made to prevent identification of individuals, this can never be completely prevented. Therefore the proportionality of using such details should be considered, and in the case of very sensitive data, (even) stronger ethical checks should be performed concerning the consent basis and the taken privacy and security measures. But the bottom line about data sharing is: less is more, which concerns both the amount of data and the resolution and sensitivity of data. Instead of, for example, sharing a birth date (day), in many cases the year of birth may suffice. In those cases, only this year of birth (a lower resolution) should be shared/used.
Back to the tutorial
Considering the rules of privacy by design, access to personal data should be limited where possible. In the previous example more details on the participants were shared than required.
PEP helps you to follow the principles of privacy by design in multiple ways. Instead of giving downloaders access to all data, we can specify access exclusively to specific columns and participants for all members in a UserGroup.
This is done by grouping columns and participants in ColumnGroups
and ParticipantGroups
. The UserGroup
WorkshopUser
from the example has access to ParticipantGroup AllExampleParticipants
and ColumnGroup AllExampleColumns
, but also to the smaller ParticipantGroup MaleParticipants
and ColumnGroup Grades
. Let us see what happens if we use those parameters in the pull command.
-
Before we continue, delete the old downloaded data folder
pulled-data
for a clean start. -
Then download the new data using the
pull command
:
pepcli pull --participant-groups MaleParticipants --column-groups Grades
You see that we now only have two subfolders in the pulled-data folder for the two participants that are in ParticipantGroup MaleParticipants
, and those only contain files with the French and Maths grades. A researcher would be able to directly use this data and never even know about the names of the students. Obviously, in a real world situation, a researcher should only have access to these groups, and not for the AllExample...
groups.
Registering new participants and uploading data
Of course we should also be able to add new participants and upload data for them. Before we can add datapoints on a new student (participant), we must first register the new participant with the command:
pepcli register id
Make sure to write down the identifier (e.g. NOLS596389441202
). We will use this identifier to store data in the columns we saw above.
Storing a value from the terminal prompt
We start by storing the name of this new participant. Using the --data
parameter we can add the data directly from the terminal. Make up a nice student name and use it in the following command. NOTE: if you want to store values with spaces in it, put double quotes around them (e.g.: --data "Henk de Vries"
):
pepcli store --participant <IDENTIFIER> --column Name --data <PARTICIPANTS_NAME>
where <IDENTIFIER>
should be replaced with the identifier you have just written down, and <PARTICIPANTS_NAME>
should be replaced by the name you made up.
This should produce an output on your terminal screen similar to:
2024-04-09 16:16:24: <info> [Application binary] 02209646710da21f7dfc
2024-04-09 16:16:24: <info> [Application configuration] No version information available. Running a local build?
{
"id": "0A526690753CDB47163B1AA25155082A2EF9C0276DBE10D33D9A99A4F2E6C3C46BE8291E3DE5D408C52F2705B443E77E81B44A71CFA8FFE5C6E8A8878337EE5CF20AFA9D2C3B986CAC705F03815E782E1101E6EA12109932F382082A18EE9A2E98DAD5974EF81A107FF9F29D100D38311553D85ED540D5A9"
}
Storing a value from a file
It is also possible to upload files, using the --input-path
parameter and filling in the path to the file.
- Go to your Desktop and make a file called
participantAddress.txt
.\ - Open it and type a nice address for this new participant to live at. Save the file.\
- Store the address data with the command:
pepcli store --participant <IDENTIFIER> --column Address --input-path <PATH/TO/participantAddress.txt>
- Use one of the above methods to upload the Class for the new participant. Make sure they are in the second class! This will be important in a bit.\
- Use the commands above to fill in values for the columns
Maths_Level
andMaths_Grade
. Since French is not very important anyway, we will skip those columns for now.
When we retry the
pull command
we used earlier, we will notice that the newly created student is not part of the pulled_data. The reason for that is that the participant is not part of anyParticipantGroups
yet. We will address this in the next paragraph.
Creating a more realistic administration
In the next section we will create a more realistic UserGroup that only has access to a limited selection of the data. This UserGroup will represent a research group that is interested in the Maths levels and grades of students in their second year.
Intermezzo: Four Eyes principle
The administrative superpowers in PEP are divided over two specific types of users. One is part of the Data Administrator
UserGroup and the other in the Access Administrator
UserGroup. Both administrators work together and need each other to administer the PEP repository.
The reason for this separation of tasks is that a potential breach of one of the administrator roles will not immediately lead to a data leak.
In this tutorial, you will act as both roles although this should never be the case in operational environments. The Data Administrator
, as the name suggests, deals with the data itself. They create columns and put those columns in ColumnGroups. They also add new participants to ParticipantGroups. Note that in the example above, the ColumnGroup name Grades
and ParticipantGroup name MaleParticipants
have semantic value, but which participants or columns are in a particular group is not based of data in the repository but defined by the Data Administrator
. The Access Administrator
creates UserGroups (such as WorkshopUser
in the example) and grant these UserGroups access privileges to ColumnGroups and ParticipantGroups that were created by the Data Administrator
.
Naming the new groups
In the next part we will create our own UserGroups, ParticipantGroups, and ColumnGroups. Since everybody following this tutorial may be using the same PEP repository, and to avoid naming collisions, we would like you to prefix these new groups with the first letters of your name(s). We (Mathijs and Joep) will make our own groups, and prefix them with MaJo
, ending up with MaJo_User
, MaJo_Participants
, and MaJo_Columns
. Please decide on your own PREFIX
now.
Creating the ColumnGroup
At the moment, you are logged in as the WorkshopUser
, but we want to switch to Data Administrator
.
Use the pepLogon
command again to log in again and switch UserGroups:
pepLogon
Now, as Data Administrator
, use the following commands to create the ColumnGroup. Do not forget to prefix the ColumnGroup name with your abbreviated names:
pepcli ama columnGroup create "PREFIX_MathLevelsAndGrades"
Then, add the columns we are interested in:
pepcli ama column addTo "Maths_Level" "PREFIX_MathLevelsAndGrades"
pepcli ama column addTo "Maths_Grade" "PREFIX_MathLevelsAndGrades"
Creating the ParticipantGroup
The researchers are interested in the grades of second year students, so lets make a ParticipantGroup containing only those participants:
pepcli ama group create "PREFIX_SecondYearStudents"
The data administrator needs to know which participants have to be in that ParticipantGroup. To do this, we ask which class every participant is in and add them when required. Remember that we (and everyone else doing this workshop) just created a new participant, who is not yet in any ParticipantGroups. To find them, we have a special ParticipantGroup: *
. The *
ParticipantGroup contains ALL participants in the repository. Access to this ParticipantGroup is very limited, and only used sparingly.
Also, because we are dealing with a small set of data that we do not necessarily want to keep, instead of the pull
command, we use the list
command. This will not write the data to files, but show them in the terminal. Using the --columns
parameter twice, we can specify multiple columns for viewing. The --group-output
parameter makes sure that all data belonging to a participant is grouped and shown together.
- Use the pepcli list command to download the columns Class and ParticipantIdentifier for all participants
pepcli list --participant-groups "*" --columns "Class" --columns "ParticipantIdentifier" --group-output
- We now have an overview of all ParticipantIdentifiers and Class values. Look for participants that are in their second year, and use the following command to add them to our new ParticipantGroup. (If you have added four of them, you can stop, no need for busy work).
pepcli ama group addTo "PREFIX_SecondYearStudents" <FOUND_IDENTIFIER>
Don't forget to replace PREFIX
with your own prefix and <FOUND_IDENTIFIER>
with the identifier you have selected.
Intermezzo: End-to-end encryption
The process just used is rather tedious. It would be nice if PEP supported data driven processes such as the one above. However, there is a very good reason we do not. PEP is end-to-end encrypted, another principle of Privacy by Design. The whole time data is in the PEP repository, it is encrypted, meaning that the PEP servers can only perform very limited logic on it. Things as automating ParticipantGroup assignment based on data contents is by choice impossible to do. Logic like this will have to be performed or scripted by the Data Administrator themselves.
Intermezzo: Polymorphic Encryption
Another thing you might have noticed is that, when you downloaded data as the Data Administrator, the subfolder names were different from when you downloaded data as the "WorkshopUser". This is a very important concept within PEP. So much so, that it is the namesake of the system, Polymorphic Encryption and Pseudonymisation. Every UserGroup in the repository has their own pseudonyms for the participants. These are consistent within a UserGroup, but differ from the pseudonyms used by other UserGroups. This makes sharing data and coupling different parts of the dataset a lot harder.
Creating the UserGroup
Now that the ParticipantGroup and ColumnGroup exist, let's make the UserGroup and its access privileges in order. To do this, we need to log in as the "Access Administrator". Again, in a real life scenario, this would be a different person from the "Data Administrator".
-
Use the
pepLogon
command again to log in again and switch UserGroup to "Access Administrator". -
Create the UserGroup:
pepcli asa group create "PREFIX_downloadUser"
- Give the newly created UserGroup access to the ColumnGroup by creating ColumnGroupAccessRules (cgar):
pepcli ama cgar create "PREFIX_MathLevelsAndGrades" "PREFIX_downloadUser" read
- Give the newly created UserGroup access to the ParticipantGroup by creating ParticipantGroupAccessRules (pgar) by giving the UserGroup both
access
andenumerate
rights to the ParticipantGroup (given the scope of this workshop, please just accept this for now):
pepcli ama pgar create "PREFIX_SecondYearStudents" "PREFIX_downloadUser" access
pepcli ama pgar create "PREFIX_SecondYearStudents" "PREFIX_downloadUser" enumerate
- Add Users to the UserGroup. Use your log in (U-number with a capital "U", including the @ru.nl domain, e.g. U123456@ru.nl):
pepcli asa user addTo "U_NUMBER" "PREFIX_downloadUser"
Checking progress
Everything should be in order. To check, we can query the repository for the status of our administration. There will be quite a bit of text here. If it makes things easier, you can add > some_filename.txt
to the end of the command to write the results to a .txt file, and then open that file in a text editor, for easier searching.
- Display an overview of the ColumnGroups, ParticipantGroups, and Access Rules (AMA = Access Manager Administration):
pepcli ama query
or:
pepcli ama query > amaQuery.txt
The output comprises a few sections:
- Columns
- ColumnGroups
- ColumnGroupsAccessRules
- ParticipantGroups
- ParticipantGroupAccessRules
Look in all sections and see whether or not your newly created administration is present. You should also be able to see the administration made by your fellow workshoppers.
- Display an overview of the UserGroups (ASA = Authentication Server Administration). This output is much less elaborate:
pepcli asa query
- Check whether your UserGroup exists and it is added to your U-number.
Downloading the specific data
We now have a UserGroup that only has specific access to a small part of the data repository.
- Use the
pepLogon
command again to log in again and switch UserGroup to "PREFIX_downloadUser".\ - If needed, delete the old 'pulled-data' folder.\
- Use the pull command to download the data, using the handy parameter
--all-accessible
:
pepcli pull --all-accessible
You should now only have data of the math grade and level of the second year students. We managed this without the downloader even having to know the names of the ColumnGroups and ParticipantGroups that are used behind the scenes.
How could PEP be used in your project?
What applications within your project could benefit from storing and sharing data using PEP? What kind of users do you expect could benefit from that? What would help you implement PEP in your project?
Please write this down on the provided form.
Wrapping up
In this workshop we have explored the basics of PEP. First, we downloaded the software and got it installed. We then logged in as a premade User and were able to download some data. We then registered a new participant and stored data for it. We made it possible for a new research group to download a very specific subset of the data by creating a new UserGroup, ParticipantGroup, and ColumnGroup, and made ColumnGroupAccessRules and ParticipantGroupAccessRules pertaining to those groups. Logging in as the new group, we saw that only the data that we specified was downloaded using a handy command, not needing to know anything about the rest of the administration of the repository.
Thank you for your participation in this workshop! If you have any questions or comments, please let us know.