De-anonymising Ukraine university entrance test results

Vadym Hudyma - May 26, 2017 in Data Blog

Authors: Vadym Hudyma, Pavlo Myronov. Part 1 of a series on Ukrainian student data.

Introduction

External Independent Evaluation Testing is a single exam is used nationwide to access all public universities.

As detailed in our previous article, the release of a poorly anonymised dataset by organisation in charge of the External Independent Evaluation Testing (EIT) resulted in serious risks to the privacy of Ukrainian students. One of those was the risk of unwanted mass disclosure of personal information, with the help of a single additional dataset. We detail below how we reached our results.

The EIT datasets contains the following dimensions:

  • Unique identifier for every person
  • Year of birth
  • Sex
  • Test scores of every subject taken by student (for those who get 95% and more of possible points – exact to decimals)
  • Place, where test were taken

On the other hands, the dataset we used to de-anonymise the EIT results, was collected from the website vstup.info, and it gives us access to the following elements:

  • family name and initials of the applicant (also referred below to as name)
  • university where the applicant was accepted
  • the combined EIT result scores per required subject, with a multiplier applied to each subject by the universities, depending on their priorities.

At first glance, as every university uses its own list of subject-specific multipliers to create the combined EIT results of applicants, it should be impossible to precisely know their EIT score, as well as find matches with exact scores in EIT data set.

The only problem with that reasoning is that the law requires all the multipliers to be published on the same website as a part of a corruption prevention mechanism. And this is good. But it also provides attackers with enough data to use it as a basis for calculation to find exact matches between datasets.

How we did it

Our calculations were based on an assumption that every EIT participant applied to universities of their local region. Of course, this assumption may not be true for every participant but it’s usually the case and also one of the easiest ways to decrease the complexity of the calculations.

For every Ukrainian region, we isolated in the EIT dataset a subset of local test-takers and calculated the EIT ratings they would have if they had applied for every speciality at local universities. Then we merged this dataset of “potential enrollees” with real enrollees’ dataset from website vstup.info, which contained real names of enrollees and their final rating (meaning multiplied by subject- and university specific multipliers) by the parameters of the university, speciality, and rating.

By joining these data sets for every region we gained the first set of pairs, where test-takers’ ids correspond with enrollees’ names (data set A1B1). In the resulting set, the quantity of EIT participants that correspond with only one name, i.e. those who can be unambiguously identified, is 20 637 (7.7% of all participants).

To expand the scope of our de-anonymization, we used the fact that most of the enrollees try to increase their chances of getting accepted by applying to several universities. We consequently tested all pairs from first merged dataset (A1B1) against the whole dataset of enrollees (B1), counting the number of matches by final rating for the every pair. Then we filtered the pairs that were matched by at least two unique values of EIT rating. If the same match occurs in two cases with different universities/speciality coefficients to form aggregate EIT rating, it’s much less likely that we got a “false positive”.

Therefore, we formed a data set where each EIT participant’s id corresponds with one or more names, and the number of unique EIT rating values is recorded for every correspondence (C1). In this case, the number EIT participant (unique identifier from A1) that correspond only one name with the number of unique aggregate ratings > 1, is 50 845 (18.97%).

We also noticed the possibility of false positive results, namely the situation where the same family name and initials from enrollees dataset (B1) corresponds with several ids from EIT participants dataset (A1). It doesn’t necessary mean we guessed test taker’s family name wrongly, especially in a case of rather common a family name. The more widespread name is, the more the probability that we have correctly identified several EIT participants with the same name. But still it leaves possibilty of some number of false positive results.

To separate the most reliable results from others, we identified correspondences with unique names and calculated the number of the records where unique id corresponds with a unique name.

Consequently, the results of our de-anonymization can be described by the following table.

Assumptions De-anonymized EIT participants with unique names De-anonymized EIT participants (regardless of names uniqueness)
1) Every enrollee applied to at least one university in his/her region. 8 231 (3.07%) 20 637 (7.7%)
1) + Every enrollee applied to at least two specialities with different coefficients. 31 418 (11.42%) 50 845 (18.97%)

In each row, false positive results can occur only if some of the enrollees broke basic assumption(s).

So far we speaking about unambiguous identification of test-takers. But even narrowing results to a small number of possible variants makes subsequent identification using any kind of background knowledge or other available data sets trivial. At the end, we were able to identify 10 and less possible name-variants for 43 825 EIT participants. Moreover, we established only 2 possible name-variants for 19 976 test-takers.

Our method provides assumed name(or names) for every EIT participant, who applied to university in the region where they had taken their tests, and applied to at least two specialities with different multipliers. Though not being 100% free from false positives, the results are precise enough to show that external testing dataset provides all necessary identifiers to de-anonymize a significant part of test-takers. Of course, those who may have personal or business, and not purely research interest in test-takers’ personalities or enrollees external testing results would find multiple ways to make de-anonymization even more precise and wider in its scope.

(NOTE: For example, one can use clusterization of each specialty rating coefficients to decrease the number of calculation avoiding our basic assumption. It is also possible to take into account the locations of EIT centres and assume that test-takers would probably try to enrol at the universities in nearby regions or to estimate real popularity of names among enrollees using social network “Vkontakte” API and so on.)

Using comparatively simple R algorithms and an old HP laptop we have found more than 20 637 exact matches (7.7% of all EIT participants), re-identifying individuals behind anonymized records. And more than 40 thousands – participants were effectively de-anonymised with less than perfect precision – but more than good enough for motivated attacker.

What could be done about it?

After conducting initial investigation, we reached out to CEQA for comments. This was their response:

“Among other things, Ukraine struggles with high level of public distrust to government institutions. By publishing information about standardized external assessment results and the work we deliver, we try to lead by example and show our openness and readiness for public scrutiny…

At the same time, we understand that Ukraine has not yet formed a mature culture of robust data analysis and interpretation. Therefore, it is essential to be aware of all risks and think in advance about ways to mitigate adverse impact on individuals and the education system in general.”

So what could be done better with this particular dataset to mitigate at least the above mentioned risks, while preserving its obvious research value? Well, a lot.

First of all, a part of the problem that is easy to fix is the exact test scores. Simple rounding and bucketing them into small portions (like 172 instead of the range from 171 to 173, 155 for the range from 154 to 156 and so on), and so making them reasonably k-anonymous. Whilst this wouldn’t make massive deanonymization impossible, it could seriously reduce both the number of possible attack vectors and the precision of these breaches. “Barnardisation” (adding 1 and -1 randomly to each score) would also do the trick, though it should be combined with other anonymisation techniques.

The problem with background knowledge (like in the “nosy neighbour” scenario) is that it would be impossible to mitigate without removing a huge number of outliers and specific cases, such as small schools, non-common test subjects in small communities and so on, as well as huge steps in bucketing different scores or generalising test locations. Some educational experts have raised concerns about the projected huge loss in precision.

Still, CEQA may have considered releasing dataset with generalised data and some added noise and give researchers more detailed information under a non-disclosure agreement.

This “partial release/controlled disclosure” scheme could also help to deal with the alarming problem of school ratings. For example, a generalisation of testing location from exact places to school districts or even regions would probably help. Usually, local media wouldn’t be interested in comparing EIT results outside their audience locations, and national media is much more reluctant to publish stories about differences in educational results between different regions for obvious discrimination and defamation concerns.

This kind of attack is not very dangerous at this particular moment in Ukraine – we don’t have a huge data-broker market (as in US or UK) and our HR/insurance companies do not use sophisticated algorithms (yet) to determine the fate of peoples’ job applications or final life insurance cost. But the situation is quickly changing, and this kind of sensitive personal data, which isn’t worth much at this point, can be easily exploited at any moment in the near future. And both the speed and low cost of this kind of attack make this data set a very low hanging fruit.

Conclusions

Current states of affairs in personal data protection in Ukraine, as well as workload of existing responsible stuff in government don’t leave much hopes for a swift change in any of already released data sets. Still, this case clearly demonstrates that anonymisation is really hard problem to tackle, and benefits of microdata disclosure could be quite easily outweighed by possible risks of unwanted personal data disclosures. So, all open data activists advocating for disclosure maximum information possible, as well as government agencies responsible for releasing such sensitive data sets, should put really hard efforts into figuring out possible privacy connected risks.

We hope that our work would be helpful not just for future releases of external testing results, but for the wider open data community – both in Ukraine and throughout the world.

Flattr this!

The lost privacy of Ukrainian students: a story of bad anonymisation

Vadym Hudyma - May 23, 2017 in Data Blog

Authors: Vadym Hudyma, Pavlo Myronov. Part 1 of a series on Ukrainian student data.

Introduction

Ukraine has long been plagued with corruption in the university admission process due to a complicated and untransparent process of admission, especially for state-funded seats. To get it would-be students required not just a good grades from school (which also was subject of manipulation), but usually some connections or bribes to the universities admission boards.

Consequently, the adoption of External Independent Evaluation Testing (EIT) (as the primary criteria for admission into universities is considered one of a handful of successful anticorruption reforms in Ukraine. External independent evaluation is conducted once a year for a number of subjects, anyone with school diploma can participate in it. It is supervised by an independent government body (CEQA – Center for Educational Quality Assessment) with no direct links neither to school system nor major universities, All participant names are protected with unique code to protect results from forgery. {Explanation of the system in 1-2 sentence.}

The EIT has not eradicated corruption, but reduced it to a negligible level in the university admissions system. While its impact on the school curriculum and evaluation is, and should be, critically discussed, its success in providing opportunities for a bright student to get a chance to choose between the best Ukrainian universities is beyond doubt. Also, it provides researchers and the general public with a very good tool to understand, at least on some level, what’s going on with secondary education based on unique dataset of country-wide results of university admission tests.

Obviously, it’s also crucial that the results of the admission tests, a potentially life-changing endeavour, must be held as privately and securely as possible. Which is why we were stricken when the Ukrainian Center for Educational Quality Assessment (CEQA) also responsible for collecting and managing the EIT data, released this August a huge dataset of independent testing results from 2016.

In this case, this dataset includes individual records. Although the names and surnames of participants were de-identified using randomly assigned characters, the dataset was still full of multiple other entries that could link to exact individuals. Those include exact scores (with decimals) of every taken test subject, the birth year of each participant, their gender, whether they graduated this year or not and, most damning, the name of the place where each subject of external examination was taken – which is usually the schools at which participants got their secondary education.

I. Happy Experts

Of course, the first reaction from the Ukrainian Open Data community was overwhelmingly positive, helped with the fact that previous releases of EIT datasets were frustrating in their lack of precision and scope.

A Facebook post announcing the publication: “Here are the anonymized results of IET in csv #opendata”

image alt text

*A Facebook comment reacting to the publication: “Super! Almost 80 thouthands entries” (actually more ;) *

image alt text

A tweet discussing the data: “Some highly expected conclusions from IET data from SECA…”

As Igor Samokhin, one of the researchers who used the released EIT dataset in his studies, put it:

“[..This year’s] EIT result dataset allows for the first time to study the distribution of scores on all levels of aggregation (school, school type, region, sex) and to measure inequality in scores between students and between schools on different levels.[…] The dataset is detailed enough that researchers can ask questions and quickly find answers without the need to ask for additional data from the state agencies, which are usually very slow or totally unresponsive when data is needed on the level lower than regional.”

Indeed, the dataset made possible some interesting visualisations and analysis.

image alt text

A simple visualisation showing differences in test results between boys and girls

image alt text

Quick analysis of birth years of those who took IET in 2016

But that amount of data and the variety of dimensions (characteristics) available carry many risks, unforeseen by data providers and overlooked by the hyped open data community and educational experts. We’ve made a short analysis of most obvious threat scenarios.

II. What could go wrong?

As demonstrated by various past cases across the world, microdata disclosure, while extremely valuable for many types of research such as longitudinal studies, is highly susceptible to re-identification attacks.

To understand the risks involved, we went through a process called threat modeling. This consists in analysing all the potential weaknesses of a system (here the anonymisation technique used on the dataset) from the point of view of a potential individual with malicious intentions (called’ attacker’). Three threat models emerged from this analysis:

The ‘Nosy neighbour’ scenario

The first and most problematic possibility is called the “nosy neighbour” scenario. This corresponds to an unexpected disclosure of results from relatives, neighbours, school teachers, classmates, or anyone with enough knowledge about an individual described in the dataset to recognize who the data describes – without having to look at the name. The risks involved with this scenario include possible online and offline harassment against people with too low or too high – depending on context – test results.

Unwanted disclosure may happen because members in the subject’s close environment can already have some additional information about the person. If you know that your classmate Vadym was one of the rare person of the village to take chemistry in the test, you can easily deduce which line of the data corresponds to him, discovering in the same way all the details of his tests results. And depending on what you (and others) discover about Vadym, the resulting social judgment could be devastating for him, all because of an improperly anonymised dataset.

This is a well-known anonymisation problem – it’s really hard to get a good anonymity with that many dimensions – in this case, the subject and exact results of multiple tests and their primary examination location.

It’s an especially alarming problem for schools in small villages or specialised schools – where social pressure and subsequent risk of stigmatisation is already very high.

The ‘Ratings fever’ problem

image alt text

Map of schools in Kiev, Ukraine’s capital, made by the most popular online media based on EIT results

The second problem with educational data is hardly new and the release of this dataset just made it worse. With added precision and targeting power, more fervour was granted to the media’s favoured exercise of grading schools according to successes and failures of the external testing results of its students.

In previous years, many educational experts criticised ratings made by media and the different government authorities for incompleteness: they were based either on a full dataset, but for only one test subject, or were made using heavily aggregated and non-exhaustive data. But such visualisations can have consequences more problematic than misleading news readers about the accuracy of the data.

The issue here is about the ethical use of the data, something often overlooked by the media in Ukraine, who happily jumped on the opportunity to make new ratings. As educational expert Iryna Kogut from CEDOS explains:

“EIT scores by themselves can not be considered as a sign of the quality of education in a individual school. The new dataset and subsequent school ratings based on it and republished by CEQA only maintains this problem. Public opinion about the quality of teaching and parental choice of school relies on results of the EIT, but the authors of the rating do not take into account parents’ education, family income, the effect of private tutoring and others out-of-school factors which have a huge influence on learning results. Besides, some schools are absolutely free to select better students (usually from families with higher socioeconomic status), and this process of selection into “elite” schools is usually neither transparent nor fair. So they are from the start not comparable with the schools having to teach ‘leftovers’. “

Even as people start understanding the possible harm of the “rate everything” mentality for determining both public policy and individual decisions, almost every local website and newspaper has made or republished school ratings from their cities and regions. In theory, there could be benefits to the practice, such as efforts to improve school governance. Instead, what seems to happen is that more students from higher-income families migrate to private schools and less wealthy parents are incentivised to use ‘unofficial’ methods to transfer their kids to public school with better EIT records. Overall, this is a case where the principle “the more informed you are the better” is actually causing harm to the common good – especially when there is no clear agenda or policy in place to create a fairer and more inclusive environment in Ukrainian secondary education.

Mass scale disclosure

The last and most long-term threat identified is the possible future negative impact on the personal life of individuals, due to the unwanted disclosure of test results. This scenario considers the possibility of mass scale unwanted identity disclosure of individuals whose data were included in recent EIT data set.

As our research has shown, it would be alarmingly easy to execute. The only thing one needs to look at is already-published educational data. To demonstrate the existence of the this threat, we only had to use one data set: close to half of the EIT records could be de-anonymised with varying level certainty, meaning that we could find the identity of the individual behind the results (or narrow down the possibility to a couple of individuals) for one hundred thousand individual records.

The additional dataset we used comes from another government website – vstup.info – which lists all applicants to every Ukrainian university. The data includes the family names and initials of each applicant, along with the combined EIT results scores. The reason behind publishing this data was to make the acceptance process more transparent and cut space for possible manipulations.

But with some data wrangling and mathematical work, we were able to join this data with the IET dataset, allowing a mass scale de-anonymisation.

So what should be the lessons learned from this?

First, while publishing microdata may bring enormous benefits to the researchers, one should be conscious that anonymisation may be really hard and non-trivial problem to solve. Sometimes less precision is needed to preserve anonymity of persons whose data is included in the dataset.

Second – it’s important to be aware of any other existing datasets, which datasets may be used for de-anonymization. It’s responsibility of data publisher to make sure of that before any information sharing.

Third – it’s not enough just to publish dataset. It’s important to make sure that your data wouldn’t be used in obviously harmful or irresponsible manner, like in various eye-catching, but very damaging in the long run ratings and “comparisons”.

Flattr this!

Data is a Team Sport

Dirk Slater - May 16, 2017 in Announcement

A series of online conversations examining the data literacy ecosystem.

In this series we seek to capture learnings about the ever-changing field of data literacy and how it is evolving in response to concepts like ‘big data’, ‘post-fact’ and ‘data cofusion’.  This open research project by School of Data, in collaboration with FabRiders, will produce a series of podcasts and blog posts as we engage data literacy practitioners with particular expertise within the ecosystem (e.g., investigative journalism, advocacy and activism, academia, government, etc) in conversation. 

You can join the conversation (see RSVP below) and provide inputs into the research we are conducting. During each online conversation we will give participants an opportunity to ask questions and share their own insights on the topic.

Our first conversation will be take place on May 25th at 7:00 PDT, 10:00 EDT, 15:00 BST, 16:00 CEST, 17:00 EAT, 19:30 IST, & 21:00 Bangkok with:

  • Rahul Bhargava, MIT Media Lab, will discuss his methodologies that take individuals from spreadsheets to drawing and how this lead to the development of databasic.io
  • Lucy Chambers, School of Data Staff Alumni currently with Tech to Human, will reflect on the challenges faced when starting School of Data and her various roles as a data literacy practitioner.

Your hosts:

You can join the hangout and contribute questions and chat via text at:

https://hangouts.google.com/hangouts/_/fw4mr3pvd5flrhlkuhafmctpnqe

Or just view live on YouTube:

http://youtu.be/Cl7FGYNAmJc

 

Flattr this!

[French] Le Fellowship de School of Data : Questions et réponses

Cedric Lombion - March 29, 2017 in Fellowship

En 2017, nous recrutons des Fellows dans trois pays francophones: Haïti, Côte d’Ivoire et Sénégal. Les thèmes sont les suivants :

  • Haïti: Fondamentaux de la littératie de données
  • Côte d’Ivoire, Sénégal: Données de l’industrie extractive.

Voir l’annonce principale

Vous n’êtes pas certains que le Fellowship soit fait pour vous ? Vous vous posez encore des questions ? Cet article rassemble les questions et réponses les plus courantes. Nous le mettrons à jour aussi souvent que possible !

  • En quoi consiste le Fellowship de School of Data ?

Les Fellowships sont des placements de 9 mois au sein du réseau School of Data pour des individus pratiquant ou passionnés par la littératie de données. Au cours de cette période, les Fellows travaillent aux côtés de l’équipe de coordination et du réseau de School of Data : vous apprendrez beaucoup de nous, et inversement ! Nous travaillerons ensemble pour construire un programme individuel pour votre Fellowship. Avec pour but d’acquérir les compétences vous permettant de progresser sur votre travail de littératie de donnée: pour former les autres, développer un réseau, organiser des événements. Quelle que soit l’activité, notre objectif est de sensibiliser à la littératie de données et construire des communautés qui, ensemble, peuvent utiliser les compétences d’usage des données afin d’être moteur du changement dans le monde.

Le Fellowship a pour objectif de recruter et former la prochaine génération de “data leaders” et formateurs afin d’étendre l’impact de notre programme de littératie de la donnée. Les Fellows fournissent une formation et un appui dans le temps aux journalistes, organisations de la société civile et individus innovants afin qu’ils soient capables d’utiliser les données de façon pertinente au sein de leur communauté ou pays. Nous recherchons des candidats qui ont des liens existants avec un réseau de promoteurs de la littératie de données, ou qui ont des connexions au sein d’une organisation particulière travaillent dans ce domaine.

Nous recrutons nos Fellows annuellement, et chaque génération devient une partie intégrante du réseau international de School of Data. Ils peuvent donc s’apputer sur la force du réseau pour partager des ressources ou connaissances, de façon à contribuer au mieux à notre compréhension des meilleurs stratégies pour mener des formations pertinentes au niveau local.

  • Est-ce que le Fellow doit habiter/être en permanence dans le pays ?

Il est attendu des Fellows qu’ils soient disponibles 10 jours par mois pour le Fellowship. La plupart des missions nécessiteront une présence de terrain, ce qui sera plus facile si vous habitez le pays au moins 2 semaines par mois. Par ailleurs, nous recherchons des personnes qui aimeraient rester actives sur le long terme dans le pays, ce qui implique qu’un candidat y habitant sera favorisé. Cela dit, nous sommes flexibles et si un Fellow a un déplacement prévu, nous saurons trouver un arrangement.

  • Est-ce que le Fellow doit parler couramment anglais ?

La coordination du Fellowship se fera en Français pour les Fellows francophones. Cela dit, un avantage sera donné aux candidats sachant parler anglais: il est important de pouvoir communiquer avec les reste de la communauté School of Data ! Pas besoin d’être bilingue cependant, être capable de parler un anglais simple et de comprendre des interlocuteurs anglophones est suffisant.

  • Les Fellows devront-ils voyager durant le programme ?

Oui. En mai, à l’occasion du Camp d’Été de School of Data, les Fellows rejoindront la communauté en Afrique du Sud pour planifier leur Fellowship et être formés aux méthodologies de School of Data. Cela nécessite donc d’avoir un passeport, et de lancer les démarches de demande de visa dès que vous êtes sélectionnés. Pensez-y !

Vous avez des questions mais pas de réponses ? Contactez nous via Twitter ou notre site!

Flattr this!

Ask Your Questions to Former School of Data Fellows

Meg Foulkes - March 23, 2017 in Announcement, Events, Fellowship

 

Do you have questions about what it’s like to be a School of Data Fellow? What will I learn? How can I fit Fellowship work around other commitments like work and family? Will I need to travel a lot?

As part of our call for applications for the 2017 Fellowships and Data Experts, we’re hosting a live, informal Question and Answer session next Monday 27th March at 12.30 UTC with two former fellows :

  • Julio Lopez, a Fellow from the Class of 2015 from Ecuador
  • Sheena Carmel Opulencia-Calub, also from our Class of 2015, who’s based in the Philippines.

You can read more about both of their backgrounds and interests here.

The Q&A will be live on School of Data’s Youtube channel: link. Look forward to seeing you there!

Flattr this!

[French] Postulez maintenant! Candidatures ouvertes pour les programmes de School of Data

Cedric Lombion - March 21, 2017 in Announcement, Fellowship

School of Data invite journalistes, associations de la société civiles – et quiconque intéressé par la promotion de la littératie de données – à candidater à son programme de Fellowship. Les candidatures pour ce programmes, qui durent d’avril à mai 2017, fermeront Dimanche 16 avril 2017. Pour le Fellowship francophones, School of Data recherche des candidats dans trois pays:

  • Sénégal
  • Côte d’Ivoire
  • Haïti

Candidater pour Fellowship  ou lire la Foire aux questions.

Note: si vous venez d’un autre pays, veuillez vous référer à l’annonce principale, en anglais

Le Fellowship

Les Fellowships sont des placements de 9 mois au sein du réseau School of Data pour des individus pratiquant ou passionnés par la littératie de données. Au cours de cette période, les Fellows travaillent aux côtés de l’équipe de coordination et du réseau de School of Data : vous apprendrez beaucoup de nous, et inversement ! Nous travaillerons ensemble pour construire un programme individuel pour votre Fellowship. Avec pour but d’acquérir les compétences vous permettant de progresser sur votre travail de littératie de donnée: pour former les autres, développer un réseau, organiser des événements.

A l’image des années précédentes, l’objectif du programme de Fellowship est de faire la promotion de la littératie de données et de construire des communautés qui, ensemble, pourront utiliser leurs compétences liées aux données pour créer le changement qu’elles veulent voir dans le monde.

Le Fellowship 2017 poursuit l’approche thématique entamée par notre processus de recrutement de 2016. Ainsi, nous prioriserons les candidats qui:

  • font preuve d’une expérience et d’un enthousiasme envers une thématique spécifique de la littératie de données.
  • peuvent justifier de liens avec une organisation ou une communauté d’individus qui travaillent sur cette thématique

Nous recherchons des candidats qui ont une connaissance approfondie des domaines qui nous intéressent et qui ont entamé une réflexion sur les enjeux de littératie de données de ces domaines. Le but étant de pouvoir rentrer dans le vif du sujet le plus vite possible: 9 mois passent vite !

Pour en lire plus sur le programme de Fellowship (en anglais)

Le thèmes prioritaires de 2017

Nous collaborons cette année avec des organisations intéressés par les thèmes suivants:

  • données des industries extractives
  • fondamentaux de la littératie de données
Programme Thématique Pays
Fellowship Données de l’industri extractive Sénégal, Côte d’Ivoire
Fellowship Fondamentaux de la littératie de données Haïti

9 mois pour laisser un impact

Le programme se déroule d’avril à décembre 2017, et requiert 10 jours par mois de disponibilité. Les Fellows reçoivent un défraiement de 1,000 US$ par mois pour leur permettre de travailler dans des conditions optimales.

En mai, les Fellows rejoindront le reste de la communauté dans le cadre du Camp d’Ete de School of Data (pays à confirmer). Ce sera l’occasion de rencontrer les autres Fellows et membres du réseau, de planifier votre Fellowship et d’apprendre des autres participants sur les bonnes pratiques utilisées au sein du réseau School of Data.

Qu’attendez-vous ?

Lire la Foire aux questions or Candidater

Informations clé: le Fellowship

  • Date limite de candidature : 16 avril 2017, minuit GMT+0
  • Durée : d’avril 24 2017 au 31 décembre 2017
  • Disponibilité requise : 10 jours par mois
  • Défraiement : 1000 US$ par mois

Diversité et inclusivité

Nous nous engageons à être inclusifs dans notre processus de recrutement. Être inclusif signifie de n’exclure personne pour des questions d’origine ethnique, de religion, d’apparence, d’orientation sexuelle, ou de genre. Nous cherchons activement à recruiter des individus qui diffèrent les uns des autres sur ces caractéristiques, car nous sommes convaincus que la diversité est une richesse pour notre travail.

Flattr this!

How do you become data literate? Part 1

helene.hahn - March 9, 2017 in Community, Data Blog

We at Open Knowledge Foundation Germany launched a new project this year, that we’re very proud of: Datenschule (datenschule.de), the German version of School of Data. We want to encourage civil society organisations, journalists and human rights defenders to use data and technology effectively within their work to create positive social change.

But what does it actually mean to become ‘data literate’? Where do you start and how can you use data within your work and projects? To explore these questions, we would like to introduce some of our community members and data activists from around the world, who ended up working with data at some point in their lives. We were curious about how they actually got started and – looking back now – what they would recommend to data newbies.

Each month we will publish a new interview, this is no. #1. Got feedback? Have questions? Feel free to get in touch: helene.hahn@okfn.de

 

Camila Salazar

 

 

 

Who: Data-Journalist from Costa Rica, working at the newsroom La Nación, data trainer at School of Data

Topics: data-driven stories on society, economics, politics

Tweets: @milamila07

 

Hi Camila, please introduce yourself.

My name is Camila, I’m from Costa Rica and I’m a data journalist and an economist. I’m currently working at a newspaper called La Nación in the data unit. I’m also involved with the School of Data community and started as a fellow in 2015. This was the year when I started running data trainings and workshops. I was trying to build a community around data in Costa Rica, in Latin America, also a bit in Mexico and South America.

When was the first time you came across data and when did you start to use data in your work?

I started studying journalism, but after my second year I was disappointed with the university and I wasn’t really motivated. So I thought, maybe I could start studying something else besides journalism. I enrolled myself in Economics at my university and was taking two courses simultaneously. In Economics it’s all about numbers and I really liked it. But when I was about to finish journalism studies, I thought, do I want to be an economist and work in a bank or do I want to write stories? How could I combine both? That’s how I got involved in data journalism. I found that this was an area where you could combine both in a good way. So you can take all the methods and technical skills, that you acquire in an economics degree, and apply them to tell stories of public-interest, so that’s how I mix the both and so far, it’s worked well.

What topics and projects are you currently working on?

At the data unit at La Nación we don’t focus on one major topic, it differs all time. This year we ran projects about the municipal elections in Costa Rica. We collected data regarding the mayors, that were running for the different counties. We also developed a project about live fact-checking the promises of the president. Every year, he gives a speech about the situation in the country. We built a platform where you could follow the speech live and see, if the things that the president saysare true or not. We tried to look for all kind of stories and narratives and see what kind of data is available on that topic. It could be a social topic, an economic one or something else. Now we are working on a project around wages. Within our unit we had the liberty to choose our topics and to see what’s interesting.

How would you explain data literacy?

I think to be data literate is to change the way you solve problems. You don’t have to be super pro in statistics. It’s a way you approach questions and the way you solve them. So for example, if you’re working in a social discipline, in economics or in science you are used to solve problems with certain scientific methods, you ask a questions, apply a method and then try to prove your point, you experiment a lot with data. That’s the way you become data literate. And this can work in any kind of field, in data journalism, public policy, in economics, if you are trying to introduce better solutions to improve efficiency in your business. Data literacy is about changing your way of thinking. It’s about trying to prove things and trying to find solutions with numbers and data. It’s a way of making things more methodical and reproducible.

What would you recommend to someone interested in data, but who does not know where to start?

If you really don’t know anything about data, don’t worry, it’s not that hard to get started. There are many learning resources available online. For a start, I would try and look for projects of people who already work with data – to get inspired. Then you can look for tools online, for example, on schoolofdata.org, there are courses, there are links to projects and it’s a good way to start. Don’t be afraid, and if you want to go super pro, I encourage you to do this. But it’s a process, you don’t need to expect to be modelling data in two weeks, but in two weeks you can learn the basics and start answering small questions with data.

Links:

Blog posts by Camila on School of Data

Data unit at the newspaper La Nación

Live fact-checking project on presidential promises http://www.nacion.com/gnfactory/investigacion/2016/promesas-presidente/index.html

Flattr this!

Apply Now! School of Data’s Fellowship and Data Expert Programmes

Cedric Lombion - March 2, 2017 in Announcement, Fellowship

image alt text

School of Data is inviting journalists, civil society advocates and anyone interested in pushing data literacy forward to apply for its 2017 Fellowship and Data Expert Programmes, which will run from April to December 2017. Up to 10 positions are open, with an application deadline set on Sunday, April 16th of 2017.

Apply for the Fellowship Programme or Apply for the Data Expert Programme

The Fellowship

Fellowships are nine-month placements with School of Data for data-literacy practitioners or enthusiasts. During this time, Fellows work alongside School of Data to build an individual programme that will make use of both the collective experience of School of Data’s network to help Fellows gain new skills, and the knowledge that Fellows bring along with them, be it about a topic, a community or specific data literacy challenges.

Similarly to previous years, our aim with the Fellowship programme is to increase awareness of data literacy and build communities who together, can use data literacy skills to make the change they want to see in the world.

The 2017 Fellowship will continue the thematic approach pioneered by the 2016 class. As a result, we will be prioritising candidates who:

  • possess experience in, and enthusiasm for, a specific area of data literacy training

  • can demonstrate links with an organisation practising in this defined area and/or links with an established network operating in the field

We are looking for engaged individuals who already have in-depth knowledge of a given sector and have been reflecting on the data literacy challenges faced in the field. This will help Fellows get off to a running start and achieve the most during their time with School of Data: nine months fly by!

Read More about the Fellowship Programme

The Data Expert programme

Launched formally for the first time this year, the Data Expert programme aims to strengthen the ability of strategic civil society organisations that are strategically positioned to bring about social change in their field of expertise to manage and deliver data driven projects. The Data Expert Programme was designed to complement the School of Data Fellowship and for it, we are recruiting a slightly different profile. Data Experts are expected to be more senior than fellows, with demonstrable technical and project management skills. By matching these individuals with the selected partner organisations, while providing them support through our network and partners, we expect to create a decisive impact on the use of data within key civil society organisations around the world

We will consequently prioritise individuals who:

  • possess relevant experience and expertise in the technical areas our local partners need help with
  • can demonstrate a strong interest in the field of activity of the civil society organisation they will be supporting

Read More about the Data Expert Programme

The areas of focus in 2017

We have partnered with organisations interested in working on the following themes: Data Journalism, Procurement and Extractives Data. These amazing partner organisations will provide Fellows and Experts with guidance, mentorship and expertise in their respective domains.

Programme Theme Location Open slots
Fellowship Extractives Data Sénégal, Côte d’Ivoire Up to 2
Fellowship Procurement Data Wordwide Up to 1
Fellowship Data Journalism Worldwide Up to 2
Fellowship Own focus Worldwide Up to 3
Data Expert Extractives Data Uganda, Tanzania 2

9 months to make an impact

The two programmes will run from April to December 2017, and entail up to 10 days a month of time. While Fellows will be focused on ironing their skills as data trainers and build a community around them, Experts will focus on supporting and training a civil society organisation or newsroom with a specific project. Fellows will receive a monthly stipend of $1,000 USD a month to cover for their work. Experts, who will have a planning with more variations, will receive a total stipend of $10,500 USD over the course of the programme.

In May, both Experts and Fellows will come together during an in-person Summer Camp (location to be decided) to meet their peers, build and share their skills, and learn about the School of Data way of training people on data skills.

What are you waiting for?

Read more about School of Data’s Fellowship or Apply now

Read more about School of Data’s Expert Programme or Apply now

Key Information: Fellowship

  • Available positions: up to 10 fellows. Learn more.
  • Application deadline: April 16th, 2017, midnight GMT+0
  • Duration: From April 24th, 2017 to December 31st, 2017
  • Level of activity: 10 days per month
  • Stipend: $1000 USD per month

Key Information: Data Expert Programme
* Available positions: 2 Experts, in Uganda and Tanzania. Learn more.
* Application deadline: April 16th, 2017, midnight GMT+0
* Duration: From April 24th, 2017 to December 31st, 2017
* Level of activity: up to 10 days per month.
* Stipend: $10,500 USD in total

Key links

About diversity and inclusivity

School of Data is committed to being inclusive in its recruitment practices. Inclusiveness means excluding no one because of race, age, religion, cultural appearance, sexual orientation, ethnicity or gender. We proactively seek to recruit individuals who differ from one another in these characteristics, in the belief that diversity enriches all that we do.

Finally, we are grateful for the support of our partners and funders for making these programmes funded. The School of Data Programme is funded through grants from the following institutions: Internews/USAID, Open Data For Development (World Bank & IDRC), the Hewlett Foundation & the Open Society Foundations, the Natural Resources Governance Institute and Publish What You Pay.

Flattr this!

SNI 2016: ICT and Open Data for Sustainable Development

Malick Lingani - November 23, 2016 in Data Blog, Fellowship

The National ICT Week (SNI) is an annual event in Burkina Faso dedicated to promote ICT. Each year, thousands of people are introduced to the basics of operating computers; impactful ICT initiatives are also rewarded by a host of prizes. This year’s event, the 12th edition, was hosted by the Ministry of Digital Economy from May 31st to June 4th with the theme of ICT and sustainable development.

image alt text

The panelists of the conference

The Burkina Open Data Initiative (BODI) was represented by its Deputy Manager, Mr. Malick Tapsoba. He gave an introductory speech that gave the audience a general idea as to what open data is about. He then continued by presenting some of the key accomplishments of BODI so far:

  • NENDO, a web application developed with data on education available on the Burkina Faso open data portal, was presented as an example of how open data can be used to boost accountability in education systems

  • the GIS data collected on drinkable water wells has become a key decision-making tool toward the achievement of ‘Sustainable Development Goal (SDG) 6: ‘Ensuring availability and sustainable management of water and sanitation for all.’

  • The open election project: a web platform that allowed the visualization of both the 2015 presidential and legislative election results. The visualizations were created almost in real-time, as fast as the data was released by the electoral commission. This project, initiated by BODI, has strongly contributed to the acceptance of the election’s results by all contenders.

Some ongoing projects of BODI were also presented:

  • Open Data and government procurement tracking project. This project aims to improve transparency in the government’s budget spending and to unlock opportunities for enterprises based on market competition.

  • Open Data to monitor both foreign funds and domestic funds: “When the data are not available and open, how can we measure progress toward Sustainable Development Goals?”, said Mr. Tapsoba.

Mr. Tapsoba also announced that a hackathon had been organised to showcase the use of open data and that the results would be revealed at the closing ceremony of SNI. One participant, a student who took part in the hackathon, called for more initiatives like these. He said that he strongly appreciated the way hackathons allow programmers and non-programmers to work together to build data applications and, for him, this helps to demystify ICT in general.

Mr. Sonde Amadou, CEO of Dunya Technology and one of the panelists, spoke about Smart Cities: African cities are growing fast, he said, and Ouagadougou, the capital city of Burkina Faso, is one of them. But Open GIS Data, he continued, is a stumbling block for Smart Cities and work is needed in this area.

Dr. Moumini Savadogo, IUCN Head Country Programme, talked about the IUCN Red List of threatened, critically endangered, endangered and vulnerable species in Africa. This list helps raise awareness and encourages better informed decisions for the conservation of nature, something critical for sustainable development.

The 400 participants of the conference were well served and I was confident that most of them can now be considered as open data advocates. As a School of Data Fellow, I made sure to speak after the panelists, pointing out the importance of strong institutions supported by transparency and accountability (SDG 16) for achieving the 2030 agenda in general. So I encouraged the audience to take a look at Open Data portals, notably BODI and EITI, for transparency in the extractive industry, including the environmental impact. I also mentioned the GODAN initiative for SDG 02 and called the panelist Malick Tapsoba to develop more on that. The open data community of Burkina Faso has made that day one more step on its journey towards building a stronger open data community and data literacy advocates.


Infobox
Event name: SNI 2016: ICT and Open Data for Sustainable Development
Event type: Conference
Event theme: ICT and Open Data for Sustainable Development
Description: The conference part of Burkina Faso’s National ICT Week (SNI) purpose was to showcase the role of ICT and Open Data to meet the Sustainable Development Goals (SDGs). The conference was designed to bring together ICT Specialists, Academia and Open Data activists to explore and learn about Sustainable development Goals and how ICT and Open Data can contribute to that Agenda
Speakers: Pr. Jean Couldiaty (University of Ouagadougou) Facilitator, Mr. SONDE Amadou (CEO of Dunya Technology), Mr. Malick Tapsoba (BODI Deputy Manager), Dr. Moumini SAVADOGO (IUCN Head Country Programme)
Partners: Burkina Faso Ministry of Digital Economy, Burkina Faso Open Data Initiative (BODI), International Union for the Conservation of Nature (IUCN), University of Ouagadougou
Location: Ouagadougou, Burkina Faso
Date: May 31st 2016
Audience: ICT specialists, Open Data and Data Literacy enthusiasts, Students, Journalists
Number of attendees 400
Gender split: 60% men, 40% women
Duration: 1 day
Link to the event website: http://www.sni.bf

Flattr this!

Who works with data in El Salvador?

Omar Luna - November 16, 2016 in Data Blog, Fellowship

For five years, El Salvador has had the Public Information Access Law (PIAL), which requires various kinds of information from all state, municipal and public-private entities —such as statistics, contracts, agreements, plans, etc. These inputs are all managed under the tutelage of PIAL, in an accurate and timely manner.

As well as the social control exerted by Civil Society Organizations (CSOs) in El Salvador, to ensure compliance with this law, the country’s public administration gave space for the emergence of various bodies, such as the Institute of Access to Public Information (IAPI), the Secretariat of Transparency, Anti-Corruption and Citizen Participation and the Open Government website, which compiles —without periodic revision of official documents and other resources by any government official— more than 92,000 official data documents.

In this five year period, the government showed discontent. Why? They didn’t expect that this legislation would strengthen the journalistic, activist and investigative powers of civil society, who took advantage of this period of time to improve and refine the techniques under which they requested information from the public administration.

Presently, there are few digital skills amongst these initiatives in the country. It has now become essential to ask the question: what is known about data in El Salvador? Are the initiatives that have emerged limited in the scope of their achievements? Can something be done to awaken or consolidate the interest of people in data? To answer these and other questions, I conducted a survey with different research and communication professionals in El Salvador and this is what I found.

The Scope

“I think [data work] has been explored very little (in journalism at least),” said Jimena Aguilar, Salvadoran journalist and researcher, who also assured me that working with data helps provide new perspectives to stories that have been written for some time. One example is Aguilar’s research for La Prensa Grafica (LPG) sections, such as transparency, legal work, social issues, amongst others.

Similarly, I discovered different initiatives that are making efforts to incorporate the data pipeline within their work. For two years, the digital newspaper ElFaro.net has explored various national issues (laws, homicides, travel deputies, pensions, etc.) using data. During the same period, Latitudes Foundation processed different aspects of gender issues to determine that violence against women is a multi-causal phenomenon in the country under “Háblame de Respeto” project.

And although resistance persists in government administrations and related institutions to adequately provide the information requested by civil society —deputies, think tanks, Non-Governmental Organizations (NGOs), journalists, amongst others— more people and entities are interested in data work, performing the necessary steps to obtain information that allows them to know the level of pollution in the country, for instance, build socio-economic reports, uncover the history of Salvadoran political candidates and, more broadly, promote the examination of El Salvador’s past in order to understand the present and try to improve the country’s future.

 

The Limitations

“[Perhaps,] it is having to work from scratch. A lot of carpentry work [too much work for a media outlet professional]”, says Edwin Segura, director for more than 15 years of LPG Datos, one of the main data units in the country, who also told me that often too much time and effort is lost in cleaning false, malicious data provided by different government offices, which often has incomplete or insufficient inputs. Obviously, Segura says, this is with the intention of hindering the work of those working with data in the country.

In addition, there’s something very important that Jimena told me about the data work: “If you are not working as a team, it is difficult to do [data work] in a creative and attractive way.” What she said caught my attention for two reasons: first, although there are platforms that help create visualizations, such as Infogr.am and Tableau, you always need a multidisciplinary approach to jump-start a data project, which is the actual case of El Diario de Hoy data unit that is conformed by eight people specialized in data editing, web design, journalism and other related areas.

And, on the other hand, although there are various national initiatives that work to obtain data, such as Fundación Nacional para el Desarrollo (FUNDE), Latitudes Foundation, etc., there’s a scattered effort to do something with the results, which means that everyone does what they can do to take forward the challenge of working with databases individually, instead of pursuing common goals between them. 

Stones in the Road

When I asked Jimena what are the negative implications of working with data, she was blunt: “(Working with data) is something that is not understood in newsrooms […] [it] takes a lot of time, something that they don’t like to give in newsrooms”. And not only newsrooms, because NGOs and various civil society initiatives are unaware of the skills needed to work with data.

Of the many different internal and external factors affecting the construction of stories with data, I would highlight the following. To begin with, there is a fear and widespread ignorance towards mathematics and basic statistics, so individuals across a wide variety of sectors don’t understand data work; to them, it is a waste of time to learn how to use them in their work. For them, it’s very simple to gather data in press conferences, institutional reports and official statements, which is a mistake because they don’t see how data journalism can help them to tell stories in a different way.

Another issue is that we have an inconsistency in government actions because, although the government discursively supports transparency, their actions are focused on answering requests vaguely rather than proactively releasing good quality data —opening data in this way is hampered with delays. I experienced this first hand when, on many occasions, I asked for information that didn’t match with what I requested or, on the contrary, the government officials sent me different information, in contrast with other information requests sent by other civil society sectors (journalists, researchers, etcetera).

Where Do We Go From Here?

With this context, it becomes essential to begin to make different sectors of civil society aware of the importance of data on specific issues. For that, I find myself designing a series of events with multidisciplinary teams, workshops, activities and presentations that deconstruct the fear of numbers, that currently people have, through the exchange of experience and knowledge. Only then can our civil society groups make visible the invisible and explain the why in all kinds of topics that are discussed in the country.

With this approach, I believe that not only future generations of data practitioners can benefit from my activities, but also those who currently have only indirect contact with it (editors, coordinators, journalists, etc.), whose work can be enhanced by an awareness of data methodologies; for example, by encouraging situational awareness of data in the country, time-saving tools and transcendence of traditional approaches to visualization.

After working for two years with gender issues and historic memory, I have realized that most data practitioners have a self-taught experience; through trainings of various kinds we can overcome internal/external challenges and, in the end, reach common goals. But, we don’t have any formal curricula and all we’ve learned so far comes from a proof and error practices… something we have to improve with time.

And, also, we’re coping with the obstacles imposed by the Government on how data is requested and how the requested information is sent; we also have to constantly justify our work in workplaces where data work is not appreciated. From NGO to media outlets, data journalism is seen as a waste of time because they’re thinking that we don’t produce materials as fast as they desire; so, they don’t appreciate all the effort required to request, clean, analyse and visualise data.

As part of my School of Data Fellowship, I’m supporting the design of an educational curriculum specialising in data journalism for fellow journalists in Honduras, Guatemala and El Salvador, so they may acquire all the necessary skills and knowledge to undertake data histories on specific issues in their home countries. This is a wonderful opportunity to awaken the persistence, passion and skills for doing things with data.

The outlook is challenging. But now that I’m aware of the limits, scope and stones in the way of data journalism in El Salvador and all that remains to be done, I want to move forward. I take the challenge this fellowship has presented me, because as Sandra Crucianelli (2012) would say, “(…) in this blessed profession, not only doesn’t shine people with good connections, even with brilliant minds: for this task only shine the perseverant ones. That’s the difference”.

Flattr this!