You are browsing the archive for Vadym Hudyma.

De-anonymising Ukraine university entrance test results

- May 26, 2017 in Data Blog

Authors: Vadym Hudyma, Pavlo Myronov. Part 1 of a series on Ukrainian student data.

Introduction

External Independent Evaluation Testing is a single exam is used nationwide to access all public universities.

As detailed in our previous article, the release of a poorly anonymised dataset by organisation in charge of the External Independent Evaluation Testing (EIT) resulted in serious risks to the privacy of Ukrainian students. One of those was the risk of unwanted mass disclosure of personal information, with the help of a single additional dataset. We detail below how we reached our results.

The EIT datasets contains the following dimensions:

  • Unique identifier for every person
  • Year of birth
  • Sex
  • Test scores of every subject taken by student (for those who get 95% and more of possible points – exact to decimals)
  • Place, where test were taken

On the other hands, the dataset we used to de-anonymise the EIT results, was collected from the website vstup.info, and it gives us access to the following elements:

  • family name and initials of the applicant (also referred below to as name)
  • university where the applicant was accepted
  • the combined EIT result scores per required subject, with a multiplier applied to each subject by the universities, depending on their priorities.

At first glance, as every university uses its own list of subject-specific multipliers to create the combined EIT results of applicants, it should be impossible to precisely know their EIT score, as well as find matches with exact scores in EIT data set.

The only problem with that reasoning is that the law requires all the multipliers to be published on the same website as a part of a corruption prevention mechanism. And this is good. But it also provides attackers with enough data to use it as a basis for calculation to find exact matches between datasets.

How we did it

Our calculations were based on an assumption that every EIT participant applied to universities of their local region. Of course, this assumption may not be true for every participant but it’s usually the case and also one of the easiest ways to decrease the complexity of the calculations.

For every Ukrainian region, we isolated in the EIT dataset a subset of local test-takers and calculated the EIT ratings they would have if they had applied for every speciality at local universities. Then we merged this dataset of “potential enrollees” with real enrollees’ dataset from website vstup.info, which contained real names of enrollees and their final rating (meaning multiplied by subject- and university specific multipliers) by the parameters of the university, speciality, and rating.

By joining these data sets for every region we gained the first set of pairs, where test-takers’ ids correspond with enrollees’ names (data set A1B1). In the resulting set, the quantity of EIT participants that correspond with only one name, i.e. those who can be unambiguously identified, is 20 637 (7.7% of all participants).

To expand the scope of our de-anonymization, we used the fact that most of the enrollees try to increase their chances of getting accepted by applying to several universities. We consequently tested all pairs from first merged dataset (A1B1) against the whole dataset of enrollees (B1), counting the number of matches by final rating for the every pair. Then we filtered the pairs that were matched by at least two unique values of EIT rating. If the same match occurs in two cases with different universities/speciality coefficients to form aggregate EIT rating, it’s much less likely that we got a “false positive”.

Therefore, we formed a data set where each EIT participant’s id corresponds with one or more names, and the number of unique EIT rating values is recorded for every correspondence (C1). In this case, the number EIT participant (unique identifier from A1) that correspond only one name with the number of unique aggregate ratings > 1, is 50 845 (18.97%).

We also noticed the possibility of false positive results, namely the situation where the same family name and initials from enrollees dataset (B1) corresponds with several ids from EIT participants dataset (A1). It doesn’t necessary mean we guessed test taker’s family name wrongly, especially in a case of rather common a family name. The more widespread name is, the more the probability that we have correctly identified several EIT participants with the same name. But still it leaves possibilty of some number of false positive results.

To separate the most reliable results from others, we identified correspondences with unique names and calculated the number of the records where unique id corresponds with a unique name.

Consequently, the results of our de-anonymization can be described by the following table.

Assumptions De-anonymized EIT participants with unique names De-anonymized EIT participants (regardless of names uniqueness)
1) Every enrollee applied to at least one university in his/her region. 8 231 (3.07%) 20 637 (7.7%)
1) + Every enrollee applied to at least two specialities with different coefficients. 31 418 (11.42%) 50 845 (18.97%)

In each row, false positive results can occur only if some of the enrollees broke basic assumption(s).

So far we speaking about unambiguous identification of test-takers. But even narrowing results to a small number of possible variants makes subsequent identification using any kind of background knowledge or other available data sets trivial. At the end, we were able to identify 10 and less possible name-variants for 43 825 EIT participants. Moreover, we established only 2 possible name-variants for 19 976 test-takers.

Our method provides assumed name(or names) for every EIT participant, who applied to university in the region where they had taken their tests, and applied to at least two specialities with different multipliers. Though not being 100% free from false positives, the results are precise enough to show that external testing dataset provides all necessary identifiers to de-anonymize a significant part of test-takers. Of course, those who may have personal or business, and not purely research interest in test-takers’ personalities or enrollees external testing results would find multiple ways to make de-anonymization even more precise and wider in its scope.

(NOTE: For example, one can use clusterization of each specialty rating coefficients to decrease the number of calculation avoiding our basic assumption. It is also possible to take into account the locations of EIT centres and assume that test-takers would probably try to enrol at the universities in nearby regions or to estimate real popularity of names among enrollees using social network “Vkontakte” API and so on.)

Using comparatively simple R algorithms and an old HP laptop we have found more than 20 637 exact matches (7.7% of all EIT participants), re-identifying individuals behind anonymized records. And more than 40 thousands – participants were effectively de-anonymised with less than perfect precision – but more than good enough for motivated attacker.

What could be done about it?

After conducting initial investigation, we reached out to CEQA for comments. This was their response:

“Among other things, Ukraine struggles with high level of public distrust to government institutions. By publishing information about standardized external assessment results and the work we deliver, we try to lead by example and show our openness and readiness for public scrutiny…

At the same time, we understand that Ukraine has not yet formed a mature culture of robust data analysis and interpretation. Therefore, it is essential to be aware of all risks and think in advance about ways to mitigate adverse impact on individuals and the education system in general.”

So what could be done better with this particular dataset to mitigate at least the above mentioned risks, while preserving its obvious research value? Well, a lot.

First of all, a part of the problem that is easy to fix is the exact test scores. Simple rounding and bucketing them into small portions (like 172 instead of the range from 171 to 173, 155 for the range from 154 to 156 and so on), and so making them reasonably k-anonymous. Whilst this wouldn’t make massive deanonymization impossible, it could seriously reduce both the number of possible attack vectors and the precision of these breaches. “Barnardisation” (adding 1 and -1 randomly to each score) would also do the trick, though it should be combined with other anonymisation techniques.

The problem with background knowledge (like in the “nosy neighbour” scenario) is that it would be impossible to mitigate without removing a huge number of outliers and specific cases, such as small schools, non-common test subjects in small communities and so on, as well as huge steps in bucketing different scores or generalising test locations. Some educational experts have raised concerns about the projected huge loss in precision.

Still, CEQA may have considered releasing dataset with generalised data and some added noise and give researchers more detailed information under a non-disclosure agreement.

This “partial release/controlled disclosure” scheme could also help to deal with the alarming problem of school ratings. For example, a generalisation of testing location from exact places to school districts or even regions would probably help. Usually, local media wouldn’t be interested in comparing EIT results outside their audience locations, and national media is much more reluctant to publish stories about differences in educational results between different regions for obvious discrimination and defamation concerns.

This kind of attack is not very dangerous at this particular moment in Ukraine – we don’t have a huge data-broker market (as in US or UK) and our HR/insurance companies do not use sophisticated algorithms (yet) to determine the fate of peoples’ job applications or final life insurance cost. But the situation is quickly changing, and this kind of sensitive personal data, which isn’t worth much at this point, can be easily exploited at any moment in the near future. And both the speed and low cost of this kind of attack make this data set a very low hanging fruit.

Conclusions

Current states of affairs in personal data protection in Ukraine, as well as workload of existing responsible stuff in government don’t leave much hopes for a swift change in any of already released data sets. Still, this case clearly demonstrates that anonymisation is really hard problem to tackle, and benefits of microdata disclosure could be quite easily outweighed by possible risks of unwanted personal data disclosures. So, all open data activists advocating for disclosure maximum information possible, as well as government agencies responsible for releasing such sensitive data sets, should put really hard efforts into figuring out possible privacy connected risks.

We hope that our work would be helpful not just for future releases of external testing results, but for the wider open data community – both in Ukraine and throughout the world.

Flattr this!

The lost privacy of Ukrainian students: a story of bad anonymisation

- May 23, 2017 in Data Blog

Authors: Vadym Hudyma, Pavlo Myronov. Part 1 of a series on Ukrainian student data.

Introduction

Ukraine has long been plagued with corruption in the university admission process due to a complicated and untransparent process of admission, especially for state-funded seats. To get it would-be students required not just a good grades from school (which also was subject of manipulation), but usually some connections or bribes to the universities admission boards.

Consequently, the adoption of External Independent Evaluation Testing (EIT) (as the primary criteria for admission into universities is considered one of a handful of successful anticorruption reforms in Ukraine. External independent evaluation is conducted once a year for a number of subjects, anyone with school diploma can participate in it. It is supervised by an independent government body (CEQA – Center for Educational Quality Assessment) with no direct links neither to school system nor major universities, All participant names are protected with unique code to protect results from forgery. {Explanation of the system in 1-2 sentence.}

The EIT has not eradicated corruption, but reduced it to a negligible level in the university admissions system. While its impact on the school curriculum and evaluation is, and should be, critically discussed, its success in providing opportunities for a bright student to get a chance to choose between the best Ukrainian universities is beyond doubt. Also, it provides researchers and the general public with a very good tool to understand, at least on some level, what’s going on with secondary education based on unique dataset of country-wide results of university admission tests.

Obviously, it’s also crucial that the results of the admission tests, a potentially life-changing endeavour, must be held as privately and securely as possible. Which is why we were stricken when the Ukrainian Center for Educational Quality Assessment (CEQA) also responsible for collecting and managing the EIT data, released this August a huge dataset of independent testing results from 2016.

In this case, this dataset includes individual records. Although the names and surnames of participants were de-identified using randomly assigned characters, the dataset was still full of multiple other entries that could link to exact individuals. Those include exact scores (with decimals) of every taken test subject, the birth year of each participant, their gender, whether they graduated this year or not and, most damning, the name of the place where each subject of external examination was taken – which is usually the schools at which participants got their secondary education.

I. Happy Experts

Of course, the first reaction from the Ukrainian Open Data community was overwhelmingly positive, helped with the fact that previous releases of EIT datasets were frustrating in their lack of precision and scope.

A Facebook post announcing the publication: “Here are the anonymized results of IET in csv #opendata”

image alt text

*A Facebook comment reacting to the publication: “Super! Almost 80 thouthands entries” (actually more ;) *

image alt text

A tweet discussing the data: “Some highly expected conclusions from IET data from SECA…”

As Igor Samokhin, one of the researchers who used the released EIT dataset in his studies, put it:

“[..This year’s] EIT result dataset allows for the first time to study the distribution of scores on all levels of aggregation (school, school type, region, sex) and to measure inequality in scores between students and between schools on different levels.[…] The dataset is detailed enough that researchers can ask questions and quickly find answers without the need to ask for additional data from the state agencies, which are usually very slow or totally unresponsive when data is needed on the level lower than regional.”

Indeed, the dataset made possible some interesting visualisations and analysis.

image alt text

A simple visualisation showing differences in test results between boys and girls

image alt text

Quick analysis of birth years of those who took IET in 2016

But that amount of data and the variety of dimensions (characteristics) available carry many risks, unforeseen by data providers and overlooked by the hyped open data community and educational experts. We’ve made a short analysis of most obvious threat scenarios.

II. What could go wrong?

As demonstrated by various past cases across the world, microdata disclosure, while extremely valuable for many types of research such as longitudinal studies, is highly susceptible to re-identification attacks.

To understand the risks involved, we went through a process called threat modeling. This consists in analysing all the potential weaknesses of a system (here the anonymisation technique used on the dataset) from the point of view of a potential individual with malicious intentions (called’ attacker’). Three threat models emerged from this analysis:

The ‘Nosy neighbour’ scenario

The first and most problematic possibility is called the “nosy neighbour” scenario. This corresponds to an unexpected disclosure of results from relatives, neighbours, school teachers, classmates, or anyone with enough knowledge about an individual described in the dataset to recognize who the data describes – without having to look at the name. The risks involved with this scenario include possible online and offline harassment against people with too low or too high – depending on context – test results.

Unwanted disclosure may happen because members in the subject’s close environment can already have some additional information about the person. If you know that your classmate Vadym was one of the rare person of the village to take chemistry in the test, you can easily deduce which line of the data corresponds to him, discovering in the same way all the details of his tests results. And depending on what you (and others) discover about Vadym, the resulting social judgment could be devastating for him, all because of an improperly anonymised dataset.

This is a well-known anonymisation problem – it’s really hard to get a good anonymity with that many dimensions – in this case, the subject and exact results of multiple tests and their primary examination location.

It’s an especially alarming problem for schools in small villages or specialised schools – where social pressure and subsequent risk of stigmatisation is already very high.

The ‘Ratings fever’ problem

image alt text

Map of schools in Kiev, Ukraine’s capital, made by the most popular online media based on EIT results

The second problem with educational data is hardly new and the release of this dataset just made it worse. With added precision and targeting power, more fervour was granted to the media’s favoured exercise of grading schools according to successes and failures of the external testing results of its students.

In previous years, many educational experts criticised ratings made by media and the different government authorities for incompleteness: they were based either on a full dataset, but for only one test subject, or were made using heavily aggregated and non-exhaustive data. But such visualisations can have consequences more problematic than misleading news readers about the accuracy of the data.

The issue here is about the ethical use of the data, something often overlooked by the media in Ukraine, who happily jumped on the opportunity to make new ratings. As educational expert Iryna Kogut from CEDOS explains:

“EIT scores by themselves can not be considered as a sign of the quality of education in a individual school. The new dataset and subsequent school ratings based on it and republished by CEQA only maintains this problem. Public opinion about the quality of teaching and parental choice of school relies on results of the EIT, but the authors of the rating do not take into account parents’ education, family income, the effect of private tutoring and others out-of-school factors which have a huge influence on learning results. Besides, some schools are absolutely free to select better students (usually from families with higher socioeconomic status), and this process of selection into “elite” schools is usually neither transparent nor fair. So they are from the start not comparable with the schools having to teach ‘leftovers’. ”

Even as people start understanding the possible harm of the “rate everything” mentality for determining both public policy and individual decisions, almost every local website and newspaper has made or republished school ratings from their cities and regions. In theory, there could be benefits to the practice, such as efforts to improve school governance. Instead, what seems to happen is that more students from higher-income families migrate to private schools and less wealthy parents are incentivised to use ‘unofficial’ methods to transfer their kids to public school with better EIT records. Overall, this is a case where the principle “the more informed you are the better” is actually causing harm to the common good – especially when there is no clear agenda or policy in place to create a fairer and more inclusive environment in Ukrainian secondary education.

Mass scale disclosure

The last and most long-term threat identified is the possible future negative impact on the personal life of individuals, due to the unwanted disclosure of test results. This scenario considers the possibility of mass scale unwanted identity disclosure of individuals whose data were included in recent EIT data set.

As our research has shown, it would be alarmingly easy to execute. The only thing one needs to look at is already-published educational data. To demonstrate the existence of the this threat, we only had to use one data set: close to half of the EIT records could be de-anonymised with varying level certainty, meaning that we could find the identity of the individual behind the results (or narrow down the possibility to a couple of individuals) for one hundred thousand individual records.

The additional dataset we used comes from another government website – vstup.info – which lists all applicants to every Ukrainian university. The data includes the family names and initials of each applicant, along with the combined EIT results scores. The reason behind publishing this data was to make the acceptance process more transparent and cut space for possible manipulations.

But with some data wrangling and mathematical work, we were able to join this data with the IET dataset, allowing a mass scale de-anonymisation.

So what should be the lessons learned from this?

First, while publishing microdata may bring enormous benefits to the researchers, one should be conscious that anonymisation may be really hard and non-trivial problem to solve. Sometimes less precision is needed to preserve anonymity of persons whose data is included in the dataset.

Second – it’s important to be aware of any other existing datasets, which datasets may be used for de-anonymization. It’s responsibility of data publisher to make sure of that before any information sharing.

Third – it’s not enough just to publish dataset. It’s important to make sure that your data wouldn’t be used in obviously harmful or irresponsible manner, like in various eye-catching, but very damaging in the long run ratings and “comparisons”.

Flattr this!

Avoiding Harm While Pushing Good Stories

- September 5, 2016 in Event report, Fellowship

image alt text

Working on Responsible Data is about asking some key questions: how can we ensure the right to consent for individuals and communities? How can we preserve privacy, security and ownership around their data. These issues should be balanced with the need to create meaningful impact with a project or a story. Which makes journalists one of a prime audiences for Responsible Data training. So I was excited when I was invited to hold a session at a big event for journalists and independent bloggers, organised by Sourcefabric in Odessa, Ukraine.

As news stories incorporate more (personal) data than ever in their work, journalists face several challenges related to the responsible use of this data – sometimes without being aware of them, as the discussion with my audience showed. We explored three issues often found in popular stories of the year past: the need for informed consent, the risks of covering war casualties, and the issues related to public ratings.

Why we need informed consent

As social media data becomes an attractive source of data and stories for news outlets, they get reminded that the rules related to traditional reporting, such as informed consent, still apply – but the nature of social media as a medium making much more complicated than just reaching out to the heroes of your story. We discussed this issue using the example of Buzzfeed’s article on sexual assault. In this case, the journalist embedded in her story several tweets from Twitter thread on this topic and made sure to have the consent of those whose tweets were quoted in the story. The problem was that it was extremely easy to get to the whole Twitter thread in one click and read the stories of those who did not want to get “popularity” brought by an article on Buzzfeed. They couldn’t reasonably expect such a high level of visibility after answering in a Twitter thread.

This is an issue explored by Helen Nissembaum, who explains that privacy is not binary and should be understood in context: people have a certain expectation of the final use of the information that they share. Once the receiver of that information (an individual on Twitter vs Buzzfeed readers) or the transmission principle (Twitter thread vs Buzzfeed article) changes, it creates a perceived violation of privacy.

As pointed out by participants, getting informed consent is not always easy in the kind of reporting, which heavily relied on social media, even though using human faces and personal stories is crucial to create impact to a story.

The risks of covering war deaths

Another example dealt with the potential issues linked to interactive maps, when used as a data story medium. Not just the usual complications of getting a complex story right, but also the connected problems of geolocation data as a possible privacy issue. There is as well a a need to consider the wider context – as with the reuse of CNN’s War Casualties Map in stories about other armed conflicts, and the possible danger for relatives of deceased fighters, who fought “for the wrong side”. Also, we looked into the problem of false sense of accuracy in the highly uncertain situation of war casualty statistics, like in the example of civilian casualties during the Syrian conflict in the example below:

image alt text

The issues with public rating

At the end, we spoke a bit about the sad example of the now closed Schooloscope project. While there are many lessons to be learned from this example, we spoke mainly about how the revelation of school ratings, without any public policy involved in place to fix the problem, was damaging to the communities involved. As a good counterexample of a solution, not just problem-driven data journalistics, I presented ProPublica’s project on public schools inequality.

As a speaker, working with a less-experienced audience, and the need to locate my presentation in the wider context of a data literacy event was a challenging, but extremely interesting task.


Infobox
Event name: Responsible Data in Data Journalism
Event type: workshop
Event theme: Responsible Data
Description: a part of 4-days training on creating data-driven stories
Speakers: Vadym Hudyma, Jacopo Ottaviani
Partners: Sourcefabric
Location: Ukraine, Odessa
Date: August 3, 2016
Audience: data journalists
Number of attendees 17
Gender split: 50% female, 50% male
Duration: 1.5 hours

Flattr this!