You are browsing the archive for Data Blog.

De-anonymising Ukraine university entrance test results

Vadym Hudyma - May 26, 2017 in Data Blog

Authors: Vadym Hudyma, Pavlo Myronov. Part 1 of a series on Ukrainian student data.

Introduction

External Independent Evaluation Testing is a single exam is used nationwide to access all public universities.

As detailed in our previous article, the release of a poorly anonymised dataset by organisation in charge of the External Independent Evaluation Testing (EIT) resulted in serious risks to the privacy of Ukrainian students. One of those was the risk of unwanted mass disclosure of personal information, with the help of a single additional dataset. We detail below how we reached our results.

The EIT datasets contains the following dimensions:

  • Unique identifier for every person
  • Year of birth
  • Sex
  • Test scores of every subject taken by student (for those who get 95% and more of possible points – exact to decimals)
  • Place, where test were taken

On the other hands, the dataset we used to de-anonymise the EIT results, was collected from the website vstup.info, and it gives us access to the following elements:

  • family name and initials of the applicant (also referred below to as name)
  • university where the applicant was accepted
  • the combined EIT result scores per required subject, with a multiplier applied to each subject by the universities, depending on their priorities.

At first glance, as every university uses its own list of subject-specific multipliers to create the combined EIT results of applicants, it should be impossible to precisely know their EIT score, as well as find matches with exact scores in EIT data set.

The only problem with that reasoning is that the law requires all the multipliers to be published on the same website as a part of a corruption prevention mechanism. And this is good. But it also provides attackers with enough data to use it as a basis for calculation to find exact matches between datasets.

How we did it

Our calculations were based on an assumption that every EIT participant applied to universities of their local region. Of course, this assumption may not be true for every participant but it’s usually the case and also one of the easiest ways to decrease the complexity of the calculations.

For every Ukrainian region, we isolated in the EIT dataset a subset of local test-takers and calculated the EIT ratings they would have if they had applied for every speciality at local universities. Then we merged this dataset of “potential enrollees” with real enrollees’ dataset from website vstup.info, which contained real names of enrollees and their final rating (meaning multiplied by subject- and university specific multipliers) by the parameters of the university, speciality, and rating.

By joining these data sets for every region we gained the first set of pairs, where test-takers’ ids correspond with enrollees’ names (data set A1B1). In the resulting set, the quantity of EIT participants that correspond with only one name, i.e. those who can be unambiguously identified, is 20 637 (7.7% of all participants).

To expand the scope of our de-anonymization, we used the fact that most of the enrollees try to increase their chances of getting accepted by applying to several universities. We consequently tested all pairs from first merged dataset (A1B1) against the whole dataset of enrollees (B1), counting the number of matches by final rating for the every pair. Then we filtered the pairs that were matched by at least two unique values of EIT rating. If the same match occurs in two cases with different universities/speciality coefficients to form aggregate EIT rating, it’s much less likely that we got a “false positive”.

Therefore, we formed a data set where each EIT participant’s id corresponds with one or more names, and the number of unique EIT rating values is recorded for every correspondence (C1). In this case, the number EIT participant (unique identifier from A1) that correspond only one name with the number of unique aggregate ratings > 1, is 50 845 (18.97%).

We also noticed the possibility of false positive results, namely the situation where the same family name and initials from enrollees dataset (B1) corresponds with several ids from EIT participants dataset (A1). It doesn’t necessary mean we guessed test taker’s family name wrongly, especially in a case of rather common a family name. The more widespread name is, the more the probability that we have correctly identified several EIT participants with the same name. But still it leaves possibilty of some number of false positive results.

To separate the most reliable results from others, we identified correspondences with unique names and calculated the number of the records where unique id corresponds with a unique name.

Consequently, the results of our de-anonymization can be described by the following table.

Assumptions De-anonymized EIT participants with unique names De-anonymized EIT participants (regardless of names uniqueness)
1) Every enrollee applied to at least one university in his/her region. 8 231 (3.07%) 20 637 (7.7%)
1) + Every enrollee applied to at least two specialities with different coefficients. 31 418 (11.42%) 50 845 (18.97%)

In each row, false positive results can occur only if some of the enrollees broke basic assumption(s).

So far we speaking about unambiguous identification of test-takers. But even narrowing results to a small number of possible variants makes subsequent identification using any kind of background knowledge or other available data sets trivial. At the end, we were able to identify 10 and less possible name-variants for 43 825 EIT participants. Moreover, we established only 2 possible name-variants for 19 976 test-takers.

Our method provides assumed name(or names) for every EIT participant, who applied to university in the region where they had taken their tests, and applied to at least two specialities with different multipliers. Though not being 100% free from false positives, the results are precise enough to show that external testing dataset provides all necessary identifiers to de-anonymize a significant part of test-takers. Of course, those who may have personal or business, and not purely research interest in test-takers’ personalities or enrollees external testing results would find multiple ways to make de-anonymization even more precise and wider in its scope.

(NOTE: For example, one can use clusterization of each specialty rating coefficients to decrease the number of calculation avoiding our basic assumption. It is also possible to take into account the locations of EIT centres and assume that test-takers would probably try to enrol at the universities in nearby regions or to estimate real popularity of names among enrollees using social network “Vkontakte” API and so on.)

Using comparatively simple R algorithms and an old HP laptop we have found more than 20 637 exact matches (7.7% of all EIT participants), re-identifying individuals behind anonymized records. And more than 40 thousands – participants were effectively de-anonymised with less than perfect precision – but more than good enough for motivated attacker.

What could be done about it?

After conducting initial investigation, we reached out to CEQA for comments. This was their response:

“Among other things, Ukraine struggles with high level of public distrust to government institutions. By publishing information about standardized external assessment results and the work we deliver, we try to lead by example and show our openness and readiness for public scrutiny…

At the same time, we understand that Ukraine has not yet formed a mature culture of robust data analysis and interpretation. Therefore, it is essential to be aware of all risks and think in advance about ways to mitigate adverse impact on individuals and the education system in general.”

So what could be done better with this particular dataset to mitigate at least the above mentioned risks, while preserving its obvious research value? Well, a lot.

First of all, a part of the problem that is easy to fix is the exact test scores. Simple rounding and bucketing them into small portions (like 172 instead of the range from 171 to 173, 155 for the range from 154 to 156 and so on), and so making them reasonably k-anonymous. Whilst this wouldn’t make massive deanonymization impossible, it could seriously reduce both the number of possible attack vectors and the precision of these breaches. “Barnardisation” (adding 1 and -1 randomly to each score) would also do the trick, though it should be combined with other anonymisation techniques.

The problem with background knowledge (like in the “nosy neighbour” scenario) is that it would be impossible to mitigate without removing a huge number of outliers and specific cases, such as small schools, non-common test subjects in small communities and so on, as well as huge steps in bucketing different scores or generalising test locations. Some educational experts have raised concerns about the projected huge loss in precision.

Still, CEQA may have considered releasing dataset with generalised data and some added noise and give researchers more detailed information under a non-disclosure agreement.

This “partial release/controlled disclosure” scheme could also help to deal with the alarming problem of school ratings. For example, a generalisation of testing location from exact places to school districts or even regions would probably help. Usually, local media wouldn’t be interested in comparing EIT results outside their audience locations, and national media is much more reluctant to publish stories about differences in educational results between different regions for obvious discrimination and defamation concerns.

This kind of attack is not very dangerous at this particular moment in Ukraine – we don’t have a huge data-broker market (as in US or UK) and our HR/insurance companies do not use sophisticated algorithms (yet) to determine the fate of peoples’ job applications or final life insurance cost. But the situation is quickly changing, and this kind of sensitive personal data, which isn’t worth much at this point, can be easily exploited at any moment in the near future. And both the speed and low cost of this kind of attack make this data set a very low hanging fruit.

Conclusions

Current states of affairs in personal data protection in Ukraine, as well as workload of existing responsible stuff in government don’t leave much hopes for a swift change in any of already released data sets. Still, this case clearly demonstrates that anonymisation is really hard problem to tackle, and benefits of microdata disclosure could be quite easily outweighed by possible risks of unwanted personal data disclosures. So, all open data activists advocating for disclosure maximum information possible, as well as government agencies responsible for releasing such sensitive data sets, should put really hard efforts into figuring out possible privacy connected risks.

We hope that our work would be helpful not just for future releases of external testing results, but for the wider open data community – both in Ukraine and throughout the world.

Flattr this!

The lost privacy of Ukrainian students: a story of bad anonymisation

Vadym Hudyma - May 23, 2017 in Data Blog

Authors: Vadym Hudyma, Pavlo Myronov. Part 1 of a series on Ukrainian student data.

Introduction

Ukraine has long been plagued with corruption in the university admission process due to a complicated and untransparent process of admission, especially for state-funded seats. To get it would-be students required not just a good grades from school (which also was subject of manipulation), but usually some connections or bribes to the universities admission boards.

Consequently, the adoption of External Independent Evaluation Testing (EIT) (as the primary criteria for admission into universities is considered one of a handful of successful anticorruption reforms in Ukraine. External independent evaluation is conducted once a year for a number of subjects, anyone with school diploma can participate in it. It is supervised by an independent government body (CEQA – Center for Educational Quality Assessment) with no direct links neither to school system nor major universities, All participant names are protected with unique code to protect results from forgery. {Explanation of the system in 1-2 sentence.}

The EIT has not eradicated corruption, but reduced it to a negligible level in the university admissions system. While its impact on the school curriculum and evaluation is, and should be, critically discussed, its success in providing opportunities for a bright student to get a chance to choose between the best Ukrainian universities is beyond doubt. Also, it provides researchers and the general public with a very good tool to understand, at least on some level, what’s going on with secondary education based on unique dataset of country-wide results of university admission tests.

Obviously, it’s also crucial that the results of the admission tests, a potentially life-changing endeavour, must be held as privately and securely as possible. Which is why we were stricken when the Ukrainian Center for Educational Quality Assessment (CEQA) also responsible for collecting and managing the EIT data, released this August a huge dataset of independent testing results from 2016.

In this case, this dataset includes individual records. Although the names and surnames of participants were de-identified using randomly assigned characters, the dataset was still full of multiple other entries that could link to exact individuals. Those include exact scores (with decimals) of every taken test subject, the birth year of each participant, their gender, whether they graduated this year or not and, most damning, the name of the place where each subject of external examination was taken – which is usually the schools at which participants got their secondary education.

I. Happy Experts

Of course, the first reaction from the Ukrainian Open Data community was overwhelmingly positive, helped with the fact that previous releases of EIT datasets were frustrating in their lack of precision and scope.

A Facebook post announcing the publication: “Here are the anonymized results of IET in csv #opendata”

image alt text

*A Facebook comment reacting to the publication: “Super! Almost 80 thouthands entries” (actually more ;) *

image alt text

A tweet discussing the data: “Some highly expected conclusions from IET data from SECA…”

As Igor Samokhin, one of the researchers who used the released EIT dataset in his studies, put it:

“[..This year’s] EIT result dataset allows for the first time to study the distribution of scores on all levels of aggregation (school, school type, region, sex) and to measure inequality in scores between students and between schools on different levels.[…] The dataset is detailed enough that researchers can ask questions and quickly find answers without the need to ask for additional data from the state agencies, which are usually very slow or totally unresponsive when data is needed on the level lower than regional.”

Indeed, the dataset made possible some interesting visualisations and analysis.

image alt text

A simple visualisation showing differences in test results between boys and girls

image alt text

Quick analysis of birth years of those who took IET in 2016

But that amount of data and the variety of dimensions (characteristics) available carry many risks, unforeseen by data providers and overlooked by the hyped open data community and educational experts. We’ve made a short analysis of most obvious threat scenarios.

II. What could go wrong?

As demonstrated by various past cases across the world, microdata disclosure, while extremely valuable for many types of research such as longitudinal studies, is highly susceptible to re-identification attacks.

To understand the risks involved, we went through a process called threat modeling. This consists in analysing all the potential weaknesses of a system (here the anonymisation technique used on the dataset) from the point of view of a potential individual with malicious intentions (called’ attacker’). Three threat models emerged from this analysis:

The ‘Nosy neighbour’ scenario

The first and most problematic possibility is called the “nosy neighbour” scenario. This corresponds to an unexpected disclosure of results from relatives, neighbours, school teachers, classmates, or anyone with enough knowledge about an individual described in the dataset to recognize who the data describes – without having to look at the name. The risks involved with this scenario include possible online and offline harassment against people with too low or too high – depending on context – test results.

Unwanted disclosure may happen because members in the subject’s close environment can already have some additional information about the person. If you know that your classmate Vadym was one of the rare person of the village to take chemistry in the test, you can easily deduce which line of the data corresponds to him, discovering in the same way all the details of his tests results. And depending on what you (and others) discover about Vadym, the resulting social judgment could be devastating for him, all because of an improperly anonymised dataset.

This is a well-known anonymisation problem – it’s really hard to get a good anonymity with that many dimensions – in this case, the subject and exact results of multiple tests and their primary examination location.

It’s an especially alarming problem for schools in small villages or specialised schools – where social pressure and subsequent risk of stigmatisation is already very high.

The ‘Ratings fever’ problem

image alt text

Map of schools in Kiev, Ukraine’s capital, made by the most popular online media based on EIT results

The second problem with educational data is hardly new and the release of this dataset just made it worse. With added precision and targeting power, more fervour was granted to the media’s favoured exercise of grading schools according to successes and failures of the external testing results of its students.

In previous years, many educational experts criticised ratings made by media and the different government authorities for incompleteness: they were based either on a full dataset, but for only one test subject, or were made using heavily aggregated and non-exhaustive data. But such visualisations can have consequences more problematic than misleading news readers about the accuracy of the data.

The issue here is about the ethical use of the data, something often overlooked by the media in Ukraine, who happily jumped on the opportunity to make new ratings. As educational expert Iryna Kogut from CEDOS explains:

“EIT scores by themselves can not be considered as a sign of the quality of education in a individual school. The new dataset and subsequent school ratings based on it and republished by CEQA only maintains this problem. Public opinion about the quality of teaching and parental choice of school relies on results of the EIT, but the authors of the rating do not take into account parents’ education, family income, the effect of private tutoring and others out-of-school factors which have a huge influence on learning results. Besides, some schools are absolutely free to select better students (usually from families with higher socioeconomic status), and this process of selection into “elite” schools is usually neither transparent nor fair. So they are from the start not comparable with the schools having to teach ‘leftovers’. “

Even as people start understanding the possible harm of the “rate everything” mentality for determining both public policy and individual decisions, almost every local website and newspaper has made or republished school ratings from their cities and regions. In theory, there could be benefits to the practice, such as efforts to improve school governance. Instead, what seems to happen is that more students from higher-income families migrate to private schools and less wealthy parents are incentivised to use ‘unofficial’ methods to transfer their kids to public school with better EIT records. Overall, this is a case where the principle “the more informed you are the better” is actually causing harm to the common good – especially when there is no clear agenda or policy in place to create a fairer and more inclusive environment in Ukrainian secondary education.

Mass scale disclosure

The last and most long-term threat identified is the possible future negative impact on the personal life of individuals, due to the unwanted disclosure of test results. This scenario considers the possibility of mass scale unwanted identity disclosure of individuals whose data were included in recent EIT data set.

As our research has shown, it would be alarmingly easy to execute. The only thing one needs to look at is already-published educational data. To demonstrate the existence of the this threat, we only had to use one data set: close to half of the EIT records could be de-anonymised with varying level certainty, meaning that we could find the identity of the individual behind the results (or narrow down the possibility to a couple of individuals) for one hundred thousand individual records.

The additional dataset we used comes from another government website – vstup.info – which lists all applicants to every Ukrainian university. The data includes the family names and initials of each applicant, along with the combined EIT results scores. The reason behind publishing this data was to make the acceptance process more transparent and cut space for possible manipulations.

But with some data wrangling and mathematical work, we were able to join this data with the IET dataset, allowing a mass scale de-anonymisation.

So what should be the lessons learned from this?

First, while publishing microdata may bring enormous benefits to the researchers, one should be conscious that anonymisation may be really hard and non-trivial problem to solve. Sometimes less precision is needed to preserve anonymity of persons whose data is included in the dataset.

Second – it’s important to be aware of any other existing datasets, which datasets may be used for de-anonymization. It’s responsibility of data publisher to make sure of that before any information sharing.

Third – it’s not enough just to publish dataset. It’s important to make sure that your data wouldn’t be used in obviously harmful or irresponsible manner, like in various eye-catching, but very damaging in the long run ratings and “comparisons”.

Flattr this!

How do you become data literate? Part 1

helene.hahn - March 9, 2017 in Community, Data Blog

We at Open Knowledge Foundation Germany launched a new project this year, that we’re very proud of: Datenschule (datenschule.de), the German version of School of Data. We want to encourage civil society organisations, journalists and human rights defenders to use data and technology effectively within their work to create positive social change.

But what does it actually mean to become ‘data literate’? Where do you start and how can you use data within your work and projects? To explore these questions, we would like to introduce some of our community members and data activists from around the world, who ended up working with data at some point in their lives. We were curious about how they actually got started and – looking back now – what they would recommend to data newbies.

Each month we will publish a new interview, this is no. #1. Got feedback? Have questions? Feel free to get in touch: helene.hahn@okfn.de

 

Camila Salazar

 

 

 

Who: Data-Journalist from Costa Rica, working at the newsroom La Nación, data trainer at School of Data

Topics: data-driven stories on society, economics, politics

Tweets: @milamila07

 

Hi Camila, please introduce yourself.

My name is Camila, I’m from Costa Rica and I’m a data journalist and an economist. I’m currently working at a newspaper called La Nación in the data unit. I’m also involved with the School of Data community and started as a fellow in 2015. This was the year when I started running data trainings and workshops. I was trying to build a community around data in Costa Rica, in Latin America, also a bit in Mexico and South America.

When was the first time you came across data and when did you start to use data in your work?

I started studying journalism, but after my second year I was disappointed with the university and I wasn’t really motivated. So I thought, maybe I could start studying something else besides journalism. I enrolled myself in Economics at my university and was taking two courses simultaneously. In Economics it’s all about numbers and I really liked it. But when I was about to finish journalism studies, I thought, do I want to be an economist and work in a bank or do I want to write stories? How could I combine both? That’s how I got involved in data journalism. I found that this was an area where you could combine both in a good way. So you can take all the methods and technical skills, that you acquire in an economics degree, and apply them to tell stories of public-interest, so that’s how I mix the both and so far, it’s worked well.

What topics and projects are you currently working on?

At the data unit at La Nación we don’t focus on one major topic, it differs all time. This year we ran projects about the municipal elections in Costa Rica. We collected data regarding the mayors, that were running for the different counties. We also developed a project about live fact-checking the promises of the president. Every year, he gives a speech about the situation in the country. We built a platform where you could follow the speech live and see, if the things that the president saysare true or not. We tried to look for all kind of stories and narratives and see what kind of data is available on that topic. It could be a social topic, an economic one or something else. Now we are working on a project around wages. Within our unit we had the liberty to choose our topics and to see what’s interesting.

How would you explain data literacy?

I think to be data literate is to change the way you solve problems. You don’t have to be super pro in statistics. It’s a way you approach questions and the way you solve them. So for example, if you’re working in a social discipline, in economics or in science you are used to solve problems with certain scientific methods, you ask a questions, apply a method and then try to prove your point, you experiment a lot with data. That’s the way you become data literate. And this can work in any kind of field, in data journalism, public policy, in economics, if you are trying to introduce better solutions to improve efficiency in your business. Data literacy is about changing your way of thinking. It’s about trying to prove things and trying to find solutions with numbers and data. It’s a way of making things more methodical and reproducible.

What would you recommend to someone interested in data, but who does not know where to start?

If you really don’t know anything about data, don’t worry, it’s not that hard to get started. There are many learning resources available online. For a start, I would try and look for projects of people who already work with data – to get inspired. Then you can look for tools online, for example, on schoolofdata.org, there are courses, there are links to projects and it’s a good way to start. Don’t be afraid, and if you want to go super pro, I encourage you to do this. But it’s a process, you don’t need to expect to be modelling data in two weeks, but in two weeks you can learn the basics and start answering small questions with data.

Links:

Blog posts by Camila on School of Data

Data unit at the newspaper La Nación

Live fact-checking project on presidential promises http://www.nacion.com/gnfactory/investigacion/2016/promesas-presidente/index.html

Flattr this!

SNI 2016: ICT and Open Data for Sustainable Development

Malick Lingani - November 23, 2016 in Data Blog, Fellowship

The National ICT Week (SNI) is an annual event in Burkina Faso dedicated to promote ICT. Each year, thousands of people are introduced to the basics of operating computers; impactful ICT initiatives are also rewarded by a host of prizes. This year’s event, the 12th edition, was hosted by the Ministry of Digital Economy from May 31st to June 4th with the theme of ICT and sustainable development.

image alt text

The panelists of the conference

The Burkina Open Data Initiative (BODI) was represented by its Deputy Manager, Mr. Malick Tapsoba. He gave an introductory speech that gave the audience a general idea as to what open data is about. He then continued by presenting some of the key accomplishments of BODI so far:

  • NENDO, a web application developed with data on education available on the Burkina Faso open data portal, was presented as an example of how open data can be used to boost accountability in education systems

  • the GIS data collected on drinkable water wells has become a key decision-making tool toward the achievement of ‘Sustainable Development Goal (SDG) 6: ‘Ensuring availability and sustainable management of water and sanitation for all.’

  • The open election project: a web platform that allowed the visualization of both the 2015 presidential and legislative election results. The visualizations were created almost in real-time, as fast as the data was released by the electoral commission. This project, initiated by BODI, has strongly contributed to the acceptance of the election’s results by all contenders.

Some ongoing projects of BODI were also presented:

  • Open Data and government procurement tracking project. This project aims to improve transparency in the government’s budget spending and to unlock opportunities for enterprises based on market competition.

  • Open Data to monitor both foreign funds and domestic funds: “When the data are not available and open, how can we measure progress toward Sustainable Development Goals?”, said Mr. Tapsoba.

Mr. Tapsoba also announced that a hackathon had been organised to showcase the use of open data and that the results would be revealed at the closing ceremony of SNI. One participant, a student who took part in the hackathon, called for more initiatives like these. He said that he strongly appreciated the way hackathons allow programmers and non-programmers to work together to build data applications and, for him, this helps to demystify ICT in general.

Mr. Sonde Amadou, CEO of Dunya Technology and one of the panelists, spoke about Smart Cities: African cities are growing fast, he said, and Ouagadougou, the capital city of Burkina Faso, is one of them. But Open GIS Data, he continued, is a stumbling block for Smart Cities and work is needed in this area.

Dr. Moumini Savadogo, IUCN Head Country Programme, talked about the IUCN Red List of threatened, critically endangered, endangered and vulnerable species in Africa. This list helps raise awareness and encourages better informed decisions for the conservation of nature, something critical for sustainable development.

The 400 participants of the conference were well served and I was confident that most of them can now be considered as open data advocates. As a School of Data Fellow, I made sure to speak after the panelists, pointing out the importance of strong institutions supported by transparency and accountability (SDG 16) for achieving the 2030 agenda in general. So I encouraged the audience to take a look at Open Data portals, notably BODI and EITI, for transparency in the extractive industry, including the environmental impact. I also mentioned the GODAN initiative for SDG 02 and called the panelist Malick Tapsoba to develop more on that. The open data community of Burkina Faso has made that day one more step on its journey towards building a stronger open data community and data literacy advocates.


Infobox
Event name: SNI 2016: ICT and Open Data for Sustainable Development
Event type: Conference
Event theme: ICT and Open Data for Sustainable Development
Description: The conference part of Burkina Faso’s National ICT Week (SNI) purpose was to showcase the role of ICT and Open Data to meet the Sustainable Development Goals (SDGs). The conference was designed to bring together ICT Specialists, Academia and Open Data activists to explore and learn about Sustainable development Goals and how ICT and Open Data can contribute to that Agenda
Speakers: Pr. Jean Couldiaty (University of Ouagadougou) Facilitator, Mr. SONDE Amadou (CEO of Dunya Technology), Mr. Malick Tapsoba (BODI Deputy Manager), Dr. Moumini SAVADOGO (IUCN Head Country Programme)
Partners: Burkina Faso Ministry of Digital Economy, Burkina Faso Open Data Initiative (BODI), International Union for the Conservation of Nature (IUCN), University of Ouagadougou
Location: Ouagadougou, Burkina Faso
Date: May 31st 2016
Audience: ICT specialists, Open Data and Data Literacy enthusiasts, Students, Journalists
Number of attendees 400
Gender split: 60% men, 40% women
Duration: 1 day
Link to the event website: http://www.sni.bf

Flattr this!

Who works with data in El Salvador?

Omar Luna - November 16, 2016 in Data Blog, Fellowship

For five years, El Salvador has had the Public Information Access Law (PIAL), which requires various kinds of information from all state, municipal and public-private entities —such as statistics, contracts, agreements, plans, etc. These inputs are all managed under the tutelage of PIAL, in an accurate and timely manner.

As well as the social control exerted by Civil Society Organizations (CSOs) in El Salvador, to ensure compliance with this law, the country’s public administration gave space for the emergence of various bodies, such as the Institute of Access to Public Information (IAPI), the Secretariat of Transparency, Anti-Corruption and Citizen Participation and the Open Government website, which compiles —without periodic revision of official documents and other resources by any government official— more than 92,000 official data documents.

In this five year period, the government showed discontent. Why? They didn’t expect that this legislation would strengthen the journalistic, activist and investigative powers of civil society, who took advantage of this period of time to improve and refine the techniques under which they requested information from the public administration.

Presently, there are few digital skills amongst these initiatives in the country. It has now become essential to ask the question: what is known about data in El Salvador? Are the initiatives that have emerged limited in the scope of their achievements? Can something be done to awaken or consolidate the interest of people in data? To answer these and other questions, I conducted a survey with different research and communication professionals in El Salvador and this is what I found.

The Scope

“I think [data work] has been explored very little (in journalism at least),” said Jimena Aguilar, Salvadoran journalist and researcher, who also assured me that working with data helps provide new perspectives to stories that have been written for some time. One example is Aguilar’s research for La Prensa Grafica (LPG) sections, such as transparency, legal work, social issues, amongst others.

Similarly, I discovered different initiatives that are making efforts to incorporate the data pipeline within their work. For two years, the digital newspaper ElFaro.net has explored various national issues (laws, homicides, travel deputies, pensions, etc.) using data. During the same period, Latitudes Foundation processed different aspects of gender issues to determine that violence against women is a multi-causal phenomenon in the country under “Háblame de Respeto” project.

And although resistance persists in government administrations and related institutions to adequately provide the information requested by civil society —deputies, think tanks, Non-Governmental Organizations (NGOs), journalists, amongst others— more people and entities are interested in data work, performing the necessary steps to obtain information that allows them to know the level of pollution in the country, for instance, build socio-economic reports, uncover the history of Salvadoran political candidates and, more broadly, promote the examination of El Salvador’s past in order to understand the present and try to improve the country’s future.

 

The Limitations

“[Perhaps,] it is having to work from scratch. A lot of carpentry work [too much work for a media outlet professional]”, says Edwin Segura, director for more than 15 years of LPG Datos, one of the main data units in the country, who also told me that often too much time and effort is lost in cleaning false, malicious data provided by different government offices, which often has incomplete or insufficient inputs. Obviously, Segura says, this is with the intention of hindering the work of those working with data in the country.

In addition, there’s something very important that Jimena told me about the data work: “If you are not working as a team, it is difficult to do [data work] in a creative and attractive way.” What she said caught my attention for two reasons: first, although there are platforms that help create visualizations, such as Infogr.am and Tableau, you always need a multidisciplinary approach to jump-start a data project, which is the actual case of El Diario de Hoy data unit that is conformed by eight people specialized in data editing, web design, journalism and other related areas.

And, on the other hand, although there are various national initiatives that work to obtain data, such as Fundación Nacional para el Desarrollo (FUNDE), Latitudes Foundation, etc., there’s a scattered effort to do something with the results, which means that everyone does what they can do to take forward the challenge of working with databases individually, instead of pursuing common goals between them. 

Stones in the Road

When I asked Jimena what are the negative implications of working with data, she was blunt: “(Working with data) is something that is not understood in newsrooms […] [it] takes a lot of time, something that they don’t like to give in newsrooms”. And not only newsrooms, because NGOs and various civil society initiatives are unaware of the skills needed to work with data.

Of the many different internal and external factors affecting the construction of stories with data, I would highlight the following. To begin with, there is a fear and widespread ignorance towards mathematics and basic statistics, so individuals across a wide variety of sectors don’t understand data work; to them, it is a waste of time to learn how to use them in their work. For them, it’s very simple to gather data in press conferences, institutional reports and official statements, which is a mistake because they don’t see how data journalism can help them to tell stories in a different way.

Another issue is that we have an inconsistency in government actions because, although the government discursively supports transparency, their actions are focused on answering requests vaguely rather than proactively releasing good quality data —opening data in this way is hampered with delays. I experienced this first hand when, on many occasions, I asked for information that didn’t match with what I requested or, on the contrary, the government officials sent me different information, in contrast with other information requests sent by other civil society sectors (journalists, researchers, etcetera).

Where Do We Go From Here?

With this context, it becomes essential to begin to make different sectors of civil society aware of the importance of data on specific issues. For that, I find myself designing a series of events with multidisciplinary teams, workshops, activities and presentations that deconstruct the fear of numbers, that currently people have, through the exchange of experience and knowledge. Only then can our civil society groups make visible the invisible and explain the why in all kinds of topics that are discussed in the country.

With this approach, I believe that not only future generations of data practitioners can benefit from my activities, but also those who currently have only indirect contact with it (editors, coordinators, journalists, etc.), whose work can be enhanced by an awareness of data methodologies; for example, by encouraging situational awareness of data in the country, time-saving tools and transcendence of traditional approaches to visualization.

After working for two years with gender issues and historic memory, I have realized that most data practitioners have a self-taught experience; through trainings of various kinds we can overcome internal/external challenges and, in the end, reach common goals. But, we don’t have any formal curricula and all we’ve learned so far comes from a proof and error practices… something we have to improve with time.

And, also, we’re coping with the obstacles imposed by the Government on how data is requested and how the requested information is sent; we also have to constantly justify our work in workplaces where data work is not appreciated. From NGO to media outlets, data journalism is seen as a waste of time because they’re thinking that we don’t produce materials as fast as they desire; so, they don’t appreciate all the effort required to request, clean, analyse and visualise data.

As part of my School of Data Fellowship, I’m supporting the design of an educational curriculum specialising in data journalism for fellow journalists in Honduras, Guatemala and El Salvador, so they may acquire all the necessary skills and knowledge to undertake data histories on specific issues in their home countries. This is a wonderful opportunity to awaken the persistence, passion and skills for doing things with data.

The outlook is challenging. But now that I’m aware of the limits, scope and stones in the way of data journalism in El Salvador and all that remains to be done, I want to move forward. I take the challenge this fellowship has presented me, because as Sandra Crucianelli (2012) would say, “(…) in this blessed profession, not only doesn’t shine people with good connections, even with brilliant minds: for this task only shine the perseverant ones. That’s the difference”.

Flattr this!

Understanding the extractives data community in Burkina Faso

Malick Lingani - November 15, 2016 in Data Blog, Fellowship

As a 2016 School of Data Fellow, my focus area of work is Extractives Data and I work with NRGI to advance data literacy in that sector in Africa, particularly in Burkina Faso.

Burkina Faso is experiencing a mining boom mainly due to the exploitation of gold. With a production of 5.6 tons of gold in 2008, Burkina Faso rose to 36.5 tons of gold exported in 2015 and the projection for 2016 is about 39.6 tons. In terms of revenues, in 2015, the share of mining in budgetary revenues was estimated at about 170 billion CFA francs (about $280 million). This represents a major development challenge for the country and in particular for the local communities around mining sites. Work to monitor and inform communities on these issues is of paramount importance and it is necessary to see the actors involved in that work.

So, a good starting point was to map the community around Extractives Data in Burkina Faso. From May 26th to July 12th 2016, I was able to achieve a clear understanding of who is involved in what and the challenges they face.

image alt text

Burkina Faso main Civil Society Organizations Coalition (SPONG) Annual Assembly

The open data and Extractives community can be split into 3 categories: Government bodies and Institutes, media and individual data journalists, and Civil Society Organisations.

Government bodies and Institutes

The two main relevant government institutions are the Information and Communication Technology Promotion National Agency (ANPTIC), which leads the Open Data community in Burkina Faso along with the Burkina Open Data initiative (BODI). There are also Institutes such as the National Institute of Statistics and Demography (INSD), the Research Institute for Development (IRD) and the Institute of Science of Population (ISSP), which undertake many socio-economic studies on the impact of mining in Burkina Faso. These Government bodies and Institutes are regularly invited by the ANPTIC to meet, in order to strengthen their relationship and encourage them to open the data in their possession.

Media

The media play a key role in covering all events related to extractives. But their work doesn’t stop there. Some media organisations are even performing in-depth analysis of data to fully inform the country’s citizens. Among the Burkinabe investigative journalists, known for their sharp insights on political issues, some took the decision to follow socio-economic courses, with the aim of becoming better armed for achieving transparency and fighting corruption. The gold mining sector, in particular, is a regular subject of investigation.

Lefaso.net, Burkina24, the blog “Le blog sam la touch” (which features all hot subjects in Burkina Faso), l’Indépendant, l’Économiste du Faso (the first economic weekly journal of Burkina Faso), l’Évènement (a monthly newspaper), and Le Pays are major media actors. Individual data journalists, specifically Justin Yarga and Stella Nana, are among those that are shaking the web with interesting insights around Extractives.

Their main data sources are research institutes and some field surveys and some of the journalists use the EITI data portal to communicate their findings.

Civil Society Organizations (CSOs)

The main CSOs involved in Extractives Data are:

  • Chambre des mines” (CMB), a non-profit organization representing the mining private sector. CMB collects economic and environmental data from the mining companies;

  • EITI Burkina Faso. When it comes to advancing Open Data in Extractives, the Extractive Industries Transparency Initiative (EITI), supported by the World Bank Group through a Multi-Donor Trust Fund (MDTF), is a central actor. Burkina Faso* *produces annual EITI reports that disclose the production and revenue of the extractive industries. The latest report published is from 2013;

  • Open Burkina is a young organization of activists that are active in Burkina Faso’s Open Data community;

  • Open Street Map Burkina is concerned about the spatial distribution of the mining sites;

  • BEOG NEERE, an NGO working for human rights, transparency and accountability has conducted studies on gold mining and child labour in Burkina Faso;

  • Diakonia, Oxfam, Plan Burkina, SOS Sahel International Burkina Faso are major International NGOs that are interested in the social, economic and environmental impact of mining;

  • Publish what you pay – Burkina Faso“. Publish What You Pay is a coalition of CSOs working for transparency and accountability, advocating at the policy level for people to get the most benefit of the flourishing mining sector;

  • ORCADE (Organization for the Reinforcement of Development Capacity) is also advocating at the policy level for open contracts and for the adoption of a mining code that is beneficial to local communities.

Data availability and trainings as the main challenge

Several events gather the community through the year. I have been fortunate enough to attend some of this year’s major events:

  • SEMICA, I attended the SEMICA (the annual gathering of actors involved in mining and energy), held in Ouagadougou, Burkina Faso from May 26th to May 28th.

  • The SNI, (the National ICT Week) which took place in Ouagadougou from 31st of May to the 4th of June, was also a great place to meet journalists and data journalists interested in mining and sustainable development.

  • The 3rd event I attended was the General Assembly of SPONG (the main coalition of Civil Society Organizations in Burkina Faso) held in Ouagadougou the 31st of May.

I use these opportunities to administer questionnaires to the people I met at these events, in order to get some insight into their relationship with other actors in the field, the challenges they face in their work and their needs.

As revealed by the answers to the questionnaire, the main challenge of this otherwise vibrant community is still to get companies to release data. Building a strong relationship among community members has helped overcome this issue. “We are facing difficulties to convince data producers to shake things up. But we take it more as a challenge :-)”, said Idriss Tinto, technical manager at BODI. Mr Tinto continued by pointing to the need of more funds to overcome that challenge.

Capacity building is also one the recurrent needs expressed by members of the community. “Getting more access to data and a more tailored and complete training on data processing and data analysis are the main needs”; said Abdou Zouré, Editor-in-chief at Burkina24. Inna Guenda-Segueda, communication manager at CMB, pointed to the need of training on the process to collect data as disaggregated as possible from mining companies. Hence, specific trainings on the data pipeline are needed to support both data journalists and Civil Society Organizations.

Flattr this!

The state of Open Data in Burkina Faso

Malick Lingani - November 5, 2016 in Data Blog, Fellowship

The Open Data Community

The Burkina Open Data Initiative (BODI), set by the National Agency for the Promotion of ICT (ANPTIC), leads the open data community in Burkina Faso. Created in 2013, BODI has undertaken many activities, including building a strong community of local Civil Society Organizations (CSOs) and Non-Governmental Organisations (NGOs), Government Bodies and International Organizations. The Open Data Portal is also one of BODI’s key achievements. At the present time, nearly 200 data sets have been released.

image alt text

Workshop at ANPTIC headquarters for the development of the NENDO Application

To engage more people around the Open Data Initiative, BODI organises workshops and conferences in order to strengthen its partnerships with relevant stakeholders.

Education data as a starting point

In fact, data on Education has been released in partnership with the Ministry of Education, but also with the partnership of local communities. The whole process started in April 2014, in the coworking space Jokkolabs, with a workshop with Government bodies, the National Agency of Statistics and Demography, The World Bank, OpenStreetMap Burkina and many NGOs.

The Education dataset serves also as a case study for BODI to showcase the benefits of Open Data. For instance, a Web Application called NENDO is being developed. The NENDO Application shows a multilayer digital map of primary schools and kindergartens with their characteristics. CSOs like Open Burkina, BEOG NEERE, JokkoLabs, OuagaLab and also the Ministry of Education were associated in the Development of NENDO through a series of workshops and bootcamps organised at ANPTIC headquarters.

Recently, always with the spirit of inclusiveness, BODI approached the Agriculture, Water and Environment Sectors. A workshop was conducted on December 2015, with the participation of relevant stakeholders of the above sectors, including IRC Wash, the Ministries in charge of Agriculture, Water, Environment, and their respective specialised bodies. The aim was to present the work already achieved in releasing data regarding drinkable water wells. Malick Tapsoba, Deputy Manager of BODI, encouraged the ministries to release more data in their possession, in order to better inform decisions at all stages and trigger innovation for the benefit of all.The World Bank is the main funder of all the initiatives cited above.

Events and conferences as additional initiatives

Some other BODI initiatives were the 2016 Open Data Day, co-organised with Open Knowledge International (OKI) Burkina Faso, and the NGO BEOG NEERE. Open Street Map (OSM), Open Burkina, Geek Developers Network (GDN) and the Fablab Ouagalab were also present on March 5th 2016.

The academic world cannot be left behind, so a series of conferences were held in the three major universities of the country with the theme “Open Data and Academia: Challenges and Opportunities”. The University Aube Nouvelle hosted its conference on April 13th, followed one week later by the University of Bobo Dioulasso and by the University of Ouagadougou on April 28th.

Some media, mainly online media like Burkina24 and LeFaso.net, are always associated with BODI activities. These media are not only covering BODI’s events but they are participating as key data practitioners for the eclosion of a stronger data journalism community. In fact, data journalism trainings are organised for media, including bloggers, by the ANPTIC and BODI’s team.

All these conferences and workshops have helped build and strengthen a diversified community around open data in Burkina Faso but some major work remains to be done. For instance, for Extractives Data, a parallel open data community for Extractives has emerged around EITI Burkina Faso and the Burkina Chamber of Mines (CMB). I see my School of Data Fellowship as a huge opportunity to link those communities. In the next article, I will look at the Open Data and Extractives community in Burkina Faso.

Flattr this!

The state of Open Data in Bolivia

Raisa Valda Ampuero - November 4, 2016 in Data Blog, Fellowship

To start my work as a School of Data fellow this year, I needed to survey the Open Data community in Bolivia. I wanted to rediscover and meet its members, those that are driving us towards the goal of making the data in our country open, and identify the concerns and the challenges we face.

Bolivia’s first experience with open data occurred in June 2013 with the first Data Bootcamp in Bolivia; this experience was lead by experts, including Michael Bauer, and was attended mainly by journalists and developers. From there, we moved forward (with little steps) on the open data path with a second version of the Data Bootcamp, followed in 2015 by the first Accelerator of Data Journalism. This helped launch several small journalistic and citizen-led projects.

In order to understand the community and how best to promote the results of our joint efforts in this field, although it is not easy to make a radiograph of an emerging movement, we can categorise the Bolivian Open Data community into three groups: citizens’ initiatives, journalism and government agencies.

Citizen Initiatives

Marco Antonio Frías, a member of the open software movement, started to make maps nine years ago with Open Street Map (OSM). The project had two main goals: to move away from over-reliance on a single technology, and to gain a deeper understanding of how and where we live, understanding that the maps used are not simple photographs but also are an abstraction, a representation of a social situation, under our social and cultural precepts. It is a form of open data, he has commented, since the Bolivian maps published by the Military Geographical Institute are expensive and of low quality. He highlighted the fact that the OSM community provides recent public maps of Cochabamba, a city in Central Bolivia, whereas the last one published by the city is from 2003.

Mapillary, another example highlighted by him, is a service that enables collaboration and use of photographs taken at street level anywhere in the world, in any form of locomotion.

The challenges faced this year by Open Street Map Cochabamba are related to the work needed to be done on geography and graphical representations workshops in fields related to education. They will also have to bring into the movement a wider array of people beyond cartography and geography specialists.

Mariana Leyton is another person who has followed the rise of open data and open government in the region since early 2015. This is for her both a personal interest and a professional one,as a GobApp communications manager. In August last year, she collaborated with another Bolivian open data activist, Fabian Soria, to add Bolivia to Open Knowledge International’s Global Open Data Index. Using Facebook to call on other volunteers, she got people to participate using a Google Spreadsheet where they could upload information related to the 15 types of databases used to assess each country. Although this call could not get many contributions, the result was a document with centralising this information, which was published in September last year.

Other volunteer groups exist on Facebook, such as DatosAbiertosBo; Luis Rejas, creator of the group, believes that the greater use of open data will be a push on other institutions to open their data.

The project “Cuántas Más” also plays a key role in the open data movement: the team behind it has been monitoring and collecting data on femicide cases in Bolivia for more than one year. Femicides have been recognized officially since the Act 348, a law voted in March 2013. The Cuántas Más team created an open database and visualised the data using timelines, georeferencing, and also produced statistics and media files for all cases. A continuous validation of the data allowed them to produce accurate figures addressing various aspects of the issue of gender violence.

Another volunteer-led citizen project, “Que no te la charlen“, was a winner of the 1st. Data Journalism Accelerator challenge. It promoted more transparency about universities by systematically collecting data from public and private universities in Bolivia, before georeferencing each university and its associated information, allowing comparisons. The other aspect of their project focused on questioning the origins and uses of the budget of Bolivian public universities. For this purpose, they worked with data from 2010 to 2014 to analyze the budget of fourteen universities.

Journalistic

On the journalism side, a key reference is LT-Data, a data-focused section of the website of Los Tiempos,the main newspaper in Cochabamba. Maintained by a team led by Fabiola Chambi, with the support of Mauricio Canelas. The goal of this project is to produce a regular series of data journalism articles. Fabiola says that she began to see examples and implement projects as an autodidact, having as a major reference the Argentinian newspaper La Nación She began with population data from the World Bank, then worked with the 2012 Bolivian Census Data. The results generated were just a way to find out what was available and what could work. However, she notes that a defining moment took place at the second Bootcamp in La Paz, with the “Elige Bien” project. It is a platform designed for citizen engagement, which displayed information about candidates and parties for the 2014 elections. That said, the team is still not fully dedicated to LT-Data, sharing its time with other projects within the newsroom.

Also worth highlighting is the work of ED Data, from the Santa Cruz newspaper El Deber, which features a project called “Assets of Evo Morales’ Cabinet during the last decade“.This project required 8 months of combined efforts by journalist Nelfi Fernandez and developer Williams Chorolque, under the guidance of expert data journalists Sandra Crucianelli and David Dusster, respectively from Argentina and from Spain.

A similar project is the DataBo initiative, run by the digital platform La Pública and the NGO Oxfam. As Javier Badani, director of La Pública, explained in an interview with Inter-American Development Bank, “public institutions have not yet fully internalized the culture of information openness. However, Los Tiempos and La Pública have developed research projects using journalism and an open data philosophy.”

There are other challenges related to data journalism and open data, as seen by Tonny Lopez, journalist from El Alto, in La Paz: “Many want to do it, but internet access is still poor in newsrooms, and there is no clear process or means available. Introducing new journalists to digital journalism, is an ongoing effort, and there is a need to train them in the logic of digital tools from the onset. Those efforts currently focus on free internet tools.

Bolivian Government Agencies

AGETIC, the Agency for Electronic Government and Technologies of Information and Communication was created by Supreme Decree No. 2514, on 9 September 2015. It is a decentralized entity but subsidiary of the Ministry of the Presidency. The same decree instructs the creation of the Council of Technology Information, which is part of an open data work-table. This unit, according to Wilfredo Jordan, head consultant of the work-table, will have as main task in coming months to standardize and release of data from the Bolivian State through a web platform. This first phase includes a training component with specific audiences (journalists and researchers, for example) to promote open data, as required by the decree at the origin of the agency.

An analysis of the state of open data by the agency highlighted the fact that Bolivian public institutions, while used to share information under the Transparency Act, usually share it under non easily reusable formats. This is the case for existing open data initiatives such as GEO Bolivia and the National Statistics Institute. It is then necessary to educate civil servants on open standards and work on the standardization of public data.

Bolivia is a newcomer to the Open Data movement, making it important for the country to learn from the experience of its neighbours; there is much to learn, understand but luckily there is a true will to do so.

Thank you to everyone who contacted me and answered my calls and to all who sent information through social networks; not all contributions could be added to this article, but they are proof of the dynamic of this emerging community.

Flattr this!

Data in December: Sharing Data Journalism Love in Tunisia

Ali Rebaie - January 11, 2016 in Data Blog, Data Expeditions, Data for CSOs

NRGI hosted the event #DataMuseTunisia in collaboration with Data Aurora and School of Data senior fellow Ali Rebaie on the 11th of December 2015 in beautiful Tunis where a group of CSO’s from different NGOs met in the Burge Du Lac Hotel to learn how to craft their datasets and share their stories through creative visuals.

Bahia Halawi, one of the leading women data journalism practitioners in the MENA region and the co-founder at Data Aurora, led this workshop for 3 days. This event featured a group of professionals from different CSO’s. NRGI has been working closely with School of Data for the sake of driving economic development & transparency through data in the extractive industry. Earlier this year NRGI did similar events in Washington, Istanbul, United Kingdom, GhanaTanzania, Uganda and many others. The experience was very unique and the participants were very excited to use the open source tools and follow the data pipeline to end up with interactive stories.

The first day started with an introduction to the world of data driven journalism and storytelling. Later on, participants checked out some of the most interesting stories worldwide before working with different layers of the data pipeline. The technical part challenged the participants to search for data related to their work and then scraping it using google spreadsheets, web extensions and scrapers to automate the data extraction phase. After that, each of the participants used google refine to filter and clean the data sets and  then remove redundancies ending up with useable data formats. The datasets were varied and some of them were placed on interactive maps through CartoDB while some of the participants used datawrapper to interactively visualize them in charts. The workshop also exposed participants to Tabula, empowering them with the ability of transforming documents from pdfs to excel.

Delegates also discussed some of the challenges each of them faces at different locations in Tunisia. It was very interesting to see 12321620_1673319796270332_5440100026922548095_nparticipants share their ideas on how to approach different datasets and how to feed this into an official open data portal that can carry all these datasets together. One of the participants, Aymen Latrach, discussed the problems his team faces when it comes to data transparency about extractives in Tataouine. Other CSO’s like Manel Ben Achour who is a Project Coordinator at I WATCH Organization came already from a technical backgrounds and they were very happy to make use of new tools and techniques while working with their data.

Most of the delegates didn’t come from technical backgrounds however and this was the real challenge. Some of the tools, even when they do not require any coding, mandate the knowledge about some technical terms or ideas. Thus, each phase in the data pipeline started with a theoretical explanatory session to familiarize delegates with the technical concepts that are to be covered. After that, Bahia had to demonstrate the steps and go around the delegates facing any problems to assist them in keeping up with the rest of the group.

It was a little bit messy at the beginning but soon the participants got used to it and started trying out the tools on their own. In reality, trial and error is very crucial to developing the data journalism skills. These skills can never be attained without practice.
11232984_1673319209603724_5889072769128707064_n
Another important finding, according to Bahia who discussed the importance of the learnt skills to the delegate’s community and workplace, is that each of them had his/her own vision about its use. The fact that the CSO’s had a very good work experience allowed them to have unique visions about the deployment of what they have learnt at their workplaces. This, along with the strong belief in the change open data portals can drive in their country are the only triggers to learning more tools and skills and bringing out better visualizations and stories that impact people around.

The data journalism community 3 years ago was still at a very embryonic stage with few practitioners and data initiatives taking place in Africa and Asia. Today, with enthusiastic practitioners and a community like School of Data spreading the love of data and the spirit of change it can make, the data journalism field has very promising expectations. The need for more initiatives and meet ups to develop the skills of CSOs in the extractive industries as well as other fields remains a priority for reaching out for true transparency in every single domain. 

Thank you,

You can connect with Bahia on Twitter @HalawiBahia.

Flattr this!

Making open data accessible to data science beginners

Nkechi Okwuone - November 6, 2015 in Data Blog, Fellowship

If you’re reading this, I suspect you’re already familiar with open data, data science and what it entails. But if that’s not the case, fret not, here are a few beginner courses from School of Data to get you started.

As new data scientists, we need easy access to substantial, meaningful data without the restrictions of cost or licenses. It’s the best way to hone our new skillset, get objective answers to questions we have and provide solutions to problems. This is a fact that has been acknowledged by leading data scientists. So, how can new data scientists get easy and timely access to this type of data?

Open Data Companion (ODC) is a free mobile tool that has been created to provide quick, easy and timely access to open data. ODC acts as a unified access point to over 120 open data portals and thousands of datasets from around the world; right from your mobile device. All crafted with mobile-optimised features and design.

ODC was created by Utopia Software, a developer company being mentored by the Nigerian School of Data fellow in the open data community of SabiHub in Benin city, Nigeria.

We believe ODC successfully addresses some key problems facing open data adoption; particularly on the mobile platform.

  • With the growth of open data around the world, an ever-increasing number of individuals (open data techies, concerned citizens, software developers and enthusiasts), organisations (educational institutions, civic duty and civil society groups) and many more continually clamour for machine-readable data to be made available in the public domain. However, many of these interested individuals and organisations are unaware of the existence of relevant portals where these datasets can be accessed and only stumble across these portals after many hours of laborious searching. ODC solves this problem by providing an open repository of available open data portals through which portal datasets can be accessed in a reliable yet flexible manner.

  • The fact that mobile platforms and mobile apps are now a dominant force in the computing world is beyond dispute. The percentage of mobile apps used on a daily basis and their use-rate continues to grow rapidly. This means that mobile devices are now one of the easiest and fastest means of accessing data and information; if more people are to be made aware of the vast array of available open data producers, the open data at their disposal and how to use them, then open data needs a significant mobile presence with the mobile features users have come to expect. ODC tackles this problem effectively by providing a fast mobile channel with a myriad of mobile-optimised features and an easy design.

What can ODC offer data scientists? Here’s a quick run-through of its features:

  • access datasets and their meta-data from over 120 data portals around the world. Receive push notification messages when new datasets are available from chosen data portals. This feature not only ensures users get easy access to the data they need, but it also provides timely announcements about the existence of such data.

    image alt text

  • preview data content, create data visualisations in-app and download data content to mobile device. The app goes beyond a simple “data browser” by incorporating productivity features which allow users to preview, search and filter datasets. Data scientists can also start working on data visualisations likes maps and charts from within the app.

    image alt text

  • translate dataset details from various languages to your preferred language. This feature comes in really handy when users have to inspect datasets not provided in their native language. For instance, when investigating the state of agriculture and hunger across Africa, available datasets (and meta-data) would be in different languages (such as English, French, Swahili etc). ODC helps to overcome this language barrier.

  • bookmark/save datasets for later viewing and share links to datasets on collaborative networks, social media, email, sms etc., right from the app.

Armed with this tool, novice data scientists, and our more experienced colleagues, can start wrangling data with greater ease and accessibility. Do you have ideas or suggestions on how ODC can work better? Please do leave a reply!

Flattr this!