You are browsing the archive for Data journalism.

Data is a Team Sport: One on One with Daniela Lepiz

Dirk Slater - July 3, 2017 in Community, Data Blog, Event report

Data is a Team Sport is a series of online conversations held with data literacy practitioners in mid-2017 that explores the ever evolving data literacy eco-system.

To subscribe to the podcast series, cut and paste the following link into your podcast manager : http://feeds.soundcloud.com/users/soundcloud:users:311573348/sounds.rss or find us in the iTunes Store and Stitcher.

This episode features a one on one conversation with Daniela Lepiz, a Costa Rican data journalist and trainer, who is currently the Investigation Editor for CENOZO, a West African Investigative Journalism Project that aims to promote and support cross border data investigation and open data in the region. She has a masters degree in data journalism from the Rey Juan Carlos University in Madrid, Spain. Previously involved with OpenUP South Africa working with journalists to produce data driven stories.  Daniela is also a trainer for the Tanzania Media Foundation and has been involved in many other projects with South African Media, La Nacion in Costa Rica and other international organisations.

Notes from the conversation

Daniela spoke to us from Burkina Faso and reflected on the importance of data-driven journalism in holding power to accountability. Her project aims to train and support  journalists working across borders in West Africa to use data to expose corruption and human rights violation. To identify journalists to participate in the project, they seek individuals who are experienced, passionate and curious. The project engages media houses, such as Premium Times in Nigeria, to ensure that there are respected outlets to publish their stories. Daniela raised the following points:

  • As the media landscape continues to evolve, data literacy is increasing becoming a required competency
  • Journalists do not necessarily have a background in mathematics or statistics and are often intimidated by the idea of having to these concepts in their stories.
  • Data stories are best done in teams of people with complementary skills. This can go against a traditional approach to journalism in which journalists work alone and tightly guard their sources.
  • It is important that data training programmes also work with, and better understand the needs of journalists.

Resources she finds inspiring

Her blogs posts

The full online conversation:

Daniela’s bookmarks!

These are the resources she uses the most often.

.Rddj – Resources for doing data journalism with RComparing Columns in Google Refine | OUseful.Info, the blog…Journalist datastores: where can you find them? A list. | Simon RogersAidInfoPlus – Mastering Aid Information for Change

Data skills

Mapping tip: how to convert and filter KML into a list with Open Refine | Online Journalism Blog
Mapbox + Weather Data
Encryption, Journalism and Free Expression | The Mozilla Blog
Data cleaning with Regular Expressions (NICAR) – Google Docs
NICAR 2016 Links and Tips – Google Docs
Teaching Data Journalism: A Survey & Model Curricula | Global Investigative Journalism Network
Data bulletproofing tips for NICAR 2016 – Google Docs
Using the command line tabula extractor tool · tabulapdf/tabula-extractor Wiki · GitHub
Talend Downloads

Github

Git Concepts – SmartGit (Latest/Preview) – Confluence
GitHub For Beginners: Don’t Get Scared, Get Started – ReadWrite
Kartograph.org
LittleSis – Profiling the powers that be

Tableau customized polygons

How can I create a filled map with custom polygons in Tableau given point data? – Stack Overflow
Using Shape Files for Boundaries in Tableau | The Last Data Bender
How to make custom Tableau maps
How to map geographies in Tableau that are not built in to the product (e.g. UK postcodes, sales areas) – Dabbling with Data
Alteryx Analytics Gallery | Public Gallery
TableauShapeMaker – Adding custom shapes to Tableau maps | Vishful thinking…
Creating Tableau Polygons from ArcGIS Shapefiles | Tableau Software
Creating Polygon-Shaded Maps | Tableau Software
Tool to Convert ArcGIS Shapefiles into Tableau Polygons | Tableau and Behold!
Polygon Maps | Tableau Software
Modeling April 2016
5 Tips for Making Your Tableau Public Viz Go Viral | Tableau Public
Google News Lab
HTML and CSS
Open Semantic Search: Your own search engine for documents, images, tables, files, intranet & news
Spatial Data Download | DIVA-GIS
Linkurious – Linkurious – Understand the connections in your data
Apache Solr –
Apache Tika – Apache Tika
Neo4j Graph Database: Unlock the Value of Data Relationships
SQL: Table Transformation | Codecademy
dc.js – Dimensional Charting Javascript Library
The People and the Technology Behind the Panama Papers | Global Investigative Journalism Network
How to convert XLS file to CSV in Command Line [Linux]
Intro to SQL (IRE 2016) · GitHub
Malik Singleton – SELECT needle FROM haystack;
Investigative Reporters and Editors | Tipsheets and links
Investigative Reporters and Editors | Tipsheets and Links

SQL_PYTHON

More data

2016-NICAR-Adv-SQL/SQL_queries.md at master · taggartk/2016-NICAR-Adv-SQL · GitHub
advanced-sql-nicar15/stats-functions.sql at master · anthonydb/advanced-sql-nicar15 · GitHub
2016-NICAR-Adv-SQL/SQL_queries.md at master · taggartk/2016-NICAR-Adv-SQL · GitHub
Malik Singleton – SELECT needle FROM haystack;
Statistical functions in MySQL • Code is poetry
Data Analysis Using SQL and Excel – Gordon S. Linoff – Google Books
Using PROC SQL to Find Uncommon Observations Between 2 Data Sets in SAS | The Chemical Statistician
mysql – Query to compare two subsets of data from the same table? – Database Administrators Stack Exchange
sql – How to add “weights” to a MySQL table and select random values according to these? – Stack Overflow
sql – Fast mysql random weighted choice on big database – Stack Overflow
php – MySQL: Select Random Entry, but Weight Towards Certain Entries – Stack Overflow
MySQL Moving average
Calculating descriptive statistics in MySQL | codediesel
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, …
R, MySQL, LM and quantreg
26318_AllText_Print.pdf
ddi-documentation-english-572 (1).pdf
Categorical Data — pandas 0.18.1+143.g3b75e03.dirty documentation
python – Loading STATA file: Categorial values must be unique – Stack Overflow
Using the CSV module in Python
14.1. csv — CSV File Reading and Writing — Python 3.5.2rc1 documentation
csvsql — csvkit 0.9.1 documentation
weight samples with python – Google Search
python – Weighted choice short and simple – Stack Overflow
7.1. string — Common string operations — Python v2.6.9 documentation
Introduction to Data Analysis with Python | Lynda.com
A Complete Tutorial to Learn Data Science with Python from Scratch
GitHub – fonnesbeck/statistical-analysis-python-tutorial: Statistical Data Analysis in Python
Verifying the email – Email Checker
A little tour of aleph, a data search tool for reporters – pudo.org (Friedrich Lindenberg)
Welcome – Investigative Dashboard Search
Investigative Dashboard
Working with CSVs on the Command Line
FiveThirtyEight’s data journalism workflow with R | useR! 2016 international R User conference | Channel 9
Six issue when installing package · Issue #3165 · pypa/pip · GitHub
python – Installing pip on Mac OS X – Stack Overflow
Source – Journalism Code, Context & Community – A project by Knight-Mozilla OpenNews
Introducing Kaggle’s Open Data Platform
NASA just made all the scientific research it funds available for free – ScienceAlert
District council code list | Statistics South Africa
How-to: Index Scanned PDFs at Scale Using Fewer Than 50 Lines of Code – Cloudera Engineering Blog
GitHub – gavinr/geojson-csv-join: A script to take a GeoJSON file, and JOIN data onto that file from a CSV file.
7 command-line tools for data science
Python Basics: Lists, Dictionaries, & Booleans
Jupyter Notebook Viewer

PYTHON FOR JOURNALISTS

New folder

Reshaping and Pivot Tables — pandas 0.18.1 documentation
Reshaping in Pandas – Pivot, Pivot-Table, Stack and Unstack explained with Pictures – Nikolay Grozev
Pandas Pivot-Table Example – YouTube
pandas.pivot_table — pandas 0.18.1 documentation
Pandas Pivot Table Explained – Practical Business Python
Pivot Tables In Pandas – Python
Pandas .groupby(), Lambda Functions, & Pivot Tables
Counting Values & Basic Plotting in Python
Creating Pandas DataFrames & Selecting Data
Filtering Data in Python with Boolean Indexes
Deriving New Columns & Defining Python Functions
Python Histograms, Box Plots, & Distributions
Resources for Further Learning
Python Methods, Functions, & Libraries
Python Basics: Lists, Dictionaries, & Booleans
Real-world Python for data-crunching journalists | TrendCT
Cookbook — agate 1.4.0 documentation
3. Power tools — csvkit 0.9.1 documentation
Tutorial — csvkit 0.9.1 documentation
4. Going elsewhere with your data — csvkit 0.9.1 documentation
2. Examining the data — csvkit 0.9.1 documentation
A Complete Tutorial to Learn Data Science with Python from Scratch
For Journalism
ProPublica Summer Data Institute
Percentage of vote change | CARTO
Data Science | Coursera
Data journalism training materials
Pythex: a Python regular expression editor
A secure whistleblowing platform for African media | afriLEAKS
PDFUnlock! – Unlock secured PDF files online for free.
The digital journalist’s toolbox: mapping | IJNet
Bulletproof Data Journalism – Course – LEARNO
Transpose columns across rows (grefine 2.5) ~ RefinePro Knowledge Base for OpenRefine
Installing NLTK — NLTK 3.0 documentation
1. Language Processing and Python
Visualize any Text as a Network – Textexture
10 tools that can help data journalists do better work, be more efficient – Poynter
Workshop Attendance
Clustering In Depth · OpenRefine/OpenRefine Wiki · GitHub
Regression analysis using Python
DataBasic.io
DataBasic.io
R for Every Survey Analysis – YouTube
Git – Book
NICAR17 Slides, Links & Tutorials #NICAR17 // Ricochet by Chrys Wu
Register for Anonymous VPN Services | PIA Services
The Bureau of Investigative Journalism
dtSearch – Text Retrieval / Full Text Search Engine
Investigation, Cybersecurity, Information Governance and eDiscovery Software | Nuix
How we built the Offshore Leaks Database | International Consortium of Investigative Journalists
Liz Telecom/Azimmo – Google Search
First Python Notebook — First Python Notebook 1.0 documentation
GitHub – JasonKessler/scattertext: Beautiful visualizations of how language differs among document types

 

Flattr this!

Data is a Team Sport: Data-Driven Journalism

Dirk Slater - June 20, 2017 in Community, Data Blog, Event report

Data is a Team Sport is a series of online conversations held with data literacy practitioners in mid-2017 that explores the ever evolving data literacy eco-system.

Cut and paste this link into your podcast app to subscribe: http://feeds.soundcloud.com/users/soundcloud:users:311573348/sounds.rss or find us in the iTunes Store and Stitcher.

In this episode we speak with two veteran data literacy practitioners who have been involved with developing data-driven journalism teams.

Our guests:

  • Eva Constantaras is a data journalist specialized in building data journalism teams in developing countries. These teams that have reported from across Latin America, Asia and East Africa on topics ranging from displacement and kidnapping by organized crime networks to extractive industries and public health. As a Google Data Journalism Scholar and a Fulbright Fellow, she developed a course for investigative and data journalism in high-risk environments.
  • Natalia Mazotte is Program Manager of School of Data in Brazil and founder and co-director of the digital magazine Gender and Number. She has a Master Degree in Communications and Culture from the Federal University of Rio de Janeiro and a specialization in Digital Strategy from Pompeu Fabra University (Barcelona/Spain). Natalia has been teaching data skills in different universities and newsrooms around Brazil. She also works as instructor in online courses in the Knight Center for Journalism in the Americas, a project from Texas University, and writes for international publications such as SGI News, Bertelsmann-Stiftung, Euroactiv and Nieman Lab.

Notes from this episode

Our first conversation on Data-Driven Journalism featured Eva Constantaras, on her work in developing data-driven journalism teams in Afghanistan and Pakistan, and Natalia Mazotte on her work in Brazil. They discussed what they have learned helping journalists think through how they can use data to drive social change. They agreed that good journalism necessarily includes data-driven approaches in order uncover facts and the root causes of societal problems.

Eva strives to motivate journalists to look beyond the fact that corruption exists and dig deeper into the causes and impacts. She has seen data journalists in Europe and North America making a choice to focus, for example, on polling data rather than breaking down the data behind the candidates’ policies. Eva sees this as a mistake and is committed to making emerging data journalists understand why this is problematic. Finally, Eva made a critique of the approach funders take in the field of data literacy, often putting too much emphasis on short-term solution rather than investing in long-term data capacity building programmes. This is something that School of Data has long struggled with from third-party funders and clients alike. It’s clear that more work needs to be done to explaining what short term programmes can and, more importantly, cannot achieve.

Natalia primarily discussed School of Data Brazil’s Gender and Number project. The project was designed to use data to move the discussion on gender equality past arguments based on traditional roles. She is concerned about the growing ‘data literacy’ gap between those with power, government and corporations, and those without power, people living in the favelas. In Brazil, the media landscape is changing as the mainstream are reporting  on ‘what’s happened’ while independent media is doing the more investigative reporting on ‘why it’s happened’.

They wanted to plug:

Readings/Resources they find inspiring for their work.

Resources contributed from the participants:

View the online conversation in full:

Flattr this!

Who works with data in El Salvador?

Omar Luna - November 16, 2016 in Data Blog, Fellowship

For five years, El Salvador has had the Public Information Access Law (PIAL), which requires various kinds of information from all state, municipal and public-private entities —such as statistics, contracts, agreements, plans, etc. These inputs are all managed under the tutelage of PIAL, in an accurate and timely manner.

As well as the social control exerted by Civil Society Organizations (CSOs) in El Salvador, to ensure compliance with this law, the country’s public administration gave space for the emergence of various bodies, such as the Institute of Access to Public Information (IAPI), the Secretariat of Transparency, Anti-Corruption and Citizen Participation and the Open Government website, which compiles —without periodic revision of official documents and other resources by any government official— more than 92,000 official data documents.

In this five year period, the government showed discontent. Why? They didn’t expect that this legislation would strengthen the journalistic, activist and investigative powers of civil society, who took advantage of this period of time to improve and refine the techniques under which they requested information from the public administration.

Presently, there are few digital skills amongst these initiatives in the country. It has now become essential to ask the question: what is known about data in El Salvador? Are the initiatives that have emerged limited in the scope of their achievements? Can something be done to awaken or consolidate the interest of people in data? To answer these and other questions, I conducted a survey with different research and communication professionals in El Salvador and this is what I found.

The Scope

“I think [data work] has been explored very little (in journalism at least),” said Jimena Aguilar, Salvadoran journalist and researcher, who also assured me that working with data helps provide new perspectives to stories that have been written for some time. One example is Aguilar’s research for La Prensa Grafica (LPG) sections, such as transparency, legal work, social issues, amongst others.

Similarly, I discovered different initiatives that are making efforts to incorporate the data pipeline within their work. For two years, the digital newspaper ElFaro.net has explored various national issues (laws, homicides, travel deputies, pensions, etc.) using data. During the same period, Latitudes Foundation processed different aspects of gender issues to determine that violence against women is a multi-causal phenomenon in the country under “Háblame de Respeto” project.

And although resistance persists in government administrations and related institutions to adequately provide the information requested by civil society —deputies, think tanks, Non-Governmental Organizations (NGOs), journalists, amongst others— more people and entities are interested in data work, performing the necessary steps to obtain information that allows them to know the level of pollution in the country, for instance, build socio-economic reports, uncover the history of Salvadoran political candidates and, more broadly, promote the examination of El Salvador’s past in order to understand the present and try to improve the country’s future.

 

The Limitations

“[Perhaps,] it is having to work from scratch. A lot of carpentry work [too much work for a media outlet professional]”, says Edwin Segura, director for more than 15 years of LPG Datos, one of the main data units in the country, who also told me that often too much time and effort is lost in cleaning false, malicious data provided by different government offices, which often has incomplete or insufficient inputs. Obviously, Segura says, this is with the intention of hindering the work of those working with data in the country.

In addition, there’s something very important that Jimena told me about the data work: “If you are not working as a team, it is difficult to do [data work] in a creative and attractive way.” What she said caught my attention for two reasons: first, although there are platforms that help create visualizations, such as Infogr.am and Tableau, you always need a multidisciplinary approach to jump-start a data project, which is the actual case of El Diario de Hoy data unit that is conformed by eight people specialized in data editing, web design, journalism and other related areas.

And, on the other hand, although there are various national initiatives that work to obtain data, such as Fundación Nacional para el Desarrollo (FUNDE), Latitudes Foundation, etc., there’s a scattered effort to do something with the results, which means that everyone does what they can do to take forward the challenge of working with databases individually, instead of pursuing common goals between them. 

Stones in the Road

When I asked Jimena what are the negative implications of working with data, she was blunt: “(Working with data) is something that is not understood in newsrooms […] [it] takes a lot of time, something that they don’t like to give in newsrooms”. And not only newsrooms, because NGOs and various civil society initiatives are unaware of the skills needed to work with data.

Of the many different internal and external factors affecting the construction of stories with data, I would highlight the following. To begin with, there is a fear and widespread ignorance towards mathematics and basic statistics, so individuals across a wide variety of sectors don’t understand data work; to them, it is a waste of time to learn how to use them in their work. For them, it’s very simple to gather data in press conferences, institutional reports and official statements, which is a mistake because they don’t see how data journalism can help them to tell stories in a different way.

Another issue is that we have an inconsistency in government actions because, although the government discursively supports transparency, their actions are focused on answering requests vaguely rather than proactively releasing good quality data —opening data in this way is hampered with delays. I experienced this first hand when, on many occasions, I asked for information that didn’t match with what I requested or, on the contrary, the government officials sent me different information, in contrast with other information requests sent by other civil society sectors (journalists, researchers, etcetera).

Where Do We Go From Here?

With this context, it becomes essential to begin to make different sectors of civil society aware of the importance of data on specific issues. For that, I find myself designing a series of events with multidisciplinary teams, workshops, activities and presentations that deconstruct the fear of numbers, that currently people have, through the exchange of experience and knowledge. Only then can our civil society groups make visible the invisible and explain the why in all kinds of topics that are discussed in the country.

With this approach, I believe that not only future generations of data practitioners can benefit from my activities, but also those who currently have only indirect contact with it (editors, coordinators, journalists, etc.), whose work can be enhanced by an awareness of data methodologies; for example, by encouraging situational awareness of data in the country, time-saving tools and transcendence of traditional approaches to visualization.

After working for two years with gender issues and historic memory, I have realized that most data practitioners have a self-taught experience; through trainings of various kinds we can overcome internal/external challenges and, in the end, reach common goals. But, we don’t have any formal curricula and all we’ve learned so far comes from a proof and error practices… something we have to improve with time.

And, also, we’re coping with the obstacles imposed by the Government on how data is requested and how the requested information is sent; we also have to constantly justify our work in workplaces where data work is not appreciated. From NGO to media outlets, data journalism is seen as a waste of time because they’re thinking that we don’t produce materials as fast as they desire; so, they don’t appreciate all the effort required to request, clean, analyse and visualise data.

As part of my School of Data Fellowship, I’m supporting the design of an educational curriculum specialising in data journalism for fellow journalists in Honduras, Guatemala and El Salvador, so they may acquire all the necessary skills and knowledge to undertake data histories on specific issues in their home countries. This is a wonderful opportunity to awaken the persistence, passion and skills for doing things with data.

The outlook is challenging. But now that I’m aware of the limits, scope and stones in the way of data journalism in El Salvador and all that remains to be done, I want to move forward. I take the challenge this fellowship has presented me, because as Sandra Crucianelli (2012) would say, “(…) in this blessed profession, not only doesn’t shine people with good connections, even with brilliant minds: for this task only shine the perseverant ones. That’s the difference”.

Flattr this!

Data Journalism for Beginners in Guatemala

Ximena Villagrán - September 6, 2016 in Event report, Fellowship

image alt text

School of Data’s first data journalism workshop in Guatemala was a total success. We invited 14 journalists, video journalists and graphic designers in Guatemala to attend a four hour workshop at “Casa de Cervantes”, to learn the basic tools of data journalism. Journalists from the most important newspapers and magazines of the country attended: Soy502, elPeriódico, Prensa Libre, Contrapoder and Nómada.

In this first event, which will be followed by other regular workshops, the journalists were able to explore the data pipeline and work with a crime dataset to obtain news stories. The workshop was given by Ximena Villagrán, assisted by Daniel Villatoro.

The objective of the workshop was that after four hours, participants would be able to understand the basics of what data journalism is, when to use it and how to use it.

The workshop started with an exercise that involved only paper (not computers) to represent the data pipeline:

  • Collect individual information

  • Gather information

  • Organize the information in a database

  • Clean, normalize and standardize the data

  • Contextualize the data

  • Create an hypothesis

  • Obtains conclusions by interviewing the database

After this exercise, participants were given a pdf document about crime in Guatemala. We first showed them how to convert this document from PDF to Excel, before manually converting the resulting table to a database. Once the database building step was done, we started creating hypotheses and analyzing the data with Excel filters.

image alt text

The PDF given to the participants

image alt text

The data once converted into a database

We arranged with the group to follow up this workshop with several others, once a month, in order to learn more about data journalism, and to explore in depth the whole data pipeline.


Infobox
Event name: Easy recipes to take away (to the newsroom)
Event type: workshop
Event theme: Data Journalism
Description: an event focusing on training journalists in data journalism pipeline
Speakers: Ximena Villagrán, Daniel Villotoro
Partners: None
Location: Guatemala, Guatemala
Date: July 2, 2016
Audience: journalists
Number of attendees 14
Gender split: 28% male 72% female
Duration: 4 hours

Flattr this!

Avoiding Harm While Pushing Good Stories

Vadym Hudyma - September 5, 2016 in Event report, Fellowship

image alt text

Working on Responsible Data is about asking some key questions: how can we ensure the right to consent for individuals and communities? How can we preserve privacy, security and ownership around their data. These issues should be balanced with the need to create meaningful impact with a project or a story. Which makes journalists one of a prime audiences for Responsible Data training. So I was excited when I was invited to hold a session at a big event for journalists and independent bloggers, organised by Sourcefabric in Odessa, Ukraine.

As news stories incorporate more (personal) data than ever in their work, journalists face several challenges related to the responsible use of this data – sometimes without being aware of them, as the discussion with my audience showed. We explored three issues often found in popular stories of the year past: the need for informed consent, the risks of covering war casualties, and the issues related to public ratings.

Why we need informed consent

As social media data becomes an attractive source of data and stories for news outlets, they get reminded that the rules related to traditional reporting, such as informed consent, still apply – but the nature of social media as a medium making much more complicated than just reaching out to the heroes of your story. We discussed this issue using the example of Buzzfeed’s article on sexual assault. In this case, the journalist embedded in her story several tweets from Twitter thread on this topic and made sure to have the consent of those whose tweets were quoted in the story. The problem was that it was extremely easy to get to the whole Twitter thread in one click and read the stories of those who did not want to get “popularity” brought by an article on Buzzfeed. They couldn’t reasonably expect such a high level of visibility after answering in a Twitter thread.

This is an issue explored by Helen Nissembaum, who explains that privacy is not binary and should be understood in context: people have a certain expectation of the final use of the information that they share. Once the receiver of that information (an individual on Twitter vs Buzzfeed readers) or the transmission principle (Twitter thread vs Buzzfeed article) changes, it creates a perceived violation of privacy.

As pointed out by participants, getting informed consent is not always easy in the kind of reporting, which heavily relied on social media, even though using human faces and personal stories is crucial to create impact to a story.

The risks of covering war deaths

Another example dealt with the potential issues linked to interactive maps, when used as a data story medium. Not just the usual complications of getting a complex story right, but also the connected problems of geolocation data as a possible privacy issue. There is as well a a need to consider the wider context – as with the reuse of CNN’s War Casualties Map in stories about other armed conflicts, and the possible danger for relatives of deceased fighters, who fought “for the wrong side”. Also, we looked into the problem of false sense of accuracy in the highly uncertain situation of war casualty statistics, like in the example of civilian casualties during the Syrian conflict in the example below:

image alt text

The issues with public rating

At the end, we spoke a bit about the sad example of the now closed Schooloscope project. While there are many lessons to be learned from this example, we spoke mainly about how the revelation of school ratings, without any public policy involved in place to fix the problem, was damaging to the communities involved. As a good counterexample of a solution, not just problem-driven data journalistics, I presented ProPublica’s project on public schools inequality.

As a speaker, working with a less-experienced audience, and the need to locate my presentation in the wider context of a data literacy event was a challenging, but extremely interesting task.


Infobox
Event name: Responsible Data in Data Journalism
Event type: workshop
Event theme: Responsible Data
Description: a part of 4-days training on creating data-driven stories
Speakers: Vadym Hudyma, Jacopo Ottaviani
Partners: Sourcefabric
Location: Ukraine, Odessa
Date: August 3, 2016
Audience: data journalists
Number of attendees 17
Gender split: 50% female, 50% male
Duration: 1.5 hours

Flattr this!

Discover patterns in hundreds of documents with DocumentCloud

Daniel Villatoro - August 20, 2016 in Fellowship, HowTo

If you’re a journalist (or a researcher), say goodbye to printing all your docs in a file, getting them into a folder, and highlighting those with markers, adding post-its and labels. This heavy burden of reading, finding repeated information and highlighting it can be done for you by DocumentCloud: it allows you to reveal the names of the people, places and institutions mentioned in your documents to line up dates in a timeline, to save your docs on the Cloud in a private way – and with the option to make them public later.

DocumentCloud is an Open Source platform, and journalists and other media professionals have been using it as online archive of digital documents and scanned text. It provides a space to share source documents.

A major feature of DocumentCloud is how well it works with printed files. When you upload a PDF scanned as an image, the platform will read it with Optical Character Recognition (OCR) to recognize the words in the file. This allows investigative journalists to upload documents from original sources and make them publically accessible, and for the documents to be processed much more easily.

Some other features include:

  • Running every document through OpenCalais, a metadata technology from Thomson Reuters that aggregates other contextual information to the uploaded files. It can take the dates from a document and graph them in a timeline or help you find other documents related to your story.

  • Annotating and highlighting important sections of your documents. Each note that you add will have its own unique URL so that you can have all in order.

  • Uploading files in a safe and private manner, but you have also the option to share those documents, make them public, and embed them. The sources and evidence of an investigation don’t have to stay in the computer of a journalist or the archives of a media organization – they can go public and become open.

  • Review of the documents that other people have uploaded such as files, hearing transcripts, testimony, legislation, reports, declassified documents and correspondence.

The platform in action

A while ago, an investigation on the manipulation of the buying system at the Guatemalan social insurance revealed a network of attorneys, doctors, specialists and associations of patients that forced the purchase of certain medicines for terminal patients. It was led by Oswaldo Hernández from *Plaza Públic*a, and DocumentCloud was at the core of the investigation process.

“I searched for words like ‘Doctor’ or ‘Attorney’ to find out the names of the people involved. That way I was able to put together a database and the relationships between those involved. It’s like having a big text document where you can explore and search everything”, explains Hernández.

When analysing one of the documents about medicines, DocumentCloud shows the names of people and institutions that are repeated in the text in a graphic plot.

image alt text

A screenshot of the graphic analysis that DocumentCloud plots from the uploaded files

Four creative uses of DocumentCloud

Below are some examples of how you can produce different types of content when you mix uploaded information, creativity and the functions of this tool.

The platform VozData, from the Argentinian newspaper La Nación, combines their own code with the technology of DocumentCloud to set up an openly collaborative platform that transforms Senate expense receipts into open and useful information by crowdsourcing it.

image alt text

Due to the fact that their investigation about violence in a prison got published in The New York Times, *The Marshall Projec*t did a follow-up about how the prison officers censored the names of some guards and interns, and also aerial photos of the prison when the newspaper was distributed to prisoners.

image alt text

The I*nternational Consortium of Investigative Journalists *(ICIJ) uses DocumentCloud so that readers can access the original documents of the Luxembourg Leak, secret agreements that reduced taxes to 350 companies across the world and approved by the Luxembourg authorities.

image alt text

The* Washington Post *used the software to explain the set of instructions that the US National Security Agencys gives to their analysts, so that whenever they fill a form to access databases and justify their research, they don’t reveal too much suspicious or illegal information.

image alt text

So, next time, when you have to do tons of research using original documents, you can make it publicly available through DocumentCloud. And, even if you’re not a journalist, you can still use this tool to browse their extensive catalogue of documents uploaded by journalists across the world.

Flattr this!

Call for a week-long data journalism training in Berlin

Nika Aleksejeva - August 18, 2016 in Events, Fellowship

image alt text

Photo from a data visualization training in Istanbul, 2014. Author: Nika Aleksejeva

‘Data-driven journalism against prejudices about migration’ training course for young media-makers, human rights activists and developers Berlin, 12 – 20 November 2016

Deadline for receiving applications is: 31st August 2016, 23:59h CET.


School of Data fellow, Nika Aleksejeva, in collaboration with European Youth Press (EYP), an umbrella association of young media-makers in Europe, is inviting young media-makers, designers/developers/programmers and human rights activists to participate in a week-long data journalism training. The training aims to produce impartial, data-driven reports on local migration issues using innovative storytelling forms. It will address the current European refugee crisis, from the perspective of 11 European countries (listed below).

What to expect?

The main objective of the training course is to increase data journalism skills through hands-on training and through working on a real story that will eventually be published in the media. During the project, EYP will partner up with established media organisations from the eleven, listed countries, who will each send one journalist to attend the training. Working together, participants will learn data journalism skills and immediately apply them to practical scenarios. The finished results of their work will be published by media partners of the project. It is hoped that this broad public outreach will lead to significant effect on the media’s treatment of the issue. This course will be an opportunity to strengthen an already-established international network of young media-makers, mid-career journalists and activists concerned with migration and refugee rights.

Participants of the training course will:

  • learn and practice data journalism techniques: finding the right data, scraping, compiling, cleaning, storytelling with data;

  • form teams and work on specific projects, with a view to publication in the national media of participants’ home countries;

  • make professional contacts in the field and obtain hands-on experience of working on a cross-border, data-driven investigation.

Financial Information

This training course is funded by the Erasmus+ grant. Participants will receive reimbursement of their travel costs** up to the amount indicated below, **according to their country of residence:

  • Armenia: 270 EUR

  • Belgium: 170 EUR

  • Czech Republic: 80 EUR

  • Denmark: 80 EUR

  • Germany (outside Berlin): 80 EUR

  • Italy: 170 EUR

  • Latvia: 170 EUR

  • Montenegro: 170 EUR

  • Slovakia: 170 EUR

  • Sweden: 170 EUR

  • Ukraine: 170 EUR

  • participants living in Berlin will not be eligible for reimbursement of any travel expenses.

Although travel costs will be reimbursed, participants are asked to make the travel bookings themselves, as soon as possible after being selected. Participants are also asked to take the most economical route from their place of residence to Berlin and use the following means of the transportation:

  • Train: 2nd class ticket (normal as well as high-speed trains),

  • Flight: economy-class air ticket or cheaper,

  • Bus

Accommodation, meals and all necessary materials will be provided.

Who can apply?

Applicants must fulfil all the criteria below:

  • young media-makers, journalism students, bloggers and citizen journalists with a demonstrated interest in issues related to the rights of ethnic minorities, migrants and refugees; human rights activists working on refugee/migration issues; developers interested in the topic;

  • 18-30 year-olds;

  • residents of Czech Republic, Germany, Belgium, Italy, Sweden, Armenia, Ukraine, Montenegro, Slovakia, Denmark and Latvia;

  • proficient in English.

How to apply?

Interested candidates are invited to apply by completing this application form. Please also send your CV, in Europass format, and via e-mail, to applications@youthpress.org with ‘ddj on migration’ in the subject line.

The deadline for receiving completed applications (form and CV) is: 31st August, 23:59h CET.

Flattr this!

Data in December: Sharing Data Journalism Love in Tunisia

Ali Rebaie - January 11, 2016 in Data Blog, Data Expeditions, Data for CSOs

NRGI hosted the event #DataMuseTunisia in collaboration with Data Aurora and School of Data senior fellow Ali Rebaie on the 11th of December 2015 in beautiful Tunis where a group of CSO’s from different NGOs met in the Burge Du Lac Hotel to learn how to craft their datasets and share their stories through creative visuals.

Bahia Halawi, one of the leading women data journalism practitioners in the MENA region and the co-founder at Data Aurora, led this workshop for 3 days. This event featured a group of professionals from different CSO’s. NRGI has been working closely with School of Data for the sake of driving economic development & transparency through data in the extractive industry. Earlier this year NRGI did similar events in Washington, Istanbul, United Kingdom, GhanaTanzania, Uganda and many others. The experience was very unique and the participants were very excited to use the open source tools and follow the data pipeline to end up with interactive stories.

The first day started with an introduction to the world of data driven journalism and storytelling. Later on, participants checked out some of the most interesting stories worldwide before working with different layers of the data pipeline. The technical part challenged the participants to search for data related to their work and then scraping it using google spreadsheets, web extensions and scrapers to automate the data extraction phase. After that, each of the participants used google refine to filter and clean the data sets and  then remove redundancies ending up with useable data formats. The datasets were varied and some of them were placed on interactive maps through CartoDB while some of the participants used datawrapper to interactively visualize them in charts. The workshop also exposed participants to Tabula, empowering them with the ability of transforming documents from pdfs to excel.

Delegates also discussed some of the challenges each of them faces at different locations in Tunisia. It was very interesting to see 12321620_1673319796270332_5440100026922548095_nparticipants share their ideas on how to approach different datasets and how to feed this into an official open data portal that can carry all these datasets together. One of the participants, Aymen Latrach, discussed the problems his team faces when it comes to data transparency about extractives in Tataouine. Other CSO’s like Manel Ben Achour who is a Project Coordinator at I WATCH Organization came already from a technical backgrounds and they were very happy to make use of new tools and techniques while working with their data.

Most of the delegates didn’t come from technical backgrounds however and this was the real challenge. Some of the tools, even when they do not require any coding, mandate the knowledge about some technical terms or ideas. Thus, each phase in the data pipeline started with a theoretical explanatory session to familiarize delegates with the technical concepts that are to be covered. After that, Bahia had to demonstrate the steps and go around the delegates facing any problems to assist them in keeping up with the rest of the group.

It was a little bit messy at the beginning but soon the participants got used to it and started trying out the tools on their own. In reality, trial and error is very crucial to developing the data journalism skills. These skills can never be attained without practice.
11232984_1673319209603724_5889072769128707064_n
Another important finding, according to Bahia who discussed the importance of the learnt skills to the delegate’s community and workplace, is that each of them had his/her own vision about its use. The fact that the CSO’s had a very good work experience allowed them to have unique visions about the deployment of what they have learnt at their workplaces. This, along with the strong belief in the change open data portals can drive in their country are the only triggers to learning more tools and skills and bringing out better visualizations and stories that impact people around.

The data journalism community 3 years ago was still at a very embryonic stage with few practitioners and data initiatives taking place in Africa and Asia. Today, with enthusiastic practitioners and a community like School of Data spreading the love of data and the spirit of change it can make, the data journalism field has very promising expectations. The need for more initiatives and meet ups to develop the skills of CSOs in the extractive industries as well as other fields remains a priority for reaching out for true transparency in every single domain. 

Thank you,

You can connect with Bahia on Twitter @HalawiBahia.

Flattr this!

What was the School of Data Network up to in 2015?

Marco Túlio Pires - December 28, 2015 in Community, Impact

The School of Data Network is formed by member organisations, individuals, fellows and senior fellows around the world

The School of Data Network is formed by member organisations, individuals, fellows and senior fellows around the world


We just can’t believe it’s already the end of the year! I mean, every year you see people saying the months passed by so fast, but we really mean it! There was a lot going on in our community, from the second edition of our Fellowship Program to many exciting events and activities our members organised around the world.

Let’s start with folks at Code4SA. They coordinated the activities of three open data fellows and are organising the first physical Data Journalism School of the continent! Isn’t that amazing? They’re actually creating a space for people to work together with on-site support on data journalism skills. This is the first time this happens in the School of Data network and we’re really proud Code4SA is taking the lead on that! But they didn’t stop it there. They also participated in the Africa Open Data Conference, coordinated trainings and skillshares with NU & BlackSash and ran two three-day Bootcamps (Cape Town and Johannesburg). “One of our biggest challenges this year has been establishing a mandate to work with the government”, said Jennifer Walker, from Code4SA. “On the Data Journalism School, the challenge is really getting everything in place, the newsroom, the trainer etc.”

The group will pursue the project of setting up the first data-journalism agency in Macedonia (Dona Dzambaska - CC-by-sa 3.0)

In Macedonia, this group will pursue the project of setting up the first data-journalism agency in the country (Dona Dzambaska – CC-by-sa 3.0)

In Macedonia, our friends at Metamorphosis Foundation had their second School of Data Fellow, Goran Rizaov. Together with Dona Djambaska, senior School of Data Fellow (2014), they organised four open data meetups, and two 2-day open data trainings, including a data journalism workshop with local journalists in Skopje. They also launched a call for applications that resulted in Goran supporting three local NGOs in open data projects. They also supported the Institute for Rural Communities and the PIU Institute with data clinics. And if that was not enough, Dona and Goran were special guests speakers at the TEDxBASSalon.

Open Knowledge Spain and Open Knowledge Greece also were busy coordinating School of Data in their respective countries. In Spain, Escuela de Datos participated in a data journalism conference leading workshops for three days and a hackathon. They also ran monthly meeting with people interested in exploring data; they call it “open data maker nights” and also our own “data expeditions.” They will have a couple of meeting early January to set the goals for 2016. Greece organised an open science training event and also servers as the itersection between open data and linked data, coming from people working at the University of Greece.

In France, Ecole des Données has organised three activities in Paris: a local urban data laboratory, a School of Data training and the Budget Democracy Laboratory, both for the city hall. They also developed a DatavizCard Game and coordinate a working group around data visualisation. Our French friends also took part in a series of events, such as workshops, conferences, debates and MeetUps. You can check out the list here. In 2016 they want to do more collaboration with other countries and will participate in the SuperDemain (digital culture for children and families) and Futur en Seien 2016 events.

Camila Salazar & Julio Lopez, 2015 School of Data Fellows, organised a series of workshops in Latin America

Camila Salazar & Julio Lopez, 2015 School of Data Fellows, organised a series of workshops in Latin America

Across the Atlantic we arrive in the Latin American Escuela de Datos, coordinated by SocialTIC, in Mexico. Camila Salazar and Julio Lopez, two fellows from the class of 2015 did amazing things in the region, such as organasing 23 training events in four different countries (Ecuador, Costa Rica, Chile and Mexico), reaching out to more than 400 people. Julio is working with the Natural Resource Governance Institute on a major project about extractives data (stay tuned for news!) and Camila was hired by Costa Rica’s biggest data journalism team at La Nación, on top of developing a project about migrant data in the country. They’re on fire! You will hear more from them on our annual report that’s coming out early next year. “Our biggest challenge now will be having more trainers comming out of the community”, said Juan Manuel Casanueva, from SocialTIC.

Escola de Dados (Brazil) instructors and participants in a workshop about data journalism and government spending data, in São Paulo

Escola de Dados (Brazil) instructors and participants in a workshop about data journalism and government spending data, in São Paulo

Heading down to South America we see that brasileiros at Escola de Dados, in Brazil, are also on fire. They organised 22 workshops, trainings and talks/events reaching out to over 760 people in universities, companies and even government agencies. Two of their intructors were invited by the Knight Center for Journalism in the Americas to organise and run the first 100% in Portuguese MOOC about Data Journalism, with the support from the National Newspaper Association and Google. In total, 4989 people enrolled for the course which was a massive success. They also organised a data analysis course for Folha de S.Paulo, biggest broadsheet newspaper in the country. Next year is looking even better, according to Natália Mazotte, Escola de Dado’s coordinator. “We will be offering more courses with the Knight Center, will create data labs inside Rio de Janeiro favelas and will run our own fellowship program”. Outstanding!

We have so much more to share with you in our annual report that’s coming up in a few weeks. 2015 has been a great year for School of Data in many, many aspects and we are eager to share all those moments with you!

Flattr this!

Heads up for the first data journalism agency in Macedonia!

Marco Túlio Pires - December 3, 2015 in Event report

Developer Baze Petrushev showed participants how to use the Normal Distribution to find stories in data

Developer Baze Petrushev showed participants how to use the Normal Distribution to find stories in data (Dona Dzambaska – CC-by-sa 3.0)

Data journalism in Macedonia just got a lot stronger: a group of journalists and programmers started what could become the first data-journalism agency in the country. The group was part of the two-day workshop organised by folks at School of Data Macedonia, from member organisation Metamorphosis Foundation, as part of the ongoing support the British Embassy is providing in the region.

Journalists, programmers and data enthusiasts got together in Skopje to talk about data journalism in Macedonia (Dona Dzambaska - CC-by-sa 3.0)

Journalists, programmers and data enthusiasts got together in Skopje to talk about data journalism in Macedonia (Dona Dzambaska – CC-by-sa 3.0)


The rainy weekend (November 28th & 29th) didn’t stop 17 journalists from getting together to learn the basics of the Data Pipeline: getting, cleaning, validating, analysing and presenting data for different audiences. The workshop included groups activities and hands-on sessions with tools such as OpenRefine, for data cleaning, Google Sheets, for analysis and IFTTT for scraping. Goran Rizaov, 2015 School of Data Fellow in Macedonia was one of the trainers and organisers of the training experience. We also had the support from senior fellow (2014) Dona Dzambaska, who took amazing pictures and gave general help during the sessions.
Participants went through groups sessions and hands-on training about a variety of tools that are useful for working with data in journalism (Dona Dzambaska - CC-by-sa 3.0)

Participants went through groups sessions and hands-on training about a variety of tools that are useful for working with data in journalism (Dona Dzambaska – CC-by-sa 3.0)

Even with such a short time together, participants formed three groups and came up with prototypes of projects with great potential for the region. One of them will monitor the sporting habits of Macedonians on Twitter. “Our idea is to use hashtags and the social media API to analyse many variables, such as time of the day, the weather, which activity people are doing at the moment of the tweet, their mood, age, gender etc”, said journo-coder Bozidar Hristov, one of the members of the group.

Another group wanted to take a look at the data about the turnout in Macedonian elections, using data analysis to draw conclusions about all of the regions in the country. “We’re wondering if the turnout rate has anything to do with the geographical location”, said the developer and data-wrangler Baze Petrushev.

The group will pursue the project of setting up the first data-journalism agency in Macedonia (Dona Dzambaska - CC-by-sa 3.0)

The group will pursue the project of setting up the first data-journalism agency in Macedonia (Dona Dzambaska – CC-by-sa 3.0)

Adriana Mijuskovic and Ivana Kostovska want to start a data journalism agency in Skopje to help newsrooms publish data-driven stories. “We also want to create opportunities for journalists and programmers to work together in projects with macedonian data, also in cooperation with other networks in the Balkans”, said Adriana. The project was welcomed by the whole group and they will meet again in the coming weeks to plan next steps.

Flattr this!