In Support of the Bangladeshi Garment Industries Data Expedition

October 18, 2013 in Data Expeditions

<magazine.image =”http://farm4.staticflickr.com/3726/10346029396_8cd32e8d1f_z.jpg”>

A couple of quick things that may be of use in the current data expedition around the Bangladeshi garment industry…

Matching company names

One possible route for the expedition to take is to look to see whether a company referred to in one list (for example, a particular suppliers list) is also mentioned in another list (for example, the accord on fire safety, or a another supplier’s factory list or factory blacklist). It is quite possible that names won’t match exactly… Some clues for how to address this are described in the School of Data blog post Finding Matching Items on Separate Lists – Bangladeshi Garment Factories.

Other things you can try:

  • use a tool such as OpenRefine to fix the capitalisation, for example by converting names to titlecase (Edit Cells->Common Transforms->To titlecase);
  • a lossy approach (and so not ideal), though one that can help with exact string matching, is to get to the “core” of the company name, for example by stripping out mentions to words like “Limited”, or variants thereof:

example of pruning a company name

Here’s the command used in OpenRefine in that example:

value.toUppercase().replace(/[.;:’()`-]/,”).replace(‘LIMITED’,’LTD’).replace(/(LTD$)/,”).replace(‘\s+’,’ ‘).strip()

A better approach might be to “normalise” mentions to “Ltd”, etc, using something like this:

value.toUppercase().replace(/[.;:’()`-]/,”).replace(‘ LTD’,’ LIMITED’).replace(‘\s+’,’ ‘).strip()

normalising a company name

Corporate Identifiers

As described in the post On the Need for Corporate Identifiers , it’s often more convenient if we can represent companies using unique and unambiguous corporate identifiers.

OpenCorporates have recently uploaded details of Bangladeshi registered companies to their databases; using the OpenCorporates reconciliation API, mapping “messy” names to company identifiers can be handled by OpenCorporates automatically, with a confidence score describing the likelihood of a match being a “true” match.

The OpenCorporates reconciliation API can be used to match company names to just Bangladeshi registered companies using the reconciliation endpoint http://opencorporates.com/reconcile/bd

For convenience, I quickly ran the names of the company on the fire safety accord through the reconciliation process (results here). Not all the company names are matched, but a good proportion are. To match the rest, filter out the unmatched items, run the reconciliation search on what’s left and see if you can improve the matches.) Note that inspecting that data, I see that it needs cleaning a little! Also, the scraper used to get the data from the accord PDF is slightly broken.

Reconciliation is most easily achieved using OpenRefine. From the header of the company name column, select Reconcile -> Start reconciling...

openrefine add reconciliation service

Accept the defaults, and click on Start Reconciling to begin the matching process.

You can then set about accepting – or not – the matches… I tend to filter to just display the confident matches and accept them:

reconciliation - confident matching

Then I look at the middling matches and accept those by hand:

openrefine reconciliation

Then I’m often tempted to filter down to the results I’m not confident about and discard the reconciliation judgements:

reconciliation - filter and dsicard

You can generate new columns based on the data pulled down from the reconciliation API:

openrefine - get ready to add a column

For example:

  • to get the reconciled company name, create a new column based on the reconciled column with the pattern cell.recon.match.name.

reconciliation name match

  • to get the reconciled company identifier, create a new column based on the reconciled column with the pattern cell.recon.match.id.

Opencorproates id reconciliation match

If you reconcile names appearing in different lists, your data will be enriched with unambiguous names and identifiers that you can use to support company matching across differnt data files.

Addresses

There may be some work to be done around the addresses associated with each company. For example, the Bangladeshi company descriptions on OpenCorporates seem to be lacking in address information. There may be some merit in treating OpenCorporates as the bast place to store this information, and then retrieve it through the various OpenCorporates APIs as required. At the current time, volunteer effort in terms of adding address information to OpenCorporates may be one way forward?

Expedition activity?

A major use for the addresses is to support mapping activities. Some geocoders may give spurious locations for some addresses – try to force the issue by adding , Bangladesh on to the end of every address you geocode.

It may be possible to augment addresses for geocoding using additional geographical region information. There is a list of Bangladesh postcodes by area published at http://www.bangladeshpost.gov.bd/postcode.asp which I have scraped and popped into a data file here:
https://github.com/psychemedia/ScoDa-GarmentsDataExpedition/tree/master/geocoding

It may be possible/appropriate to annotate addresses with additional information from this file, for example by matching on district and/or division and adding in more information to the geocoded address, where possible. (Phone city codes may also help identify additional address information – is there a mapping of these somewhere? Or could we construct one?)

If you do get latitude/longitude coordinates for an address, try to also associate the co-ordinates with an OpenCorporates company identifier to support unambiguous matching of co-ordinates/locations to names (although there may be multiple names/identifiers associated with a particular location).

Other hints and tips

If you have any other tricks that may be useful to support the current data expedition, please feel free to add a link to, or description of, them as a comment to this post.

Or if you discover or develop a useful technique during the data expedition that you think would have helped had you come across it before you started the expedition, again, please link to or describe it in the comments below.

Data resources: data resources to support the investigation are available here:
http://datahub.io/dataset/bangladesh-garment-industry-dataset
Global Garment Supply Chain Data

Flattr this!