In Support of the Bangladeshi Garment Industries Data Expedition
<magazine.image =”http://farm4.staticflickr.com/3726/10346029396_8cd32e8d1f_z.jpg”>
A couple of quick things that may be of use in the current data expedition around the Bangladeshi garment industry…
Matching company names
One possible route for the expedition to take is to look to see whether a company referred to in one list (for example, a particular suppliers list) is also mentioned in another list (for example, the accord on fire safety, or a another supplier’s factory list or factory blacklist). It is quite possible that names won’t match exactly… Some clues for how to address this are described in the School of Data blog post Finding Matching Items on Separate Lists – Bangladeshi Garment Factories.
Other things you can try:
- use a tool such as OpenRefine to fix the capitalisation, for example by converting names to titlecase (Edit Cells->Common Transforms->To titlecase);
- a lossy approach (and so not ideal), though one that can help with exact string matching, is to get to the “core” of the company name, for example by stripping out mentions to words like “Limited”, or variants thereof:
Here’s the command used in OpenRefine in that example:
value.toUppercase().replace(/[.;:’()`-]/,”).replace(‘LIMITED’,’LTD’).replace(/(LTD$)/,”).replace(‘\s+’,’ ‘).strip()
A better approach might be to “normalise” mentions to “Ltd”, etc, using something like this:
value.toUppercase().replace(/[.;:’()`-]/,”).replace(‘ LTD’,’ LIMITED’).replace(‘\s+’,’ ‘).strip()
Corporate Identifiers
As described in the post On the Need for Corporate Identifiers , it’s often more convenient if we can represent companies using unique and unambiguous corporate identifiers.
OpenCorporates have recently uploaded details of Bangladeshi registered companies to their databases; using the OpenCorporates reconciliation API, mapping “messy” names to company identifiers can be handled by OpenCorporates automatically, with a confidence score describing the likelihood of a match being a “true” match.
The OpenCorporates reconciliation API can be used to match company names to just Bangladeshi registered companies using the reconciliation endpoint http://opencorporates.com/reconcile/bd
For convenience, I quickly ran the names of the company on the fire safety accord through the reconciliation process (results here). Not all the company names are matched, but a good proportion are. To match the rest, filter out the unmatched items, run the reconciliation search on what’s left and see if you can improve the matches.) Note that inspecting that data, I see that it needs cleaning a little! Also, the scraper used to get the data from the accord PDF is slightly broken.
Reconciliation is most easily achieved using OpenRefine. From the header of the company name column, select Reconcile -> Start reconciling...
Accept the defaults, and click on Start Reconciling to begin the matching process.
You can then set about accepting – or not – the matches… I tend to filter to just display the confident matches and accept them:
Then I look at the middling matches and accept those by hand:
Then I’m often tempted to filter down to the results I’m not confident about and discard the reconciliation judgements:
You can generate new columns based on the data pulled down from the reconciliation API:
For example:
- to get the reconciled company name, create a new column based on the reconciled column with the pattern cell.recon.match.name.
- to get the reconciled company identifier, create a new column based on the reconciled column with the pattern cell.recon.match.id.
If you reconcile names appearing in different lists, your data will be enriched with unambiguous names and identifiers that you can use to support company matching across differnt data files.
Addresses
There may be some work to be done around the addresses associated with each company. For example, the Bangladeshi company descriptions on OpenCorporates seem to be lacking in address information. There may be some merit in treating OpenCorporates as the bast place to store this information, and then retrieve it through the various OpenCorporates APIs as required. At the current time, volunteer effort in terms of adding address information to OpenCorporates may be one way forward?
A major use for the addresses is to support mapping activities. Some geocoders may give spurious locations for some addresses – try to force the issue by adding , Bangladesh on to the end of every address you geocode.
It may be possible to augment addresses for geocoding using additional geographical region information. There is a list of Bangladesh postcodes by area published at http://www.bangladeshpost.gov.bd/postcode.asp which I have scraped and popped into a data file here:
https://github.com/psychemedia/ScoDa-GarmentsDataExpedition/tree/master/geocoding
It may be possible/appropriate to annotate addresses with additional information from this file, for example by matching on district and/or division and adding in more information to the geocoded address, where possible. (Phone city codes may also help identify additional address information – is there a mapping of these somewhere? Or could we construct one?)
If you do get latitude/longitude coordinates for an address, try to also associate the co-ordinates with an OpenCorporates company identifier to support unambiguous matching of co-ordinates/locations to names (although there may be multiple names/identifiers associated with a particular location).
Other hints and tips
If you have any other tricks that may be useful to support the current data expedition, please feel free to add a link to, or description of, them as a comment to this post.
Or if you discover or develop a useful technique during the data expedition that you think would have helped had you come across it before you started the expedition, again, please link to or describe it in the comments below.
Data resources: data resources to support the investigation are available here:
– http://datahub.io/dataset/bangladesh-garment-industry-dataset
– Global Garment Supply Chain Data