Category Archives: Perl

Matters relating to the Perl programming (scripting) language

Perl vs Python

Here at Cranfield University we work a lot with data in our GIS and data-related teaching and research. A common challenge is in transferring a complex dataset that is in one format into another format to make it useable. Many times there are tools we can use to help in that manipulation, both proprietary and open source. For the spatial datasets we often work with, we can use the range of data convertors in ArcGIS and QGIS, we can use the fantastic ‘Feature Manipulation Engine’ (FME) from Safe inc., or its manifestation in ArcGIS – the data interoperability tool, then again we can look to libraries such as the Geospatial Data Abstraction Library (gdal) for scripted functionality. As ever in computing, there are many ways of achieving our objectives.

However, sometimes there is nothing for it but to hack away in a favourite programming scripting language to make the conversion. Traditionally we used the wonderfully eclectic ‘Perl‘ language (pathologically eclectic rubbish lister – look it up!!) More recently the emphasis has perhaps shifted to Python as the language of choice. Certainly, if we are asked by our students which general purpose programming language to use for data manipulation, we advise Python is the one to have experience with on the CV.

If we have a simple data challenge, for example, we might want to convert an ASCII text file with data in one format to another format and write it out to a new file. We might want to go say from a file in this format (in ‘input.csv’):

AL1 1AG,1039499.00,0
AL1 1AG,383009.76,10251
etc......

To this format (in ‘output.csv’) …

UK,Item 1,R,AL1 1AG,,,,,,,,,,1039499.00,0,,,,
UK,Item 2,R,AL1 1AG,,,,,,,,,,383009.76,10251,,,,
etc......

For this Perl is a great solution – integration the strengths of awk and sed. Perl can produce code which quickly chomps through huge data files. One has to be careful as to how the code is developed, to ensure its readability. Sometimes, coming back to a piece of code one can struggle to remember how it works for a while – and this is especially so where the code is highly compacted.

#!/usr/bin/env perl
# Call as β€˜perl script.pl <in_file> > <out_file>β€˜
# e.g. perl script.pl input.csv > output.csv
use Text::CSV;
my $csv = Text::CSV->new({sep_char => ',' });
$j=1;
while (<>) {
  chomp;
  if ($csv->parse($_)) {
    my @fields = $csv->fields();
    printf("UK,Item %d,R,%s,,,,,,,,,,%s,%s,,,,\n",$j++,@fields[0],@fields[1],@fields[2]);
  }       
}

The equivalent task in Python is equally simple, and perhaps a little more readable…

#!/usr/bin/env python
# python3 code
# Call as 'python3 script.py'
import csv
o = open('output.csv','w')
with open('input.csv', 'r') as f:
   reader = csv.reader(f)
   mylist = list(reader)
j = 0
for row in mylist:
   j+=1
   o.write('UK,Item {:d},R,{:s},,,,,,,,,{:s},{:s},,,,\n'.format(j, row[0], row[1], row[2]$

Note the code above is Python3 not Python2. Like Perl (with cpan), Python is extensible (with pip) – and in fact one really needs to use extensions (modules, or imported libraries) to get the most out of it (and to help prevent you needing to reinvent the wheel and introducing unnecessary errors). There is no need to write lots of code for handling CSV files for example – the csv library above does this very efficiently in Python. Likewise, if say we want to write data back out to JSON (JavaScript Object Notation format), again the json library can come to the rescue:

import csv
import json
jsonfile = open('/folderlocation/output.json', 'w')
with open('/folderlocation/input.csv', newline='', encoding='utf-8-sig') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        print(', '.join(row))
        json.dump(row, jsonfile)

There is probably not really a lot in the difference between the two languages – it all rather depends on ones preferences. However, for GIS professionals, Python expertise is a must as it is adopted as the scripting language of choice in ArcGIS (in fact even being shipped with ArcGIS). Other alternatives exist of course for these sorts of tasks – ‘R‘ is one that comes to mind – again being equally extensible.

Corrupt Word docx document

We all use Word for report writing and general word processing, but arghhhh, what to do when the file becomes corrupted!

One of the students here at Cranfield University, suffered a recent misfortune of corrupting a MS Word docx file. The file had been 30+ pages of closely written text, ready for a thesis meeting, when disaster struck. Somehow the document became corrupted and opened as a blank document with no text. Inspecting it, we realised the document size was still 1.5Mb – so the text was probably still in the file – even if we couldn’t see it.

We tried all means of tricks to cajole Word to open the file and recover the precious text, all to no avail. At a point frankly of some desparation, we remembered the docx file format is a zipped XML structure file. This is a saving grace – the earlier ‘doc’ format was just a proprietary binary format, now the ‘docx’ format offered some hope.

We took a copy of the file, and renamed it ‘document.zip’. This allowed us to open the zip archive, and to see the contents. Immediately we see the hierarchical structure of the document and the multiple files it contains.
MS Word XML file structure

We could then open the folder ‘word’, which showed the principle contents of the document.
MS Word XML file structure

Straight away, we can see the sub-folder ‘media’ – this folder contained all the images that had been in the original document. Great – those were saved off. Now we needed to extract the text itself.

There is also the key file ‘document.xml’ – XML is XML is a software- and hardware-independent tool for storing and transporting data (http://www.w3schools.com/xml/xml_whatis.asp), stored in plain text. We extracted the text and loaded in our favourite text editor (TextWrangler – for the Mac). Inspecting the XML file shows the usual structures of xml – all spun onto a single line, thus:
MS Word XML file structure

We could then hunt through the file and locate the text in the XML, noting the tags within which the text was recorded. In this case, we can see text tagged with <w:t>.
MS Word XML file structure

So now we needed an automated method to extract all the text from the document – to start with, no such tool is present on the Mac. However, thanks to Kevin Peck’s excellent blog here (http://kevsaidwhat.blogspot.co.uk/2013/03/other-mac-things-i-have-learned.html) we found that the software xml_grep did exactly what we wanted. This uses the module XML::Twig for Perl (http://www.xmltwig.org/xmltwig/). Perl is a fantastic scripting language – well worth learning, and excellent for file manipulation.

As Kevin notes, the tool was swiftly installed in the Mac thus:

cd XML-Twig-3.50 (use latest version downloaded here)
perl Makefile.PL -y
make
make test
sudo make install

Once the tool was built and working, we could run the extraction we wanted, thus:

$> xml_grep --text_only --cond 'w:t' document.xml > extractedtext.txt

This produced a file holding the text of the document – which at least allowed our student to carry on with their work – albeit that the report needed reconstructing. Also it was interesting to see how the Office files are held as zipped XML format. In any case – phew! The learning point in all this of course is the BACK UP YOUR FILES!! This has of course been said many times πŸ˜‰