Category Archives: Data formats

Matters relating to databases

Perl vs Python

Here at Cranfield University we work a lot with data in our GIS and data-related teaching and research. A common challenge is in transferring a complex dataset that is in one format into another format to make it useable. Many times there are tools we can use to help in that manipulation, both proprietary and open source. For the spatial datasets we often work with, we can use the range of data convertors in ArcGIS and QGIS, we can use the fantastic ‘Feature Manipulation Engine’ (FME) from Safe inc., or its manifestation in ArcGIS – the data interoperability tool, then again we can look to libraries such as the Geospatial Data Abstraction Library (gdal) for scripted functionality. As ever in computing, there are many ways of achieving our objectives.

However, sometimes there is nothing for it but to hack away in a favourite programming scripting language to make the conversion. Traditionally we used the wonderfully eclectic ‘Perl‘ language (pathologically eclectic rubbish lister – look it up!!) More recently the emphasis has perhaps shifted to Python as the language of choice. Certainly, if we are asked by our students which general purpose programming language to use for data manipulation, we advise Python is the one to have experience with on the CV.

If we have a simple data challenge, for example, we might want to convert an ASCII text file with data in one format to another format and write it out to a new file. We might want to go say from a file in this format (in ‘input.csv’):

AL1 1AG,1039499.00,0
AL1 1AG,383009.76,10251
etc......

To this format (in ‘output.csv’) …

UK,Item 1,R,AL1 1AG,,,,,,,,,,1039499.00,0,,,,
UK,Item 2,R,AL1 1AG,,,,,,,,,,383009.76,10251,,,,
etc......

For this Perl is a great solution – integration the strengths of awk and sed. Perl can produce code which quickly chomps through huge data files. One has to be careful as to how the code is developed, to ensure its readability. Sometimes, coming back to a piece of code one can struggle to remember how it works for a while – and this is especially so where the code is highly compacted.

#!/usr/bin/env perl
# Call as β€˜perl script.pl <in_file> > <out_file>β€˜
# e.g. perl script.pl input.csv > output.csv
use Text::CSV;
my $csv = Text::CSV->new({sep_char => ',' });
$j=1;
while (<>) {
  chomp;
  if ($csv->parse($_)) {
    my @fields = $csv->fields();
    printf("UK,Item %d,R,%s,,,,,,,,,,%s,%s,,,,\n",$j++,@fields[0],@fields[1],@fields[2]);
  }       
}

The equivalent task in Python is equally simple, and perhaps a little more readable…

#!/usr/bin/env python
# python3 code
# Call as 'python3 script.py'
import csv
o = open('output.csv','w')
with open('input.csv', 'r') as f:
   reader = csv.reader(f)
   mylist = list(reader)
j = 0
for row in mylist:
   j+=1
   o.write('UK,Item {:d},R,{:s},,,,,,,,,{:s},{:s},,,,\n'.format(j, row[0], row[1], row[2]$

Note the code above is Python3 not Python2. Like Perl (with cpan), Python is extensible (with pip) – and in fact one really needs to use extensions (modules, or imported libraries) to get the most out of it (and to help prevent you needing to reinvent the wheel and introducing unnecessary errors). There is no need to write lots of code for handling CSV files for example – the csv library above does this very efficiently in Python. Likewise, if say we want to write data back out to JSON (JavaScript Object Notation format), again the json library can come to the rescue:

import csv
import json
jsonfile = open('/folderlocation/output.json', 'w')
with open('/folderlocation/input.csv', newline='', encoding='utf-8-sig') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        print(', '.join(row))
        json.dump(row, jsonfile)

There is probably not really a lot in the difference between the two languages – it all rather depends on ones preferences. However, for GIS professionals, Python expertise is a must as it is adopted as the scripting language of choice in ArcGIS (in fact even being shipped with ArcGIS). Other alternatives exist of course for these sorts of tasks – ‘R‘ is one that comes to mind – again being equally extensible.

Exploring traffic times data

A recent investigation here at Cranfield University considered the sources of road journey traffic time data, and this blog recounts some of that investigation. First of all comes the sources of the data.

Highways England Data

Thanks to the fantastic open data revolution we now have a huge wealth of public data available via the www.data.gov.uk portal. Here for example we can source data on traffic times from the Highways England agency. Their traffic times data can be obtained from https://data.gov.uk/dataset/dft-eng-srn-routes-journey-times

This data series provides average journey time, speed and traffic flow information for 15-minute periods since April 2009 on all morotways and ‘A’ roads managed by the Highways Agency, known as the Strategic Road Network, in England, with journey times and speeds estimated using a combination of sources, including Automatic Number Plate Recognition (ANPR) cameras, in-vehicle Global Positioning Systems (GPS) and inductive loops built into the road surface.

For example, we downloaded the CSV file: ‘Feb15.csv‘ relating to  February 2015 data. The first line of which by example reads:

LinkRef Link Description Date Time Period AverageJT Average Speed Data Quality Link Length Flow
AL215 A120 between A133 and A1232 (AL215) 2015-02-10 00:00:00 67 305.47 105.12 1 8.9200000762939453 286.50

This line of data relates to a stretch of road north of Colchester, UK on the A120. Key information here being that on 10th February 2015, for this c.9km stretch of road, it took 287 seconds (c4.8mins) to drive. The time of day is given as 67. This number is one of 96 15-minute intervals in the day that the data refers to (0-95 where 0 indicates 00:00 to 00:15). 67 is therefore 4:45:00 PM to 5:00:00 PM (see a useful table at the end of this article for working this out).

Google Traffic Data

Another useful source of data is from Google. The Google routing and traffic functions can be used by making a call to the Google ‘Distance Matrix’ API, described here:

https://developers.google.com/maps/documentation/distance-matrix/intro#traffic-model

Using the excellent ‘Postman‘ tool, We can formulate and test a REST call to the Google distancematrix API.

https://maps.googleapis.com/maps/api/distancematrix/json?units=metric&origins=enc:{s{{H{ovD:&destinations=enc:g{t{HqtiE:&departure_time=now&traffic_model=best_guess&key=<API KEY>

Parameters for this API are as follows:
units = metric values (e.g. km)
origins = startint point (encoded)
destinations = finish point (encoded)
departure time = can’t be historical, ‘now’ = keyword
traffic model = best guess (not optimistic/pessimistic)
API key = the personal API

The parameters origins and  destinations hold locations in latitude and longitude. As an alternative to decimal degree values for these, there can be encoded values used in the URL. To encode loctions the polyline utility can be used: See https://developers.google.com/maps/documentation/utilities/polylineutility

The resultant response to this REST call, made using Postman to send query off (GET), is:

{
    "destination_addresses": [
        "A120, Colchester CO7, UK"
    ],
    "origin_addresses": [
        "A120, Ardleigh, Colchester CO7, UK"
    ],
    "rows": [
        {
            "elements": [
                {
                    "distance": {
                        "text": "8.0 km",
                        "value": 7993
                    },
                    "duration": {
                        "text": "5 mins",
                        "value": 278
                    },
                    "duration_in_traffic": {
                        "text": "5 mins",
                        "value": 301
                    },
                    "status": "OK"
                }
            ]
        }
    ],
    "status": "OK"
}

Key information here being that at the time of making the call (‘now’), for this c.8km stretch of road, it took between 278 to 301 seconds (c4.6 to 5.0 mins) to drive. Key to this is the difference between the ‘duration’ and ‘duration_in_traffic’ values. Google note the allows you to ‘receive a route and trip duration (response field: duration_in_traffic) that take traffic conditions into account’. Note that ‘the departure_time must be set to the current time or some time in the future. It cannot be in the past’.

So in this way the Google approach allows a definition of the delays in drive time caused by traffic conditions. Although this cannot be determined retrospectively, a speculative future date can be selected whereby a prediction is made based on previous traffic conditions.


Utilities

The table used to calculate the time period for the Highways England data, described above:

Period From To
0 12:00:00 AM 12:15:00 AM
1 12:15:00 AM 12:30:00 AM
2 12:30:00 AM 12:45:00 AM
3 12:45:00 AM 1:00:00 AM
4 1:00:00 AM 1:15:00 AM
5 1:15:00 AM 1:30:00 AM
6 1:30:00 AM 1:45:00 AM
7 1:45:00 AM 2:00:00 AM
8 2:00:00 AM 2:15:00 AM
9 2:15:00 AM 2:30:00 AM
10 2:30:00 AM 2:45:00 AM
11 2:45:00 AM 3:00:00 AM
12 3:00:00 AM 3:15:00 AM
13 3:15:00 AM 3:30:00 AM
14 3:30:00 AM 3:45:00 AM
15 3:45:00 AM 4:00:00 AM
16 4:00:00 AM 4:15:00 AM
17 4:15:00 AM 4:30:00 AM
18 4:30:00 AM 4:45:00 AM
19 4:45:00 AM 5:00:00 AM
20 5:00:00 AM 5:15:00 AM
21 5:15:00 AM 5:30:00 AM
22 5:30:00 AM 5:45:00 AM
23 5:45:00 AM 6:00:00 AM
24 6:00:00 AM 6:15:00 AM
25 6:15:00 AM 6:30:00 AM
26 6:30:00 AM 6:45:00 AM
27 6:45:00 AM 7:00:00 AM
28 7:00:00 AM 7:15:00 AM
29 7:15:00 AM 7:30:00 AM
30 7:30:00 AM 7:45:00 AM
31 7:45:00 AM 8:00:00 AM
32 8:00:00 AM 8:15:00 AM
33 8:15:00 AM 8:30:00 AM
34 8:30:00 AM 8:45:00 AM
35 8:45:00 AM 9:00:00 AM
36 9:00:00 AM 9:15:00 AM
37 9:15:00 AM 9:30:00 AM
38 9:30:00 AM 9:45:00 AM
39 9:45:00 AM 10:00:00 AM
40 10:00:00 AM 10:15:00 AM
41 10:15:00 AM 10:30:00 AM
42 10:30:00 AM 10:45:00 AM
43 10:45:00 AM 11:00:00 AM
44 11:00:00 AM 11:15:00 AM
45 11:15:00 AM 11:30:00 AM
46 11:30:00 AM 11:45:00 AM
47 11:45:00 AM 12:00:00 PM
48 12:00:00 PM 12:15:00 PM
49 12:15:00 PM 12:30:00 PM
50 12:30:00 PM 12:45:00 PM
51 12:45:00 PM 1:00:00 PM
52 1:00:00 PM 1:15:00 PM
53 1:15:00 PM 1:30:00 PM
54 1:30:00 PM 1:45:00 PM
55 1:45:00 PM 2:00:00 PM
56 2:00:00 PM 2:15:00 PM
57 2:15:00 PM 2:30:00 PM
58 2:30:00 PM 2:45:00 PM
59 2:45:00 PM 3:00:00 PM
60 3:00:00 PM 3:15:00 PM
61 3:15:00 PM 3:30:00 PM
62 3:30:00 PM 3:45:00 PM
63 3:45:00 PM 4:00:00 PM
64 4:00:00 PM 4:15:00 PM
65 4:15:00 PM 4:30:00 PM
66 4:30:00 PM 4:45:00 PM
67 4:45:00 PM 5:00:00 PM
68 5:00:00 PM 5:15:00 PM
69 5:15:00 PM 5:30:00 PM
70 5:30:00 PM 5:45:00 PM
71 5:45:00 PM 6:00:00 PM
72 6:00:00 PM 6:15:00 PM
73 6:15:00 PM 6:30:00 PM
74 6:30:00 PM 6:45:00 PM
75 6:45:00 PM 7:00:00 PM
76 7:00:00 PM 7:15:00 PM
77 7:15:00 PM 7:30:00 PM
78 7:30:00 PM 7:45:00 PM
79 7:45:00 PM 8:00:00 PM
80 8:00:00 PM 8:15:00 PM
81 8:15:00 PM 8:30:00 PM
82 8:30:00 PM 8:45:00 PM
83 8:45:00 PM 9:00:00 PM
84 9:00:00 PM 9:15:00 PM
85 9:15:00 PM 9:30:00 PM
86 9:30:00 PM 9:45:00 PM
87 9:45:00 PM 10:00:00 PM
88 10:00:00 PM 10:15:00 PM
89 10:15:00 PM 10:30:00 PM
90 10:30:00 PM 10:45:00 PM
91 10:45:00 PM 11:00:00 PM
92 11:00:00 PM 11:15:00 PM
93 11:15:00 PM 11:30:00 PM
94 11:30:00 PM 11:45:00 PM
95 11:45:00 PM 12:00:00 AM

Corrupt Word docx document

We all use Word for report writing and general word processing, but arghhhh, what to do when the file becomes corrupted!

One of the students here at Cranfield University, suffered a recent misfortune of corrupting a MS Word docx file. The file had been 30+ pages of closely written text, ready for a thesis meeting, when disaster struck. Somehow the document became corrupted and opened as a blank document with no text. Inspecting it, we realised the document size was still 1.5Mb – so the text was probably still in the file – even if we couldn’t see it.

We tried all means of tricks to cajole Word to open the file and recover the precious text, all to no avail. At a point frankly of some desparation, we remembered the docx file format is a zipped XML structure file. This is a saving grace – the earlier ‘doc’ format was just a proprietary binary format, now the ‘docx’ format offered some hope.

We took a copy of the file, and renamed it ‘document.zip’. This allowed us to open the zip archive, and to see the contents. Immediately we see the hierarchical structure of the document and the multiple files it contains.
MS Word XML file structure

We could then open the folder ‘word’, which showed the principle contents of the document.
MS Word XML file structure

Straight away, we can see the sub-folder ‘media’ – this folder contained all the images that had been in the original document. Great – those were saved off. Now we needed to extract the text itself.

There is also the key file ‘document.xml’ – XML is XML is a software- and hardware-independent tool for storing and transporting data (http://www.w3schools.com/xml/xml_whatis.asp), stored in plain text. We extracted the text and loaded in our favourite text editor (TextWrangler – for the Mac). Inspecting the XML file shows the usual structures of xml – all spun onto a single line, thus:
MS Word XML file structure

We could then hunt through the file and locate the text in the XML, noting the tags within which the text was recorded. In this case, we can see text tagged with <w:t>.
MS Word XML file structure

So now we needed an automated method to extract all the text from the document – to start with, no such tool is present on the Mac. However, thanks to Kevin Peck’s excellent blog here (http://kevsaidwhat.blogspot.co.uk/2013/03/other-mac-things-i-have-learned.html) we found that the software xml_grep did exactly what we wanted. This uses the module XML::Twig for Perl (http://www.xmltwig.org/xmltwig/). Perl is a fantastic scripting language – well worth learning, and excellent for file manipulation.

As Kevin notes, the tool was swiftly installed in the Mac thus:

cd XML-Twig-3.50 (use latest version downloaded here)
perl Makefile.PL -y
make
make test
sudo make install

Once the tool was built and working, we could run the extraction we wanted, thus:

$> xml_grep --text_only --cond 'w:t' document.xml > extractedtext.txt

This produced a file holding the text of the document – which at least allowed our student to carry on with their work – albeit that the report needed reconstructing. Also it was interesting to see how the Office files are held as zipped XML format. In any case – phew! The learning point in all this of course is the BACK UP YOUR FILES!! This has of course been said many times πŸ˜‰